fix: per-group char classes in toRegex to prevent false positives by jan-kubica · Pull Request #76 · stella/stdnum

jan-kubica · 2026-03-26T18:36:05Z

Summary

toRegex() inferred a single character class ([A-Z0-9]) for all groups when the compact form mixed letters and digits. For de.svnr (German SSN), this matched all-caps prose like "OF NOVEMBER 6" as a candidate, consuming the span and blocking the correct date pattern.
Added inferPerGroupInfo() which derives per-group character classes from format() output. For SVNR "12 010188 M 01 1", this produces \d{2} \d{6} [A-Z]{1} \d{2} \d{1} instead of [A-Z0-9] everywhere.
Letter-only groups before the first digit group (format-prepended prefixes like "CHE") are excluded — handled by inferPrefix separately.

Test plan

bun test __test__/patterns.test.ts — all 17 pattern tests pass (includes de.svnr, ch.uid, multi-length, prefix patterns)
bun test — full suite (4185 tests) passes
Manual: toRegex(de.svnr) produces \d{2}[\s\-./]?\d{6}[\s\-./]?[A-Z]{1}[\s\-./]?\d{2}[\s\-./]?\d{1} and rejects "OF NOVEMBER 6"
Downstream: anonymize detection-batch (37 tests) passes with this change

toRegex() inferred a single character class for all positions from examples. For validators where the compact form mixes letters and digits in distinct positions (e.g., German SVNR "12010188M011"), this produced [A-Z0-9] for every group. The overly broad pattern matched all-caps prose like "OF NOVEMBER 6" as a valid candidate, consuming the span and preventing the correct date pattern from firing. Add inferPerGroupInfo() which derives per-group character classes from the formatted output. For SVNR, "12 010188 M 01 1" now produces \d{2} \d{6} [A-Z]{1} \d{2} \d{1} instead of [A-Z0-9] for all positions. Letter-only groups before the first digit group (format-prepended prefixes like "CHE") are still excluded.

greptile-apps · 2026-03-26T18:40:58Z

Greptile Summary

This PR fixes a false-positive matching bug where toRegex() used a single broad [A-Z0-9] character class for all groups of mixed letter/digit validators, causing patterns like German SVNR to match all-caps prose (e.g. "OF NOVEMBER 6").

The fix introduces inferPerGroupInfo(), which formats the first compact example and derives per-group sizes and character classes — so de.svnr's "12 010188 M 01 1" now produces \d{2}[\s\-./]?\d{6}[\s\-./]?[A-Z]{1}[\s\-./]?\d{2}[\s\-./]?\d{1} instead of five repetitions of [A-Z0-9]. charClassFor() replaces inline scanning in inferCharClass with a reusable helper. Two targeted regression tests cover the de.svnr match and reject cases.

Key observations:

The guard lengths.length <= 1 correctly restricts per-group logic to single-length validators, preventing misapplication to multi-length validators.
Exclusion of letter-only groups before the first digit group correctly avoids double-counting prefixes already handled by inferPrefix.
IBAN/ISIN-style validators (parts that internally mix letters and digits) are correctly excluded by the isAlphanumeric short-circuit.
Two prior P2 notes from earlier review rounds — charClassFor duplication with inferCharClass, and the missing length-guard in groupsToPatternPerClass — remain open but are non-blocking.

Confidence Score: 4/5

Safe to merge; fix is correct and well-tested, two minor prior P2s remain unaddressed but are non-blocking.

The per-group inference logic is sound: it correctly excludes IBAN/ISIN-style mixed parts, excludes prefix-only letter groups, and only engages for single-length validators. Regression tests cover the specific failure mode (all-caps prose false positive) and the three canonical input forms. The two unaddressed P2s from prior rounds (charClassFor duplication, missing array-length guard in groupsToPatternPerClass) are acknowledged but genuinely non-blocking.

src/patterns.ts — the two prior P2 follow-ups (charClassFor/inferCharClass deduplication, groupsToPatternPerClass length guard) are still open.

Important Files Changed

Filename	Overview
src/patterns.ts	Adds charClassFor, inferPerGroupInfo, and groupsToPatternPerClass to derive tighter per-group regex patterns for mixed letter/digit validators; toRegex now prefers the per-group path when available. Two prior P2 notes (duplication with inferCharClass, missing length-guard in groupsToPatternPerClass) remain unaddressed.
test/patterns.test.ts	Adds two focused regression tests for de.svnr: one verifying correct per-group matching (compact, spaced, dot-separated) and one verifying that all-caps prose is rejected. Good coverage for the targeted bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[toRegex called] --> B[inferPrefix / inferGroups / inferLengths / inferCharClass]
    B --> C[inferPerGroupInfo]
    C --> D{perGroup not null AND lengths le 1}
    D -- Yes --> E[groupsToPatternPerClass - per-group classes]
    D -- No --> F{groups not null AND lengths le 1}
    F -- Yes --> G[groupsToPattern - single char class]
    F -- No --> H{lengths == 1}
    H -- Yes --> I[flat cc-N pattern]
    H -- No --> J{lengths gt 1}
    J -- Yes --> K[cc min-max range]
    J -- No --> L[cc 6-20 fallback]
    E --> M{prefix exists}
    G --> M
    I --> M
    K --> M
    L --> M
    M -- Yes --> N[prepend prefix + SEP]
    M -- No --> O[wrap in word boundaries]
    N --> O
    O --> P[return RegExp]

_{Reviews (2): Last reviewed commit: "fix: address review comments" | Re-trigger Greptile}

- Extract shared charClassFor helper; inferCharClass now delegates to it instead of duplicating the letter/digit scanning logic - Add regression tests for de.svnr: per-group char class matching and all-caps prose rejection

jan-kubica · 2026-03-26T18:44:55Z

Addressing the Greptile summary items:

Duplicate scanning logic (P2): Accepted. Extracted charClassFor as the shared core; inferCharClass now delegates to it. See fa6fb9b.

No length-parity guard (P2): Pushing back. sizes and classes are derived from the same filtered array in inferPerGroupInfo, making length divergence structurally impossible. A guard for an impossible state adds dead code.

Missing regression test (P2, outside-diff): Accepted and implemented. Added two tests in fa6fb9b: one for per-group char class matching (de.svnr formatted/compact/dot-separated) and one for all-caps prose rejection (OF NOVEMBER 6, AS OF NOVEMBER 6, 2024, BY, AMENDMENT DATED NOVEMBER 6).

CC on behalf of @jan-kubica

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

This comment was marked as resolved.

Sign in to view

fix: address review comments

fa6fb9b

- Extract shared charClassFor helper; inferCharClass now delegates to it instead of duplicating the letter/digit scanning logic - Add regression tests for de.svnr: per-group char class matching and all-caps prose rejection

jan-kubica marked this pull request as ready for review March 26, 2026 18:56

jan-kubica merged commit b776c5f into main Mar 26, 2026
7 of 10 checks passed

devin-ai-integration bot reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: per-group char classes in toRegex to prevent false positives#76

fix: per-group char classes in toRegex to prevent false positives#76
jan-kubica merged 2 commits intomainfrom
fix/per-group-char-classes

jan-kubica commented Mar 26, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

greptile-apps bot commented Mar 26, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

This comment was marked as resolved.

Uh oh!

jan-kubica commented Mar 26, 2026

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jan-kubica commented Mar 26, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

greptile-apps bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

This comment was marked as resolved.

Uh oh!

jan-kubica commented Mar 26, 2026

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-kubica commented Mar 26, 2026 •

edited by devin-ai-integration bot

Loading

greptile-apps bot commented Mar 26, 2026 •

edited

Loading