Skip to content

fix: per-group char classes in toRegex to prevent false positives#76

Merged
jan-kubica merged 2 commits intomainfrom
fix/per-group-char-classes
Mar 26, 2026
Merged

fix: per-group char classes in toRegex to prevent false positives#76
jan-kubica merged 2 commits intomainfrom
fix/per-group-char-classes

Conversation

@jan-kubica
Copy link
Copy Markdown
Contributor

@jan-kubica jan-kubica commented Mar 26, 2026

Summary

  • toRegex() inferred a single character class ([A-Z0-9]) for all groups when the compact form mixed letters and digits. For de.svnr (German SSN), this matched all-caps prose like "OF NOVEMBER 6" as a candidate, consuming the span and blocking the correct date pattern.
  • Added inferPerGroupInfo() which derives per-group character classes from format() output. For SVNR "12 010188 M 01 1", this produces \d{2} \d{6} [A-Z]{1} \d{2} \d{1} instead of [A-Z0-9] everywhere.
  • Letter-only groups before the first digit group (format-prepended prefixes like "CHE") are excluded — handled by inferPrefix separately.

Test plan

  • bun test __test__/patterns.test.ts — all 17 pattern tests pass (includes de.svnr, ch.uid, multi-length, prefix patterns)
  • bun test — full suite (4185 tests) passes
  • Manual: toRegex(de.svnr) produces \d{2}[\s\-./]?\d{6}[\s\-./]?[A-Z]{1}[\s\-./]?\d{2}[\s\-./]?\d{1} and rejects "OF NOVEMBER 6"
  • Downstream: anonymize detection-batch (37 tests) passes with this change

Open with Devin

toRegex() inferred a single character class for all positions from
examples. For validators where the compact form mixes letters and
digits in distinct positions (e.g., German SVNR "12010188M011"),
this produced [A-Z0-9] for every group. The overly broad pattern
matched all-caps prose like "OF NOVEMBER 6" as a valid candidate,
consuming the span and preventing the correct date pattern from
firing.

Add inferPerGroupInfo() which derives per-group character classes
from the formatted output. For SVNR, "12 010188 M 01 1" now
produces \d{2} \d{6} [A-Z]{1} \d{2} \d{1} instead of [A-Z0-9]
for all positions. Letter-only groups before the first digit group
(format-prepended prefixes like "CHE") are still excluded.
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR fixes a false-positive matching bug where toRegex() used a single broad [A-Z0-9] character class for all groups of mixed letter/digit validators, causing patterns like German SVNR to match all-caps prose (e.g. "OF NOVEMBER 6").

The fix introduces inferPerGroupInfo(), which formats the first compact example and derives per-group sizes and character classes — so de.svnr's "12 010188 M 01 1" now produces \d{2}[\s\-./]?\d{6}[\s\-./]?[A-Z]{1}[\s\-./]?\d{2}[\s\-./]?\d{1} instead of five repetitions of [A-Z0-9]. charClassFor() replaces inline scanning in inferCharClass with a reusable helper. Two targeted regression tests cover the de.svnr match and reject cases.

Key observations:

  • The guard lengths.length <= 1 correctly restricts per-group logic to single-length validators, preventing misapplication to multi-length validators.
  • Exclusion of letter-only groups before the first digit group correctly avoids double-counting prefixes already handled by inferPrefix.
  • IBAN/ISIN-style validators (parts that internally mix letters and digits) are correctly excluded by the isAlphanumeric short-circuit.
  • Two prior P2 notes from earlier review rounds — charClassFor duplication with inferCharClass, and the missing length-guard in groupsToPatternPerClass — remain open but are non-blocking.

Confidence Score: 4/5

Safe to merge; fix is correct and well-tested, two minor prior P2s remain unaddressed but are non-blocking.

The per-group inference logic is sound: it correctly excludes IBAN/ISIN-style mixed parts, excludes prefix-only letter groups, and only engages for single-length validators. Regression tests cover the specific failure mode (all-caps prose false positive) and the three canonical input forms. The two unaddressed P2s from prior rounds (charClassFor duplication, missing array-length guard in groupsToPatternPerClass) are acknowledged but genuinely non-blocking.

src/patterns.ts — the two prior P2 follow-ups (charClassFor/inferCharClass deduplication, groupsToPatternPerClass length guard) are still open.

Important Files Changed

Filename Overview
src/patterns.ts Adds charClassFor, inferPerGroupInfo, and groupsToPatternPerClass to derive tighter per-group regex patterns for mixed letter/digit validators; toRegex now prefers the per-group path when available. Two prior P2 notes (duplication with inferCharClass, missing length-guard in groupsToPatternPerClass) remain unaddressed.
test/patterns.test.ts Adds two focused regression tests for de.svnr: one verifying correct per-group matching (compact, spaced, dot-separated) and one verifying that all-caps prose is rejected. Good coverage for the targeted bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[toRegex called] --> B[inferPrefix / inferGroups / inferLengths / inferCharClass]
    B --> C[inferPerGroupInfo]
    C --> D{perGroup not null AND lengths le 1}
    D -- Yes --> E[groupsToPatternPerClass - per-group classes]
    D -- No --> F{groups not null AND lengths le 1}
    F -- Yes --> G[groupsToPattern - single char class]
    F -- No --> H{lengths == 1}
    H -- Yes --> I[flat cc-N pattern]
    H -- No --> J{lengths gt 1}
    J -- Yes --> K[cc min-max range]
    J -- No --> L[cc 6-20 fallback]
    E --> M{prefix exists}
    G --> M
    I --> M
    K --> M
    L --> M
    M -- Yes --> N[prepend prefix + SEP]
    M -- No --> O[wrap in word boundaries]
    N --> O
    O --> P[return RegExp]
Loading

Reviews (2): Last reviewed commit: "fix: address review comments" | Re-trigger Greptile

greptile-apps[bot]

This comment was marked as resolved.

- Extract shared charClassFor helper; inferCharClass now delegates
  to it instead of duplicating the letter/digit scanning logic
- Add regression tests for de.svnr: per-group char class matching
  and all-caps prose rejection
@jan-kubica
Copy link
Copy Markdown
Contributor Author

Addressing the Greptile summary items:

Duplicate scanning logic (P2): Accepted. Extracted charClassFor as the shared core; inferCharClass now delegates to it. See fa6fb9b.

No length-parity guard (P2): Pushing back. sizes and classes are derived from the same filtered array in inferPerGroupInfo, making length divergence structurally impossible. A guard for an impossible state adds dead code.

Missing regression test (P2, outside-diff): Accepted and implemented. Added two tests in fa6fb9b: one for per-group char class matching (de.svnr formatted/compact/dot-separated) and one for all-caps prose rejection (OF NOVEMBER 6, AS OF NOVEMBER 6, 2024, BY, AMENDMENT DATED NOVEMBER 6).

CC on behalf of @jan-kubica

@jan-kubica jan-kubica marked this pull request as ready for review March 26, 2026 18:56
@jan-kubica jan-kubica merged commit b776c5f into main Mar 26, 2026
7 of 10 checks passed
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant