fix: per-group char classes in toRegex to prevent false positives#76
fix: per-group char classes in toRegex to prevent false positives#76jan-kubica merged 2 commits intomainfrom
Conversation
toRegex() inferred a single character class for all positions from
examples. For validators where the compact form mixes letters and
digits in distinct positions (e.g., German SVNR "12010188M011"),
this produced [A-Z0-9] for every group. The overly broad pattern
matched all-caps prose like "OF NOVEMBER 6" as a valid candidate,
consuming the span and preventing the correct date pattern from
firing.
Add inferPerGroupInfo() which derives per-group character classes
from the formatted output. For SVNR, "12 010188 M 01 1" now
produces \d{2} \d{6} [A-Z]{1} \d{2} \d{1} instead of [A-Z0-9]
for all positions. Letter-only groups before the first digit group
(format-prepended prefixes like "CHE") are still excluded.
|
| Filename | Overview |
|---|---|
| src/patterns.ts | Adds charClassFor, inferPerGroupInfo, and groupsToPatternPerClass to derive tighter per-group regex patterns for mixed letter/digit validators; toRegex now prefers the per-group path when available. Two prior P2 notes (duplication with inferCharClass, missing length-guard in groupsToPatternPerClass) remain unaddressed. |
| test/patterns.test.ts | Adds two focused regression tests for de.svnr: one verifying correct per-group matching (compact, spaced, dot-separated) and one verifying that all-caps prose is rejected. Good coverage for the targeted bug. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[toRegex called] --> B[inferPrefix / inferGroups / inferLengths / inferCharClass]
B --> C[inferPerGroupInfo]
C --> D{perGroup not null AND lengths le 1}
D -- Yes --> E[groupsToPatternPerClass - per-group classes]
D -- No --> F{groups not null AND lengths le 1}
F -- Yes --> G[groupsToPattern - single char class]
F -- No --> H{lengths == 1}
H -- Yes --> I[flat cc-N pattern]
H -- No --> J{lengths gt 1}
J -- Yes --> K[cc min-max range]
J -- No --> L[cc 6-20 fallback]
E --> M{prefix exists}
G --> M
I --> M
K --> M
L --> M
M -- Yes --> N[prepend prefix + SEP]
M -- No --> O[wrap in word boundaries]
N --> O
O --> P[return RegExp]
Reviews (2): Last reviewed commit: "fix: address review comments" | Re-trigger Greptile
- Extract shared charClassFor helper; inferCharClass now delegates to it instead of duplicating the letter/digit scanning logic - Add regression tests for de.svnr: per-group char class matching and all-caps prose rejection
|
Addressing the Greptile summary items: Duplicate scanning logic (P2): Accepted. Extracted No length-parity guard (P2): Pushing back. Missing regression test (P2, outside-diff): Accepted and implemented. Added two tests in fa6fb9b: one for per-group char class matching ( CC on behalf of @jan-kubica |
Summary
toRegex()inferred a single character class ([A-Z0-9]) for all groups when the compact form mixed letters and digits. Forde.svnr(German SSN), this matched all-caps prose like"OF NOVEMBER 6"as a candidate, consuming the span and blocking the correct date pattern.inferPerGroupInfo()which derives per-group character classes fromformat()output. For SVNR"12 010188 M 01 1", this produces\d{2} \d{6} [A-Z]{1} \d{2} \d{1}instead of[A-Z0-9]everywhere."CHE") are excluded — handled byinferPrefixseparately.Test plan
bun test __test__/patterns.test.ts— all 17 pattern tests pass (includesde.svnr,ch.uid, multi-length, prefix patterns)bun test— full suite (4185 tests) passestoRegex(de.svnr)produces\d{2}[\s\-./]?\d{6}[\s\-./]?[A-Z]{1}[\s\-./]?\d{2}[\s\-./]?\d{1}and rejects"OF NOVEMBER 6"