feat: add EU personal ID validators (Phase 2)#3
Merged
jan-kubica merged 8 commits intomainfrom Mar 19, 2026
Merged
Conversation
14 new personal identification validators: - BE: NN (National Number, mod-97 dual-century) - BG: EGN (Unified Civil Number, weighted checksum) - DK: CPR (Personal ID, date-only, no checksum) - EE: IK (Isikukood, two-pass weighted checksum) - ES: DNI (National ID, mod-23 letter) - ES: NIE (Foreigner ID, prefix replacement + DNI) - FI: HETU (Personal ID, mod-31 alphanumeric check) - GR: AMKA (Social Security, Luhn) - IE: PPS (Personal Public Service, mod-23) - LT: Asmens kodas (Personal Code, reuses EE IK) - NL: BSN (Citizen Service Number, 11-proof) - RO: CNP (Personal Numeric Code, weighted + county) - SE: Personnummer (Personal ID, Luhn + birth date) - SI: EMŠO (Master Citizen Number, weighted mod-11) Oracle results (200,000 random inputs): - python-stdnum: 0 disagreements on all 14 - stdnum-js (JS): disagreements on 9 countries (confirmed as stdnum-js bugs by python tiebreak) - Rust, Ruby, other JS oracles: 0 56 total validators, 27 countries, 310 unit tests.
|
| Filename | Overview |
|---|---|
| src/be/nn.ts | Belgian National Number validator; mod-97 checksum logic is correct, but the 2000s-century gate (yy + 2000 <= new Date().getFullYear()) introduces time-dependent validation that can produce inconsistent results across timezones or test runs. |
| src/dk/cpr.ts | Danish CPR validator; century inference table is correct, but adds a future birth-date rejection guard not clearly specified in the CPR spec, making validation time-dependent and potentially diverging from python-stdnum for inputs encoding years 2027–2036. |
| src/ee/ik.ts | Estonian Isikukood validator; two-pass checksum is correctly implemented and reused by lt/asmens.ts. Minor: the ?? 1900 fallback on line 70 is dead code since g is constrained to [1,8] by the guard above. |
| src/ie/pps.ts | Irish PPS validator; length bound was corrected to > 9 in this PR. New-format (9-char) checksum contribution is implemented. The JS oracle (stdnum-js) still only generates 8-char inputs for cross-validation, leaving the new-format checksum branch uncovered by the JS oracle (already flagged in prior review thread). |
| src/ro/cnp.ts | Romanian CNP validator; g=9 passes the guard but falls through to centuryMap[9] ?? 1900 and proceeds to validate the date bytes — unlike lt/asmens.ts which explicitly skips date validation for g=9. County code whitelist and checksum (special-case sum=10 → check=1) are correctly implemented. |
| src/se/personnummer.ts | Swedish Personnummer validator; compact correctly handles 10-digit, 11-char (with sep), and 12/13-char formats. Century inference in getBirthDate is time-dependent (uses new Date().getFullYear()), which can cause the inferred year for a --separated input to flip as the real-world year advances. |
| src/si/emso.ts | Slovenian EMŠO validator; year threshold was corrected from 800 to 900 (per the JMBG standard) in this PR. The calcCheckDigit formula correctly mirrors Python's (-total % 11) % 10. |
| scripts/oracle.ts | Oracle cross-validation script; ES NIE and FI HETU full-separator coverage added to python SPECS. SE Personnummer now includes + separator and 12-digit generators. IE PPS JS_SPECS still generates only 8-char inputs. FI HETU JS_SPECS still only tests - and A separators. (Both already flagged in prior review threads.) |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
Input["Raw input string"] --> Compact["compact()\nstrip separators / normalise"]
Compact --> LenCheck{"length\nvalid?"}
LenCheck -- No --> ErrLen["INVALID_LENGTH"]
LenCheck -- Yes --> FmtCheck{"digits /\nletters valid?"}
FmtCheck -- No --> ErrFmt["INVALID_FORMAT"]
FmtCheck -- Yes --> CompCheck{"century /\ncomponent\nextraction"}
CompCheck --> DateCheck{"isValidDate?\n(be/nn: month only;\ndk/cpr: + future guard;\nse: time-dependent century)"}
DateCheck -- No --> ErrComp["INVALID_COMPONENT"]
DateCheck -- Yes --> Checksum{"checksum\nvalid?\n(mod-97, Luhn,\nweighted-sum,\ntwo-pass, mod-23)"}
Checksum -- No --> ErrCsum["INVALID_CHECKSUM"]
Checksum -- Yes --> Valid["{ valid: true, compact }"]
style ErrLen fill:#f66,color:#fff
style ErrFmt fill:#f66,color:#fff
style ErrComp fill:#f96,color:#fff
style ErrCsum fill:#f66,color:#fff
style Valid fill:#6a6,color:#fff
Last reviewed commit: "fix: address review ..."
- Fix IE PPS length check: max 9, not 10 - Fix BE NN error message: "0..12" not "1..12" (month 0 is valid for counter-exhaustion) - Add ES NIE to oracle cross-validation
Contributor
Author
|
All three concerns from the confidence score section have been addressed in commit 0cf7903:
CC on behalf of @jan-kubica |
- NL BSN: reject all-zeros "000000000" (python- stdnum rejects it; our mod-11 check incorrectly passed since 0 % 11 === 0) - RO CNP: accept gender digit 9 (foreigners with temporary residence; python-stdnum accepts it) - DK CPR: remove future date rejection (python- stdnum does not enforce this; CPR numbers can be pre-assigned for future births)
Added boundary value injection to oracle (all-zeros, all-nines, off-by-one lengths, repeated digits). This immediately caught 3 bugs: - FR NIF: incorrectly rejected all-zeros (python-stdnum accepts: 0 % 511 == 0) - BE VAT: incorrectly accepted all-zeros (python-stdnum rejects) - NL VAT: missing zero-padding for numeric part (8-digit inputs like "41442283B01" must pad to "041442283B01" before validation) Oracle `digs()` generator now mixes 70% random values with 30% targeted edge cases (Hypothesis strategy pattern). Every `digs(n)` call injects all-zeros, all-nines, sequential digits, single repeated digits, and off-by-one lengths.
Belgian VAT numbers before 2007 were 9 digits.
Official SPF Finances spec says older 9-digit
numbers should start with a leading zero. Added
zero-padding in compact().
Verified against:
- Official: finance.belgium.be (pre-2007 format)
- python-stdnum: compact('990246769') → '0990246769'
- jsvat: accepts both 9 and 10-digit forms
- Oracle: 5/5 runs with 0 disagreements
This bug was found by the Hypothesis-style edge
case injection (digs(9) generates 9-digit values
that exercise the padding path).
Mutant testing: for each valid value, corrupt single digits and verify the checksum rejects them. Proves checksum strength per algorithm: 100% detection: IBAN (mod-97), Luhn, DE VAT (ISO 7064), NL BSN (11-proof), HR OIB, PL NIP, FR SIREN, IT IVA, BE NN ~96-98%: CZ IČO, CZ RČ, EE IK, SI EMŠO, GB UTR (inherent mod-11 limitation, not bugs) Also: - Bump default sample count from 2K to 10K - Configurable via ORACLE_SAMPLES env var - Mutant escapes are informational, not failures
Extract duplicated code into shared modules: - _util/date.ts: isValidDate (was in 11 files) - _util/result.ts: err() helper (was in 56 files) - _checksums/mod1110.ts: ISO 7064 Mod 11,10 (was in de/vat, de/idnr, hr/vat) - Replace 14 inline weighted-sum loops with shared weightedSum (LV personal kept inline: non-zero initial sum incompatible with shared fn) - Hoist centuryMap to module level in ee/ik, ro/cnp - Fix import paths in de/vat, de/idnr (relative → #util/* aliases) - Restore DK CPR future date rejection (python- stdnum does enforce it, contrary to earlier claim) 311 tests pass, oracle verified.
- SI EMŠO: fix year threshold from 800 to 900 per official JMBG standard (Wikipedia, JMBG spec). python-stdnum uses 800 but the standard says 900. No practical difference (800-899 range has no living citizens) but matches the official spec. - Oracle: expand IE PPS to cover 9-char new format (7 digits + check letter + A/B/H) - Oracle: expand FI HETU to cover all 13 separators (+, -, Y, X, W, V, U, A, B, C, D, E, F) - Oracle: expand SE Personnummer to cover + separator and 12-digit format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
14 new personal identification validators for KYC/AML compliance:
BE NN, BG EGN, DK CPR, EE IK, ES DNI, ES NIE, FI HETU, GR AMKA, IE PPS, LT Asmens, NL BSN, RO CNP, SE Personnummer, SI EMŠO.
Total: 56 validators, 27 countries, 310 unit tests.
Oracle results (200,000 random inputs, 4 languages, 11 libraries)
All stdnum-js disagreements confirmed as bugs in stdnum-js by python-stdnum tiebreaker.
Test plan