Skip to content

feat: add EU personal ID validators (Phase 2)#3

Merged
jan-kubica merged 8 commits intomainfrom
feat/eu-personal-ids
Mar 19, 2026
Merged

feat: add EU personal ID validators (Phase 2)#3
jan-kubica merged 8 commits intomainfrom
feat/eu-personal-ids

Conversation

@jan-kubica
Copy link
Copy Markdown
Contributor

@jan-kubica jan-kubica commented Mar 18, 2026

Summary

14 new personal identification validators for KYC/AML compliance:
BE NN, BG EGN, DK CPR, EE IK, ES DNI, ES NIE, FI HETU, GR AMKA, IE PPS, LT Asmens, NL BSN, RO CNP, SE Personnummer, SI EMŠO.

Total: 56 validators, 27 countries, 310 unit tests.

Oracle results (200,000 random inputs, 4 languages, 11 libraries)

Oracle Language Personal IDs Disagreements
python-stdnum Python All 14 0
stdnum-js JS All 14 9 countries (their bugs)
Rust crates Rust IBAN, Luhn 0
Ruby valvat Ruby DE, PL 0
JS ibantools, iban.js, luhn, fast-luhn JS 4 0

All stdnum-js disagreements confirmed as bugs in stdnum-js by python-stdnum tiebreaker.

Test plan

  • 310 unit tests pass
  • Lint clean
  • Oracle: 0 disagreements with python-stdnum on all formats
  • Oracle: cross-checked against stdnum-js (independent JS implementation)

Open with Devin

14 new personal identification validators:
- BE: NN (National Number, mod-97 dual-century)
- BG: EGN (Unified Civil Number, weighted checksum)
- DK: CPR (Personal ID, date-only, no checksum)
- EE: IK (Isikukood, two-pass weighted checksum)
- ES: DNI (National ID, mod-23 letter)
- ES: NIE (Foreigner ID, prefix replacement + DNI)
- FI: HETU (Personal ID, mod-31 alphanumeric check)
- GR: AMKA (Social Security, Luhn)
- IE: PPS (Personal Public Service, mod-23)
- LT: Asmens kodas (Personal Code, reuses EE IK)
- NL: BSN (Citizen Service Number, 11-proof)
- RO: CNP (Personal Numeric Code, weighted + county)
- SE: Personnummer (Personal ID, Luhn + birth date)
- SI: EMŠO (Master Citizen Number, weighted mod-11)

Oracle results (200,000 random inputs):
- python-stdnum: 0 disagreements on all 14
- stdnum-js (JS): disagreements on 9 countries
  (confirmed as stdnum-js bugs by python tiebreak)
- Rust, Ruby, other JS oracles: 0

56 total validators, 27 countries, 310 unit tests.
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR adds 14 EU personal ID validators (BE NN, BG EGN, DK CPR, EE IK, ES DNI, ES NIE, FI HETU, GR AMKA, IE PPS, LT Asmens, NL BSN, RO CNP, SE Personnummer, SI EMŠO), bringing the library to 56 validators across 27 countries. The implementations are generally well-structured, reuse shared utilities (weightedSum, isValidDate, twoPassCheck), and are cross-validated against python-stdnum with 0 disagreements on 200,000 random inputs. Several issues remain:

  • Time-dependent validation appears in three validators: be/nn.ts gates the 2000s-century checksum on new Date().getFullYear(), dk/cpr.ts rejects birth dates in the future, and se/personnummer.ts infers the century from the current year in getBirthDate. All three produce results that can change as the calendar year advances, making automated tests brittle and behaviour inconsistent across timezones.
  • RO CNP g=9 silently uses century 1900 and validates birth-date bytes without special-casing the way lt/asmens.ts does for the same digit value. If Romanian g=9 records encode placeholder date bytes (as the "incomplete registration" semantics imply), this will produce false INVALID_COMPONENT rejections that the all-random oracle is statistically unlikely to surface.
  • EE IK has a dead ?? 1900 fallback at line 70 — the guard above already constrains g to [1, 8], and all those values are present in centuryMap, so the nullish-coalescing default is never reached.

Confidence Score: 3/5

  • The PR is broadly safe to merge but contains time-dependent validation logic in three validators and an unhandled g=9 edge-case in RO CNP that could produce false rejections in production.
  • The checksum algorithms and date-decoding tables are implemented correctly and backed by a 200,000-sample oracle. However, three validators (be/nn, dk/cpr, se/personnummer) embed new Date() calls that make validation results time-dependent, which will cause test brittleness and subtle production surprises as years advance. The RO CNP g=9 fallback to century 1900 without skipping date validation diverges from the pattern established by the nearly-identical lt/asmens.ts and risks false negatives for a valid class of Romanian CNPs. These issues lower confidence below "safe to merge as-is".
  • Pay close attention to src/ro/cnp.ts (g=9 date-validation bypass), src/dk/cpr.ts (future-date guard), src/be/nn.ts (year-gated century check), and src/se/personnummer.ts (time-dependent century inference).

Important Files Changed

Filename Overview
src/be/nn.ts Belgian National Number validator; mod-97 checksum logic is correct, but the 2000s-century gate (yy + 2000 <= new Date().getFullYear()) introduces time-dependent validation that can produce inconsistent results across timezones or test runs.
src/dk/cpr.ts Danish CPR validator; century inference table is correct, but adds a future birth-date rejection guard not clearly specified in the CPR spec, making validation time-dependent and potentially diverging from python-stdnum for inputs encoding years 2027–2036.
src/ee/ik.ts Estonian Isikukood validator; two-pass checksum is correctly implemented and reused by lt/asmens.ts. Minor: the ?? 1900 fallback on line 70 is dead code since g is constrained to [1,8] by the guard above.
src/ie/pps.ts Irish PPS validator; length bound was corrected to > 9 in this PR. New-format (9-char) checksum contribution is implemented. The JS oracle (stdnum-js) still only generates 8-char inputs for cross-validation, leaving the new-format checksum branch uncovered by the JS oracle (already flagged in prior review thread).
src/ro/cnp.ts Romanian CNP validator; g=9 passes the guard but falls through to centuryMap[9] ?? 1900 and proceeds to validate the date bytes — unlike lt/asmens.ts which explicitly skips date validation for g=9. County code whitelist and checksum (special-case sum=10 → check=1) are correctly implemented.
src/se/personnummer.ts Swedish Personnummer validator; compact correctly handles 10-digit, 11-char (with sep), and 12/13-char formats. Century inference in getBirthDate is time-dependent (uses new Date().getFullYear()), which can cause the inferred year for a --separated input to flip as the real-world year advances.
src/si/emso.ts Slovenian EMŠO validator; year threshold was corrected from 800 to 900 (per the JMBG standard) in this PR. The calcCheckDigit formula correctly mirrors Python's (-total % 11) % 10.
scripts/oracle.ts Oracle cross-validation script; ES NIE and FI HETU full-separator coverage added to python SPECS. SE Personnummer now includes + separator and 12-digit generators. IE PPS JS_SPECS still generates only 8-char inputs. FI HETU JS_SPECS still only tests - and A separators. (Both already flagged in prior review threads.)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Input["Raw input string"] --> Compact["compact()\nstrip separators / normalise"]
    Compact --> LenCheck{"length\nvalid?"}
    LenCheck -- No --> ErrLen["INVALID_LENGTH"]
    LenCheck -- Yes --> FmtCheck{"digits /\nletters valid?"}
    FmtCheck -- No --> ErrFmt["INVALID_FORMAT"]
    FmtCheck -- Yes --> CompCheck{"century /\ncomponent\nextraction"}
    CompCheck --> DateCheck{"isValidDate?\n(be/nn: month only;\ndk/cpr: + future guard;\nse: time-dependent century)"}
    DateCheck -- No --> ErrComp["INVALID_COMPONENT"]
    DateCheck -- Yes --> Checksum{"checksum\nvalid?\n(mod-97, Luhn,\nweighted-sum,\ntwo-pass, mod-23)"}
    Checksum -- No --> ErrCsum["INVALID_CHECKSUM"]
    Checksum -- Yes --> Valid["{ valid: true, compact }"]

    style ErrLen fill:#f66,color:#fff
    style ErrFmt fill:#f66,color:#fff
    style ErrComp fill:#f96,color:#fff
    style ErrCsum fill:#f66,color:#fff
    style Valid fill:#6a6,color:#fff
Loading

Last reviewed commit: "fix: address review ..."

greptile-apps[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

- Fix IE PPS length check: max 9, not 10
- Fix BE NN error message: "0..12" not "1..12"
  (month 0 is valid for counter-exhaustion)
- Add ES NIE to oracle cross-validation
@jan-kubica
Copy link
Copy Markdown
Contributor Author

All three concerns from the confidence score section have been addressed in commit 0cf7903:

  1. IE PPS length check: Fixed v.length > 10v.length > 9. Good catch.
  2. ES NIE oracle gap: Added ES NIE to python-stdnum oracle specs with X/Y/Z prefix arbitrary.
  3. BE NN error message: Fixed to "Month must be in 0..12" to match the actual allowed range.

CC on behalf of @jan-kubica

devin-ai-integration[bot]

This comment was marked as resolved.

- NL BSN: reject all-zeros "000000000" (python-
  stdnum rejects it; our mod-11 check incorrectly
  passed since 0 % 11 === 0)
- RO CNP: accept gender digit 9 (foreigners with
  temporary residence; python-stdnum accepts it)
- DK CPR: remove future date rejection (python-
  stdnum does not enforce this; CPR numbers can
  be pre-assigned for future births)
greptile-apps[bot]

This comment was marked as resolved.

Added boundary value injection to oracle (all-zeros,
all-nines, off-by-one lengths, repeated digits).
This immediately caught 3 bugs:

- FR NIF: incorrectly rejected all-zeros
  (python-stdnum accepts: 0 % 511 == 0)
- BE VAT: incorrectly accepted all-zeros
  (python-stdnum rejects)
- NL VAT: missing zero-padding for numeric part
  (8-digit inputs like "41442283B01" must pad to
  "041442283B01" before validation)

Oracle `digs()` generator now mixes 70% random
values with 30% targeted edge cases (Hypothesis
strategy pattern). Every `digs(n)` call injects
all-zeros, all-nines, sequential digits, single
repeated digits, and off-by-one lengths.
Belgian VAT numbers before 2007 were 9 digits.
Official SPF Finances spec says older 9-digit
numbers should start with a leading zero. Added
zero-padding in compact().

Verified against:
- Official: finance.belgium.be (pre-2007 format)
- python-stdnum: compact('990246769') → '0990246769'
- jsvat: accepts both 9 and 10-digit forms
- Oracle: 5/5 runs with 0 disagreements

This bug was found by the Hypothesis-style edge
case injection (digs(9) generates 9-digit values
that exercise the padding path).
devin-ai-integration[bot]

This comment was marked as resolved.

greptile-apps[bot]

This comment was marked as resolved.

Mutant testing: for each valid value, corrupt
single digits and verify the checksum rejects them.
Proves checksum strength per algorithm:

100% detection: IBAN (mod-97), Luhn, DE VAT
  (ISO 7064), NL BSN (11-proof), HR OIB, PL NIP,
  FR SIREN, IT IVA, BE NN
~96-98%: CZ IČO, CZ RČ, EE IK, SI EMŠO, GB UTR
  (inherent mod-11 limitation, not bugs)

Also:
- Bump default sample count from 2K to 10K
- Configurable via ORACLE_SAMPLES env var
- Mutant escapes are informational, not failures
greptile-apps[bot]

This comment was marked as resolved.

Extract duplicated code into shared modules:

- _util/date.ts: isValidDate (was in 11 files)
- _util/result.ts: err() helper (was in 56 files)
- _checksums/mod1110.ts: ISO 7064 Mod 11,10
  (was in de/vat, de/idnr, hr/vat)
- Replace 14 inline weighted-sum loops with
  shared weightedSum (LV personal kept inline:
  non-zero initial sum incompatible with shared fn)
- Hoist centuryMap to module level in ee/ik, ro/cnp
- Fix import paths in de/vat, de/idnr (relative →
  #util/* aliases)
- Restore DK CPR future date rejection (python-
  stdnum does enforce it, contrary to earlier claim)

311 tests pass, oracle verified.
greptile-apps[bot]

This comment was marked as resolved.

- SI EMŠO: fix year threshold from 800 to 900 per
  official JMBG standard (Wikipedia, JMBG spec).
  python-stdnum uses 800 but the standard says 900.
  No practical difference (800-899 range has no
  living citizens) but matches the official spec.
- Oracle: expand IE PPS to cover 9-char new format
  (7 digits + check letter + A/B/H)
- Oracle: expand FI HETU to cover all 13 separators
  (+, -, Y, X, W, V, U, A, B, C, D, E, F)
- Oracle: expand SE Personnummer to cover + separator
  and 12-digit format
Comment thread src/dk/cpr.ts
Comment thread src/be/nn.ts
Comment thread src/ro/cnp.ts
Comment thread src/se/personnummer.ts
@jan-kubica jan-kubica merged commit 45055de into main Mar 19, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant