Skip to content

chore(oracle): add brazilian-utils, rut.js, and django-localflavor comparators#109

Merged
jan-kubica merged 5 commits into
mainfrom
chore/oracle-add-comparators
Jun 3, 2026
Merged

chore(oracle): add brazilian-utils, rut.js, and django-localflavor comparators#109
jan-kubica merged 5 commits into
mainfrom
chore/oracle-add-comparators

Conversation

@jan-kubica
Copy link
Copy Markdown
Contributor

Summary

Expands scripts/oracle.ts with three new cross-validation backends that probe identifier families not previously covered by python-stdnum / jsvat / stdnum-js.

  • @brazilian-utils/brazilian-utils (JS) — covers br.cpf, br.cnpj (with version: 2 for the alphanumeric format).
  • rut.js (JS) — covers cl.rut. Marked survey-only because rut.js rejects RUT bodies with leading zeros as a stylistic policy; our checksum-only validator accepts them (and so does python-stdnum).
  • django-localflavor (Python, gated by a new hasLocalflavor() probe) — adds 16 mappings: ar.cuit, ar.dni, au.abn, au.acn, au.tfn, br.cpf, br.cnpj, ca.sin, cl.rut, es.dni, in_.aadhaar, in_.pan, mx.clabe, mx.curp, mx.rfc, us.ssn. The Python bridge configures Django with empty settings (settings.configure(USE_I18N=False)) so no full project setup is needed.

A LOCALFLAVOR_FORMAT per-key shape transformer is included so that fields requiring punctuated input (e.g., CASocialInsuranceNumberField requires XXX-XXX-XXX) can still be exercised against our compact form.

Several pairings are pre-marked survey-only with documented reasons (localflavor's BRCNPJField predates the July 2026 alphanumeric format; INAadhaarNumberField skips Verhoeff; ARCUITField is missing prefixes 50/51/55 that python-stdnum also accepts; etc.).

CC on behalf of @jan-kubica

Test plan

  • bun install
  • bun run lint (passes)
  • bun run typecheck (passes)
  • bun test (4229 passing)
  • bun run oracle reaches the same gate disagreement count as before (the new comparators in gate mode show 0 disagreements at ORACLE_SAMPLES=1000)
  • Optional: install Python deps in a .venv (pip install django django-localflavor) and re-run bun run oracle to exercise the localflavor branch

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates several new validation oracles into the oracle script, including django-localflavor, brazilian-utils, and rut.js, along with their respective dependencies and configurations. The review feedback highlights a security and portability concern regarding the use of a hardcoded /tmp path for executing the temporary Python script, suggesting the use of tmpdir() and ensuring proper cleanup. Additionally, it is recommended to refactor the manual oracle registrations for brazilian-utils and rut.js to use the safe helper for more robust error handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/oracle.ts Outdated
Comment on lines +688 to +696
writeFileSync("/tmp/_stdnum_localflavor.py", s);
return execSync(`${PYTHON} /tmp/_stdnum_localflavor.py`, {
input: json,
encoding: "utf-8",
timeout: 60_000,
})
.trim()
.split("\n")
.map((l) => l === "1");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Avoid using a hardcoded /tmp path, which is insecure (vulnerable to symlink attacks and multi-user collisions) and non-portable (fails on Windows). Instead, use tmpdir() with a process-unique filename, and clean it up in a finally block.

  const tempFile = `${tmpdir()}/_stdnum_localflavor_${process.pid}.py`;
  writeFileSync(tempFile, s);
  try {
    return execSync(`${PYTHON} ${tempFile}`, {
      input: json,
      encoding: "utf-8",
      timeout: 60_000,
    })
      .trim()
      .split("\n")
      .map((l) => l === "1");
  } finally {
    try {
      unlinkSync(tempFile);
    } catch {}
  }

Copy link
Copy Markdown
Contributor Author

@jan-kubica jan-kubica Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in fd776d6: localflavorBatch now writes to join(tmpdir(), \stdnum_localflavor${process.pid}.py`)and cleans up viaunlinkSyncin afinallyblock. The other hardcoded/tmp paths (pyBatch, pyIdnBatch, rubyScript, phpBatch`) predate this PR and follow the same pattern; happy to migrate them together in a follow-up to keep this PR scoped to the new comparators.

CC on behalf of @jan-kubica

Comment thread scripts/oracle.ts
Comment on lines +20 to +23
import {
isValidCnpj,
isValidCpf,
} from "@brazilian-utils/brazilian-utils";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import tmpdir from node:os to enable cross-platform and secure temporary file creation.

Suggested change
import {
isValidCnpj,
isValidCpf,
} from "@brazilian-utils/brazilian-utils";
import {
isValidCnpj,
isValidCpf,
} from "@brazilian-utils/brazilian-utils";
import { tmpdir } from "node:os";

Copy link
Copy Markdown
Contributor Author

@jan-kubica jan-kubica Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imported in fd776d6.

CC on behalf of @jan-kubica

Comment thread scripts/oracle.ts
Comment on lines 51 to +53
import { execSync } from "node:child_process";
import { writeFileSync } from "node:fs";
import { validate as validateRut } from "rut.js";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import unlinkSync from node:fs to clean up the temporary Python script after execution.

Suggested change
import { execSync } from "node:child_process";
import { writeFileSync } from "node:fs";
import { validate as validateRut } from "rut.js";
import { execSync } from "node:child_process";
import { writeFileSync, unlinkSync } from "node:fs";
import { validate as validateRut } from "rut.js";

Copy link
Copy Markdown
Contributor Author

@jan-kubica jan-kubica Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imported in fd776d6.

CC on behalf of @jan-kubica

Comment thread scripts/oracle.ts Outdated
Comment on lines +1144 to +1167
e.push({
name: "br.cpf (vs brazilian-utils)",
source: "brazilian-utils",
key: "br.cpf",
tier: tierFor("brazilian-utils", "br.cpf"),
validate: (v) => v.map(isValidCpf),
});
e.push({
name: "br.cnpj (vs brazilian-utils)",
source: "brazilian-utils",
key: "br.cnpj",
tier: tierFor("brazilian-utils", "br.cnpj"),
validate: (v) =>
v.map((x) => isValidCnpj(x, { version: 2 })),
});

// rut.js (always available)
e.push({
name: "cl.rut (vs rut.js)",
source: "rut.js",
key: "cl.rut",
tier: tierFor("rut.js", "cl.rut"),
validate: (v) => v.map(validateRut),
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Refactor the manual e.push calls for brazilian-utils and rut.js to use the safe helper. This ensures consistent error handling (preventing the oracle runner from crashing if these libraries throw unexpected exceptions) and reduces boilerplate.

  safe(
    "br.cpf (vs brazilian-utils)",
    "brazilian-utils",
    "br.cpf",
    (v) => v.map(isValidCpf),
  );
  safe(
    "br.cnpj (vs brazilian-utils)",
    "brazilian-utils",
    "br.cnpj",
    (v) => v.map((x) => isValidCnpj(x, { version: 2 })),
  );

  // rut.js (always available)
  safe(
    "cl.rut (vs rut.js)",
    "rut.js",
    "cl.rut",
    (v) => v.map(validateRut),
  );

Copy link
Copy Markdown
Contributor Author

@jan-kubica jan-kubica Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fd776d6 — both brazilian-utils entries and the rut.js entry now go through safe() for consistent error handling.

CC on behalf of @jan-kubica

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd776d6291

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/oracle.ts
Comment on lines +1151 to +1152
(v) =>
localflavorBatch(path, shape ? v.map(shape) : v),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add valid generators for the new localflavor mappings

With these new entries, any mapped validator that does not declare lengths and has no CUSTOM_ARB override falls through arbFor to the default 10-digit generator. That makes mappings such as au.abn, au.acn, au.tfn, br.cpf, and us.ssn compare only invalid-length samples, so bun run oracle can report zero gate disagreements while never exercising valid values for those new localflavor comparators; add per-key arbs or lengths before treating them as gate coverage.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e5e34d8. Two things going on:

  1. Digit-only mappings (au.abn, au.acn, au.tfn, br.cpf, ca.sin, us.ssn) are now actually covered after the rebase onto main, which picked up fix(oracle): probe validators with their real lengths #107's lengthsFromExamples fix in inferArb — these validators don't declare lengths but their examples are 11/9/9/11/9/9 chars, so the arb generates the right length and the comparators do exercise the checksum path. Re-running with ORACLE_SAMPLES=1000 shows valid-sample rates that match expectations (~1-3% for Luhn / similar weighted checks, 21/1000 for us.ssn, etc.).

  2. Alphanumeric mappings were the genuine gap — in_.pan (5 letters + 4 digits + 1 letter), mx.curp (18-char structured), and mx.rfc (12/13-char persona física/moral) were producing 0/N valid samples because the default arb is digit-only. Added per-key CUSTOM_ARB entries that respect each format's character classes (mx.curp vowel/consonant constraints, mx.rfc persona-física vs moral lengths). The new arbs immediately surfaced real semantic differences between us and the oracles, which I marked survey-only with documented reasons:

    • python-stdnum:mx.rfc — their is_valid() defaults to validate_check_digits=False; ours always verifies the SAT mod-11 check digit.
    • localflavor:mx.rfc — MXRFCField requires the 2nd char of a persona física to be a vowel; we follow the SAT regex on python-stdnum.
    • python-stdnum:in_.pan — they accept holder-type 'K' (deprecated per their own comment) and reject 0000-serial PANs; ours excludes 'K' and accepts 0000.

Gate-mode disagreement count stays 0 for the new mappings that remain in gate.

CC on behalf of @jan-kubica

@jan-kubica jan-kubica closed this Jun 3, 2026
@jan-kubica jan-kubica reopened this Jun 3, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 3, 2026
…mparators

Expand the cross-validation oracle in scripts/oracle.ts with three new
backends that probe identifier families not previously covered by
python-stdnum / jsvat / stdnum-js:

- @brazilian-utils/brazilian-utils (JS): br.cpf, br.cnpj (v2 alphanumeric).
- rut.js (JS): cl.rut. Marked survey-only — rut.js rejects RUT bodies
  with leading zeros as a stylistic policy; our checksum-only validator
  accepts them, matching python-stdnum.
- django-localflavor (Python, optional via hasLocalflavor() probe):
  16 mappings across ar.cuit, ar.dni, au.{abn,acn,tfn}, br.{cpf,cnpj},
  ca.sin, cl.rut, es.dni, in_.{aadhaar,pan}, mx.{clabe,curp,rfc}, us.ssn.

Survey-only annotations were added for pairings where the upstream
library has documented gaps (e.g., localflavor's BRCNPJField predates
the July 2026 alphanumeric format; INAadhaarNumberField skips Verhoeff;
ARCUITField is missing prefixes 50/51/55 that python-stdnum also accepts).

Gate-mode disagreement count is unchanged at 0 for the new comparators
that remain in gate, validated with ORACLE_SAMPLES=1000.
- localflavorBatch now writes its temp Python script via tmpdir() with
  a PID-suffixed filename and cleans it up in a finally block, instead
  of hardcoding /tmp/_stdnum_localflavor.py. The existing /tmp paths
  in pyBatch / pyIdnBatch / rubyScript / phpBatch were not introduced
  by this PR and are left alone; they can be migrated together in a
  follow-up.

- The brazilian-utils and rut.js oracle entries are now registered
  through the existing safe() helper rather than direct e.push() calls,
  matching the pattern used by the Python and Ruby backends. This adds
  consistent try/catch handling so an unexpected library exception
  cannot crash the oracle runner mid-batch.
@jan-kubica jan-kubica force-pushed the chore/oracle-add-comparators branch from 96fddc5 to 9a10b67 Compare June 3, 2026 22:21
Codex P2 review flagged that mappings without a `lengths` declaration
and without a `CUSTOM_ARB` entry fall through to the default 10-digit
generator. After the main rebase, `inferArb` reads lengths from the
validator's `examples` so the digit-only mappings (au.abn, au.acn,
au.tfn, br.cpf, ca.sin, us.ssn) get the right length and do exercise
the checksum path. The remaining gap is alphanumeric formats: in_.pan,
mx.curp, and mx.rfc have no per-key arb and were producing 0/N valid
samples in the gate run, meaning the comparators probed nothing useful.

This patch:
- adds CUSTOM_ARB entries that respect each format's character classes
  (letter vs digit positions, mx.curp vowel/consonant constraints,
  mx.rfc persona física vs moral lengths),
- marks pairings as survey-only where the new arbs surface real
  semantic differences:
  * python-stdnum:mx.rfc — their is_valid() skips check-digit by
    default; we always verify it,
  * localflavor:mx.rfc — their MXRFCField requires the 2nd char of a
    persona física to be a vowel; we follow the SAT regex on
    python-stdnum,
  * python-stdnum:in_.pan — they accept holder-type 'K' (deprecated)
    and reject 0000-serial; ours excludes 'K' and accepts 0000.

Gate mode now shows non-zero valid-sample counts for these mappings
(in_.pan vs localflavor: 190/500 valid, mx.curp: 3-4/500, mx.rfc:
covered via per-tier marking) with 0 disagreements where retained.
@jan-kubica jan-kubica merged commit 36bb384 into main Jun 3, 2026
7 checks passed
@jan-kubica jan-kubica deleted the chore/oracle-add-comparators branch June 3, 2026 22:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant