-
Notifications
You must be signed in to change notification settings - Fork 42
Rutefig/reg 579 fix noir proof generation #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rutefig/reg 579 fix noir proof generation #110
Conversation
…aystack and regex definition json
…roup and rerun regex, input and test generation
… and tests - fails input that enters in conditional path
The epsilon closure computation was incorrectly tracking capture groups globally, causing captures from unrelated conditional branches to leak into other paths. This resulted in incorrect capture behavior when conditionals preceded capture groups in regex patterns. Changes: - Modified EpsilonClosure to use per_state_captures (BTreeMap) instead of global captures set - Updated capture tracking to maintain separate capture sets for each epsilon path - Fixed start state logic to check for start captures across all epsilon paths - Regenerated Circom and Noir templates with corrected logic - Updated test inputs to reflect correct capture behavior - Added email_addr test case demonstrating conditional before capture group
WalkthroughRefactors email regex pattern matching to handle email addresses with prefixes and suffixes (e.g., Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
compiler/src/ir/intermediate.rs (1)
431-438: Remove unused struct.The
PathInfostruct is defined but never used in the codebase. Consider removing this dead code.Apply this diff:
-/// Information about captures encountered on a specific epsilon path -#[derive(Debug, Clone)] -struct PathInfo { - /// The destination state of this path - target_state: usize, - /// Captures encountered on epsilon transitions along this path - captures: BTreeSet<(usize, bool)>, -} -circom/circuits/tests/email_addr.test.ts (1)
89-130: Reduce code duplication between test cases.The two test cases share ~40 lines of nearly identical code (lines 91-129 and 134-172). The only differences are the input string and the assertions remain the same. Consider extracting a helper function:
async function testEmailExtraction( circuit: Circuit, graph: Graph, inputString: string, expectedEmail: string ) { const maxMatchBytes = 64; const maxHaystackBytes = 640; const { type, ...circuitInputs }: CircuitInputs = JSON.parse( genCircuitInputs( JSON.stringify(graph), inputString, maxHaystackBytes, maxMatchBytes, ProvingFramework.Circom ) ); let { captureGroupIds, captureGroupStarts, ...rest } = circuitInputs; let captureGroup1Id = captureGroupIds[0]; let captureGroup1Start = captureGroupStarts[0]; rest.captureGroup1Id = captureGroup1Id; rest.captureGroup1Start = captureGroup1Start; const witness = await circuit.calculateWitness(rest); await circuit.checkConstraints(witness); expect(1n).toEqual(witness[1]!); const capture1Start = 2; const captureArray = []; for (let i = 0; i < maxMatchBytes; i++) { const charCode = Number(witness[capture1Start + i]!); if (charCode === 0) break; captureArray.push(String.fromCharCode(charCode)); } const extractedEmail = captureArray.join(""); expect(extractedEmail).toEqual(expectedEmail); } it("should match email address with to: prefix", async () => { await testEmailExtraction( circuit, graph, "to:example@example.com\r\n", "example@example.com" ); }); it("should match email address after name and <", async () => { await testEmailExtraction( circuit, graph, "to: example <example@example.com>\r\n", "example@example.com" ); });Also applies to: 132-173
circom/circuits/tests/subject_all.test.ts (2)
1-86: Extract shared test utilities to reduce duplication.This file shares significant structural similarity with
email_addr.test.ts, including:
- Identical type definitions (lines 15-49)
- Similar setup patterns in beforeAll
- Duplicated capture extraction logic
Consider creating a shared test utilities module:
Create
circom/circuits/tests/utils/test-helpers.ts:// Type definitions export interface CircomTesterOptions { include: string; } export interface Circuit { calculateWitness(input: any): Promise<bigint[]>; checkConstraints(witness: bigint[]): Promise<void>; } export interface CircuitInputs { type?: string; captureGroupIds: number[][]; captureGroupStarts: number[][]; inHaystack: number[]; [key: string]: any; } export interface Graph { [key: string]: any; } // Helper functions export async function setupCircuit( regexJsonPath: string, templateName: string, circuitPath: string, outputGraphPath: string, outputCircuitPath: string, wasmTester: any ): Promise<{ graph: Graph; circuit: Circuit }> { // Implementation combining common beforeAll logic } export async function testCaptureExtraction( circuit: Circuit, graph: Graph, inputString: string, expectedCapture: string, maxHaystackBytes: number = 640, maxMatchBytes: number = 64, captureStartIndex: number = 2 ): Promise<void> { // Implementation combining common test logic }
89-256: Reduce code duplication across test cases.The four test cases (lines 89-130, 132-172, 174-214, 216-256) contain significant duplication. Each test repeats ~40 lines of setup and extraction logic with only the input string and expected output varying.
Consider refactoring similar to the suggestion for
email_addr.test.ts, or use the shared helper from the test utilities module.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
.gitignore(1 hunks)CLAUDE.md(1 hunks)circom/circuits/common/email_addr_regex.circom(6 hunks)circom/circuits/tests/circuits/test_email_addr_regex.circom(1 hunks)circom/circuits/tests/circuits/test_subject_all_regex.circom(1 hunks)circom/circuits/tests/email_addr.test.ts(1 hunks)circom/circuits/tests/subject_all.test.ts(1 hunks)circom/regexes/email_addr.json(1 hunks)compiler/src/ir/intermediate.rs(5 hunks)noir/common/email_addr.json(1 hunks)noir/common/sample_haystacks/circuit_inputs/body_hash_pass_0.json(1 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_0.json(7 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_1.json(8 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_10.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_11.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_12.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_13.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_14.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_15.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_16.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_17.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_18.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_19.json(0 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_2.json(8 hunks)noir/common/sample_haystacks/circuit_inputs/email_addr_pass_20.json(0 hunks)
💤 Files with no reviewable changes (11)
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_13.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_19.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_14.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_15.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_20.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_10.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_18.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_12.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_16.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_11.json
- noir/common/sample_haystacks/circuit_inputs/email_addr_pass_17.json
🔇 Additional comments (19)
.gitignore (1)
18-19: LGTM!Standard gitignore additions for local development artifacts.
noir/common/sample_haystacks/circuit_inputs/body_hash_pass_0.json (1)
308-308: LGTM - Regenerated test data.The initial state change from 0 to 3 reflects the updated NFA state graph after the epsilon closure fix. This is expected regenerated test data.
CLAUDE.md (1)
1-253: Excellent documentation!Comprehensive repository guide covering architecture, workflows, and development patterns. This will significantly improve developer onboarding and AI assistant context.
circom/circuits/tests/circuits/test_subject_all_regex.circom (1)
5-14: Well-documented test harness.The inline documentation clearly explains parameter alignment requirements and warns about witness generation errors. This is a good pattern for test circuits.
noir/common/sample_haystacks/circuit_inputs/email_addr_pass_2.json (2)
4-27: LGTM - Regenerated test data.The haystack now starts with "to:" (bytes 116, 111, 58), reflecting the updated email pattern that includes an optional prefix. This is expected regenerated test data.
306-1521: LGTM - Updated state machine data.The match_length, state arrays, and capture metadata are updated consistently with the new regex pattern. All changes align with the pattern update in email_addr.json.
compiler/src/ir/intermediate.rs (5)
189-196: LGTM - Critical fix for start captures.The fix correctly uses
per_state_captures.get(&r_state)to only add captures from the specific epsilon path tor_state, preventing capture leakage from unrelated branches. This is the core of the fix.
199-207: LGTM - End capture handling.Correctly iterates over all per-state captures and filters for end markers (
!is_start). The logic properly accumulates end captures from the target state's epsilon closure.
220-229: LGTM - Start state preservation.The fix correctly checks all epsilon paths from the start state for start captures before adding alternative start states. This prevents bypassing captures when taking epsilon shortcuts.
276-327: LGTM - Per-path capture tracking.The refactored epsilon closure computation correctly:
- Tracks captures per destination state in
per_state_captures- Accumulates captures along each epsilon path in
current_path_captures- Stores path-specific captures via
closure.per_state_captures.entry(state)This prevents the global capture leakage bug.
444-495: Excellent regression test!The test validates the fix by:
- Testing the problematic pattern
(?:a)?([bc])- Verifying both paths (with and without optional)
- Asserting exactly 1 capture start event (not duplicated)
This will catch regressions of the epsilon closure bug.
circom/circuits/tests/circuits/test_email_addr_regex.circom (1)
5-14: Well-documented test harness.Clear documentation of parameter requirements and the consequences of mismatches. Consistent with the pattern in test_subject_all_regex.circom.
noir/common/email_addr.json (1)
3-14: Verify the PublicPattern length reduction.The regex pattern structure looks correct with the optional prefix and suffix to test the epsilon closure fix. However, the
PublicPatternlength changed significantly from 320 to 64. Ensure this reduction is intentional and doesn't truncate valid email addresses in your test scenarios.noir/common/sample_haystacks/circuit_inputs/email_addr_pass_1.json (1)
3-37: Test data correctly represents email with "to:" prefix and angle bracket wrapper.The haystack now encodes
to:example <example@example.com>\r\n, testing email extraction when there's an optional prefix and angle brackets. The capture start index at position 12 correctly skips the prefix to capture just the email address portion.noir/common/sample_haystacks/circuit_inputs/email_addr_pass_0.json (1)
3-39: Test data provides complementary coverage for CRLF-prefixed pattern.This test case exercises the alternative branch where the pattern starts with
\r\nbeforeto:, complementing the pass_1 test case. This is valuable for verifying the epsilon closure fix handles both paths through the optional conditional correctly.circom/circuits/common/email_addr_regex.circom (3)
9-9: Generated circuit matches the updated regex pattern.The header comment correctly reflects the composite pattern from the regex definition: prefix
(?:\r\n|^)to:(?:[^<]+<)?, main capture[a-zA-Z0-9!#$%&*+\\-\\/=?^_{|}~.]+@[a-zA-Z0-9_.-]+, and suffix>?\r\n`.
21-25: Start state handling correctly implements multiple entry points.The circuit properly expands from 1 to 3 start states with updated validation using MultiOR component. This correctly handles the alternative branches in the prefix pattern
(?:\r\n|^)to:, preventing the capture leakage issue described in the PR.Also applies to: 37-51
235-235: Verify that 64-byte capture limit is sufficient.The capture output was reduced from 320 to 64 bytes, consistent with the regex definition change. Please confirm this size is adequate for your email address use case, as RFC 5321 allows full email addresses up to 320+ bytes.
Related to the same concern raised for
circom/regexes/email_addr.json.circom/regexes/email_addr.json (1)
7-10: Confirm the rationale for the 64-byte limit and document it.The review concern is valid: the reduction from 320 to 64 bytes significantly restricts email address sizes below RFC 5321 compliance (which allows up to 320+ bytes total). While this appears to be a deliberate zk-circuit parameter constraint (for circuit size/performance), the reason for choosing 64 specifically is undocumented.
The test file shows this is a circuit parameter (
maxMatchBytes = 64), but there is no explanation of whether 64 bytes satisfies your use case requirements or if this represents a necessary performance tradeoff. Many legitimate emails exceed 64 bytes (e.g., longer local-parts combined with multi-level domain names).Action: Either document in code comments or README why 64 bytes is the appropriate limit for this system, or increase it if your use case requires broader email support.
Fix Noir Proof Generation - Epsilon Path Capture Group Tracking
Summary
This PR fixes a critical bug in the regex compiler's epsilon closure computation that caused capture groups to leak between unrelated conditional branches. The bug manifested when optional patterns (conditionals) preceded capture groups in regex patterns, resulting in incorrect capture behavior and proof generation failures.
Problem: When a regex pattern had an optional non-capturing group before a capture group (e.g.,
(?:optional)?(...)), the epsilon closure would track captures globally, causing captures from paths that took the optional branch to leak into paths that skipped it.Solution: Modified epsilon closure computation to track captures per epsilon path rather than globally, using a per-state capture map to maintain separate capture sets for each reachable state.
Root Cause Analysis
The bug was in
compiler/src/ir/intermediate.rsin theEpsilonClosurestruct and its computation:Before (buggy code):
The epsilon closure tracked all captures in a single global set, meaning:
email_addr: (?:\r\n|^)to:(?:[^<]+<)?(email@domain.com), paths that skipped the optional(?:[^<]+<)?would still see captures from paths that took itAfter (fixed code):
Now each epsilon path maintains its own capture set, preventing cross-contamination between conditional branches.
Changes Made
Compiler Changes (
compiler/src/ir/intermediate.rs)Modified
EpsilonClosurestructure (line 423-429):captures: BTreeSetto per-stateper_state_captures: BTreeMapUpdated epsilon closure computation (line 275-320):
current_path_capturesparameter to DFS traversalper_state_capturesmapFixed transition building (line 186-206):
Updated start state logic (line 217-228):
Added regression test (line 444-495):
(?:a)?([bc])which exhibits the bugRegex Pattern Updates
email_addr pattern (
noir/common/email_addr.json,circom/regexes/email_addr.json):(?:\r\n|^)to:(?:[^<]+<)?(email@domain)>?\r\n(?:[^<]+<)?before the capture groupto:example <example@example.com>(takes optional path with<)to:example@example.com(skips optional path, no<)Generated Template Updates
Regenerated all framework templates with corrected logic:
circom/circuits/common/email_addr_regex.circomand graph JSON (+2932 lines)noir/src/templates/circuits/*.nrTest Updates
email_addr_pass_*.jsonfiles that were based on old regex definitiontest_email_addr_regex.circomandtest_subject_all_regex.circomDocumentation
CLAUDE.mdfile with repository overview, architecture, and development patternsImpact
User-Facing Changes
email_addrnow correctly capture email addresses regardless of whether optional formatting is presentBreaking Changes
None. This is a pure bug fix that makes the compiler behavior correct. Existing code will work better, not differently.
Testing & Verification
Automated Tests
bun test)bun run test:circom)test_epsilon_elimination_optional_pattern_before_capture()verifies the fix<>formatting)Manual Verification
The fix was validated against the
email_addrpattern with these test cases:Pass cases (correctly match and capture):
to:example <example@example.com>- Takes optional(?:[^<]+<)?pathto:example@example.com- Skips optional pathto:example <example@example.com>(with CRLF) - Standard email formatFail cases (correctly reject):
to: example@example.com- Extra space after colon (not in pattern)Both paths through the conditional now produce correct captures without leakage.
Files Changed
Core compiler: 1 file
compiler/src/ir/intermediate.rs: Epsilon closure tracking fix (+121 lines)Regex definitions: 2 files
noir/common/email_addr.json: Pattern remains same, haystacks updatedcircom/regexes/email_addr.json: Pattern remains sameGenerated templates: 14 files
email_addr_regex.circom,email_addr_graph.json, test circuitsTest inputs: 68 files updated/deleted
Documentation: 2 files
.gitignore: Added patternsCLAUDE.md: Repository documentationMigration Notes
No migration required. This is a transparent bug fix that improves correctness. Circuits using patterns with optional conditionals before capture groups will now generate correct proofs.
Related Issues
Checklist
Summary by CodeRabbit
Release Notes
New Features
Tests
Documentation