Skip to content

fix(web): treat parens and | as word chars when regex mode is enabled#1050

Merged
brendan-kellam merged 3 commits intomainfrom
brendan-kellam/fix-regex-paren-parsing-SOU-760
Mar 26, 2026
Merged

fix(web): treat parens and | as word chars when regex mode is enabled#1050
brendan-kellam merged 3 commits intomainfrom
brendan-kellam/fix-regex-paren-parsing-SOU-760

Conversation

@brendan-kellam
Copy link
Contributor

@brendan-kellam brendan-kellam commented Mar 26, 2026

Summary

  • Adds a Lezer regex dialect to the query language grammar via @dialects { regex }
  • In regex mode, the parenToken, closeParenToken, negateToken, and wordToken tokenizers all check stack.dialectEnabled(Dialect_regex) and treat (, ), and | as plain word characters instead of query grouping operators
  • Fixes a bug where a query like (test|render)< with regex mode enabled was incorrectly parsed as AND(ParenExpr("test|render"), Term("<")) instead of a single Term("(test|render)<")

Fixes SOU-760

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed regex queries containing parentheses (e.g., (test|render)<) being incorrectly split into multiple search terms instead of being treated as a single regex pattern.
  • New Features

    • Added a regex dialect mode to properly handle complex regex patterns with special characters and operators.

brendan-kellam and others added 2 commits March 25, 2026 23:22
Adds a Lezer 'regex' dialect to the query language grammar. In regex
mode, the tokenizers no longer emit openParen/closeParen tokens, so
parentheses and pipe characters are consumed as plain word characters
rather than query grouping operators. This fixes a bug where a query
like (test|render)< was incorrectly parsed as AND(ParenExpr, Term)
instead of a single regexp Term.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 26, 2026

Walkthrough

This change implements a regex dialect for the query language parser, enabling parentheses and regex patterns to be parsed as single tokens rather than as query syntax operators. The update modifies tokenizers, adds grammar dialect support, and integrates the regex parser into the web search interface.

Changes

Cohort / File(s) Summary
Grammar and Parser Configuration
packages/queryLanguage/src/parser.terms.ts, packages/queryLanguage/src/parser.ts, packages/queryLanguage/src/query.grammar
Added Dialect_regex constant and enabled regex dialect in parser deserialization; added @dialects { regex } directive to grammar.
Tokenization
packages/queryLanguage/src/tokens.ts
Updated parenToken, closeParenToken, wordToken, and negateToken to accept parser stack and disable operator/grouping interpretation when regex dialect is enabled; parentheses and regex alternations are now consumed as literal characters.
Test Suite
packages/queryLanguage/test/grammar.regex.test.ts, packages/queryLanguage/test/grammar.test.ts, packages/queryLanguage/test/regex.txt
Added dedicated regex dialect test file with 18 test cases covering alternation, anchors, quantifiers, character classes, and prefix filter combinations; updated main test to skip regex test cases.
Web Search Integration
packages/web/src/features/search/parser.ts
Added regexParser configuration and updated parseQuerySyntaxIntoIR to select parser based on isRegexEnabled flag.
Documentation
CHANGELOG.md
Added Unreleased changelog entry documenting fix for regex queries with parentheses being incorrectly split.

Sequence Diagram

sequenceDiagram
    participant User
    participant WebParser as Web Parser
    participant TokenStack as Parser Stack
    participant Tokenizer
    participant QueryParser as Query Parser
    participant IR as IR Transformer

    User->>WebParser: parseQuerySyntaxIntoIR(query, {isRegexEnabled: true})
    WebParser->>WebParser: Select regexParser
    WebParser->>TokenStack: Initialize with dialect: "regex"
    WebParser->>Tokenizer: Tokenize query with dialect enabled
    
    alt Regex Mode Enabled
        Tokenizer->>TokenStack: dialectEnabled("regex") → true
        Tokenizer->>Tokenizer: Treat (foo|bar) as single word token
        Tokenizer->>Tokenizer: Disable paren operators
    else Normal Mode
        Tokenizer->>TokenStack: dialectEnabled("regex") → false
        Tokenizer->>Tokenizer: Treat ( ) as grouping operators
    end
    
    Tokenizer->>QueryParser: Return tokens
    QueryParser->>QueryParser: Parse with regex dialect
    QueryParser->>IR: Return AST
    IR->>IR: Transform to IR representation
    IR->>User: Return parsed IR
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • msukkari
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately describes the main change: disabling regex dialect support for parentheses and pipe characters so they are treated as word characters in regex mode.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch brendan-kellam/fix-regex-paren-parsing-SOU-760

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/queryLanguage/src/tokens.ts (2)

456-490: ⚠️ Potential issue | 🟡 Minor

Only emit negate for real prefix filters.

This still treats any -...: token as negation. With regex mode enabled, patterns like -https?:// or -[A-Z]:\d+ get stolen from wordToken even though they are not in PREFIXES. Checking the known prefix set from offset would keep -repo: working without breaking colon-bearing regexes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 456 - 490, The code
currently treats any "-...:" sequence as a prefix negation; modify the logic
that checks for a colon to instead extract the identifier between offset and the
colon and verify it is a known prefix from the PREFIXES set before emitting the
negate token. Concretely, inside the loop that sets foundColon (the block
referencing input.peek, peekOffset, foundColon) replace the naive
colon-detection with logic that builds the substring from offset to the colon
(or use an existing helper) and test PREFIXES.has(substring) (or equivalent) —
only if it's a known prefix should you call input.advance() and
input.acceptToken(negate); all other cases (e.g. regexes like -https?://) should
fall through to wordToken handling. Ensure this change coexists with the
existing Dialect_regex and hasBalancedParensAt checks.

316-337: ⚠️ Potential issue | 🟠 Major

Make or disambiguation regex-aware.

isOrKeyword() still short-circuits before the new regex fast path, so terms like foo or$, foo or(test), or repo:or|foo never reach this branch. In regex mode, or should only be reserved when it is actually acting as the boolean operator, not when it is the start of a regex term/value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 316 - 337, The
short-circuit check using isOrKeyword is blocking the regex-mode fast path;
update the logic so isOrKeyword is skipped when regex dialect is enabled (or
make isOrKeyword itself respect stack.dialectEnabled(Dialect_regex)).
Concretely, either move the stack.dialectEnabled(Dialect_regex) branch above the
isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change
isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true;
use the existing symbols (isOrKeyword, startsWithPrefix,
stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace,
input.acceptToken(word)) to locate and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 456-490: The code currently treats any "-...:" sequence as a
prefix negation; modify the logic that checks for a colon to instead extract the
identifier between offset and the colon and verify it is a known prefix from the
PREFIXES set before emitting the negate token. Concretely, inside the loop that
sets foundColon (the block referencing input.peek, peekOffset, foundColon)
replace the naive colon-detection with logic that builds the substring from
offset to the colon (or use an existing helper) and test PREFIXES.has(substring)
(or equivalent) — only if it's a known prefix should you call input.advance()
and input.acceptToken(negate); all other cases (e.g. regexes like -https?://)
should fall through to wordToken handling. Ensure this change coexists with the
existing Dialect_regex and hasBalancedParensAt checks.
- Around line 316-337: The short-circuit check using isOrKeyword is blocking the
regex-mode fast path; update the logic so isOrKeyword is skipped when regex
dialect is enabled (or make isOrKeyword itself respect
stack.dialectEnabled(Dialect_regex)). Concretely, either move the
stack.dialectEnabled(Dialect_regex) branch above the
isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change
isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true;
use the existing symbols (isOrKeyword, startsWithPrefix,
stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace,
input.acceptToken(word)) to locate and apply the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5771cc2d-1869-496b-958e-bc281d5a2ee8

📥 Commits

Reviewing files that changed from the base of the PR and between 34a1435 and 04354d7.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • packages/queryLanguage/src/parser.terms.ts
  • packages/queryLanguage/src/parser.ts
  • packages/queryLanguage/src/query.grammar
  • packages/queryLanguage/src/tokens.ts
  • packages/queryLanguage/test/grammar.regex.test.ts
  • packages/queryLanguage/test/grammar.test.ts
  • packages/queryLanguage/test/regex.txt
  • packages/web/src/features/search/parser.ts

@brendan-kellam brendan-kellam merged commit 2b35bb0 into main Mar 26, 2026
9 checks passed
@brendan-kellam brendan-kellam deleted the brendan-kellam/fix-regex-paren-parsing-SOU-760 branch March 26, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant