fix(web): treat parens and | as word chars when regex mode is enabled by brendan-kellam · Pull Request #1050 · sourcebot-dev/sourcebot

brendan-kellam · 2026-03-26T06:23:07Z

Summary

Adds a Lezer regex dialect to the query language grammar via @dialects { regex }
In regex mode, the parenToken, closeParenToken, negateToken, and wordToken tokenizers all check stack.dialectEnabled(Dialect_regex) and treat (, ), and | as plain word characters instead of query grouping operators
Fixes a bug where a query like (test|render)< with regex mode enabled was incorrectly parsed as AND(ParenExpr("test|render"), Term("<")) instead of a single Term("(test|render)<")

Fixes SOU-760

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed regex queries containing parentheses (e.g., (test|render)<) being incorrectly split into multiple search terms instead of being treated as a single regex pattern.
New Features
- Added a regex dialect mode to properly handle complex regex patterns with special characters and operators.

Adds a Lezer 'regex' dialect to the query language grammar. In regex mode, the tokenizers no longer emit openParen/closeParen tokens, so parentheses and pipe characters are consumed as plain word characters rather than query grouping operators. This fixes a bug where a query like (test|render)< was incorrectly parsed as AND(ParenExpr, Term) instead of a single regexp Term. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-26T06:23:27Z

Walkthrough

This change implements a regex dialect for the query language parser, enabling parentheses and regex patterns to be parsed as single tokens rather than as query syntax operators. The update modifies tokenizers, adds grammar dialect support, and integrates the regex parser into the web search interface.

Changes

Cohort / File(s)	Summary
Grammar and Parser Configuration `packages/queryLanguage/src/parser.terms.ts`, `packages/queryLanguage/src/parser.ts`, `packages/queryLanguage/src/query.grammar`	Added `Dialect_regex` constant and enabled regex dialect in parser deserialization; added `@dialects { regex }` directive to grammar.
Tokenization `packages/queryLanguage/src/tokens.ts`	Updated `parenToken`, `closeParenToken`, `wordToken`, and `negateToken` to accept parser stack and disable operator/grouping interpretation when regex dialect is enabled; parentheses and regex alternations are now consumed as literal characters.
Test Suite `packages/queryLanguage/test/grammar.regex.test.ts`, `packages/queryLanguage/test/grammar.test.ts`, `packages/queryLanguage/test/regex.txt`	Added dedicated regex dialect test file with 18 test cases covering alternation, anchors, quantifiers, character classes, and prefix filter combinations; updated main test to skip regex test cases.
Web Search Integration `packages/web/src/features/search/parser.ts`	Added `regexParser` configuration and updated `parseQuerySyntaxIntoIR` to select parser based on `isRegexEnabled` flag.
Documentation `CHANGELOG.md`	Added Unreleased changelog entry documenting fix for regex queries with parentheses being incorrectly split.

Sequence Diagram

sequenceDiagram
    participant User
    participant WebParser as Web Parser
    participant TokenStack as Parser Stack
    participant Tokenizer
    participant QueryParser as Query Parser
    participant IR as IR Transformer

    User->>WebParser: parseQuerySyntaxIntoIR(query, {isRegexEnabled: true})
    WebParser->>WebParser: Select regexParser
    WebParser->>TokenStack: Initialize with dialect: "regex"
    WebParser->>Tokenizer: Tokenize query with dialect enabled
    
    alt Regex Mode Enabled
        Tokenizer->>TokenStack: dialectEnabled("regex") → true
        Tokenizer->>Tokenizer: Treat (foo|bar) as single word token
        Tokenizer->>Tokenizer: Disable paren operators
    else Normal Mode
        Tokenizer->>TokenStack: dialectEnabled("regex") → false
        Tokenizer->>Tokenizer: Treat ( ) as grouping operators
    end
    
    Tokenizer->>QueryParser: Return tokens
    QueryParser->>QueryParser: Parse with regex dialect
    QueryParser->>IR: Return AST
    IR->>IR: Transform to IR representation
    IR->>User: Return parsed IR

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

feat(web): Streamed code search #623: Introduced the original queryLanguage parser structure that these changes extend with regex dialect support.
fix(parser): Allow parenthesis in query and filter terms #788: Previously modified parentheses tokenization in tokens.ts, which this change builds upon for regex mode.
fix(web): content: filter now respects regex mode #947: Modified packages/web/src/features/search/parser.ts to respect isRegexEnabled flag in downstream IR transformation, complementing this parser-level integration.

Suggested reviewers

msukkari

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately describes the main change: disabling regex dialect support for parentheses and pipe characters so they are treated as word characters in regex mode.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch brendan-kellam/fix-regex-paren-parsing-SOU-760

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

packages/queryLanguage/src/tokens.ts (2)
456-490: ⚠️ Potential issue | 🟡 Minor

Only emit negate for real prefix filters.

This still treats any -...: token as negation. With regex mode enabled, patterns like -https?:// or -[A-Z]:\d+ get stolen from wordToken even though they are not in PREFIXES. Checking the known prefix set from offset would keep -repo: working without breaking colon-bearing regexes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 456 - 490, The code
currently treats any "-...:" sequence as a prefix negation; modify the logic
that checks for a colon to instead extract the identifier between offset and the
colon and verify it is a known prefix from the PREFIXES set before emitting the
negate token. Concretely, inside the loop that sets foundColon (the block
referencing input.peek, peekOffset, foundColon) replace the naive
colon-detection with logic that builds the substring from offset to the colon
(or use an existing helper) and test PREFIXES.has(substring) (or equivalent) —
only if it's a known prefix should you call input.advance() and
input.acceptToken(negate); all other cases (e.g. regexes like -https?://) should
fall through to wordToken handling. Ensure this change coexists with the
existing Dialect_regex and hasBalancedParensAt checks.
316-337: ⚠️ Potential issue | 🟠 Major

Make or disambiguation regex-aware.

isOrKeyword() still short-circuits before the new regex fast path, so terms like foo or$, foo or(test), or repo:or|foo never reach this branch. In regex mode, or should only be reserved when it is actually acting as the boolean operator, not when it is the start of a regex term/value.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 316 - 337, The
short-circuit check using isOrKeyword is blocking the regex-mode fast path;
update the logic so isOrKeyword is skipped when regex dialect is enabled (or
make isOrKeyword itself respect stack.dialectEnabled(Dialect_regex)).
Concretely, either move the stack.dialectEnabled(Dialect_regex) branch above the
isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change
isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true;
use the existing symbols (isOrKeyword, startsWithPrefix,
stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace,
input.acceptToken(word)) to locate and apply the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 456-490: The code currently treats any "-...:" sequence as a
prefix negation; modify the logic that checks for a colon to instead extract the
identifier between offset and the colon and verify it is a known prefix from the
PREFIXES set before emitting the negate token. Concretely, inside the loop that
sets foundColon (the block referencing input.peek, peekOffset, foundColon)
replace the naive colon-detection with logic that builds the substring from
offset to the colon (or use an existing helper) and test PREFIXES.has(substring)
(or equivalent) — only if it's a known prefix should you call input.advance()
and input.acceptToken(negate); all other cases (e.g. regexes like -https?://)
should fall through to wordToken handling. Ensure this change coexists with the
existing Dialect_regex and hasBalancedParensAt checks.
- Around line 316-337: The short-circuit check using isOrKeyword is blocking the
regex-mode fast path; update the logic so isOrKeyword is skipped when regex
dialect is enabled (or make isOrKeyword itself respect
stack.dialectEnabled(Dialect_regex)). Concretely, either move the
stack.dialectEnabled(Dialect_regex) branch above the
isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change
isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true;
use the existing symbols (isOrKeyword, startsWithPrefix,
stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace,
input.acceptToken(word)) to locate and apply the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5771cc2d-1869-496b-958e-bc281d5a2ee8

📥 Commits

Reviewing files that changed from the base of the PR and between 34a1435 and 04354d7.

📒 Files selected for processing (9)

CHANGELOG.md
packages/queryLanguage/src/parser.terms.ts
packages/queryLanguage/src/parser.ts
packages/queryLanguage/src/query.grammar
packages/queryLanguage/src/tokens.ts
packages/queryLanguage/test/grammar.regex.test.ts
packages/queryLanguage/test/grammar.test.ts
packages/queryLanguage/test/regex.txt
packages/web/src/features/search/parser.ts

brendan-kellam and others added 2 commits March 25, 2026 23:22

chore: update CHANGELOG for #1050

04354d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This comment has been minimized.

Sign in to view

Merge branch 'main' into brendan-kellam/fix-regex-paren-parsing-SOU-760

df7ee95

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

brendan-kellam merged commit 2b35bb0 into main Mar 26, 2026
9 checks passed

brendan-kellam deleted the brendan-kellam/fix-regex-paren-parsing-SOU-760 branch March 26, 2026 06:36

github-actions bot mentioned this pull request Mar 26, 2026

Sourcebot Roadmap 🚀 #459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(web): treat parens and | as word chars when regex mode is enabled#1050

fix(web): treat parens and | as word chars when regex mode is enabled#1050
brendan-kellam merged 3 commits intomainfrom
brendan-kellam/fix-regex-paren-parsing-SOU-760

brendan-kellam commented Mar 26, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendan-kellam commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

This comment has been minimized.

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brendan-kellam commented Mar 26, 2026 •

edited

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading