fix(web): treat parens and | as word chars when regex mode is enabled#1050
Conversation
Adds a Lezer 'regex' dialect to the query language grammar. In regex mode, the tokenizers no longer emit openParen/closeParen tokens, so parentheses and pipe characters are consumed as plain word characters rather than query grouping operators. This fixes a bug where a query like (test|render)< was incorrectly parsed as AND(ParenExpr, Term) instead of a single regexp Term. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
WalkthroughThis change implements a regex dialect for the query language parser, enabling parentheses and regex patterns to be parsed as single tokens rather than as query syntax operators. The update modifies tokenizers, adds grammar dialect support, and integrates the regex parser into the web search interface. Changes
Sequence DiagramsequenceDiagram
participant User
participant WebParser as Web Parser
participant TokenStack as Parser Stack
participant Tokenizer
participant QueryParser as Query Parser
participant IR as IR Transformer
User->>WebParser: parseQuerySyntaxIntoIR(query, {isRegexEnabled: true})
WebParser->>WebParser: Select regexParser
WebParser->>TokenStack: Initialize with dialect: "regex"
WebParser->>Tokenizer: Tokenize query with dialect enabled
alt Regex Mode Enabled
Tokenizer->>TokenStack: dialectEnabled("regex") → true
Tokenizer->>Tokenizer: Treat (foo|bar) as single word token
Tokenizer->>Tokenizer: Disable paren operators
else Normal Mode
Tokenizer->>TokenStack: dialectEnabled("regex") → false
Tokenizer->>Tokenizer: Treat ( ) as grouping operators
end
Tokenizer->>QueryParser: Return tokens
QueryParser->>QueryParser: Parse with regex dialect
QueryParser->>IR: Return AST
IR->>IR: Transform to IR representation
IR->>User: Return parsed IR
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/queryLanguage/src/tokens.ts (2)
456-490:⚠️ Potential issue | 🟡 MinorOnly emit
negatefor real prefix filters.This still treats any
-...:token as negation. With regex mode enabled, patterns like-https?://or-[A-Z]:\d+get stolen fromwordTokeneven though they are not inPREFIXES. Checking the known prefix set fromoffsetwould keep-repo:working without breaking colon-bearing regexes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/queryLanguage/src/tokens.ts` around lines 456 - 490, The code currently treats any "-...:" sequence as a prefix negation; modify the logic that checks for a colon to instead extract the identifier between offset and the colon and verify it is a known prefix from the PREFIXES set before emitting the negate token. Concretely, inside the loop that sets foundColon (the block referencing input.peek, peekOffset, foundColon) replace the naive colon-detection with logic that builds the substring from offset to the colon (or use an existing helper) and test PREFIXES.has(substring) (or equivalent) — only if it's a known prefix should you call input.advance() and input.acceptToken(negate); all other cases (e.g. regexes like -https?://) should fall through to wordToken handling. Ensure this change coexists with the existing Dialect_regex and hasBalancedParensAt checks.
316-337:⚠️ Potential issue | 🟠 MajorMake
ordisambiguation regex-aware.
isOrKeyword()still short-circuits before the new regex fast path, so terms likefoo or$,foo or(test), orrepo:or|foonever reach this branch. In regex mode,orshould only be reserved when it is actually acting as the boolean operator, not when it is the start of a regex term/value.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/queryLanguage/src/tokens.ts` around lines 316 - 337, The short-circuit check using isOrKeyword is blocking the regex-mode fast path; update the logic so isOrKeyword is skipped when regex dialect is enabled (or make isOrKeyword itself respect stack.dialectEnabled(Dialect_regex)). Concretely, either move the stack.dialectEnabled(Dialect_regex) branch above the isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true; use the existing symbols (isOrKeyword, startsWithPrefix, stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace, input.acceptToken(word)) to locate and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 456-490: The code currently treats any "-...:" sequence as a
prefix negation; modify the logic that checks for a colon to instead extract the
identifier between offset and the colon and verify it is a known prefix from the
PREFIXES set before emitting the negate token. Concretely, inside the loop that
sets foundColon (the block referencing input.peek, peekOffset, foundColon)
replace the naive colon-detection with logic that builds the substring from
offset to the colon (or use an existing helper) and test PREFIXES.has(substring)
(or equivalent) — only if it's a known prefix should you call input.advance()
and input.acceptToken(negate); all other cases (e.g. regexes like -https?://)
should fall through to wordToken handling. Ensure this change coexists with the
existing Dialect_regex and hasBalancedParensAt checks.
- Around line 316-337: The short-circuit check using isOrKeyword is blocking the
regex-mode fast path; update the logic so isOrKeyword is skipped when regex
dialect is enabled (or make isOrKeyword itself respect
stack.dialectEnabled(Dialect_regex)). Concretely, either move the
stack.dialectEnabled(Dialect_regex) branch above the
isOrKeyword/startsWithPrefix checks so the regex consumer runs first, or change
isOrKeyword to return false when stack.dialectEnabled(Dialect_regex) is true;
use the existing symbols (isOrKeyword, startsWithPrefix,
stack.dialectEnabled(Dialect_regex), input, EOF, isWhitespace,
input.acceptToken(word)) to locate and apply the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5771cc2d-1869-496b-958e-bc281d5a2ee8
📒 Files selected for processing (9)
CHANGELOG.mdpackages/queryLanguage/src/parser.terms.tspackages/queryLanguage/src/parser.tspackages/queryLanguage/src/query.grammarpackages/queryLanguage/src/tokens.tspackages/queryLanguage/test/grammar.regex.test.tspackages/queryLanguage/test/grammar.test.tspackages/queryLanguage/test/regex.txtpackages/web/src/features/search/parser.ts
Summary
regexdialect to the query language grammar via@dialects { regex }parenToken,closeParenToken,negateToken, andwordTokentokenizers all checkstack.dialectEnabled(Dialect_regex)and treat(,), and|as plain word characters instead of query grouping operators(test|render)<with regex mode enabled was incorrectly parsed asAND(ParenExpr("test|render"), Term("<"))instead of a singleTerm("(test|render)<")Fixes SOU-760
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
(test|render)<) being incorrectly split into multiple search terms instead of being treated as a single regex pattern.New Features