Skip to content

feat: TEXT field search with exact phrase, fuzzy, proximity, OR/union, scoring#17

Merged
rbs333 merged 28 commits intomainfrom
feature/text-search-exact-phrase
Apr 6, 2026
Merged

feat: TEXT field search with exact phrase, fuzzy, proximity, OR/union, scoring#17
rbs333 merged 28 commits intomainfrom
feature/text-search-exact-phrase

Conversation

@nkanu17
Copy link
Copy Markdown
Contributor

@nkanu17 nkanu17 commented Mar 31, 2026

Summary

Comprehensive TEXT field search overhaul: fixes exact phrase matching (= operator) and adds the full RediSearch text search surface area, including fuzzy LD levels, suffix/infix matching, OR/union, proximity search, BM25 scoring, and special character escaping.

Feature Reference

Feature SQL Syntax RediSearch Output Notes
Exact phrase WHERE name = 'bank of red' @name:"bank red" Stopwords stripped (not indexed by Redis)
Negated phrase WHERE name != 'test' -@name:"test"
Tokenized search WHERE fulltext(title, 'gaming laptop') @title:(gaming laptop) AND semantics
OR / union WHERE fulltext(title, 'laptop OR tablet') @title:(laptop|tablet) Terms escaped
Prefix WHERE title LIKE 'lap%' @title:lap* Unchanged
Suffix WHERE title LIKE '%phone' @title:*phone New
Contains / infix WHERE title LIKE '%phone%' @title:*phone* New
Fuzzy LD=1 WHERE fuzzy(title, 'laptap') @title:%laptap% Default level
Fuzzy LD=2 WHERE fuzzy(title, 'laptap', 2) @title:%%laptap%%
Fuzzy LD=3 WHERE fuzzy(title, 'laptap', 3) @title:%%%laptap%%% Max LD
Proximity (slop) WHERE fulltext(title, 'a b', 2) @title:(a b) => { $slop: 2; }
Proximity + inorder WHERE fulltext(title, 'a b', 2, true) @title:(a b) => { $slop: 2; $inorder: true; }
BM25 scoring SELECT score() AS s FROM ... WITHSCORES SCORER BM25 Default scorer
Custom scorer SELECT score('TFIDF') AS s FROM ... WITHSCORES SCORER TFIDF
Score-only SELECT SELECT score() AS s FROM ... RETURN 0 + WITHSCORES No payload leak
Verbatim ParsedQuery.verbatim = True VERBATIM API-only flag
Nostopwords ParsedQuery.nostopwords = True NOSTOPWORDS API-only flag

Problem

WHERE name = 'bank of red' previously stripped stopwords and produced @name:(bank red) as a tokenized query, matching any document containing "bank" and "red" anywhere rather than the exact phrase. The = operator now produces a proper double-quoted exact phrase query (@name:"bank red"), with stopwords stripped to match how RediSearch indexes text (stopwords are not stored in the inverted index and their positions are not preserved). This was the entry point; the PR grew to cover all missing RediSearch text search features.

Key Implementation Details

Stopword handling

Both = (exact phrase) and fulltext() (tokenized search) strip default Redis stopwords before sending queries to RediSearch. This is necessary because RediSearch does not index stopwords, so including them in queries causes syntax errors or failed matches. A UserWarning is emitted when stopwords are removed. Users who need stopwords indexed can create their index with STOPWORDS 0.

Escaping

All text search operators escape RediSearch-reserved characters (|, -, (, ), @, ~, ", \) via _escape_fulltext_term(). OR operands are individually escaped before joining with |.

Score parsing

WITHSCORES + RETURN 0 changes the Redis response stride from 3 to 2 (no field arrays). Both sync and async executors handle this via _ScoreParseMixin._has_return_0(). Score alias collision with document fields is detected across all rows and auto-prefixed with __score_. Bytes field keys are normalized to strings for consistent collision detection regardless of decode_responses setting.

Validation

  • Fuzzy level must be 1, 2, or 3
  • Slop must be a non-negative integer
  • score() + GROUP BY (FT.AGGREGATE) raises ValueError
  • fulltext(), fuzzy(), and score() arguments must be literals (placeholders allowed for parameter substitution)

Double-negation normalization

NOT (field != 'x') resolves to @field:"x" instead of double-negating.

Files Changed

File Change
sql_redis/query_builder.py Exact phrase with stopword stripping for =/!=, fuzzy levels, suffix/infix, OR, proximity, escaping
sql_redis/parser.py fulltext(f, v, slop, inorder), fuzzy(f, v, level), score() in SELECT, argument validation
sql_redis/analyzer.py Routes FULLTEXT/FUZZY conditions, scoring spec
sql_redis/translator.py Wires fuzzy_level/slop/inorder, emits WITHSCORES/SCORER, double-negation normalization
sql_redis/executor.py _ScoreParseMixin for WITHSCORES+RETURN0 response parsing, score alias collision guard with bytes normalization
tests/test_query_builder.py All new operators, escaping, fuzzy levels, proximity, OR, stopword stripping
tests/test_translator.py End-to-end: fuzzy LD 2/3, suffix/infix, OR, proximity, scoring
tests/test_sql_parser.py Parser tests for fulltext/fuzzy args, score(), slop validation

Tests

  • 456 tests pass (all unit + integration)
  • make lint clean (black, isort)

Breaking Changes

= on TEXT fields now produces @field:"value" (double-quoted exact phrase) instead of @field:value (tokenized). Stopwords are stripped from the phrase to match how RediSearch indexes text. Callers relying on the old tokenized behavior should switch to fulltext().

@nkanu17 nkanu17 requested a review from Copilot March 31, 2026 16:03
@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@codex review

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates TEXT equality semantics in the SQL→RediSearch translation layer so = on TEXT fields uses exact phrase matching (quoted) to preserve stopwords, addressing cases like "bank of red" where tokenization/stopword stripping changes meaning.

Changes:

  • Change TEXT = (and !=) query generation to wrap values in quotes for exact phrase semantics.
  • Keep MATCH as tokenized search, using parenthesized multi-term syntax and stopword warnings.
  • Update/expand tests to reflect the new query syntax and error-message variability across Redis versions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
sql_redis/query_builder.py Adjusts TEXT condition building to quote =/!= values and changes multi-word handling for tokenized search.
tests/test_translator.py Updates translator expectations for TEXT = to produce quoted phrase queries.
tests/test_query_builder.py Updates/adds QueryBuilder TEXT tests for exact phrase behavior and MATCH multi-word behavior.
tests/test_parameter_substitution.py Relaxes empty-string TEXT error assertion to accommodate Redis version differences.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql_redis/query_builder.py Outdated
Comment thread sql_redis/query_builder.py Outdated
Comment thread tests/test_parameter_substitution.py Outdated
Comment thread tests/test_query_builder.py Outdated
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17 nkanu17 force-pushed the feature/text-search-exact-phrase branch from a29ca6a to 45a4716 Compare March 31, 2026 16:15
@nkanu17 nkanu17 changed the title fix: TEXT field = operator now uses exact phrase matching Fix TEXT field = operator to use exact phrase matching Mar 31, 2026
@nkanu17 nkanu17 requested a review from Copilot March 31, 2026 16:31
@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

sql_redis/query_builder.py:90

  • The multi-field branch returns early using the raw value, which bypasses the operator-specific formatting/escaping and also drops negation handling. This reintroduces incorrect behavior for =/!= (missing quotes / stopword preservation), and multi-word values won’t use the new parenthesized MATCH semantics. Consider computing search_value first (including escaping/stopword filtering) and then applying it to both single- and multi-field cases, and include prefix for negated / != queries (e.g., -(@f1|f2:...)).
        # Derive negation from both the flag and the operator itself,
        # consistent with how build_tag_condition handles != via operator.
        prefix = "-" if negated or operator == "!=" else ""

        # Handle multi-field search
        if isinstance(field, list):
            field_str = "|".join(field)
            return f"(@{field_str}:{value})"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql_redis/query_builder.py
Comment thread sql_redis/translator.py Outdated
Comment thread sql_redis/analyzer.py Outdated
Comment thread tests/test_query_builder.py Outdated
@nkanu17 nkanu17 force-pushed the feature/text-search-exact-phrase branch from 45a4716 to 39aa28d Compare March 31, 2026 16:40
@nkanu17 nkanu17 requested a review from Copilot March 31, 2026 17:05
@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@codex review

@nkanu17 nkanu17 force-pushed the feature/text-search-exact-phrase branch from 39aa28d to a7820d0 Compare March 31, 2026 17:08
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

sql_redis/query_builder.py:90

  • The multi-field early return bypasses all operator handling (including the new exact-phrase quoting/escaping for =/!=) and also ignores negated/!= when building the prefix. If multi-field search is part of the public API, it should apply the same operator semantics as the single-field path (or explicitly disallow operators other than FULLTEXT/MATCH and raise).
        # Derive negation from both the flag and the operator itself,
        # consistent with how build_tag_condition handles != via operator.
        prefix = "-" if negated or operator == "!=" else ""

        # Handle multi-field search
        if isinstance(field, list):
            field_str = "|".join(field)
            return f"(@{field_str}:{value})"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sql_redis/query_builder.py
Comment thread sql_redis/query_builder.py
Comment thread tests/test_query_builder.py
@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@copilot review

@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@copilot review

@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@copilot review

Copy link
Copy Markdown

Copilot AI commented Mar 31, 2026

@copilot review

Addressed all three open review threads in commit 506307b:

  1. MATCH/FULLTEXT consistency — removed all "MATCH" operator references from tests and the docstring. build_text_condition() now documents only =, !=, FULLTEXT, LIKE, FUZZY.

  2. RediSearch special-char escaping — added TEXT_QUERY_SPECIAL_CHARS = frozenset({"\\", "-", "@", "|", "(", ")"}) and _escape_fulltext_term() classmethod. Applied to all FULLTEXT single-term, multi-word, and OR paths to prevent characters like - (e.g., "anti-virus") or @ in user data from altering query semantics. Intentional RediSearch features (~, *, %, ^) are excluded from escaping. New test test_text_fulltext_special_chars_escaped verifies "anti-virus"@description:anti\-virus.

  3. Black formatting — ran black on tests/test_sql_queries.py to fix pre-existing f-string formatting issues in a file I modified in a prior commit.

@copilot review

Same changes apply — see reply above (commit 506307b).

@nkanu17
Copy link
Copy Markdown
Contributor Author

nkanu17 commented Mar 31, 2026

@copilot review

nkanu17 and others added 27 commits April 3, 2026 14:10
…y, BM25 scoring

- Parser: add LIKE handler, Boolean extraction for inorder param
- QueryBuilder: suffix/infix patterns, fuzzy LD 2-3, slop/inorder attributes, scoring
- Translator: score_alias on TranslatedQuery, WITHSCORES/SCORER args
- Executor: stride-3 response parsing for WITHSCORES (sync + async)
- Tests: 137 new tests (82 unit QB, 37 unit translator, 18 integration)
- All 402 tests pass, mypy clean
…tors

Addresses Copilot review feedback:
- Multi-field early return now applies _escape_text_value and prefix for =/!=
- Documented MATCH as internal alias for FULLTEXT in docstring
- score() only SELECT emits RETURN 0 to prevent full document payload leak
- score() with FT.AGGREGATE raises ValueError (WITHSCORES is FT.SEARCH-only)
- Score alias collision guard in executor (both sync/async)
- Add _escape_fulltext_term with double-quote escaping
- Apply escaping to LIKE and FUZZY operators for special chars
- Escape multi-field non-exact text search values
- Mark verbatim/nostopwords as API-only fields (no SQL parser path yet)
- 9 new tests (128 total QB+translator), 411 total pass
- Fix stride mismatch in executor when WITHSCORES + RETURN 0 (NOCONTENT)
  produces [count, id, score, ...] without field arrays
- Add _ScoreParseMixin with _has_return_0() helper for both sync/async executors
- Escape individual OR operands in fulltext() to prevent accidental
  negation (anti-virus) and field injection (@field)
- Validate slop >= 0 in parser, raise ValueError for negative values
- Add tests for all three fixes
…idation

- Escape individual terms in multi-word FULLTEXT branch while preserving
  ~ optional-term prefix (prevents @field injection and accidental negation)
- Move score alias collision check out of per-row loop into
  _resolve_score_alias() so all rows use the same column name
- Reject float slop values (e.g. 2.9) instead of silently truncating
- Add tests for all three fixes (350 total)
- When no RETURN clause exists (SELECT *), use first row's field names
  to detect score alias collisions instead of skipping the check
- Reject non-integer fuzzy levels (e.g. 2.9) with ValueError, matching
  the slop validation behavior
- Add test for fuzzy float rejection (351 total)
… boolean validation

- Refactor build_text_condition so multi-field queries share the same
  operator-specific formatting as single-field (fixes multi-word and OR
  terms escaping the field scope in multi-field queries)
- Reject fuzzy() and fulltext() on non-TEXT fields (TAG, NUMERIC, etc.)
  with ValueError instead of silently producing incorrect queries
- Reject boolean values for slop and fuzzy level arguments (true/false
  would silently coerce to 1/0 via int())
- Add 7 new tests (358 total)
… alias loop

- OR parsing now case-insensitive and whitespace-tolerant (regex split)
- Escape * and + in TEXT_QUERY_SPECIAL_CHARS to prevent wildcard/mandatory
  injection in FULLTEXT/FUZZY terms
- Add LIKE to the non-TEXT field guard (alongside FUZZY/FULLTEXT)
- Validate inorder argument strictly: reject values like 'yes', 2, etc.
- Score alias collision uses while loop for repeated prefix collisions
- Add 7 new tests (365 total)
- Feature reference table with SQL syntax and RediSearch output
- Examples for exact phrase, fuzzy, OR, proximity, LIKE patterns, scoring
- Notes on operator restrictions and auto-escaping
…ists()

- Preserve ~ optional-term prefix on single-word FULLTEXT (was escaped)
- Wrap multi-word OR operands in parentheses to prevent precedence issues
  (e.g. 'gaming laptop OR tablet' → (gaming laptop)|tablet)
- Add IS NULL / IS NOT NULL (ismissing) section to README
- Add exists() function section to README
- Add to What's Implemented checklist
- Add 3 new tests (368 total)
- Raise ValueError for empty OR operands ('laptop OR ', ' OR tablet')
  instead of crashing with IndexError
- LIKE error message now says 'LIKE can only be used...' instead of
  'like() can only be used...' since LIKE is not a function
- Fix README table: negation of single-term FULLTEXT doesn't have parens
- Add 2 new tests (370 total)
…opword warning

- Preserve caller-provided scorer casing (score('MyScorer') no longer
  uppercased to MYSCORER)
- Raise ValueError when multiple score() expressions in same query
- Preserve ~ optional-term prefix in OR operands
- Fix misleading stopword warning when all tokens are stopwords
- Remove redundant inner pytest import in test_translator.py
- Add 3 new tests (373 total)
…topword warning grammar

- Raise ValueError when score() receives more than one argument
- Validate slop is a non-negative int at QueryBuilder level (not just parser)
- Fix stopword warning grammar for all-stopword inputs
- Add 3 new tests (376 total)
…/fuzzy signatures

- LIKE '%gaming laptop%' now generates @title:(*gaming laptop*) to prevent
  token leaking across fields in Dialect 2
- fulltext(title) and fuzzy(title) with < 2 args now raise ValueError
  instead of silently dropping the predicate
- Update existing test expectation for insufficient args
- Add 3 new tests (379 total)
…as collision, extra arg validation

P1 fixes:
- Lowercase 'or' no longer parsed as boolean OR in FULLTEXT; only uppercase
  'OR' triggers union semantics ('bank or america' stays as AND search)
- ORDER BY score() alias DESC omits invalid SORTBY (RediSearch sorts by
  relevance by default); ORDER BY score ASC raises ValueError

P2 fixes:
- Score alias collision detection now checks per-row instead of first-row-only,
  preventing field overwrite when later rows have different field sets
- fulltext() rejects >4 args, fuzzy() rejects >3 args (was silently ignoring)
- Applied to both sync and async executor paths

Add 8 new tests (384 total)
…al arg

- fulltext(UPPER(title), ...) and fuzzy('title', ...) now raise ValueError
  instead of silently dropping the predicate
- score(my_column) raises ValueError; only literal scorer names accepted
- Update existing test expectations for non-column first arg
- Add 2 new tests (386 total)
- Fix FULLTEXT operator naming (MATCH → FULLTEXT) in query_builder and tests
- Resolve score alias once per result set to prevent column name flip-flop
- Add fuzzy_level range validation (1-3) in parser
- Fix import formatting (isort/black) in translator.py and analyzer.py
… resolution, OR docs fix

- Validate fulltext/fuzzy/score args are literals (allow Placeholders)
- Collect all field names across result set before resolving score alias
- Clarify OR operator case-sensitivity in README
When decode_responses=False, Redis returns field names as bytes. The
collision check now decodes them so alias detection works consistently.
RediSearch does not index stopwords, so exact phrase queries like
"diagnosing and treating" fail with a syntax error. The = operator
now strips default stopwords before wrapping in double quotes, matching
how the indexer assigns consecutive positions. A warning is emitted
when stopwords are removed.

Also removes outdated 'Use = operator for exact phrase matching that
preserves stopwords' hint from FULLTEXT stopword warning.
- Detect WITHSCORES mode via translated.score_alias instead of
  scanning args for the literal token, preventing false positives
  if a field is named WITHSCORES.
- Reject score('') with a clear ValueError so the caller's intent
  is not silently lost by falling back to the default scorer.
Update README to reflect that both exact phrase (=) and tokenized
search (fulltext()) strip default Redis stopwords before sending
queries. Adds a dedicated Stopword Handling section with examples
and the STOPWORDS 0 workaround. Replaces outdated 'preserves
stopwords' language.
- Add bool guard in _build_condition and _convert_to_numeric so
  WHERE price = true raises ValueError instead of producing
  invalid @price:[True True] syntax.
- Expand OR detection regex to match leading/trailing OR
  (e.g. 'laptop OR', 'OR tablet') so the empty-operand check
  catches these malformed inputs instead of silently dropping OR
  as a stopword.
Strip Redis default stopwords from each OR operand, matching the
existing multi-word FULLTEXT path behavior.  If an operand becomes
empty after filtering (all tokens are stopwords), fall back to the
original words so the query is not silently truncated.  A UserWarning
is emitted listing the removed stopwords.
@nkanu17 nkanu17 force-pushed the feature/text-search-exact-phrase branch from 198104b to 45c7fb7 Compare April 3, 2026 18:11
@rbs333 rbs333 merged commit 13c82d2 into main Apr 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants