Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
c3832ca
fix: TEXT field = operator now uses exact phrase matching
nkanu17 Mar 31, 2026
cc7e9af
feat: advanced text search — fuzzy LD 2/3, suffix/infix, OR, proximit…
nkanu17 Mar 31, 2026
94c9d45
fix: multi-field text search applies escaping/negation for =/!= opera…
nkanu17 Mar 31, 2026
1ec2b00
fix: address review feedback — FULLTEXT consistency, term escaping, B…
Copilot Mar 31, 2026
7fb840d
fix: address Codex/Copilot review feedback (cycle 2)
nkanu17 Mar 31, 2026
3ffccb2
fix: double-negation XOR semantics, remove dead verbatim/nostopwords …
Copilot Mar 31, 2026
33b2b9e
fix: WITHSCORES+RETURN0 parsing, OR operand escaping, slop validation
nkanu17 Apr 1, 2026
4d5c093
fix: multi-word FULLTEXT escaping, stable score alias, slop float val…
nkanu17 Apr 1, 2026
e43827a
fix: score alias collision for SELECT *, reject float fuzzy levels
nkanu17 Apr 1, 2026
2bc82aa
fix: multi-field FULLTEXT scoping, reject fuzzy/fulltext on non-TEXT,…
nkanu17 Apr 1, 2026
354111f
fix: case-insensitive OR, escape */+, LIKE guard, inorder validation,…
nkanu17 Apr 1, 2026
20abd76
docs: add TEXT search section to README
nkanu17 Apr 1, 2026
6e4ec67
fix: single-term ~ prefix, multi-word OR grouping; docs: IS NULL & ex…
nkanu17 Apr 1, 2026
09286e6
fix: empty OR operand guard, LIKE error message, README negation typo
nkanu17 Apr 1, 2026
055337f
fix: scorer case preservation, duplicate score guard, OR ~ prefix, st…
nkanu17 Apr 1, 2026
8a69d06
fix: score() arg count validation, slop validation in QueryBuilder, s…
nkanu17 Apr 1, 2026
3d3485c
fix: parenthesize multi-word LIKE patterns, raise on invalid fulltext…
nkanu17 Apr 1, 2026
5810417
fix: 4 review issues — case-sensitive OR, score ORDER BY, per-row ali…
nkanu17 Apr 1, 2026
2ca128b
fix: reject non-column args in fulltext/fuzzy, validate score() liter…
nkanu17 Apr 1, 2026
be1a839
address PR #17 review comments
nkanu17 Apr 2, 2026
597c56d
Address review comments: strict parser validation, stable score alias…
nkanu17 Apr 2, 2026
610bbf2
Normalize bytes field keys to str in score alias collision detection
nkanu17 Apr 2, 2026
6bd0745
Fix black formatting
nkanu17 Apr 2, 2026
8e2753a
Strip stopwords from exact phrase queries (= operator)
nkanu17 Apr 2, 2026
74dbaf7
Harden WITHSCORES detection and reject empty scorer names
nkanu17 Apr 2, 2026
9c857e9
Document stopword stripping behavior for = and fulltext()
nkanu17 Apr 2, 2026
13c4a32
Reject boolean values in numeric context, catch dangling OR
nkanu17 Apr 3, 2026
45c7fb7
Filter stopwords in OR operands before building text query
nkanu17 Apr 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 109 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,9 +154,11 @@ The layered approach emerged from TDD — writing tests first revealed natural b
- [x] Computed fields: `price * 0.9 AS discounted`
- [x] Vector KNN search: `vector_distance(field, :param)`
- [x] Hybrid search (filters + vector)
- [x] Full-text search: `LIKE 'prefix%'` (prefix), `fulltext(field, 'terms')` function
- [x] Full-text search: exact phrase, fuzzy, proximity, OR/union, LIKE patterns, BM25 scoring (see below)
- [x] GEO field queries with full operator support (see below)
- [x] Date functions: `YEAR()`, `MONTH()`, `DAY()`, `DATE_FORMAT()`, etc. (see below)
- [x] `IS NULL` / `IS NOT NULL` via `ismissing()` (requires Redis 7.4+, see below)
- [x] `exists()` function for field presence checks (see below)

## What's Not Implemented (Yet...)

Expand All @@ -166,6 +168,112 @@ The layered approach emerged from TDD — writing tests first revealed natural b
- [ ] DISTINCT
- [ ] Index creation from SQL (CREATE INDEX)

### TEXT Search

Full-text search on TEXT fields with multiple search modes:

| Feature | SQL Syntax | RediSearch Output | Notes |
|---------|-----------|-------------------|-------|
| Exact phrase | `title = 'gaming laptop'` | `@title:"gaming laptop"` | Stopwords stripped |
| Tokenized search | `fulltext(title, 'gaming laptop')` | `@title:(gaming laptop)` | Stopwords stripped |
| Fuzzy LD=1 | `fuzzy(title, 'laptap')` | `@title:%laptap%` | |
| Fuzzy LD=2 | `fuzzy(title, 'laptap', 2)` | `@title:%%laptap%%` | |
| Fuzzy LD=3 | `fuzzy(title, 'laptap', 3)` | `@title:%%%laptap%%%` | |
| OR / union | `fulltext(title, 'laptop OR tablet')` | `@title:(laptop\|tablet)` | |
| Prefix | `title LIKE 'lap%'` | `@title:lap*` | |
| Suffix | `title LIKE '%top'` | `@title:*top` | |
| Contains | `title LIKE '%apt%'` | `@title:*apt*` | |
| Proximity (slop) | `fulltext(title, 'gaming laptop', 2)` | `@title:(gaming laptop) => { $slop: 2; }` | |
| Proximity + order | `fulltext(title, 'gaming laptop', 2, true)` | `@title:(gaming laptop) => { $slop: 2; $inorder: true; }` | |
| Optional term | `fulltext(title, 'laptop ~gaming')` | `@title:(laptop ~gaming)` | |
| BM25 score | `SELECT score() AS relevance FROM idx` | `FT.SEARCH ... WITHSCORES` | |
| Negation | `NOT fulltext(title, 'refurbished')` | `-@title:refurbished` | |

**Examples:**

```sql
-- Exact phrase match (stopwords like "of" are stripped automatically)
SELECT * FROM products WHERE title = 'bank of america'
-- Produces: @title:"bank america"

-- Fuzzy search for typos (Levenshtein distance 2)
SELECT * FROM products WHERE fuzzy(title, 'laptap', 2)

-- OR search across terms
SELECT * FROM products WHERE fulltext(title, 'laptop OR tablet OR phone')

-- Proximity: terms within 3 words of each other, in order
SELECT * FROM products WHERE fulltext(title, 'gaming laptop', 3, true)

-- Suffix/contains pattern matching
SELECT * FROM products WHERE title LIKE '%phone%'

-- BM25 relevance scoring
SELECT title, score() AS relevance FROM products WHERE fulltext(title, 'laptop')

-- Multi-field search
SELECT * FROM products WHERE fulltext(title, 'laptop') OR fulltext(description, 'laptop')
```

**Stopword handling:**

Both `=` (exact phrase) and `fulltext()` (tokenized search) automatically strip [Redis default stopwords](https://redis.io/docs/latest/develop/ai/search-and-query/advanced-concepts/stopwords/) before sending queries to RediSearch. This is necessary because RediSearch does not index stopwords, so including them in queries causes syntax errors or failed matches. A `UserWarning` is emitted when stopwords are removed.

For example, `WHERE title = 'bank of america'` produces `@title:"bank america"` because "of" is a default stopword and is never stored in the inverted index. The stripped phrase still matches correctly because the indexer assigns consecutive token positions after dropping stopwords.

To include stopwords in your queries, create your index with `STOPWORDS 0`:

```
FT.CREATE myindex ON HASH PREFIX 1 doc: STOPWORDS 0 SCHEMA title TEXT
```

**Notes:**
- `=` on TEXT fields performs **exact phrase** matching (double-quoted)
- `fulltext()` performs **tokenized** AND search (parenthesized)
- Both operators strip stopwords and emit a warning when they do
- `fuzzy()` and `fulltext()` only work on TEXT fields; using them on TAG or NUMERIC raises `ValueError`
- OR must be **uppercase**: `'laptop OR tablet'` triggers union; lowercase `'laptop or tablet'` is treated as a regular three-word AND search
- Special characters (`@`, `|`, `-`, `*`, `+`, etc.) in search terms are automatically escaped

### IS NULL / IS NOT NULL (ismissing)

Check for missing (absent) fields using standard SQL `IS NULL` / `IS NOT NULL` syntax. Requires **Redis 7.4+** (RediSearch 2.10+) with `INDEXMISSING` declared on the field.

| SQL | RediSearch Output |
|-----|-------------------|
| `WHERE email IS NULL` | `ismissing(@email)` |
| `WHERE email IS NOT NULL` | `-ismissing(@email)` |

```sql
-- Find users without an email
SELECT * FROM users WHERE email IS NULL

-- Find users with an email
SELECT * FROM users WHERE email IS NOT NULL

-- Combine with other filters
SELECT * FROM users WHERE category = 'eng' AND email IS NULL
```

**Note:** The field must be declared with `INDEXMISSING` in the index schema. A warning is emitted at translation time as a reminder.

### exists() — Field Presence Check

Check whether a field has a value using `exists()` in SELECT or HAVING. This uses `FT.AGGREGATE` with `APPLY exists(@field)`.

```sql
-- Check if fields exist (returns 1 or 0)
SELECT name, exists(email) AS has_email FROM users

-- Filter to only rows where a field exists
SELECT name FROM users HAVING exists(email) = 1

-- Combine with other computed fields
SELECT name, exists(email) AS has_email, exists(phone) AS has_phone FROM users
```

**Note:** `exists()` is different from `IS NOT NULL` — it works via `FT.AGGREGATE APPLY` and doesn't require `INDEXMISSING` on the field, but returns `1`/`0` rather than filtering rows directly.

### DATE/DATETIME Handling

Redis does not have a native DATE field type. Dates are stored as **NUMERIC fields** with Unix timestamps.
Expand Down
147 changes: 134 additions & 13 deletions sql_redis/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,50 @@ class QueryResult:
count: int


class Executor:
class _ScoreParseMixin:
"""Shared helpers for score-related response parsing."""

@staticmethod
def _has_return_0(args: list[str]) -> bool:
"""Return True when the args contain 'RETURN 0' (no document fields)."""
try:
idx = args.index("RETURN")
return args[idx + 1] == "0"
except (ValueError, IndexError):
return False

@staticmethod
def _resolve_score_alias(
score_alias: str | None,
args: list[str],
first_row_fields: set[str] | None = None,
) -> str:
"""Determine a stable score column name that won't collide with
document fields. The alias is resolved once and reused for every
row so all rows share the same column name.

When a RETURN clause is present, the returned field names are used
for collision detection. When RETURN is absent (SELECT *), the
caller should pass ``first_row_fields`` — the union of all field
names across all result rows — so we can detect collisions even
when different documents have different field sets."""
alias = score_alias or "__score"
# Extract RETURN field names from args to detect collision
try:
idx = args.index("RETURN")
count = int(args[idx + 1])
return_fields = set(args[idx + 2 : idx + 2 + count])
except (ValueError, IndexError):
# Normalize bytes keys to str so collision detection works
# regardless of decode_responses setting.
raw = first_row_fields or set()
return_fields = {k.decode() if isinstance(k, bytes) else k for k in raw}
while alias in return_fields:
alias = f"__score_{alias}"
return alias


class Executor(_ScoreParseMixin):
"""Executes SQL queries against Redis."""

def __init__(self, client: redis.Redis, schema_registry: SchemaRegistry) -> None:
Expand Down Expand Up @@ -166,12 +209,55 @@ def execute(self, sql: str, *, params: dict | None = None) -> QueryResult:
rows = []

if translated.command == "FT.SEARCH":
# FT.SEARCH format: [count, key1, [fields1], key2, [fields2], ...]
# Skip document keys (odd indices), take field lists (even indices after count)
for i in range(2, len(raw_result), 2):
row_data = raw_result[i]
row = dict(zip(row_data[::2], row_data[1::2]))
rows.append(row)
# Use the explicit score_alias signal rather than scanning args
# for the literal token "WITHSCORES", which could false-positive
# if a returned field happened to be named "WITHSCORES".
with_scores = translated.score_alias is not None
# RETURN 0 suppresses document fields (like NOCONTENT);
# with WITHSCORES the reply is [count, id, score, id, score, ...]
no_content = self._has_return_0(translated.args)

Comment thread
nkanu17 marked this conversation as resolved.
# Pre-resolve score alias; may be deferred for SELECT *
score_alias: str | None = None

if with_scores and no_content:
# WITHSCORES + RETURN 0: [count, id1, score1, id2, score2, ...]
# Stride of 2: key, score (no field array)
score_alias = self._resolve_score_alias(
translated.score_alias, translated.args
)
for i in range(1, len(raw_result) - 1, 2):
score = raw_result[i + 1]
row = {score_alias: score}
rows.append(row)
elif with_scores:
Comment thread
nkanu17 marked this conversation as resolved.
Comment thread
nkanu17 marked this conversation as resolved.
Comment thread
nkanu17 marked this conversation as resolved.
# WITHSCORES format: [count, key1, score1, [fields1], key2, score2, [fields2], ...]
# Stride of 3: key, score, field_list
# First pass: collect all field names across all rows so the
# alias avoids collisions with any document field, not just
# the first row's fields.
all_field_names: set[str] = set()
parsed_rows: list[tuple[dict, Any]] = []
for i in range(1, len(raw_result) - 2, 3):
score = raw_result[i + 1]
row_data = raw_result[i + 2]
row = dict(zip(row_data[::2], row_data[1::2]))
Comment thread
nkanu17 marked this conversation as resolved.
all_field_names.update(row.keys())
parsed_rows.append((row, score))
resolved_alias = self._resolve_score_alias(
translated.score_alias,
translated.args,
first_row_fields=all_field_names,
)
for row, score in parsed_rows:
row[resolved_alias] = score
rows.append(row)
Comment thread
nkanu17 marked this conversation as resolved.
else:
# Standard format: [count, key1, [fields1], key2, [fields2], ...]
for i in range(2, len(raw_result), 2):
row_data = raw_result[i]
row = dict(zip(row_data[::2], row_data[1::2]))
rows.append(row)
else:
# FT.AGGREGATE format: [count, [fields1], [fields2], ...]
for row_data in raw_result[1:]:
Expand All @@ -181,7 +267,7 @@ def execute(self, sql: str, *, params: dict | None = None) -> QueryResult:
return QueryResult(rows=rows, count=count)


class AsyncExecutor:
class AsyncExecutor(_ScoreParseMixin):
"""Async version of Executor for use with redis.asyncio clients."""

def __init__(
Expand Down Expand Up @@ -258,11 +344,46 @@ async def execute(self, sql: str, *, params: dict | None = None) -> QueryResult:
rows = []

if translated.command == "FT.SEARCH":
# FT.SEARCH format: [count, key1, [fields1], key2, [fields2], ...]
for i in range(2, len(raw_result), 2):
row_data = raw_result[i]
row = dict(zip(row_data[::2], row_data[1::2]))
rows.append(row)
with_scores = translated.score_alias is not None
no_content = self._has_return_0(translated.args)

score_alias: str | None = None

if with_scores and no_content:
# WITHSCORES + RETURN 0: [count, id1, score1, id2, score2, ...]
score_alias = self._resolve_score_alias(
translated.score_alias, translated.args
)
for i in range(1, len(raw_result) - 1, 2):
score = raw_result[i + 1]
row = {score_alias: score}
rows.append(row)
elif with_scores:
# WITHSCORES format: [count, key1, score1, [fields1], ...]
# First pass: collect all field names across all rows so the
# alias avoids collisions with any document field.
all_field_names: set[str] = set()
parsed_rows: list[tuple[dict, Any]] = []
for i in range(1, len(raw_result) - 2, 3):
score = raw_result[i + 1]
row_data = raw_result[i + 2]
row = dict(zip(row_data[::2], row_data[1::2]))
all_field_names.update(row.keys())
parsed_rows.append((row, score))
resolved_alias = self._resolve_score_alias(
translated.score_alias,
translated.args,
first_row_fields=all_field_names,
)
for row, score in parsed_rows:
row[resolved_alias] = score
rows.append(row)
else:
# Standard format: [count, key1, [fields1], key2, [fields2], ...]
for i in range(2, len(raw_result), 2):
row_data = raw_result[i]
row = dict(zip(row_data[::2], row_data[1::2]))
rows.append(row)
else:
# FT.AGGREGATE format: [count, [fields1], [fields2], ...]
for row_data in raw_result[1:]:
Expand Down
Loading
Loading