Skip to content

feat(search): add fuzzy match tier for typo tolerance#124

Merged
x3ek merged 2 commits into
mainfrom
feat/103-fuzzy-search
Jul 3, 2026
Merged

feat(search): add fuzzy match tier for typo tolerance#124
x3ek merged 2 commits into
mainfrom
feat/103-fuzzy-search

Conversation

@x3ek

@x3ek x3ek commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Closes #103

Adds a third, lowest-ranked match tier to the search scorer using stdlib difflib: query tokens of length >= 4 that fail exact and prefix match against field tokens at SequenceMatcher ratio >= 0.8.

Ranking guarantee: weights alone cannot keep every fuzzy hit below every real hit across fields (a fuzzy title weight would outscore an exact body weight), so posts that needed any fuzzy match sort behind posts matched entirely by exact/prefix; the third-tier weights only order posts within the fuzzy class. AND semantics, the /search contract, and the index format are unchanged.

Performance: fuzzy only runs for tokens that already failed exact and prefix, with a real_quick_ratio pre-filter; measured ~0.6 ms per post worst case against a 500-token body vocabulary. rapidfuzz remains the escape hatch per the issue if content scale grows.

Tests: 9 new cases covering the 0.8 boundary, short-token exclusion, typo recall (gumob, intergalatic), in-field tier ordering, cross-field never-outrank, and AND semantics. One existing test repointed: its old probe token now legitimately fuzzy-matches, kept as a positive fuzzy case.

Live-verified: /search?q=gumob and /search?q=intergalatic both return the gumbo post; exact queries unaffected.

🤖 Generated with Claude Code

Query tokens of 4+ chars that fail exact and prefix now match field
tokens at difflib SequenceMatcher ratio 0.8 or higher. Tier ordering is
strict (exact > prefix > fuzzy) and posts needing any fuzzy match always
rank behind posts matched entirely by exact/prefix. AND semantics and
the /search contract are unchanged; the index format is untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@x3ek x3ek added this to the SquishMark 1.0 milestone Jul 3, 2026
@x3ek x3ek requested a review from Copilot July 3, 2026 16:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends SquishMark’s server-side search scorer with a third (lowest) fuzzy-matching tier to improve typo tolerance, while keeping AND semantics and the /search response contract intact.

Changes:

  • Added a fuzzy tier (stdlib difflib.SequenceMatcher) to token scoring, gated by minimum token length and a ratio threshold.
  • Updated ranking to ensure posts that require any fuzzy-only token match sort behind posts matched entirely via exact/prefix.
  • Added focused unit tests for fuzzy threshold boundaries, short-token exclusion, ranking behavior, and AND semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/squishmark/services/search.py Adds fuzzy token matching, updates scoring to track fuzzy dependence, and adjusts sort keys to keep fuzzy-dependent posts behind real matches.
tests/test_search.py Adds unit tests validating fuzzy recall, threshold behavior, and ranking guarantees.

Comment thread src/squishmark/services/search.py Outdated
Keeps fuzzy from perturbing order among exact/prefix posts and skips the
fuzzy scan entirely on queries with real hits.
@x3ek x3ek merged commit ada0721 into main Jul 3, 2026
5 checks passed
@x3ek x3ek deleted the feat/103-fuzzy-search branch July 3, 2026 17:37
@x3ek x3ek mentioned this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add fuzzy matching to search for typo tolerance

2 participants