feat(search): add fuzzy match tier for typo tolerance#124
Merged
Conversation
Query tokens of 4+ chars that fail exact and prefix now match field tokens at difflib SequenceMatcher ratio 0.8 or higher. Tier ordering is strict (exact > prefix > fuzzy) and posts needing any fuzzy match always rank behind posts matched entirely by exact/prefix. AND semantics and the /search contract are unchanged; the index format is untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends SquishMark’s server-side search scorer with a third (lowest) fuzzy-matching tier to improve typo tolerance, while keeping AND semantics and the /search response contract intact.
Changes:
- Added a fuzzy tier (stdlib
difflib.SequenceMatcher) to token scoring, gated by minimum token length and a ratio threshold. - Updated ranking to ensure posts that require any fuzzy-only token match sort behind posts matched entirely via exact/prefix.
- Added focused unit tests for fuzzy threshold boundaries, short-token exclusion, ranking behavior, and AND semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/squishmark/services/search.py |
Adds fuzzy token matching, updates scoring to track fuzzy dependence, and adjusts sort keys to keep fuzzy-dependent posts behind real matches. |
tests/test_search.py |
Adds unit tests validating fuzzy recall, threshold behavior, and ranking guarantees. |
Keeps fuzzy from perturbing order among exact/prefix posts and skips the fuzzy scan entirely on queries with real hits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #103
Adds a third, lowest-ranked match tier to the search scorer using stdlib difflib: query tokens of length >= 4 that fail exact and prefix match against field tokens at SequenceMatcher ratio >= 0.8.
Ranking guarantee: weights alone cannot keep every fuzzy hit below every real hit across fields (a fuzzy title weight would outscore an exact body weight), so posts that needed any fuzzy match sort behind posts matched entirely by exact/prefix; the third-tier weights only order posts within the fuzzy class. AND semantics, the /search contract, and the index format are unchanged.
Performance: fuzzy only runs for tokens that already failed exact and prefix, with a real_quick_ratio pre-filter; measured ~0.6 ms per post worst case against a 500-token body vocabulary. rapidfuzz remains the escape hatch per the issue if content scale grows.
Tests: 9 new cases covering the 0.8 boundary, short-token exclusion, typo recall (gumob, intergalatic), in-field tier ordering, cross-field never-outrank, and AND semantics. One existing test repointed: its old probe token now legitimately fuzzy-matches, kept as a positive fuzzy case.
Live-verified: /search?q=gumob and /search?q=intergalatic both return the gumbo post; exact queries unaffected.
🤖 Generated with Claude Code