Author: Scott Fear
Lightweight fuzzy search engine for chemical and pharmaceutical names, designed to resolve misspellings, phonetic confusions, and transcription errors in clinical data.
For example, resolving siprofloxasin -> ciprofloxacin or diasipam -> diazepam.
Built from scratch in Python using a character n-gram inverted index, with no external search or ranking frameworks beyond the Python standard library and a single HTTP dependency (requests) for data fetching.
ChemResolver is a two-stage retrieval and ranking pipeline:
1. Indexing: Each drug name is broken into character-level trigrams (e.g., aspirin -> {asp, spi, pir, iri, rin}). These are stored in an inverted index mapping each trigram to the set of words that contains it.
2. Retrieval: A query is similarly tokenised into trigrams. Candidate matches are retrieved by looking up shared trigrams in the index, and filtered by a minimum overlap threshold.
Note: All strings are normalised before indexing and querying (lowercased, spaces removed, and punctuation stripped). For instance, Ciprofloxacin -> ciprofloxacin.
3. Scoring: Each candidate is ranked using a weighted combination of four similarity signals:
| Signal | Weight | Description |
|---|---|---|
| Jaccard similarity | 35% | Character-level set overlap |
| Levenshtein similarity | 35% | Normalised edit distance |
| Prefix bonus | 15% | Shared leading characters |
| N-gram overlap | 15% | Dice + TF-IDF weighted trigram overlap |
TF-IDF weighting downweights common trigrams (e.g., ine, tion) so rarer, more distinctive trigrams contribute more to the score.
chemresolver/
├─ pipeline.py # Top-level API: build() and search()
├─ ngram_index.py # Inverted index and query logic
├─ similarity.py # Jaccard and Levenshtein similarity functions
├─ ranker.py # Scoring and debug breakdown
├─ utils.py # Utility functions: normalise(), load_json(), save_json()
├─ eval.py # Accuracy evaluation on query/expected pairs
├─ benchmark.py # Index build time and query latency benchmarks
├─ fetch_word_list.py # Fetches (approved) drug names from ChEMBL API
├─ generate_dataset.py # Auto-generates fuzzy eval dataset using fuzzing
├─ generate_dataset_hard.py # Generates harder eval dataset with multi-transform fuzzing
├─ word_list.json # 4,005 approved drug names (generated by fetch_word_list.py)
├─ eval_data.json # 4,003 auto-generated fuzzy query/expected pairs (generated by generate_dataset.py)
└─ eval_data_hard.json # 3,970 hard fuzzy eval pairs (generated by generate_dataset_hard.py)
No dependencies beyond the standard library and requests.
pip install requestsRun evaluation:
python eval.py # standard eval (eval_data.json)
python eval.py eval_data_hard.json # hard evalRun benchmark:
python benchmark.pyRegenerate dataset:
python fetch_word_list.py # fetches word_list.json using ChEMBL API
python generate_dataset.py # generates eval_data.json via fuzzing
python generate_dataset_hard.py # generates eval_data_hard.json via multi-transform fuzzingUse the pipeline directly:
from pipeline import Pipeline
p = Pipeline()
p.build(["aspirin", "ibuprofen", "ciprofloxacin", "amoxicillin"])
results = p.search("siprofloxasin", top_k=2)
for r in results:
print(r.word, r.score)word_list.json contains 4,005 approved drug names (clinical trial phase 4) fetched from the ChEMBL API using fetch_word_list.py. Pagination is handled automatically.
eval_data.json contains 4,003 automatically generated fuzzy query/expected pairs. Each drug name (from word_list.json) is fuzzed using one of five transforms:
- Deletion: remove a random character
- Transposition: swap two adjacent characters
- Substitution: replace a character with a different letter
- Insertion: insert a random character
- Phonetic swap: apply a pharmaceutical-specific substitution (e.g.,
ph->f,mycin->misin,cillin->silin)
eval_data_hard.json contains 3,970 harder fuzzy pairs generated by generate_dataset_hard.py. Each word has two transforms applied sequentially, producing more severe misspellings to better stress-test the pipeline.
Example generated pairs (standard eval):
vancomisin -> vancomycin
quinakrine -> quinacrine
veapamil -> verapamil
ondaznsetron -> ondansetron
cloxcaillin -> cloxacillin
=== Evaluation Summary ===
Total queries: 4003
Correct (hits): 3987
Misses: 16
No results: 6
Exact Match Accuracy: 99.60%
=== Evaluation Summary ===
Total queries: 3970
Correct (hits): 3832
Misses: 138
No results: 34
Exact Match Accuracy: 96.52%
=== Index Benchmark ===
Words indexed: 4005
Index build time: 0.059425s (59.425 ms)
=== Query Benchmark ===
Total queries: 12009
Average query time: 79.153 ms
Median query time: 26.733 ms
p99 query time: 448.368 ms
Average query time is higher than median due to a small number of slow outlier queries. Certain common trigrams (e.g., ine, cil) will match a large number of candidates in a 4,005-word index, making those queries disproportionately expensive to score. Median is the more representative measure of typical performance.
Note: At 1,000 words, average and median query times were well under 5 ms.
Pass debug=True to search() to inspect per-component scores:
results = p.search("siprofloxasin", top_k=1, debug=True)
r = results[0]
print(r.word, r.jaccard, r.levenshtein, r.prefix, r.tfidf)- Performance degrades with very large vocabularies due to high-frequency n-grams
- Designed for single-word queries; multi-word inputs are flattened via normalisation rather than structured as multi-token full-text search