ChemResolver

Author: Scott Fear

Lightweight fuzzy search engine for chemical and pharmaceutical names, designed to resolve misspellings, phonetic confusions, and transcription errors in clinical data.

For example, resolving siprofloxasin -> ciprofloxacin or diasipam -> diazepam.

Built from scratch in Python using a character n-gram inverted index, with no external search or ranking frameworks beyond the Python standard library and a single HTTP dependency (requests) for data fetching.

How It Works

ChemResolver is a two-stage retrieval and ranking pipeline:

1. Indexing: Each drug name is broken into character-level trigrams (e.g., aspirin -> {asp, spi, pir, iri, rin}). These are stored in an inverted index mapping each trigram to the set of words that contains it.

2. Retrieval: A query is similarly tokenised into trigrams. Candidate matches are retrieved by looking up shared trigrams in the index, and filtered by a minimum overlap threshold.

Note: All strings are normalised before indexing and querying (lowercased, spaces removed, and punctuation stripped). For instance, Ciprofloxacin -> ciprofloxacin.

3. Scoring: Each candidate is ranked using a weighted combination of four similarity signals:

Signal	Weight	Description
Jaccard similarity	35%	Character-level set overlap
Levenshtein similarity	35%	Normalised edit distance
Prefix bonus	15%	Shared leading characters
N-gram overlap	15%	Dice + TF-IDF weighted trigram overlap

TF-IDF weighting downweights common trigrams (e.g., ine, tion) so rarer, more distinctive trigrams contribute more to the score.

Project Structure

chemresolver/
├─ pipeline.py              # Top-level API: build() and search()
├─ ngram_index.py           # Inverted index and query logic
├─ similarity.py            # Jaccard and Levenshtein similarity functions
├─ ranker.py                # Scoring and debug breakdown 
├─ utils.py                 # Utility functions: normalise(), load_json(), save_json()
├─ eval.py                  # Accuracy evaluation on query/expected pairs
├─ benchmark.py             # Index build time and query latency benchmarks
├─ fetch_word_list.py       # Fetches (approved) drug names from ChEMBL API
├─ generate_dataset.py      # Auto-generates fuzzy eval dataset using fuzzing
├─ generate_dataset_hard.py # Generates harder eval dataset with multi-transform fuzzing
├─ word_list.json           # 4,005 approved drug names (generated by fetch_word_list.py)
├─ eval_data.json           # 4,003 auto-generated fuzzy query/expected pairs (generated by generate_dataset.py)
└─ eval_data_hard.json      # 3,970 hard fuzzy eval pairs (generated by generate_dataset_hard.py)

Quickstart

No dependencies beyond the standard library and requests.

pip install requests

Run evaluation:

python eval.py  # standard eval (eval_data.json)
python eval.py eval_data_hard.json  # hard eval

Run benchmark:

python benchmark.py

Regenerate dataset:

python fetch_word_list.py       # fetches word_list.json using ChEMBL API
python generate_dataset.py      # generates eval_data.json via fuzzing
python generate_dataset_hard.py # generates eval_data_hard.json via multi-transform fuzzing

Use the pipeline directly:

from pipeline import Pipeline

p = Pipeline()
p.build(["aspirin", "ibuprofen", "ciprofloxacin", "amoxicillin"])

results = p.search("siprofloxasin", top_k=2)
for r in results:
    print(r.word, r.score)

Dataset

word_list.json contains 4,005 approved drug names (clinical trial phase 4) fetched from the ChEMBL API using fetch_word_list.py. Pagination is handled automatically.

eval_data.json contains 4,003 automatically generated fuzzy query/expected pairs. Each drug name (from word_list.json) is fuzzed using one of five transforms:

Deletion: remove a random character
Transposition: swap two adjacent characters
Substitution: replace a character with a different letter
Insertion: insert a random character
Phonetic swap: apply a pharmaceutical-specific substitution (e.g., ph->f, mycin->misin, cillin->silin)

eval_data_hard.json contains 3,970 harder fuzzy pairs generated by generate_dataset_hard.py. Each word has two transforms applied sequentially, producing more severe misspellings to better stress-test the pipeline.

Example generated pairs (standard eval):

vancomisin -> vancomycin
quinakrine -> quinacrine
veapamil -> verapamil
ondaznsetron -> ondansetron
cloxcaillin -> cloxacillin

Results

Standard (single-transform, 4,003 queries)

=== Evaluation Summary ===
Total queries: 4003
Correct (hits): 3987
Misses: 16
No results: 6
Exact Match Accuracy: 99.60%

Hard (multi-transform, 3,970 queries)

=== Evaluation Summary ===
Total queries: 3970
Correct (hits): 3832
Misses: 138
No results: 34
Exact Match Accuracy: 96.52%

Benchmark (4,005 word index, 3 repeats)

=== Index Benchmark ===
Words indexed: 4005
Index build time: 0.059425s (59.425 ms)

=== Query Benchmark ===
Total queries: 12009
Average query time: 79.153 ms
Median query time: 26.733 ms
p99 query time: 448.368 ms

Average query time is higher than median due to a small number of slow outlier queries. Certain common trigrams (e.g., ine, cil) will match a large number of candidates in a 4,005-word index, making those queries disproportionately expensive to score. Median is the more representative measure of typical performance.

Note: At 1,000 words, average and median query times were well under 5 ms.

Debug Mode

Pass debug=True to search() to inspect per-component scores:

results = p.search("siprofloxasin", top_k=1, debug=True)
r = results[0]
print(r.word, r.jaccard, r.levenshtein, r.prefix, r.tfidf)

Limitations

Performance degrades with very large vocabularies due to high-frequency n-grams
Designed for single-word queries; multi-word inputs are flattened via normalisation rather than structured as multi-token full-text search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChemResolver

How It Works

Project Structure

Quickstart

Dataset

Results

Standard (single-transform, 4,003 queries)

Hard (multi-transform, 3,970 queries)

Benchmark (4,005 word index, 3 repeats)

Debug Mode

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
eval.py		eval.py
eval_data.json		eval_data.json
eval_data_hard.json		eval_data_hard.json
fetch_word_list.py		fetch_word_list.py
generate_dataset.py		generate_dataset.py
generate_dataset_hard.py		generate_dataset_hard.py
ngram_index.py		ngram_index.py
pipeline.py		pipeline.py
ranker.py		ranker.py
similarity.py		similarity.py
utils.py		utils.py
word_list.json		word_list.json

Folders and files

Latest commit

History

Repository files navigation

ChemResolver

How It Works

Project Structure

Quickstart

Dataset

Results

Standard (single-transform, 4,003 queries)

Hard (multi-transform, 3,970 queries)

Benchmark (4,005 word index, 3 repeats)

Debug Mode

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages