Skip to content

wgetscott/ChemResolver

Repository files navigation

ChemResolver

Author: Scott Fear

Lightweight fuzzy search engine for chemical and pharmaceutical names, designed to resolve misspellings, phonetic confusions, and transcription errors in clinical data.

For example, resolving siprofloxasin -> ciprofloxacin or diasipam -> diazepam.

Built from scratch in Python using a character n-gram inverted index, with no external search or ranking frameworks beyond the Python standard library and a single HTTP dependency (requests) for data fetching.


How It Works

ChemResolver is a two-stage retrieval and ranking pipeline:

1. Indexing: Each drug name is broken into character-level trigrams (e.g., aspirin -> {asp, spi, pir, iri, rin}). These are stored in an inverted index mapping each trigram to the set of words that contains it.

2. Retrieval: A query is similarly tokenised into trigrams. Candidate matches are retrieved by looking up shared trigrams in the index, and filtered by a minimum overlap threshold.

Note: All strings are normalised before indexing and querying (lowercased, spaces removed, and punctuation stripped). For instance, Ciprofloxacin -> ciprofloxacin.

3. Scoring: Each candidate is ranked using a weighted combination of four similarity signals:

Signal Weight Description
Jaccard similarity 35% Character-level set overlap
Levenshtein similarity 35% Normalised edit distance
Prefix bonus 15% Shared leading characters
N-gram overlap 15% Dice + TF-IDF weighted trigram overlap

TF-IDF weighting downweights common trigrams (e.g., ine, tion) so rarer, more distinctive trigrams contribute more to the score.


Project Structure

chemresolver/
├─ pipeline.py              # Top-level API: build() and search()
├─ ngram_index.py           # Inverted index and query logic
├─ similarity.py            # Jaccard and Levenshtein similarity functions
├─ ranker.py                # Scoring and debug breakdown 
├─ utils.py                 # Utility functions: normalise(), load_json(), save_json()
├─ eval.py                  # Accuracy evaluation on query/expected pairs
├─ benchmark.py             # Index build time and query latency benchmarks
├─ fetch_word_list.py       # Fetches (approved) drug names from ChEMBL API
├─ generate_dataset.py      # Auto-generates fuzzy eval dataset using fuzzing
├─ generate_dataset_hard.py # Generates harder eval dataset with multi-transform fuzzing
├─ word_list.json           # 4,005 approved drug names (generated by fetch_word_list.py)
├─ eval_data.json           # 4,003 auto-generated fuzzy query/expected pairs (generated by generate_dataset.py)
└─ eval_data_hard.json      # 3,970 hard fuzzy eval pairs (generated by generate_dataset_hard.py)

Quickstart

No dependencies beyond the standard library and requests.

pip install requests

Run evaluation:

python eval.py  # standard eval (eval_data.json)
python eval.py eval_data_hard.json  # hard eval

Run benchmark:

python benchmark.py

Regenerate dataset:

python fetch_word_list.py       # fetches word_list.json using ChEMBL API
python generate_dataset.py      # generates eval_data.json via fuzzing
python generate_dataset_hard.py # generates eval_data_hard.json via multi-transform fuzzing

Use the pipeline directly:

from pipeline import Pipeline

p = Pipeline()
p.build(["aspirin", "ibuprofen", "ciprofloxacin", "amoxicillin"])

results = p.search("siprofloxasin", top_k=2)
for r in results:
    print(r.word, r.score)

Dataset

word_list.json contains 4,005 approved drug names (clinical trial phase 4) fetched from the ChEMBL API using fetch_word_list.py. Pagination is handled automatically.

eval_data.json contains 4,003 automatically generated fuzzy query/expected pairs. Each drug name (from word_list.json) is fuzzed using one of five transforms:

  • Deletion: remove a random character
  • Transposition: swap two adjacent characters
  • Substitution: replace a character with a different letter
  • Insertion: insert a random character
  • Phonetic swap: apply a pharmaceutical-specific substitution (e.g., ph->f, mycin->misin, cillin->silin)

eval_data_hard.json contains 3,970 harder fuzzy pairs generated by generate_dataset_hard.py. Each word has two transforms applied sequentially, producing more severe misspellings to better stress-test the pipeline.

Example generated pairs (standard eval):

vancomisin -> vancomycin
quinakrine -> quinacrine
veapamil -> verapamil
ondaznsetron -> ondansetron
cloxcaillin -> cloxacillin

Results

Standard (single-transform, 4,003 queries)

=== Evaluation Summary ===
Total queries: 4003
Correct (hits): 3987
Misses: 16
No results: 6
Exact Match Accuracy: 99.60%

Hard (multi-transform, 3,970 queries)

=== Evaluation Summary ===
Total queries: 3970
Correct (hits): 3832
Misses: 138
No results: 34
Exact Match Accuracy: 96.52%

Benchmark (4,005 word index, 3 repeats)

=== Index Benchmark ===
Words indexed: 4005
Index build time: 0.059425s (59.425 ms)

=== Query Benchmark ===
Total queries: 12009
Average query time: 79.153 ms
Median query time: 26.733 ms
p99 query time: 448.368 ms

Average query time is higher than median due to a small number of slow outlier queries. Certain common trigrams (e.g., ine, cil) will match a large number of candidates in a 4,005-word index, making those queries disproportionately expensive to score. Median is the more representative measure of typical performance.

Note: At 1,000 words, average and median query times were well under 5 ms.

Debug Mode

Pass debug=True to search() to inspect per-component scores:

results = p.search("siprofloxasin", top_k=1, debug=True)
r = results[0]
print(r.word, r.jaccard, r.levenshtein, r.prefix, r.tfidf)

Limitations

  • Performance degrades with very large vocabularies due to high-frequency n-grams
  • Designed for single-word queries; multi-word inputs are flattened via normalisation rather than structured as multi-token full-text search

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages