# Fast Dictionary-Based NER for TB Drug Discovery

This notebook demonstrates `FastNERExtractor` — a fast, deterministic NER
engine that uses curated YAML gazetteers instead of an LLM.

1. Basic usage
2. How matching works (exact, regex, fuzzy)
3. Working with results (same as LLM extractor)
4. Custom gazetteers
5. Adding new entity types
6. Batch extraction & performance

## Setup

```bash
pip install "structflo-ner[fast]"
```

In [17]:
from structflo.ner.fast import FastNERExtractor

## 1. Basic Usage
The built-in gazetteers cover TB drug discovery entities.

In [18]:
fast = FastNERExtractor()

text = (
    "Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
    "mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
    "It shows potent activity against Mycobacterium tuberculosis "
    "including MDR-TB and XDR-TB. This compound was identified through "
    "whole-cell screening and targets the energy metabolism pathway."
)

result = fast.extract(text)
result

## 2. How Matching Works

The extractor uses a three-phase matching strategy:

| Phase | Method | What it catches |
|---|---|---|
| 1 | **Exact match** | Case-sensitive and normalized dictionary lookups with word-boundary enforcement |
| 1b | **Regex patterns** | Auto-derived from accession number seeds (Rv tags, UniProt, PDB, etc.) |
| 2 | **Fuzzy match** | Typos and minor variants via rapidfuzz (configurable threshold) |

Each entity's `attributes` dict includes the `match_method` used.

In [19]:
# Inspect match methods
for entity in result.all_entities():
    method = entity.attributes.get("match_method", "")
    canonical = entity.attributes.get("canonical", entity.text)
    print(f"{entity.entity_type:25s} | {entity.text:30s} | method={method:6s} | canonical={canonical}")

compound_name             | Bedaquiline                    | method=exact  | canonical=Bedaquiline
compound_name             | TMC207                         | method=exact  | canonical=TMC207
gene_name                 | atpE                           | method=exact  | canonical=atpE
disease                   | Mycobacterium tuberculosis     | method=exact  | canonical=Mycobacterium tuberculosis
disease                   | TB                             | method=exact  | canonical=TB
disease                   | TB                             | method=exact  | canonical=TB
accession_number          | Rv1305                         | method=exact  | canonical=Rv1305
product                   | ATP synthase subunit c         | method=exact  | canonical=ATP synthase subunit c
functional_category       | energy metabolism              | method=exact  | canonical=energy metabolism
screening_method          | whole-cell screening           | method=exact  | canonical=whole-cell screening


### Fuzzy matching

Catches typos and minor spelling variants. The threshold (default 85) controls sensitivity.

In [20]:
# "Bedaquilne" is a typo for "Bedaquiline"
fuzzy_result = fast.extract("Bedaquilne showed activity against TB")

for c in fuzzy_result.compounds:
    print(f"Found: {c.text!r} -> canonical: {c.attributes.get('canonical', c.text)!r} (method: {c.attributes['match_method']})")

Found: 'Bedaquilne' -> canonical: 'Bedaquiline' (method: fuzzy)


In [21]:
# Disable fuzzy matching for strict mode
strict = FastNERExtractor(fuzzy_threshold=0)
strict_result = strict.extract("Bedaquilne showed activity against TB")

print(f"Compounds (strict): {strict_result.compounds}")  # empty — typo not matched
print(f"Diseases  (strict): {[d.text for d in strict_result.diseases]}")  # TB still matched

Compounds (strict): []
Diseases  (strict): ['TB']


## 3. Working with Results

`FastNERExtractor` returns the same `NERResult` objects as the LLM-based `NERExtractor`.
All downstream tooling works identically.

In [22]:
# Typed entity lists
print("Compounds:", [c.text for c in result.compounds])
print("Targets:", [t.text for t in result.targets])
print("Diseases:", [d.text for d in result.diseases])
print("Accessions:", [a.text for a in result.accessions])
print("Screening methods:", [s.text for s in result.screening_methods])
print("Products:", [p.text for p in result.products])
print("Functional categories:", [f.text for f in result.functional_categories])

Compounds: ['Bedaquiline', 'TMC207']
Targets: ['atpE']
Diseases: ['Mycobacterium tuberculosis', 'TB', 'TB']
Accessions: ['Rv1305']
Screening methods: ['whole-cell screening']
Products: ['ATP synthase subunit c']
Functional categories: ['energy metabolism']


In [23]:
# Export to DataFrame
df = result.to_dataframe()
df

Unnamed: 0,text,entity_type,entity_class,char_start,char_end,alignment,match_method
0,Bedaquiline,compound_name,ChemicalEntity,0,11,,exact
1,TMC207,compound_name,ChemicalEntity,13,19,,exact
2,atpE,gene_name,TargetEntity,108,112,,exact
3,Mycobacterium tuberculosis,disease,DiseaseEntity,156,182,,exact
4,TB,disease,DiseaseEntity,197,199,,exact
5,TB,disease,DiseaseEntity,208,210,,exact
6,Rv1305,accession_number,AccessionEntity,114,120,,exact
7,ATP synthase subunit c,product,ProductEntity,74,96,,exact
8,energy metabolism,functional_category,FunctionalCategoryEntity,286,303,,exact
9,whole-cell screening,screening_method,ScreeningMethodEntity,249,269,,exact


In [24]:
# Character offsets let you highlight entities in the source text
for entity in result.all_entities():
    if entity.char_start is not None:
        span = text[entity.char_start:entity.char_end]
        print(f"[{entity.char_start:3d}:{entity.char_end:3d}] {entity.entity_type:25s} | {span!r}")

[  0: 11] compound_name             | 'Bedaquiline'
[ 13: 19] compound_name             | 'TMC207'
[108:112] gene_name                 | 'atpE'
[156:182] disease                   | 'Mycobacterium tuberculosis'
[197:199] disease                   | 'TB'
[208:210] disease                   | 'TB'
[114:120] accession_number          | 'Rv1305'
[ 74: 96] product                   | 'ATP synthase subunit c'
[286:303] functional_category       | 'energy metabolism'
[249:269] screening_method          | 'whole-cell screening'


## 4. Custom Gazetteers

You can extend Gazetteers and  add terms programmatically:

In [11]:
# Add extra terms on top of the built-in gazetteers
custom = FastNERExtractor(
    extra_gazetteers={
        "target": ["MyNovelTarget", "KinaseX"],
        "compound_name": ["CompoundABC"],
    }
)

r = custom.extract("CompoundABC inhibits MyNovelTarget in M. tuberculosis")
print("Compounds:", [c.text for c in r.compounds])
print("Targets:", [t.text for t in r.targets])
print("Diseases:", [d.text for d in r.diseases])

Compounds: ['CompoundABC']
Targets: ['MyNovelTarget']
Diseases: ['tuberculosis']


## 5. Adding New Entity Types

To add a new gazetteer, just drop a YAML file into the gazetteers directory.
The filename (without `.yml`) must match an `entity_type` from the entity class map.

**Built-in entity types:**

| Filename | entity_type | Python class |
|---|---|---|
| `target.yml` | target | `TargetEntity` |
| `gene_name.yml` | gene_name | `TargetEntity` |
| `compound_name.yml` | compound_name | `ChemicalEntity` |
| `disease.yml` | disease | `DiseaseEntity` |
| `accession_number.yml` | accession_number | `AccessionEntity` |
| `screening_method.yml` | screening_method | `ScreeningMethodEntity` |
| `functional_category.yml` | functional_category | `FunctionalCategoryEntity` |
| `product.yml` | product | `ProductEntity` |

**What's auto-derived from names:**
- Case variants (InhA, inha, INHA)
- Hyphen-optional forms (DprE-1 ↔ DprE1)
- Period-optional forms (M. tuberculosis ↔ M tuberculosis)
- Greek letter expansion (β-lactam ↔ beta-lactam)
- Regex patterns for accession number seeds (Rv, MT, UniProt, PDB, RefSeq)

In [12]:
# See what gazetteers are loaded by default
from structflo.ner.fast._loader import load_all_gazetteers

gazetteers = load_all_gazetteers()
for entity_type, terms in gazetteers.items():
    print(f"{entity_type:25s} | {len(terms):3d} terms | first 5: {terms[:5]}")

accession_number          | 47823 terms | first 5: ['B586_RS00005', 'B586_RS00010', 'B586_RS00015', 'B586_RS00020', 'B586_RS00025']
compound_name             |  50 terms | first 5: ['Bedaquiline', 'Delamanid', 'Pretomanid', 'Isoniazid', 'Rifampicin']
disease                   |  24 terms | first 5: ['Mycobacterium tuberculosis', 'Mtb', 'tuberculosis', 'TB', 'MDR-TB']
functional_category       |  45 terms | first 5: ['DNA replication', 'PE/PPE', 'amino acid metabolism', 'arabinogalactan biosynthesis', 'cell wall and cell processes']
gene_name                 | 37157 terms | first 5: ['35kd_ag', 'AS1726', 'AS1890', 'ASdes', 'ASpks']
product                   |  34 terms | first 5: ['enoyl-ACP reductase', 'decaprenylphosphoryl-beta-D-ribose oxidase', 'decaprenylphosphoryl-beta-D-ribose 2-epimerase', 'ATP synthase subunit c', 'polyketide synthase']
screening_method          | 131 terms | first 5: ['affinity-based screening', 'affinity screening', 'biochemical screening', 'fragment-based sc

## 6. Batch Extraction & Performance

The fast extractor processes text in couple of seconds.

In [13]:
abstracts = [
    "Bedaquiline inhibits AtpE (Rv1305) with nanomolar potency against MDR-TB.",
    "Delamanid (OPC-67683) is activated by Ddn and targets mycolic acid biosynthesis in M. tuberculosis.",
    "Pretomanid (PA-824) requires activation by Ddn (Rv3547) and kills both replicating and non-replicating Mtb.",
    "PBTZ169 (Macozinone) inhibits DprE1 (Rv3790), an essential enzyme in cell wall biosynthesis.",
    "SQ109 targets MmpL3, a trehalose monomycolate transporter essential for cell wall assembly.",
    "Fragment-based screening identified InhA inhibitors that bypass katG-mediated activation.",
    "CRISPRi screening revealed QcrB (Rv2196) as a vulnerable target in energy metabolism.",
    "Structure-based drug design targeting KasA (Rv2245) yielded novel fatty acid biosynthesis inhibitors.",
]

results = fast.extract(abstracts)

for i, r in enumerate(results):
    entities = r.all_entities()
    print(f"Abstract {i+1}: {len(entities)} entities — {[e.text for e in entities]}")

Abstract 1: 4 entities — ['Bedaquiline', 'AtpE', 'TB', 'Rv1305']
Abstract 2: 5 entities — ['Delamanid', 'OPC-67683', 'Ddn', 'tuberculosis', 'mycolic acid biosynthesis']
Abstract 3: 5 entities — ['Pretomanid', 'PA-824', 'Ddn', 'Mtb', 'Rv3547']
Abstract 4: 5 entities — ['PBTZ169', 'Macozinone', 'DprE1', 'Rv3790', 'cell wall biosynthesis']
Abstract 5: 3 entities — ['SQ109', 'MmpL3', 'trehalose monomycolate transporter']
Abstract 6: 3 entities — ['InhA', 'katG', 'Fragment-based screening']
Abstract 7: 4 entities — ['QcrB', 'Rv2196', 'energy metabolism', 'CRISPRi screening']
Abstract 8: 4 entities — ['KasA', 'Rv2245', 'fatty acid biosynthesis', 'Structure-based drug design']


In [15]:
%%timeit -n 10
# Benchmark: extract from a single abstract
fast.extract(text)

393 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
%%timeit -n 5
# Benchmark: extract from 8 abstracts
fast.extract(abstracts)

862 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Comparing Fast vs LLM Extraction

The fast extractor is ideal as a **first pass** for bulk screening.
Use the LLM extractor for deeper analysis where context and novel entities matter.

| | `FastNERExtractor` | `NERExtractor` |
|---|---|---|
| Speed | ~1-10s per abstract | ~10-60s per abstract |
| Novel entities | Only known terms | Discovers new entities |
| Context awareness | None (string matching) | Full contextual understanding |
| Cost | Free (no API calls) | API costs or GPU |
| Setup | Zero config | API key or Ollama |
| Output | `NERResult` | `NERResult` (identical) |