# Fast Dictionary-Based NER for TB Drug Discovery

This notebook demonstrates `FastNERExtractor` — a fast, deterministic NER
engine that uses curated YAML gazetteers instead of an LLM.

1. Basic usage
2. How matching works (exact, regex, fuzzy)
3. Working with results (same as LLM extractor)
4. Custom gazetteers
5. Adding new entity types
6. Batch extraction & performance

## Setup

```bash
uv add "structflo-ner[fast]"
# for DataFrame support
uv add "structflo-ner[fast,dataframe]"
```

In [None]:
from structflo.ner.fast import FastNERExtractor

## 1. Basic Usage

No API key, no LLM, no network — just instantiate and extract.
The built-in gazetteers cover TB drug discovery entities.

In [None]:
fast = FastNERExtractor()

text = (
    "Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
    "mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
    "It shows potent activity against Mycobacterium tuberculosis "
    "including MDR-TB and XDR-TB. The compound was identified through "
    "whole-cell screening and targets the energy metabolism pathway."
)

result = fast.extract(text)
result

## 2. How Matching Works

The extractor uses a three-phase matching strategy:

| Phase | Method | What it catches |
|---|---|---|
| 1 | **Exact match** | Case-sensitive and normalized dictionary lookups with word-boundary enforcement |
| 1b | **Regex patterns** | Auto-derived from accession number seeds (Rv tags, UniProt, PDB, etc.) |
| 2 | **Fuzzy match** | Typos and minor variants via rapidfuzz (configurable threshold) |

Each entity's `attributes` dict includes the `match_method` used.

In [None]:
# Inspect match methods
for entity in result.all_entities():
    method = entity.attributes.get("match_method", "")
    canonical = entity.attributes.get("canonical", entity.text)
    print(f"{entity.entity_type:25s} | {entity.text:30s} | method={method:6s} | canonical={canonical}")

### Regex matching for accession numbers

Seed entries in `accession_number.yml` auto-derive regex patterns.
For example, `Rv0005` teaches the system `Rv\d{4}[c]?`, so *all* Rv locus tags
are matched — not just the ones listed.

In [None]:
# Rv2043c is NOT in the YAML, but the regex pattern catches it
result2 = fast.extract("PptT is encoded by Rv2043c and is essential for mycolic acid biosynthesis.")

print("Accessions found:")
for a in result2.accessions:
    print(f"  {a.text} (method: {a.attributes['match_method']})")

print("\nTargets found:")
for t in result2.targets:
    print(f"  {t.text}")

### Fuzzy matching

Catches typos and minor spelling variants. The threshold (default 85) controls sensitivity.

In [None]:
# "Bedaquilne" is a typo for "Bedaquiline"
fuzzy_result = fast.extract("Bedaquilne showed activity against TB")

for c in fuzzy_result.compounds:
    print(f"Found: {c.text!r} -> canonical: {c.attributes.get('canonical', c.text)!r} (method: {c.attributes['match_method']})")

In [None]:
# Disable fuzzy matching for strict mode
strict = FastNERExtractor(fuzzy_threshold=0)
strict_result = strict.extract("Bedaquilne showed activity against TB")

print(f"Compounds (strict): {strict_result.compounds}")  # empty — typo not matched
print(f"Diseases  (strict): {[d.text for d in strict_result.diseases]}")  # TB still matched

## 3. Working with Results

`FastNERExtractor` returns the same `NERResult` objects as the LLM-based `NERExtractor`.
All downstream tooling works identically.

In [None]:
# Typed entity lists
print("Compounds:", [c.text for c in result.compounds])
print("Targets:", [t.text for t in result.targets])
print("Diseases:", [d.text for d in result.diseases])
print("Accessions:", [a.text for a in result.accessions])
print("Screening methods:", [s.text for s in result.screening_methods])
print("Products:", [p.text for p in result.products])
print("Functional categories:", [f.text for f in result.functional_categories])

In [None]:
# Export to DataFrame
df = result.to_dataframe()
df

In [None]:
# Character offsets let you highlight entities in the source text
for entity in result.all_entities():
    if entity.char_start is not None:
        span = text[entity.char_start:entity.char_end]
        print(f"[{entity.char_start:3d}:{entity.char_end:3d}] {entity.entity_type:25s} | {span!r}")

## 4. Custom Gazetteers

YAML files contain only names — one per line. The filename becomes the `entity_type`.

```yaml
# my_gazetteers/target.yml
- MyNovelTarget
- AnotherTarget
- KinaseX
```

```python
fast = FastNERExtractor(gazetteer_dir="my_gazetteers/")
```

You can also add terms programmatically without creating files:

In [None]:
# Add extra terms on top of the built-in gazetteers
custom = FastNERExtractor(
    extra_gazetteers={
        "target": ["MyNovelTarget", "KinaseX"],
        "compound_name": ["CompoundABC"],
    }
)

r = custom.extract("CompoundABC inhibits MyNovelTarget in M. tuberculosis")
print("Compounds:", [c.text for c in r.compounds])
print("Targets:", [t.text for t in r.targets])
print("Diseases:", [d.text for d in r.diseases])

## 5. Adding New Entity Types

To add a new gazetteer, just drop a YAML file into the gazetteers directory.
The filename (without `.yml`) must match an `entity_type` from the entity class map.

**Built-in entity types:**

| Filename | entity_type | Python class |
|---|---|---|
| `target.yml` | target | `TargetEntity` |
| `gene_name.yml` | gene_name | `TargetEntity` |
| `compound_name.yml` | compound_name | `ChemicalEntity` |
| `disease.yml` | disease | `DiseaseEntity` |
| `accession_number.yml` | accession_number | `AccessionEntity` |
| `screening_method.yml` | screening_method | `ScreeningMethodEntity` |
| `functional_category.yml` | functional_category | `FunctionalCategoryEntity` |
| `product.yml` | product | `ProductEntity` |

**What's auto-derived from names:**
- Case variants (InhA, inha, INHA)
- Hyphen-optional forms (DprE-1 ↔ DprE1)
- Period-optional forms (M. tuberculosis ↔ M tuberculosis)
- Greek letter expansion (β-lactam ↔ beta-lactam)
- Regex patterns for accession number seeds (Rv, MT, UniProt, PDB, RefSeq)

In [None]:
# See what gazetteers are loaded by default
from structflo.ner.fast._loader import load_all_gazetteers

gazetteers = load_all_gazetteers()
for entity_type, terms in gazetteers.items():
    print(f"{entity_type:25s} | {len(terms):3d} terms | first 5: {terms[:5]}")

## 6. Batch Extraction & Performance

The fast extractor processes text in milliseconds — orders of magnitude faster than LLM-based extraction.

In [None]:
abstracts = [
    "Bedaquiline inhibits AtpE (Rv1305) with nanomolar potency against MDR-TB.",
    "Delamanid (OPC-67683) is activated by Ddn and targets mycolic acid biosynthesis in M. tuberculosis.",
    "Pretomanid (PA-824) requires activation by Ddn (Rv3547) and kills both replicating and non-replicating Mtb.",
    "PBTZ169 (Macozinone) inhibits DprE1 (Rv3790), an essential enzyme in cell wall biosynthesis.",
    "SQ109 targets MmpL3, a trehalose monomycolate transporter essential for cell wall assembly.",
    "Fragment-based screening identified InhA inhibitors that bypass katG-mediated activation.",
    "CRISPRi screening revealed QcrB (Rv2196) as a vulnerable target in energy metabolism.",
    "Structure-based drug design targeting KasA (Rv2245) yielded novel fatty acid biosynthesis inhibitors.",
]

results = fast.extract(abstracts)

for i, r in enumerate(results):
    entities = r.all_entities()
    print(f"Abstract {i+1}: {len(entities)} entities — {[e.text for e in entities]}")

In [None]:
%%timeit -n 100
# Benchmark: extract from a single abstract
fast.extract(text)

In [None]:
%%timeit -n 10
# Benchmark: extract from 8 abstracts
fast.extract(abstracts)

## Comparing Fast vs LLM Extraction

The fast extractor is ideal as a **first pass** for bulk screening.
Use the LLM extractor for deeper analysis where context and novel entities matter.

| | `FastNERExtractor` | `NERExtractor` |
|---|---|---|
| Speed | ~1-5ms per abstract | ~2-5s per abstract |
| Novel entities | Only known terms | Discovers new entities |
| Context awareness | None (string matching) | Full contextual understanding |
| Cost | Free (no API calls) | API costs or GPU |
| Setup | Zero config | API key or Ollama |
| Output | `NERResult` | `NERResult` (identical) |