# Quickstart | structflo.ner

This notebook walks through the core features of `structflo.ner`:

1. Basic extraction with a cloud model (Gemini)
2. Local extraction with Ollama
3. Using built-in profiles
4. Custom profiles
5. Working with results

## Setup

```bash
pip install structflo-ner
```

In [2]:
from structflo.ner import NERExtractor, TB


In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)


## 1. Cloud model (Gemini)

The default model is `gemini-2.5-flash`. Pass your API key or set the `GEMINI_API_KEY` environment variable.

In [None]:
extractor = NERExtractor(api_key="YOUR_GEMINI_KEY")  # or set GEMINI_API_KEY env var

text = (
    "Gefitinib (ZD1839) is a first-generation EGFR tyrosine kinase inhibitor "
    "with IC50 = 0.033 µM, approved for non-small cell lung cancer (NSCLC). "
    "Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1."
)

result = extractor.extract(text)
result

## 2. Local model via Ollama

Run extraction on your own hardware.

Make sure Ollama is running locally:
```bash
ollama serve
ollama pull qwen2.5:72b
```


In [30]:
extractor = NERExtractor(
    model_id="qwen2.5:72b",
    model_url="http://localhost:11434",
    )

In [32]:
text = ("Gefitinib (ZD1839) is a first-generation EGFR inhibitor with IC50 = 0.033 µM approved for NSCLC."
        "Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1.")
result = extractor.extract(text)
result

In [33]:
result.to_dataframe()

Unnamed: 0,text,entity_type,entity_class,char_start,char_end,alignment,synonyms,therapeutic_area,value,unit,assay_type
0,Gefitinib,compound_name,ChemicalEntity,0,9,match_exact,ZD1839,,,,
1,ZD1839,compound_name,ChemicalEntity,11,17,match_exact,,,,,
2,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1,smiles,ChemicalEntity,110,156,match_exact,,,,,
3,EGFR,target,TargetEntity,41,45,match_exact,,,,,
4,NSCLC,disease,DiseaseEntity,90,95,match_exact,,oncology,,,
5,IC50 = 0.033 µM,bioactivity,BioactivityEntity,61,76,match_exact,,,0.033,µM,IC50


## For a TB specific extractor pass in the profile=TB

In [6]:

extractor = NERExtractor(
    model_id="qwen2.5:72b",
    model_url="http://localhost:11434",
    profile=TB,
    )

In [10]:
text = (
    "Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
    "mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
    "It shows potent activity against Mycobacterium tuberculosis "
    "including MDR-TB and XDR-TB. This compound was identified through "
    "whole-cell screening and targets the energy metabolism pathway."
)
result = extractor.extract(text)
result

## 3. Built-in profiles

Profiles control which entity types are extracted. Use them to focus the model on specific categories.

In [12]:
from structflo.ner import CHEMISTRY, BIOLOGY
# Extract only chemical entities
chem_result = extractor.extract(text, profile=CHEMISTRY)
print("Compounds:", chem_result.compounds)
print("Targets:", chem_result.targets)  # empty — not in CHEMISTRY profile

Compounds: [ChemicalEntity(text='Bedaquiline', entity_type='compound_name', char_start=0, char_end=11, attributes={'synonyms': 'TMC207'}, alignment='match_exact'), ChemicalEntity(text='TMC207', entity_type='compound_name', char_start=13, char_end=19, attributes={}, alignment='match_exact')]
Targets: []


In [13]:
# Merge profiles to combine entity types
combined = CHEMISTRY.merge(BIOLOGY)
print(f"Profile: {combined.name}")
print(f"Entity classes: {combined.entity_classes}")

combined_result = extractor.extract(text, profile=combined)
print("Compounds:", combined_result.compounds)
print("Targets:", combined_result.targets)

Profile: chemistry+biology
Entity classes: ['compound_name', 'smiles', 'cas_number', 'molecular_formula', 'target', 'gene_name', 'protein_name']
Compounds: [ChemicalEntity(text='Bedaquiline', entity_type='compound_name', char_start=0, char_end=11, attributes={'synonyms': 'TMC207'}, alignment='match_exact'), ChemicalEntity(text='TMC207', entity_type='compound_name', char_start=13, char_end=19, attributes={}, alignment='match_exact')]
Targets: [TargetEntity(text='mycobacterial ATP synthase subunit c', entity_type='target', char_start=60, char_end=96, attributes={'gene_name': 'atpE (Rv1305)', 'organism': 'Mycobacterium tuberculosis'}, alignment='match_exact'), TargetEntity(text='energy metabolism pathway', entity_type='target', char_start=286, char_end=311, attributes={'protein_family': 'pathway', 'organism': 'Mycobacterium tuberculosis'}, alignment='match_exact')]


## 5. Working with results

In [15]:
# Access typed entity lists
print("Compounds:", result.compounds)
print("Targets:", result.targets)
print("Bioactivities:", result.bioactivities)
print("Diseases:", result.diseases)
print("Mechanisms:", result.mechanisms)

Compounds: [ChemicalEntity(text='Bedaquiline', entity_type='compound_name', char_start=0, char_end=11, attributes={'synonyms': 'TMC207'}, alignment='match_exact'), ChemicalEntity(text='TMC207', entity_type='compound_name', char_start=13, char_end=19, attributes={}, alignment='match_exact')]
Targets: [TargetEntity(text='ATP synthase subunit c', entity_type='target', char_start=74, char_end=96, attributes={'gene_name': 'atpE', 'protein_family': 'ATP synthase'}, alignment='match_exact')]
Bioactivities: []
Diseases: [DiseaseEntity(text='MDR-TB', entity_type='disease', char_start=193, char_end=199, attributes={'therapeutic_area': 'infectious disease'}, alignment='match_exact'), DiseaseEntity(text='XDR-TB', entity_type='disease', char_start=204, char_end=210, attributes={'therapeutic_area': 'infectious disease'}, alignment='match_exact')]
Mechanisms: []


In [16]:
# Flat list of all entities
for entity in result.all_entities():
    print(f"{entity.entity_type:20s} | {entity.text}")

compound_name        | Bedaquiline
compound_name        | TMC207
target               | ATP synthase subunit c
disease              | MDR-TB
disease              | XDR-TB
accession_number     | Rv1305
functional_category  | energy metabolism pathway
screening_method     | whole-cell screening


In [17]:
# Export to pandas DataFrame
df = result.to_dataframe()
df

Unnamed: 0,text,entity_type,entity_class,char_start,char_end,alignment,synonyms,gene_name,protein_family,therapeutic_area
0,Bedaquiline,compound_name,ChemicalEntity,0,11,match_exact,TMC207,,,
1,TMC207,compound_name,ChemicalEntity,13,19,match_exact,,,,
2,ATP synthase subunit c,target,TargetEntity,74,96,match_exact,,atpE,ATP synthase,
3,MDR-TB,disease,DiseaseEntity,193,199,match_exact,,,,infectious disease
4,XDR-TB,disease,DiseaseEntity,204,210,match_exact,,,,infectious disease
5,Rv1305,accession_number,AccessionEntity,114,120,match_exact,,,,
6,energy metabolism pathway,functional_category,FunctionalCategoryEntity,286,311,match_exact,,,,
7,whole-cell screening,screening_method,ScreeningMethodEntity,249,269,match_exact,,,,


In [18]:
# Serialize to dict (useful for JSON export)
import json

print(json.dumps(result.to_dict(), indent=2))

{
  "source_text": "Bedaquiline (TMC207) is a diarylquinoline that inhibits the mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). It shows potent activity against Mycobacterium tuberculosis including MDR-TB and XDR-TB. This compound was identified through whole-cell screening and targets the energy metabolism pathway.",
  "compounds": [
    {
      "text": "Bedaquiline",
      "entity_type": "compound_name",
      "char_start": 0,
      "char_end": 11,
      "attributes": {
        "synonyms": "TMC207"
      },
      "alignment": "match_exact"
    },
    {
      "text": "TMC207",
      "entity_type": "compound_name",
      "char_start": 13,
      "char_end": 19,
      "attributes": {},
      "alignment": "match_exact"
    }
  ],
  "targets": [
    {
      "text": "ATP synthase subunit c",
      "entity_type": "target",
      "char_start": 74,
      "char_end": 96,
      "attributes": {
        "gene_name": "atpE",
        "protein_family": "ATP synthase"
      },
      "al

## 6. Batch extraction

Pass a list of texts to extract from multiple documents.

In [19]:
texts = [
    "Imatinib inhibits BCR-ABL with IC50 = 0.6 µM in CML.",
    "Trastuzumab targets HER2 in breast cancer patients.",
    "Remdesivir (GS-5734) is an antiviral with EC50 = 0.77 µM against SARS-CoV-2.",
]

results = extractor.extract(texts)

for i, r in enumerate(results):
    print(f"\n--- Text {i+1} ---")
    for entity in r.all_entities():
        print(f"  {entity.entity_type:20s} | {entity.text}")


--- Text 1 ---
  compound_name        | Imatinib
  target               | BCR-ABL
  disease              | CML
  bioactivity          | IC50 = 0.6 µM

--- Text 2 ---
  compound_name        | Trastuzumab
  target               | HER2
  disease              | breast cancer

--- Text 3 ---
  compound_name        | Remdesivir
  compound_name        | GS-5734
  disease              | SARS-CoV-2
  bioactivity          | EC50 = 0.77 µM
