<a href="https://colab.research.google.com/github/spyysalo/dl-binf-summer-school-2023/blob/main/chemical_ner_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named entity recognition with a fine-tuned model

This notebook demonstrates named entity recognition with the a fine-tuned model on Colab.

First, we'll install the required Python package. The [transformers](https://huggingface.co/docs/transformers/index) package is used to load the model and run prediction.

In [1]:
!pip install --quiet transformers

Next, we'll import the `AutoTokenizer`, `AutoModelForTokenClassification` and `pipeline` classes. These support loading tokenizers and models from the [Hugging Face Hub](https://huggingface.co/models) and wrapping these into an easy-to-use pipeline.

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

Load a model fine-tuned for NER and its tokenizer.

In [4]:
MODEL_NAME = 'alvaroalon2/biobert_chemical_ner'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

Wrap the tokenizer and model in a pipeline that takes care of tokenization, decoding of output, and interpreting output as named entity spans.

In [13]:
pipe = pipeline(
    'ner',
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy='first',
    device=model.device,
)

Test with a PubMed abstract text

In [24]:
text = '''Synthesis and antibacterial evaluation of a novel tricyclic oxaborole-fused fluoroquinolone.    We have designed and synthesized a novel class of compounds based on fluoroquinolone antibacterial prototype. The design concept involved the replacement of the 3-carboxylic acid in ciprofloxacin with an oxaborole-fused ring as an acid-mimicking group. The synthetic method employed in this work provides a good example of incorporating boron atom in complex molecules with multiple functional groups. The antibacterial activity of the newly synthesized compounds has been evaluated.'''

names = pipe(text)

for n in names:
    print(f"{n['entity_group']}\t{n['start']}\t{n['end']}\t{n['word']}")

CHEMICAL	50	69	tricyclic oxaborole
CHEMICAL	76	91	fluoroquinolone
CHEMICAL	165	180	fluoroquinolone
CHEMICAL	257	274	3 - carboxylic acid
CHEMICAL	278	291	ciprofloxacin
CHEMICAL	300	309	oxaborole
CHEMICAL	433	438	boron
