## BERT NER Models Analysis

This notebook explores the following BERT based models for NER in the biomedical domain using text from different corpus

Models Considered:
1. [d4data/biomedical-ner-all](https://huggingface.co/d4data/biomedical-ner-all)
2. [siddharthtumre/biobert-finetuned-ner](https://huggingface.co/siddharthtumre/biobert-finetuned-ner)
3. [fidukm34/biobert_v1.1_pubmed-finetuned-ner](https://huggingface.co/fidukm34/biobert_v1.1_pubmed-finetuned-ner)

Document Sources:
1. PubMed
2. bioRxiv
3. medRxiv


#### Installing Libraries

In [None]:
# !pip install transformers pandas
# ! pip install torch

import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

Collecting biorxiv_retriever
  Downloading biorxiv_retriever-0.20.1.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.3/187.3 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting lxml
  Downloading lxml-6.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (5.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting soupsieve>1.2
  Downloading soupsieve-2.7-py3-none-any.whl (36 kB)
Using legacy 'setup.py install' for biorxiv_retriever, since package 'wheel' is not installed.
Installing collected packages: soupsieve, lxml, beautifulsoup4, biorxiv_retriever
  Running setup.py install for biorxiv_retriever ... [?25ldone
[?25hSuccessfully installed beautifulsoup4-4.13.4 biorxiv_retriever-0.20

In [33]:
pubmed_df = pd.read_csv("../data/pubmed_papers.tsv", sep="\t")

biorxiv_df = pd.read_csv("../data/biorxiv_papers.tsv", sep="\t")

### Model 1: d4data/biomedical-ner-all


- Built on top of *distilbert-base-uncased*
- Trained on Maccrobat dataset: consists of 200 annotation documents and corresponding PMIDs


In [7]:
MODEL_NAME = "d4data/biomedical-ner-all" 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)


ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Device set to use cuda:0


#### Checking the model BIO tags

Beginning - Inside - Outside


In [8]:
for tag in model.config.id2label:
    print(f"{tag}: {model.config.id2label[tag]}")

0: O
1: B-Activity
2: B-Administration
3: B-Age
4: B-Area
5: B-Biological_attribute
6: B-Biological_structure
7: B-Clinical_event
8: B-Color
9: B-Coreference
10: B-Date
11: B-Detailed_description
12: B-Diagnostic_procedure
13: B-Disease_disorder
14: B-Distance
15: B-Dosage
16: B-Duration
17: B-Family_history
18: B-Frequency
19: B-Height
20: B-History
21: B-Lab_value
22: B-Mass
23: B-Medication
24: B-Non[biological](Detailed_description
25: B-Nonbiological_location
26: B-Occupation
27: B-Other_entity
28: B-Other_event
29: B-Outcome
30: B-Personal_[back](Biological_structure
31: B-Personal_background
32: B-Qualitative_concept
33: B-Quantitative_concept
34: B-Severity
35: B-Sex
36: B-Shape
37: B-Sign_symptom
38: B-Subject
39: B-Texture
40: B-Therapeutic_procedure
41: B-Time
42: B-Volume
43: B-Weight
44: I-Activity
45: I-Administration
46: I-Age
47: I-Area
48: I-Biological_attribute
49: I-Biological_structure
50: I-Clinical_event
51: I-Color
52: I-Coreference
53: I-Date
54: I-Detailed_desc

#### Printing the recognised entities and the corresponding entity groups

In [None]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        print(entities)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")



--- PMID: 28600191 ---
[{'entity_group': 'Detailed_description', 'score': np.float32(0.8960703), 'word': 'lipid - polymer hybrid nano', 'start': 0, 'end': 25}, {'entity_group': 'Detailed_description', 'score': np.float32(0.5850217), 'word': '##le', 'start': 31, 'end': 33}, {'entity_group': 'Detailed_description', 'score': np.float32(0.87572235), 'word': 'lipid - polymer hybrid nano', 'start': 128, 'end': 153}, {'entity_group': 'Detailed_description', 'score': np.float32(0.9021482), 'word': '##ticles', 'start': 156, 'end': 162}, {'entity_group': 'Detailed_description', 'score': np.float32(0.6210673), 'word': 'lp', 'start': 164, 'end': 166}, {'entity_group': 'Coreference', 'score': np.float32(0.7854445), 'word': '##hn', 'start': 166, 'end': 168}, {'entity_group': 'Coreference', 'score': np.float32(0.8072966), 'word': '##hn', 'start': 509, 'end': 511}, {'entity_group': 'Coreference', 'score': np.float32(0.5084899), 'word': 'lp', 'start': 677, 'end': 679}, {'entity_group': 'Coreference', 

In [34]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        print(entities)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



--- PMID: 1 ---
[]

--- PMID: 2 ---
[{'entity_group': 'Disease', 'score': np.float32(0.9378797), 'word': 'or', 'start': 65, 'end': 67}, {'entity_group': 'Disease', 'score': np.float32(0.9494301), 'word': '##th', 'start': 67, 'end': 69}, {'entity_group': 'Disease', 'score': np.float32(0.9672202), 'word': '##op', 'start': 69, 'end': 71}, {'entity_group': 'Disease', 'score': np.float32(0.96502066), 'word': '##ae', 'start': 71, 'end': 73}, {'entity_group': 'Disease', 'score': np.float32(0.812373), 'word': '##dic device infection', 'start': 73, 'end': 93}, {'entity_group': 'Disease', 'score': np.float32(0.98867023), 'word': 'Pro', 'start': 95, 'end': 98}, {'entity_group': 'Disease', 'score': np.float32(0.9901896), 'word': '##st', 'start': 98, 'end': 100}, {'entity_group': 'Disease', 'score': np.float32(0.9863632), 'word': '##hetic joint infections', 'start': 100, 'end': 122}, {'entity_group': 'Disease', 'score': np.float32(0.9716655), 'word': 'pro', 'start': 624, 'end': 627}, {'entity_grou

### Model 2: siddharthtumre/biobert-finetuned-ner

- Built on top of *dmis-lab/biobert-base-cased-v1.2*
- Trained on JNLPBA dataset 


In [36]:
MODEL_NAME = "siddharthtumre/biobert-finetuned-ner" 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)


ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Device set to use cuda:0


In [37]:
print(model.config.id2label)

{0: 'O', 1: 'B-DNA', 10: 'I-protein', 2: 'I-DNA', 3: 'B-RNA', 4: 'I-RNA', 5: 'B-cell_line', 6: 'I-cell_line', 7: 'B-cell_type', 8: 'I-cell_type', 9: 'B-protein'}


In [38]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



--- PMID: 28600191 ---
 - lphnp (protein): 0.985

--- PMID: 19243676 ---

--- PMID: 10793294 ---
 - rabbit tracheal epithelial ( rte ) cells (cell_line): 0.996
 - airway epithelial cells (cell_type): 0.998
 - rte cells (cell_line): 0.979
 - catalase (protein): 0.994
 - cat (protein): 0.993
 - antioxidant enzyme (protein): 0.859
 - superoxide dismutase (protein): 0.996
 - sod (protein): 0.990
 - rte cells (cell_line): 0.992
 - c - myc mrna (RNA): 0.996
 - c - jun (DNA): 0.707
 - c - (DNA): 0.700
 - cytokeratin 13 (protein): 0.999
 - rte cells (cell_line): 0.976
 - c - myc (DNA): 0.998
 - cat (protein): 0.997

--- PMID: 24463521 ---

--- PMID: 25984610 ---

--- PMID: arxiv1  Self-supervised deep learning of gene-gene interactions for improved gene expression recovery   Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression v

In [39]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        print(entities)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---
[{'entity_group': 'protein', 'score': np.float32(0.57718456), 'word': '##racter', 'start': 1563, 'end': 1569}]
 - ##racter (protein): 0.577

--- PMID: 2 ---
[{'entity_group': 'DNA', 'score': np.float32(0.8006522), 'word': 'minion', 'start': 1630, 'end': 1636}, {'entity_group': 'DNA', 'score': np.float32(0.69167835), 'word': 'human dna', 'start': 1675, 'end': 1684}]
 - minion (DNA): 0.801
 - human dna (DNA): 0.692

--- PMID: 3 ---
[{'entity_group': 'DNA', 'score': np.float32(0.6796471), 'word': '##phid', 'start': 36, 'end': 40}, {'entity_group': 'DNA', 'score': np.float32(0.9549623), 'word': 'short read reference genome', 'start': 697, 'end': 724}, {'entity_group': 'DNA', 'score': np.float32(0.9568055), 'word': 'pea aphid reference genome', 'start': 751, 'end': 777}, {'entity_group': 'DNA', 'score': np.float32(0.994645), 'word': 'pea aphid genomes', 'start': 804, 'end': 821}, {'entity_group': 'DNA', 'score': np.float32(0.9460317), 'word': 'rnaseq', 'start': 965, 'end': 

### Model 3: fidukm34/biobert_v1.1_pubmed-finetuned-ner

- Built on *monologg/biobert_v1.1_pubmed*
- Trained on NCBI Disease dataset


In [44]:
MODEL_NAME = "fidukm34/biobert_v1.1_pubmed-finetuned-ner" 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)


ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Device set to use cuda:0


In [41]:
print(model.config.id2label)

{0: 'O', 1: 'B-Disease', 2: 'I-Disease'}


In [47]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        print(entities)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 28600191 ---
[]

--- PMID: 19243676 ---
[]

--- PMID: 10793294 ---
[{'entity_group': 'Disease', 'score': np.float32(0.9983646), 'word': 'sq', 'start': 784, 'end': 786}, {'entity_group': 'Disease', 'score': np.float32(0.99785185), 'word': '##ua', 'start': 786, 'end': 788}, {'entity_group': 'Disease', 'score': np.float32(0.99221253), 'word': '##mous metaplasia', 'start': 788, 'end': 803}, {'entity_group': 'Disease', 'score': np.float32(0.99773467), 'word': 'sq', 'start': 1355, 'end': 1357}, {'entity_group': 'Disease', 'score': np.float32(0.9971437), 'word': '##ua', 'start': 1357, 'end': 1359}, {'entity_group': 'Disease', 'score': np.float32(0.9921528), 'word': '##mous metaplasia', 'start': 1359, 'end': 1374}]
 - sq (Disease): 0.998
 - ##ua (Disease): 0.998
 - ##mous metaplasia (Disease): 0.992
 - sq (Disease): 0.998
 - ##ua (Disease): 0.997
 - ##mous metaplasia (Disease): 0.992

--- PMID: 24463521 ---
[]

--- PMID: 25984610 ---
[{'entity_group': 'Disease', 'score': np.float32(

In [43]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    
    print(f"\n--- PMID: {pmid} ---")

    try:
        entities = ner_pipeline(text)
        print(entities)
        for ent in entities:
            print(f" - {ent['word']} ({ent['entity_group']}): {ent['score']:.3f}")
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---
[]

--- PMID: 2 ---
[{'entity_group': 'Disease', 'score': np.float32(0.9378797), 'word': 'or', 'start': 65, 'end': 67}, {'entity_group': 'Disease', 'score': np.float32(0.9494301), 'word': '##th', 'start': 67, 'end': 69}, {'entity_group': 'Disease', 'score': np.float32(0.9672202), 'word': '##op', 'start': 69, 'end': 71}, {'entity_group': 'Disease', 'score': np.float32(0.96502066), 'word': '##ae', 'start': 71, 'end': 73}, {'entity_group': 'Disease', 'score': np.float32(0.812373), 'word': '##dic device infection', 'start': 73, 'end': 93}, {'entity_group': 'Disease', 'score': np.float32(0.98867023), 'word': 'Pro', 'start': 95, 'end': 98}, {'entity_group': 'Disease', 'score': np.float32(0.9901896), 'word': '##st', 'start': 98, 'end': 100}, {'entity_group': 'Disease', 'score': np.float32(0.9863632), 'word': '##hetic joint infections', 'start': 100, 'end': 122}, {'entity_group': 'Disease', 'score': np.float32(0.9716655), 'word': 'pro', 'start': 624, 'end': 627}, {'entity_grou