## scispacy NER Models Analysis

This notebook explores the following scispacy models for NER in the biomedical domain using text from different corpus

Models Considered:
1. en_ner_craft_md
2. en_ner_jnlpba_md
3. en_ner_bc5cdr_md
4. en_ner_bionlp13cg_md

Document Sources:
1. PubMed
2. bioRxiv


In [45]:
# !pip install scispacy
# ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz
# ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz
# # !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz
# ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz

import spacy
import pandas as pd


In [46]:
model_craft = spacy.load("en_ner_craft_md")
model_bc5cdr = spacy.load("en_ner_bc5cdr_md")
model_jnlpba = spacy.load("en_ner_jnlpba_md")
model_bionlp = spacy.load("en_ner_bionlp13cg_md")



In [47]:
print("CRAFT MODEL:", model_craft.component_names)
print("JNLPBA MODEL:", model_jnlpba.component_names)
print("BC5CDR MODEL:", model_bc5cdr.component_names)
print("BIONLP MODEL:", model_bionlp.component_names)

CRAFT MODEL: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']
JNLPBA MODEL: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']
BC5CDR MODEL: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']
BIONLP MODEL: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']


In [48]:
print("CRAFT LABELS:\n")
for label in model_craft.get_pipe("ner").labels:
    print(label)
print("\nJNLPBA LABELS:\n")
for label in model_jnlpba.get_pipe("ner").labels:
    print(label)
print("\nBC5CDR LABELS:\n")
for label in model_bc5cdr.get_pipe("ner").labels:
    print(label)
print("\nBIONLP LABELS:\n")
for label in model_bionlp.get_pipe("ner").labels:
    print(label)

CRAFT LABELS:

CHEBI
CL
GGP
GO
SO
TAXON

JNLPBA LABELS:

CELL_LINE
CELL_TYPE
DNA
PROTEIN
RNA

BC5CDR LABELS:

CHEMICAL
DISEASE

BIONLP LABELS:

AMINO_ACID
ANATOMICAL_SYSTEM
CANCER
CELL
CELLULAR_COMPONENT
DEVELOPING_ANATOMICAL_STRUCTURE
GENE_OR_GENE_PRODUCT
IMMATERIAL_ANATOMICAL_ENTITY
MULTI_TISSUE_STRUCTURE
ORGAN
ORGANISM
ORGANISM_SUBDIVISION
ORGANISM_SUBSTANCE
PATHOLOGICAL_FORMATION
SIMPLE_CHEMICAL
TISSUE


In [49]:
pubmed_df = pd.read_csv("../data/pubmed_papers.tsv", sep="\t")

biorxiv_df = pd.read_csv("../data/biorxiv_papers.tsv", sep="\t")

### Model: en_ner_craft_md

In [50]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_craft(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 28600191 ---

--- PMID: 19243676 ---
food CHEBI 478 482

--- PMID: 10793294 ---
rabbit TAXON 17 23
rabbit TAXON 102 108
cells CL 135 140
epithelial cells CL 225 241
oxygen species CHEBI 287 301
hydroxyl radical CHEBI 398 414
cells CL 420 425
lipid CHEBI 436 441
catalase GGP 500 508
enzyme SO 647 653
Superoxide dismutase GGP 666 686
SOD GGP 688 691
cells CL 739 744
c-myc mRNA GGP 900 910
cytokeratin 13 GGP 1070 1084
message SO 1107 1114
protein CHEBI 1119 1126
cells CL 1139 1144
c-myc GGP 1224 1229
lipid CHEBI 1290 1295

--- PMID: 24463521 ---
Drosophila TAXON 57 67
Drosophila melanogaster TAXON 467 490
flies TAXON 1335 1340

--- PMID: 25984610 ---


In [51]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_craft(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---


Peptide SO 0 7
human TAXON 98 103
protein CHEBI 192 199
protein CHEBI 254 261
protein CHEBI 353 360
Protein CHEBI 375 382
bacterial species TAXON 450 467
human TAXON 500 505
microbiome TAXON 510 520
human TAXON 540 545
microbiome TAXON 653 663
microbes TAXON 694 702
peptide SO 769 776
human TAXON 863 868
microbiomes TAXON 873 884
drug CHEBI 906 910
peptides SO 950 958
protein CHEBI 973 980
peptide SO 1079 1086
peptides SO 1112 1120
peptide SO 1180 1187
peptide SO 1330 1337
peptide SO 1400 1407
proteins CHEBI 1467 1475
proteins CHEBI 1574 1582
peptide SO 1634 1641
human TAXON 1716 1721

--- PMID: 2 ---
gold CHEBI 278 282
sequence SO 561 569
bacterial TAXON 599 608
DNA SO 653 656
DNA SO 1086 1089
sequences SO 1303 1312
sequence SO 1637 1645
human DNA TAXON 1675 1684
genome SO 1711 1717
genome SO 1793 1799

--- PMID: 3 ---
gene SO 4 8
aphid genome assemblies TAXON 35 58
genes SO 89 94
gene SO 99 103
genome SO 131 137
gene SO 176 180
sequence SO 319 327
genome SO 339 345
gene SO 400 404
ge

### Model: en_ner_bc5cdr_md


In [52]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_bc5cdr(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 28600191 ---


Lipid-polymer CHEMICAL 0 13

--- PMID: 19243676 ---

--- PMID: 10793294 ---
oxygen CHEMICAL 287 293
hydroxyl CHEMICAL 398 406
Superoxide CHEMICAL 666 676
squamous metaplasia DISEASE 784 803
squamous metaplasia DISEASE 1355 1374

--- PMID: 24463521 ---

--- PMID: 25984610 ---
alexithymia DISEASE 13 24
psychosomatic DISEASE 30 43
psoriasis DISEASE 54 63
Alexithymia DISEASE 66 77
psychosomatic illness DISEASE 246 267
alexithymia DISEASE 314 325
psoriasis DISEASE 384 393
anxiety DISEASE 483 490


In [53]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_bc5cdr(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---


TNPA CHEMICAL 1419 1423

--- PMID: 2 ---
infection DISEASE 84 93
Prosthetic joint infections DISEASE 95 122
infections DISEASE 641 651
infection DISEASE 1337 1346
infections DISEASE 1550 1560

--- PMID: 3 ---

--- PMID: 4 ---
MGMG CHEMICAL 0 4
bioactivity DISEASE 108 119
MGMG CHEMICAL 438 442
MGMG CHEMICAL 889 893
MGMG CHEMICAL 1058 1062
MGMG CHEMICAL 1482 1486

--- PMID: 5 ---
amino acid CHEMICAL 153 163


### Model: en_ner_jnlpba_md


In [54]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_jnlpba(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 28600191 ---


LPHNPs PROTEIN 735 741
LPHNP PROTEIN 836 841

--- PMID: 19243676 ---

--- PMID: 10793294 ---
rabbit tracheal epithelial ( CELL_TYPE 102 130
airway epithelial cells CELL_TYPE 218 241
RTE cells CELL_TYPE 416 425
catalase PROTEIN 500 508
CAT PROTEIN 510 513
Superoxide dismutase PROTEIN 666 686
SOD PROTEIN 688 691
RTE cells CELL_TYPE 735 744
c-myc mRNA RNA 900 910
c-jun DNA 917 922
c-fos DNA 927 932
cytokeratin 13 DNA 1070 1084
RTE cells CELL_TYPE 1135 1144
c-myc DNA 1224 1229
CAT PROTEIN 1265 1268

--- PMID: 24463521 ---

--- PMID: 25984610 ---


In [55]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_jnlpba(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---

--- PMID: 2 ---
Oxford Nanopore Technologies MinION DNA 816 851
MinION sequence DNA 1630 1645
human DNA DNA 1675 1684

--- PMID: 3 ---
pea aphid genome DNA 31 47
mis-assemble duplicated genes DNA 469 498
promoters DNA 1333 1342

--- PMID: 4 ---
MGMG-generated molecules PROTEIN 1281 1305

--- PMID: 5 ---
co-translational protein PROTEIN 547 571
extreme codon DNA 1364 1377


### Model: en_ner_bionlp13cg_md


In [56]:
for idx, row in pubmed_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_bionlp(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 28600191 ---
LPHNP SIMPLE_CHEMICAL 836 841

--- PMID: 19243676 ---

--- PMID: 10793294 ---
rabbit tracheal epithelium ORGANISM 17 43
H(2)O(2)-induced SIMPLE_CHEMICAL 56 72
rabbit tracheal epithelial ORGANISM 102 128
airway epithelial cells CELL 218 241
reactive oxygen species SIMPLE_CHEMICAL 278 301
ROS SIMPLE_CHEMICAL 303 306
hydroxyl SIMPLE_CHEMICAL 398 406
RTE cells CELL 416 425
lipid IMMATERIAL_ANATOMICAL_ENTITY 436 441
catalase GENE_OR_GENE_PRODUCT 500 508
CAT GENE_OR_GENE_PRODUCT 510 513
3-day-old cultures CELL 562 580
7-day-old cultures CELL 598 616
Superoxide dismutase GENE_OR_GENE_PRODUCT 666 686
SOD SIMPLE_CHEMICAL 688 691
RTE cells CELL 735 744
squamous metaplasia CANCER 784 803
c-myc GENE_OR_GENE_PRODUCT 900 905
c-jun GENE_OR_GENE_PRODUCT 917 922
c-fos GENE_OR_GENE_PRODUCT 927 932
squamous CANCER 1012 1020
cytokeratin GENE_OR_GENE_PRODUCT 1070 1081
RTE cells CELL 1135 1144
c-myc GENE_OR_GENE_PRODUCT 1224 1229
CAT CANCER 1265 1268
lipid SIMPLE_CHEMICAL 1290 1295
t

In [57]:
for idx, row in biorxiv_df.iterrows():
    pmid = row["PMID"]
    text = f"{row['Title']} {row['Abstract']}"
    print(f"\n--- PMID: {pmid} ---")

    try:
        doc = model_bionlp(text)
        for ent in doc.ents:
            print(ent.text, ent.label_, ent.start_char, ent.end_char)
    except Exception as e:
        print(f"Error processing PMID {pmid}: {e}")


--- PMID: 1 ---


human gut ORGANISM 98 107
human gut ORGANISM 500 509
human ORGANISM 540 545
gut ORGANISM_SUBDIVISION 690 693
human gut ORGANISM 863 872
TNPA SIMPLE_CHEMICAL 1419 1423
human gut ORGANISM 1716 1725

--- PMID: 2 ---
joint MULTI_TISSUE_STRUCTURE 106 111
joint MULTI_TISSUE_STRUCTURE 635 640
DNA CELLULAR_COMPONENT 653 656
fluids ORGANISM_SUBSTANCE 691 697
DNA CELLULAR_COMPONENT 1086 1089
joint MULTI_TISSUE_STRUCTURE 1544 1549
human DNA ORGANISM 1675 1684
extracts ORGANISM_SUBSTANCE 1688 1696

--- PMID: 3 ---
pea aphid genome GENE_OR_GENE_PRODUCT 31 47
pea aphid TISSUE 751 760
pea aphid TISSUE 804 813
pea aphids TISSUE 889 899

--- PMID: 4 ---
MGMG SIMPLE_CHEMICAL 0 4
Cell CELL 6 10
MGMG SIMPLE_CHEMICAL 438 442
cellular CELL 507 515
Cell CELL 648 652
MGMG SIMPLE_CHEMICAL 889 893
MGMG SIMPLE_CHEMICAL 1058 1062
MGMG SIMPLE_CHEMICAL 1482 1486

--- PMID: 5 ---
amino acid AMINO_ACID 153 163
layer MULTI_TISSUE_STRUCTURE 453 458
co-translational CELLULAR_COMPONENT 547 563
humans ORGANISM 746 752
hum