# Scientific entity name recognition using pretrained model for knowledge extraction and interpretation

In [1]:
import spacy  # spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. 
              #It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
import scispacy # A full spaCy pipeline and models for scientific/biomedical documents
from spacy import displacy
from scispacy.linking import EntityLinker #The EntityLinker is a SpaCy component which performs linking to a knowledge base. 
                                          #The linker simply performs a string overlap - based search (char-3grams) on named entities, 
                                          #comparing them with the concepts in a knowledge base using an approximate nearest neighbours search

In [2]:
text = "Cancer treatment with doxorubicin (DOX) can induce cumulative dose-dependent cardiotoxicity. \
           Currently, there are no specific biomarkers that can identify patients at risk during the initial \
           doses of chemotherapy. The aim of this study was to examine plasma cytokines/chemokines and potential \
           cardiovascular biomarkers for the prediction of DOX-induced cardiotoxicity. Plasma samples were collected before (T0), \
           and after the first (T1) and the second (T2) cycles of DOX-based chemotherapy of 27 breast cancer patients, \
           including five patients who presented with >10% decline of left ventricular ejection fraction (LVEF), \
           five patients with LVEF decline of 5-10%, and 17 patients who maintained normal LVEF at the end of chemotherapy \
           (240 mg/m2 cumulative dose of DOX from four cycles of treatment). Multiplex immunoassays were used to screen plasma \
           samples for 40 distinct chemokines, nine matrix metalloproteinases, 33 potential markers of cardiovascular diseases, \
           and the fourth-generation cardiac troponin T assay. The results showed that the patients with abnormal decline of LVEF (>10%) \
           had lower levels of CXCL6 and sICAM-1 and higher levels of CCL23 and CCL27 at T0; higher levels of CCL23 and lower levels of \
           CXCL5, CCL26, CXCL6, GM-CSF, CXCL1, IFN-γ, IL-2, IL-8, CXCL11, CXCL9, CCL17, and CCL25 at T1; and higher levels of MIF and CCL23 \
           at T2 than the patients who maintained normal LVEF. Patients with LVEF decline of 5-10% had lower plasma levels of CXCL1, CCL3, GDF-15, \
           and haptoglobin at T0; lower levels of IL-16, FABP3, and myoglobin at T1; and lower levels of myoglobin and CCL23 at T2 as compared to \
           the patients who maintained normal LVEF. This pilot study identified potential biomarkers that may help predict which patients are vulnerable to \
           DOX-induced cardiotoxicity although further validation is needed in a larger cohort of patients. Impact statement Drug-induced cardiotoxicity is\
           one of the major concerns in drug development and clinical practice. It is critical to detect potential cardiotoxicity early before onset of \
           symptomatic cardiac dysfunction or heart failure. Currently there are no qualified clinical biomarkers for the prediction of cardiotoxicity \
           caused by cancer treatment such as doxorubicin (DOX). By using multiplex immunoassays, we identified proteins with significantly changed \
           plasma levels in a group of breast cancer patients who were treated with DOX-based chemotherapy and produced cardiotoxicity. These proteins \
           were associated with immune response and were identified before DOX treatment or at early doses of treatment, thus they could be potential \
           predictive biomarkers of cardiotoxicity although further validation is required to warrant their clinical values."

In [3]:
nlp = spacy.load("en_core_sci_lg") # using pretrained model

In [4]:
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})



<scispacy.linking.EntityLinker at 0x2ee80244850>

In [5]:
doc = nlp(text)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


In [6]:
cra = spacy.load("en_ner_craft_md") #F1 76.11	GGP, SO, TAXON, CHEBI, GO, CL : (cell types, chemicals, proteins, genes)
jnl = spacy.load("en_ner_jnlpba_md") #F1 71.62	DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN : (cell lines, cell types, DNAs, RNAs, proteins)
                                     
bc5 = spacy.load("en_ner_bc5cdr_md") #F1 84.49	DISEASE, CHEMICAL : ( chemicals and diseases)
bio = spacy.load("en_ner_bionlp13cg_md") #F1 77.75	AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, 
                                         # DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, 
                                         # MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, 
                                         # PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE : (cancer genetics)

In [7]:
cra.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
jnl.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
bc5.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
bio.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

MemoryError: 

In [None]:
cra_doc=cra(text)
jnl_doc=jnl(text)
bc5_doc=bc5(text)
bio_doc=bio(text)

In [None]:
displacy.render(cra_doc, style='ent')

In [31]:
displacy.render(jnl_doc, style='ent')

In [32]:
displacy.render(bc5_doc, style='ent')

In [33]:
displacy.render(bio_doc, style='ent')

## Example: Li-Rong Yu et al: Immune response proteins as predictive biomarkers of doxorubicin-induced cardiotoxicity in breast cancer patients (abstract only)

## Display recognized scientific entities

In [21]:
displacy.render(doc, style='ent')

In [12]:
entity = doc.ents[2]
entity

doxorubicin

## Linked to GO

In [15]:
#display doxorubicin is linked to go with a score (char-3gram matching)
entity._.kb_ents

[('C3547938', 0.8894673585891724),
 ('C3822556', 0.879368245601654),
 ('C3549011', 0.8477047681808472)]

In [16]:
linker = nlp.get_pipe("scispacy_linker")
for go_ent in entity._.kb_ents:
    print(linker.kb.cui_to_entity[go_ent[0]])

CUI: C3547938, Name: doxorubicin transport
Definition: The directed movement of a doxorubicin into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. [GOC:TermGenie, PMID:12057006, PMID:15090538, PMID:19063901, PMID:19651502, PMID:9651400]
TUI(s): T043
Aliases: (total: 2): 
	 (1S,3S)-3,5,12-trihydroxy-3-(hydroxyacetyl)-10-methoxy-6,11-dioxo-1,2,3,4,6,11-hexahydrotetracen-1-yl 3-amino-2,3,6-trideoxy-alpha-L-lyxo-hexopyranoside transport, doxorubicine transport
CUI: C3822556, Name: response to doxorubicin
Definition: Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a doxorubicin stimulus. [GOC:dw, GOC:TermGenie, PMID:23648065]
TUI(s): T043
Aliases: (total: 0): 
	 
CUI: C3549011, Name: doxorubicin metabolic process
Definition: The chemical reactions and pathways involving doxorubicin, an anthracycline antibiotic, used i