## Pre-example: general domain text

In [1]:
import spacy
from spacy import displacy

from collections import defaultdict
import numpy as np 
import pandas as pd

In [4]:
obama_text = 'Barack Hussein Obama II is an American politician and attorney who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, Obama was the first African American president of the United States. He previously served as a U.S. senator from Illinois from 2005 to 2008 and an Illinois state senator from 1997 to 2004. '

In [3]:
nlp = spacy.load("en_core_web_md")

In [5]:
doc_obama = nlp(obama_text)

In [6]:
displacy.render(doc_obama, style='ent')

# NER with pre-trained models
In this notebook: 
1. NER with spaCy medium english model 
2. NER with scispaCy medium ensligh science model
3. Alignment of extracted entities 

## 1. spaCy

### Read in abstracts as a string: 

In [8]:
with open('/home/serenalotreck/projects/knowledge-graph/data/abstract-jasmonicac-set.txt') as f:
    text = " ".join([x.strip() for x in f])

print(text[:1000])

1. Plant J. 2017 Aug;91(3):491-504. doi: 10.1111/tpj.13585. Epub 2017 Jun 9.  Regulation of the turnover of ACC synthases by phytohormones and heterodimerization in Arabidopsis.  Lee HY(1), Chen YC(1), Kieber JJ(2), Yoon GM(1).  Author information: (1)Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN, 47907, USA. (2)Department of Biology, University of North Carolina, Chapel Hill, NC, 27599, USA.  Ethylene influences many aspects of plant growth and development. The biosynthesis of ethylene is highly regulated by a variety of internal and external cues. A key target of this regulation is 1-aminocyclopropane-1-carboxylic acid (ACC) synthases (ACS), generally the rate-limiting step in ethylene biosynthesis, which is regulated both transcriptionally and post-transcriptionally. Prior studies have demonstrated that cytokinin and brassinosteroid (BR) act as regulatory inputs to elevate ethylene biosynthesis by increasing the stability of ACS proteins. Here, we d

### Load the spaCy model:

In [9]:
nlp = spacy.load("en_core_web_md")

### Create doc object:

In [10]:
doc = nlp(text)

### Check out the extracted entities:

Get a look at the entities using displaCy:

In [11]:
displacy.render(doc, style='ent')

What types were identified, and how many of each?

In [12]:
types = defaultdict(int)
for ent in doc.ents:
    types[ent.label_] += 1
    
print(types)

defaultdict(<class 'int'>, {'CARDINAL': 164, 'ORG': 304, 'PERSON': 154, 'GPE': 123, 'DATE': 56, 'WORK_OF_ART': 7, 'NORP': 3, 'FAC': 10, 'TIME': 1, 'LOC': 3, 'QUANTITY': 8, 'PRODUCT': 20, 'EVENT': 3, 'LAW': 3, 'MONEY': 2, 'PERCENT': 2, 'ORDINAL': 2, 'LANGUAGE': 1})


Some of these types make no sense for our application. For example, `WORK_OF_ART`. What spans were identified as this type? What spans were identified under each type?

In [13]:
type_value = defaultdict(set)
for ent in doc.ents:
    type_value[ent.label_].add(ent.text)
    
print(type_value['WORK_OF_ART'])

{'Safitri FA(1', 'Arabidopsis', 'MeJA', 'The Authors', 'Rodríguez D', 'PMC6501071'}


In [14]:
print(type_value['FAC']) # FAC is facility 

{'Plant', 'Conti G(3)(4', 'jasmonate', 'Gymnasium Place', 'López-Climent M', 'CpMYC2', 'Kitayama M', 'LC-Orbitrap-MS', 'the COI1-JAZ-DELLA-PIF', 'Gómez-Cadenas A'}


In [15]:
print(type_value['LANGUAGE'])

{'PMC3358897'}


In [16]:
print(type_value['LAW'])

{'Aug;4(8):750-1', 'AtMYC2-2', 'TMV-Cg'}


What about entities that might make sense, like `PERSON`?

In [17]:
print(type_value['PERSON'], f'\nNumber of PERSON entities: {len(type_value["PERSON"])}')

{'Wang JR(5)(3', 'Bax', 'Dong L(1', 'Thomashow MF', 'Li N(1)(2', 'Centro Hispano-Luso de Investigaciones Agrarias', 'Deng M(3', 'Mun BG(1', 'Seebold K', 'Hoehenwarter W', 'Wei ZZ(3', 'Weckwerth W.  ', 'Cutler AJ', 'Chen L(2', 'Abid K(1', 'PMC2504347', 'Hou', 'Jiménez JA', 'Abid', 'Arabidopsis', 'Lee SK(2', 'Kasr El-Aini Street', 'Luo D(1', 'Jiang YF(3', 'Zong LJ(3', 'gibberellic acid', 'Lantz AT(2', 'Romeis T(3', 'Wang Y(3', 'Li Q,', 'Chen YC(1', 'Lu', 'Li', 'Yang N(2', 'Kachroo P.  ', 'Saeed UH(1', 'Li J', 'Zhang W(1', 'Pellny TK', 'Childs KL(5', 'Wu J(3', 'brassinosteroids', 'P. sojae', 'Chung IK(4', 'Wang', 'Isoprene', 'Xu BJ(3', 'Hedden P, Driscoll S', 'PMC7076656', 'Xu P(1', 'Mei CS', 'Brachypodium distachyon', 'Mizuno T.  ', 'Reyes D', 'Zhang S(1', 'Guo ZR(3', 'Rodriguez', 'GLABRA1', 'Biosci Biotechnol Biochem.', 'Hurlingham', 'Yu K', 'Hause B(2', 'Giavalisco', 'Zavallo', 'Nicotiana tabacum', 'Hou X(1', 'Wang X(1', 'Facultad de Biología', 'Chen R(1', 'Yang Y', 'Lee CM', 'Nicotian

spaCy seems to get confused by the notation used to indicate author affiliations, but otherwise does seem to pick out people fairly well. However - notice that it thinks Nicotiana benthamiana, brassinosteriods, monoterpenes, isoprene, and JA are people!

In [18]:
print(type_value['GPE'], f'\nNumber of GPE entities: {len(type_value["GPE"])}')

{'Chapel Hill', 'Yangling', 'Halle', 'Spain', 'Canada', 'Daegu', 'Salem', 'Sun TP', 'pnas.1201616109', 'Wuhan', 'Egypt', 'Buenos', 'Argentina', 'China', '0C6', 'lignan', 'Germany', 'Mardan', 'New Phytol', 'Lexington', 'Saskatoon', 'Navarre D', 'NH 03755', 'Yoon GM(1', 'Pakistan', 'West Lafayette', 'Sichuan', 'Ottawa', 'Pisa', 'Hancock RD', 'Korea', 'UK', 'plants9060785', 'NC', 'USA', 'Milwaukee WI', 'Cairo', 'Kentucky', 'gl1 plants', 'Cheng Q(1', 'East Lansing', 'Kunming 650224', 'Aires', 'Kong L(3', 'Salamanca', 'Chengdu', 'Basel', 'Italy', 'Berlin', 'Cellular', 'Shanghai 200032', 'Hanover', '0W9'} 
Number of GPE entities: 53


In [19]:
# Unclear why I'm getting different numbers for some types here vs when I made a dictionary that counted 

## 2. scispaCy

### Load the model:

In [20]:
nlp_sci = spacy.load("en_core_sci_md")

### Create doc object:

In [21]:
doc_sci = nlp_sci(text)

### Check out the extracted entities:

In [23]:
displacy.render(doc_sci, style='ent')

What types does this model identify?

In [24]:
'label_bin'sci_types = defaultdict(set)
for ent in doc_sci.ents:
    sci_types[ent.label_].add(ent.text)
    
print(sci_types.keys())
    

dict_keys(['ENTITY'])


In [25]:
print(len(sci_types['ENTITY']))

1196


#### The core model does not have entity types! 
It merely identifies entities. According to the example in [this blog post](https://towardsdatascience.com/using-scispacy-for-named-entity-recognition-785389e7918d), the core model identifies more entities than the NER model, which only identifies the biological entities that it has types for. Let's check it out!<br>

Types included in the 4 NER models:
1. en_ner_craft_md: types are \[GGP, SO, TAXON, CHEBI, GO, CL\]. 
2. en_ner_jnlpba_md: types are \[DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN\]
3. en_ner_bc5cdr_md: types are \[DISEASE, CHEMICAL\]
4. en_ner_bionlp13cg_md: types are \[AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE\]

In this notebook I'm going to play around with the first few models. I predict that model 2, en_ner_jnlpba_md will perform best on this data, because only one of the types isn't relevant to plant biology.

### 1. en_ner_craft_md

In [26]:
nlp_craft = spacy.load("en_ner_craft_md")

In [27]:
doc_craft = nlp_craft(text)

In [28]:
displacy.render(doc_craft, style='ent')

### 2. en_ner_jnlpba_md

In [29]:
nlp_jnlp = spacy.load("en_ner_jnlpba_md")

In [30]:
doc_jnlp = nlp_jnlp(text)

In [31]:
displacy.render(doc_jnlp, style='ent')

After manually inspecting the NER outputs of these three scispaCy models, it appears that the full pipeline model, with only type ENTITY, actually has the highest recall on the entities we want to extract from this text. However, this is a qualitative observation. In order to get a quantitative evaluation, we need to complete our Gold Standard by deciding what entities we do want to extract.

## 3. Visualization of model comparison

While skimming the displaCy renderings gives an idea of which models extracted more entities, let's look more in-depth at the differences between models. First, we'll make a table using spaCy's IOB scheme. We'll make a dataframe where each row is one token from the text. The subsequent columns will give us the IOB value for each token in the various models. <br>

If a token recieves I, that means it's inside an entity span. If it recieves a B, it's the beginning of an entity span, and if it recieves an O, it's outside an entity span (not an entity).

First, let's get the tokens and their annotations from each doc object.

In [35]:
doc_iob = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
doc_sci_iob = [(t.text, t.ent_iob_, t.ent_type_) for t in doc_sci]
doc_craft_iob = [(t.text, t.ent_iob_, t.ent_type_) for t in doc_craft]
doc_jnlp_iob = [(t.text, t.ent_iob_, t.ent_type_) for t in doc_jnlp]

In [37]:
print(len(doc_iob), len(doc_sci_iob), len(doc_craft_iob), len(doc_jnlp_iob))

8389 7960 7960 7960


It looks like spaCy and scispaCy split tokens differently. This is most likely because scispaCy uses a custom tokenizer, but I'll have to check.

Since they split tokens differently, it doesn't necessarily make sense to compare the spaCy and scispaCy model entities, so I'll just compare the entities extracted from the scispaCy models.

Make a tuple for each token, containing its text, iob value and type:

In [39]:
sci_iob_df = pd.DataFrame(doc_sci_iob, columns=['text', 'sci_ent_iob', 'sci_ent_type'])
craft_iob_df = pd.DataFrame(doc_craft_iob, columns=['text', 'craft_ent_iob', 'craft_ent_type'])
jnlp_iob_df = pd.DataFrame(doc_jnlp_iob, columns=['text', 'jnlp_ent_iob', 'jnlp_ent_type'])

Check that the tokens are indeed identical:

In [53]:
sci_iob_df['text'].equals(craft_iob_df['text'])

True

In [54]:
craft_iob_df['text'].equals(jnlp_iob_df['text'])

True

Add on the columns for iob and type from each model. For some reason, using pd.merge on the 'text' column produces many many dupliocate rows. Since I've confirmed that all three text columns are equivalent, I'm going to tack on the relevant columns.

In [57]:
all_iob_df = sci_iob_df
all_iob_df['craft_ent_iob'] = craft_iob_df['craft_ent_iob']
all_iob_df['craft_ent_type'] = craft_iob_df['craft_ent_type']
all_iob_df['jnlp_ent_iob'] = jnlp_iob_df['jnlp_ent_iob']
all_iob_df['jnlp_ent_type'] = jnlp_iob_df['jnlp_ent_type']

In [58]:
all_iob_df.head()

Unnamed: 0,text,sci_ent_iob,sci_ent_type,craft_ent_iob,craft_ent_type,jnlp_ent_iob,jnlp_ent_type
0,1,O,,O,,O,
1,.,O,,O,,O,
2,Plant,B,ENTITY,O,,O,
3,J.,O,,O,,O,
4,2017,O,,O,,O,


Now, let's drop all the rows where none of the models label the text as an entity, so we can better compare.

In [62]:
index_sci = all_iob_df[all_iob_df['sci_ent_iob']=='O'].index.tolist()
index_craft = all_iob_df[all_iob_df['craft_ent_iob']=='O'].index.tolist()
index_jnlp = all_iob_df[all_iob_df['jnlp_ent_iob']=='O'].index.tolist()

In [67]:
index_drop = [i for i in index_sci if (i in index_craft and i in index_jnlp)]

In [69]:
# Spot check the indices
to_drop = all_iob_df.iloc[index_drop]
to_drop

Unnamed: 0,text,sci_ent_iob,sci_ent_type,craft_ent_iob,craft_ent_type,jnlp_ent_iob,jnlp_ent_type
0,1,O,,O,,O,
1,.,O,,O,,O,
3,J.,O,,O,,O,
4,2017,O,,O,,O,
6,-,O,,O,,O,
...,...,...,...,...,...,...,...
7955,no,O,,O,,O,
7956,conflict,O,,O,,O,
7957,of,O,,O,,O,
7958,interest,O,,O,,O,


In [70]:
# Drop rows where no model identified the token as an entityb
entities_iob = all_iob_df.drop(index_drop)
entities_iob

Unnamed: 0,text,sci_ent_iob,sci_ent_type,craft_ent_iob,craft_ent_type,jnlp_ent_iob,jnlp_ent_type
2,Plant,B,ENTITY,O,,O,
5,Aug;91(3):491,B,ENTITY,O,,O,
22,turnover,B,ENTITY,O,,O,
24,ACC,B,ENTITY,O,,B,PROTEIN
25,synthases,B,ENTITY,O,,I,PROTEIN
...,...,...,...,...,...,...,...
7938,PMID,B,ENTITY,O,,O,
7942,Indexed,B,ENTITY,O,,O,
7944,MEDLINE,B,ENTITY,O,,O,
7950,statement,B,ENTITY,O,,O,


Now let's break these down further. <br>

1. The sci model is clearly overzealous - what entities did it identify that the other models didn't?
2. The sci model is also mode aggressive about breaking spans up into multiple entities - are any of the entities more than one token long? (Are any of the IOB values == I?)
3. What entities do CRAFT and JNLPBA models have in common? What do they not have in common?
4. Are there any entities NOT identified by the sci model that are identified by others?

#### 1. Overzealous sci model
Want to look at rows where the iob is O for the other two models

In [75]:
index_ents = entities_iob[entities_iob['craft_ent_iob']=='O'].index.tolist()

Something that may also be interesting is, what are the things that weren't identified as entites by any model? Do we care about them or want them to be extracted as entities in the future?