# Named Entity Recognition with SciSpacy
The amount of raw text data is increasing exponentially now a days. Processing this text to extract meaningful pieces of information can be very helpful. This act of processing text data is called Information Extraction.

One of the important components of Information Extraction pipline is Named Entity Recognition (NER). It takes in a piece of text and labels tokens (or a set of tokens) with their respective named entity. For example:

<img src="img/NER Example.png">

## Biomedical NER
Biomedical named entity recognition (Bio-NER) is a major errand in taking care of biomedical texts, for example, RNA, protein, cell type, cell line, DNA drugs, and diseases.

We'll make use of SciSpacy library, very popular NLP library used extensively in both general and biomedical domain.

In [1]:
## Installing SciSpacy
SciSpacy has a lot of different pretrained models available for biomedical NER. Each of these models target different Entity Types, as mentioned below.

<img src="img/SciSpacy.png" height="500px">

SyntaxError: invalid syntax (<ipython-input-1-af53bdb41771>, line 2)

For training a NER model, dataset is formatted accoring to BIO (Beginning, Inside, Outside) tagging scheme.

The Spacy NER system contains a word embedding strategy using sub word features, and a deep convolution neural network with residual connections (helps in solving vanishing and exploding gradient problem). The system is designed to give a good balance of efficiency, accuracy and adaptability. 

(https://blog.vsoftconsulting.com/blog/understanding-named-entity-recognition-pre-trained-models#:~:text=Model%20Architecture,-The%20current%20architecture&text=The%20Spacy%20NER%20system%20contains,of%20efficiency%2C%20accuracy%20and%20adaptability.)

In this demo, we'll use ```en_ner_bc5cdr_md``` and ```en_ner_bionlp13cg_md``` model. To download and setup, run the following.
```
pip3 install scispacy
pip3 install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
pip3 install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz
```


### Loading libraries

In [1]:
import spacy
import scispacy

import en_ner_bc5cdr_md   
import en_ner_bionlp13cg_md

from spacy import displacy

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Loading pre-trained model

In [86]:
nlp = spacy.load("en_ner_bc5cdr_md")
nlp1 = spacy.load("en_ner_bionlp13cg_md")

### Get the data

In [13]:
import spacy

# spacy.load('en')
# spacy.load("en_core_web_sm")
# nlp=spacy.load("en_core_web_sm")

from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        # elif token.like_url:
        #     lda_tokens.append('URL')
        # elif token.orth_.startswith('@'):
        #     lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [14]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\choij\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [(token) for token in tokens]
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\choij\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [82]:
import pandas as pd

text_data = []
covid_dataset = pd.read_csv('../COVID-Dataset/metadata_April10_2020.csv', encoding = "ISO-8859-1")

for index, row in covid_dataset.iterrows():
    if index%1000==0: print(index)
    # if index==2: break
    # tokens = prepare_text_for_lda(str(row['title'])+str(row['abstract']))
    tokens = str(row['title'])+" "+str(row['abstract'])
    # print(type(tokens))
    text_data.append(tokens)
# print(text_data)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000


### Initializing text and feeding it to spacy pipeline

In [57]:
# text = '''Myeloid derived suppressor cells (MDSC) are immature 
#           myeloid cells with immunosuppressive activity. 
#           They accumulate in tumor-bearing mice and humans 
#           with different types of cancer, including hepatocellular 
#           carcinoma (HCC).'''

# doc = nlp(text1)
# print(list(doc.sents))

In [89]:
from collections import defaultdict

d1=dict()
s1=set()

for i, each in enumerate(text_data):
    if i%100==0: print(i)
    doc=nlp1(each)
    for ent in doc.ents:
        
        s1.add(ent.label_)
        if not ent.label_ in d1: 
            d1[ent.label_] = {}
            d1[ent.label_][ent.text] = 1
        else:
            if ent.text in d1[ent.label_]: d1[ent.label_][ent.text]+=1
            else: d1[ent.label_][ent.text]=1

d1={k: v for k, v in sorted(d1.items(), key=lambda item: item[0])}
# print(d)

import os, json, datetime
from dateutil.parser import parse
from os import listdir
from os.path import join
import pandas as pd
import pandas as pd
import numpy as np

path = join(os.getcwd(), "../our_data1/")

for k, v in d1.items():
    print(k)
    if not os.path.exists(join(path, k+'.csv')):
        
        column_names = [k, "frequency"]
        df = pd.DataFrame(columns=column_names)

        for a, b in v.items(): df.loc[len(df)] = [a, b]
        
        if not os.path.exists(join(path, k+'.csv')): df.to_csv(join(path, k+'.csv'), index=False)

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400
15500
15600
15700
15800
15900
16000
16100
16200
16300
16400
16500
16600
16700
16800
16900
17000
17100
17200
17300
17400
17500
17600
17700
17800
17900
18000
18100
18200
18300
18400
18

In [83]:
from collections import defaultdict

d=dict()
s=set()

for i, each in enumerate(text_data):
    if i%500==0: print(i)
    doc=nlp(each)
    for ent in doc.ents:
        
        s.add(ent.label_)
        if not ent.label_ in d: 
            d[ent.label_] = {}
            d[ent.label_][ent.text] = 1
        else:
            if ent.text in d[ent.label_]: d[ent.label_][ent.text]+=1
            else: d[ent.label_][ent.text]=1

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
12500
13000
13500
14000
14500
15000
15500
16000
16500
17000
17500
18000
18500
19000
19500
20000
20500
21000
21500
22000
22500
23000
23500
24000
24500
25000
25500
26000
26500
27000
27500
28000
28500
29000
29500
30000
30500
31000
31500
32000
32500
33000
33500
34000
34500
35000
35500
36000
36500
37000
37500
38000
38500
39000
39500
40000
40500
41000
41500
42000
42500
43000
43500
44000
44500
45000
45500
46000
46500
47000
47500
48000
48500
49000
49500
50000
50500
51000


In [84]:
d={k: v for k, v in sorted(d.items(), key=lambda item: item[0])}
print(d)

import os, json, datetime
from dateutil.parser import parse
from os import listdir
from os.path import join
import pandas as pd
import pandas as pd
import numpy as np

path = join(os.getcwd(), "../our_data/")

for k, v in d.items():
    if not os.path.exists(join(path, k+'.csv')):
        
        column_names = [k, "frequency"]
        df = pd.DataFrame(columns=column_names)

        for a, b in v.items(): df.loc[len(df)] = [a, b]
        
        if not os.path.exists(join(path, k+'.csv')): df.to_csv(join(path, k+'.csv'), index=False)

heral T cell lymphopenia': 1, 'lyp/lyp': 1, 'organ-specific autoimmunity': 1, 'mood disorder': 2, 'coronaviruses, influenza A': 1, 'psychotic': 1, 'mood episodes': 1, 'unipolar or': 1, 'suicidal behavior': 1, 'febrile infection': 1, 'pneumonitides': 1, 'pandemic illness': 1, 'over-protection': 1, 'LoVo': 1, 'aplastic anemia/bone marrow aplasia': 1, 'acute hemorrhagic syndrome': 1, 'promyelocytic leukemia zinc': 1, 'promyelocytic leukemia': 1, 'vocal fold dysfunction': 1, 'HY': 5, 'FECV Type II': 1, 'FECV Type II viruses': 1, 'infection of E. coli +': 1, 'fetuses/newborns': 1, 'coccidian oocysts': 1, 'E. debliecki': 1, 'E. scabra': 1, 'E. suis': 2, 'E. spinosa': 1, 'Isosporoid oocysts': 1, 'notamment de la': 1, 'restricted species tropism': 1, 'HIV to hepatotropic viruses': 1, 'TBSA': 2, 'inhalation injury': 1, 'non-small cell lung cancers': 1, 'Coronavirus JHMV Abstract A': 1, 'Necrotic Enteritis': 1, 'white tail disease': 1, 'rhinovirus reinfection': 1, 'respiratory syndrome coronavir

### Entites extracted from text

In [30]:
# for ent in doc.ents:
#     print(ent.text, ent.start_char, ent.end_char, ent.label_)
#     print('-------')

upper respiratory tract infections 49 83 DISEASE
-------
exacerbations 115 128 DISEASE
-------
aerosols 748 756 CHEMICAL
-------
aerosols 808 816 CHEMICAL
-------
AVL 923 926 DISEASE
-------


In [6]:
displacy.render(doc, style="ent")