<h1> Custom Named Entity Recognizer</h1>

Named entity recognizer's are generally trained on texts from web articles or news articles. In such cases they are not able to recognize entities from finance or medical domain. For such use cases, we need to train the NER using domain specific labelled data. 

If training data is large then we may train an NER from scratch. But if we don't have sufficient data then we may retrain a previously trained NER with new data.

**For this use case since our data is small we would be retraining a pre-trained model.**
<br>

<h3>Problem Description</h3>

The core problem that we are trying to solve here is :- Given a text (sentence), we have to find out the different types of entities in that text, e.g - names of people, organizations, places etc. 

We can model it as a **many-to-many** sequence prediction problem, i.e, input is a sentence (set of tokens) & output is a set of tags describing each token.
<br>

<h3>Dataset Description :-</h3>

The dataset used is BC5CDR-disease from NCBI. One can find the dataset here - https://github.com/dmis-lab/biobert . We have nearly 2K labelled sentences.
<br>

<h3>Plan of Attack</h3>

We are going to retrain a pre-trained model. The model we are going to use is a **1D-CNN model** provided by SpaCy. SpaCy's NLP pipeline provides a tagger, a parser & a named entity recognizer. We are going to disable the tagger & parser before starting to retrain, since our objective is to train the NER only.

In [2]:
import spacy
from spacy import displacy

In [3]:
## For training models using GPU. Use spacy.prefer_gpu() if you are not sure whether GPU is available or not.
## require_gpu() will return False if GPU is unavailable. Always execute below line before loading Spacy language models

# spacy.require_gpu()
spacy.prefer_gpu()

False

In [4]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tagger', 'parser', 'ner']

## Formatting data for training :-

Format of data to be supplied to SpaCy model - 

[["Sent_1",{'entities':[(start, end, label),(start, end, label),....]}],

["Sent_2",{'entities':[(start, end, label),(start, end, label),....]}], ....]

In [5]:
def load_data(file_path):
    """Function to help us prepare data for training"""
    
    train_data, entities, sentence, unique_labels = [], [], [], []
    current_annot = None
    start, end = 0, 0

    with open(file_path, 'r') as file:
        for line in file:
            line = line.strip("\n").split("\t")

            if len(line) > 1:
                label = line[1]
                if label != 'O': label = label + "_Dis"
                word = line[0]
                sentence.append(word)
                start = end
                end += len(word) + 1

                if label == "I_Dis":
                    entities.append((start, end-1, label))
                elif label == "B_Dis":
                    entities.append((start, end-1, label))

                if label != 'O' and label not in unique_labels:
                    unique_labels.append(label)

            elif len(line) == 1:
                if len(entities) > 0:
                    sentence = " ".join(sentence)
                    train_data.append([sentence,
                                       {'entities':entities}])
                    start, end = 0, 0
                    entities, sentence = [], []

    return train_data, unique_labels

In [6]:
train_data, unique_labels = load_data(r"/content/train.tsv")
test_data, _ = load_data(r"/content/test.tsv")

print("Train data size :- ",len(train_data))
print("Test data size :- ",len(test_data))

Train data size :-  2658
Test data size :-  2842


In [7]:
for data in train_data[0:5]:
  print(data)

["Selegiline - induced postural hypotension in Parkinson ' s disease : a longitudinal study on the effects of drug withdrawal .", {'entities': [(21, 29, 'B_Dis'), (30, 41, 'I_Dis'), (45, 54, 'B_Dis'), (55, 56, 'I_Dis'), (57, 58, 'I_Dis'), (59, 66, 'I_Dis')]}]
["OBJECTIVES : The United Kingdom Parkinson ' s Disease Research Group ( UKPDRG ) trial found an increased mortality in patients with Parkinson ' s disease ( PD ) randomized to receive 10 mg selegiline per day and L - dopa compared with those taking L - dopa alone .", {'entities': [(32, 41, 'B_Dis'), (42, 43, 'I_Dis'), (44, 45, 'I_Dis'), (46, 53, 'I_Dis'), (132, 141, 'B_Dis'), (142, 143, 'I_Dis'), (144, 145, 'I_Dis'), (146, 153, 'I_Dis'), (156, 158, 'B_Dis')]}]
['Recently , we found that therapy with selegiline and L - dopa was associated with selective systolic orthostatic hypotension which was abolished by withdrawal of selegiline .', {'entities': [(92, 100, 'B_Dis'), (101, 112, 'I_Dis'), (113, 124, 'I_Dis')]}]
['This unwanted e

In [10]:
## Checking if SpaCy's NER is able to detect any named entity from clinical data.
doc = nlp(test_data[0][0])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, jupyter=True, style = "ent")

Entities []


  "__main__", mod_spec)


**SpaCy's NER model didn't find any entity which was somewhat obvious. It wasn't trained using clinical text data.**

## Retraining model :-

Steps to be followed :-

- Add new labels to NER part of pipeline
- Disable other stuff in pipeline before training
- Start training

In [11]:
ner = nlp.get_pipe("ner")

for label in unique_labels:
    ner.add_label(label)

## Disable other stuff in pipeline to only train NER
disable_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']    

In [12]:
import time
import random
from spacy.util import minibatch, compounding
from pathlib import Path

In [13]:
epochs = 50

with nlp.disable_pipes(*disable_pipes):
  optimizer = nlp.begin_training()
 
  for epoch in range(epochs):
    st = time.time()
    random.shuffle(train_data)
    losses = {}

    batches = minibatch(train_data, size=compounding(16.0, 64.0, 1.5))
    for batch in batches:
      text, annotation = zip(*batch)
      nlp.update(text, annotation, drop=0.5, losses=losses, sgd=optimizer)
      if epoch % 10 == 0:
        print("Losses = ",losses)

    epoch_time = time.time() - st
    # print("Time taken - {} mins",.format(epoch_time/60))

Losses =  {'ner': 881.079833984375}
Losses =  {'ner': 2082.011489868164}
Losses =  {'ner': 3284.989227294922}
Losses =  {'ner': 5649.9578857421875}
Losses =  {'ner': 8555.014343261719}
Losses =  {'ner': 10774.422836303711}
Losses =  {'ner': 12778.362442016602}
Losses =  {'ner': 15236.054641723633}
Losses =  {'ner': 17730.669326782227}
Losses =  {'ner': 20238.814407348633}
Losses =  {'ner': 22412.823440551758}
Losses =  {'ner': 24920.975311279297}
Losses =  {'ner': 26905.799285888672}
Losses =  {'ner': 28618.208953857422}
Losses =  {'ner': 30844.873428344727}
Losses =  {'ner': 32618.490478515625}
Losses =  {'ner': 34438.96694946289}
Losses =  {'ner': 36409.355545043945}
Losses =  {'ner': 38423.28039550781}
Losses =  {'ner': 40357.236404418945}
Losses =  {'ner': 42049.9609375}
Losses =  {'ner': 43933.59896850586}
Losses =  {'ner': 46078.80177307129}
Losses =  {'ner': 48117.87957763672}
Losses =  {'ner': 50354.362213134766}
Losses =  {'ner': 51898.64566040039}
Losses =  {'ner': 53733.6313

In [14]:
print(test_data[0][0])
print(test_data[0][1])

Torsade de pointes ventricular tachycardia during low dose intermittent dobutamine treatment in a patient with dilated cardiomyopathy and congestive heart failure .
{'entities': [(0, 7, 'B_Dis'), (8, 10, 'I_Dis'), (11, 18, 'I_Dis'), (19, 30, 'B_Dis'), (31, 42, 'I_Dis'), (111, 118, 'B_Dis'), (119, 133, 'I_Dis'), (138, 148, 'B_Dis'), (149, 154, 'I_Dis'), (155, 162, 'I_Dis')]}


## Testing newly trained model

In [15]:
doc = nlp(test_data[0][0])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, jupyter=True, style = "ent")

Entities [('pointes', 'B_Dis'), ('ventricular', 'I_Dis'), ('tachycardia', 'I_Dis'), ('dilated', 'B_Dis'), ('cardiomyopathy', 'I_Dis'), ('congestive', 'B_Dis'), ('heart', 'I_Dis'), ('failure', 'I_Dis')]


In [16]:
doc = nlp(test_data[100][0])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, jupyter=True, style = "ent")

Entities [('visual', 'B_Dis'), ('attention', 'I_Dis'), ('deterioration', 'B_Dis')]


**We can see that the newly trained model was able to identify most of named-entities. So, our model is able to learn semantic relationships between the domain specific entities.**

In [17]:
## Saving just NER model to disk
output_dir = Path("/content/spacy_example")
nlp.to_disk(output_dir)

In [18]:
nlp_retrained = spacy.load(output_dir)
doc = nlp_retrained(test_data[2000][0])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, jupyter=True, style = "ent")

Entities [('CML', 'B_Dis')]


In [19]:
doc = nlp_retrained(test_data[2200][0])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.render(doc, jupyter=True, style = "ent")

Entities [('catalepsy', 'B_Dis')]


## Problems with retraining NLP models :-

Retraining NLP models with newly introduced data lead to Catastrophic Forgetting. In the context of NER, after retraining it may happen that the model fails to classify some named entity which it was able to classify before we newly trained it.

A solution to this problem is **Pseudo-rehearsal**. Simply said we need to augment our data with the type of data that the model was originally trained with,i.e, we need to have some examples of the data that the model had originally seen when it was trained initially. Explained in details here - https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting