<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NER/Train_NER_spaCy_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How train a NER spaCy model 

In this notebook, we present an example as how train a Named-Entity Recognition model in spaCy to identify specific entities that we labelled previously.

For this example, we are going to use the code presented in the entry blog: 
**Automatic Summarization of Resumes using NER**

URL: https://dataturks.com/blog/named-entity-recognition-in-resumes.php

GITHUB: https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy

In this example, we are going to train a NER model to identify entities in resumes (skills, name, education, etc.). We are going to use the input data created by `DataTurks` (company who develop the original code). They use a custom format for their input data JSON. We have to transform this custom format to the spaCy valid format.

Import packages

In [0]:
import json
import random
import logging
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from spacy.gold import GoldParse
from spacy.scorer import Scorer
from sklearn.metrics import accuracy_score

## Input data adaptation

The input data (training and test) is formated in JSON with a custom structure defined by `DataTurks`. 

spaCy use as input format to train the models a list of tuples ```[ (entry1, entry2) ]``` where the tuple is composed by:
 * 1st part ```(entry1)```: input text
 * 2nd part ```(entry2)```: list of annotated entities. Each annotated entity is defined by three components:
   - 1. start_point
   - 2. end_point
   - 3. label

In [0]:
from urllib.request import urlopen
def get_lines_from_ulr(urlpath):
  content = urlopen(urlpath).read().decode("utf-8") 
  return content.splitlines( )

def get_lines_from_file(filepath):
  with open(filepath, 'r') as f:
    lines = f.readlines()
    return lines


def convert_dataturks_to_spacy(filepath=None, urlpath=None):
    try:
        training_data = []
        lines=[]
        
        if filepath :
          lines = get_lines_from_file(filepath)
        else:
          lines = get_lines_from_ulr(urlpath)

        for line in lines:

            #Each line represents a JSON
            data = json.loads(line)

            #JSON has two main keys: 
            #  - content: full text of the resume; 
            #  - annotation: list of identified parts of text. It is structured:
            #      - label: keyword of the type of annotation (ex., skills, email, etc.)
            #      - points: span information for the annotation:
            #            > start: position of the first token of the annotation
            #            > end: position of the last token of the annotation
            #            > text: content of the annotation
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))
                
            #Training data for the spacy is a tuple: 
            #    1. full text
            #    2. list of annotated entities [(start_point, end_point, label)]
            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        logging.exception("Unable to process \n" + "error = " + str(e))
        return None


## Model training

We are going to use the input labelled data to train the NER component of the spaCy pipeline to identify the parts of one resume.

In [0]:
import spacy
nlp = spacy.load("en_core_web_sm") # create blank Language class

Get the input data in a spaCy valid format

In [78]:
TRAIN_DATA = convert_dataturks_to_spacy(None, urlpath="https://raw.githubusercontent.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy/master/testdata.json")
print(f'Number of resumes to train : {len(TRAIN_DATA)}')
print(f'Ejemplo de input :')
TRAIN_DATA[0]

Number of resumes to train : 20
Ejemplo de input :


("Abhishek Jha\nApplication Development Associate - Accenture\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a\n\n• To work for an organization which provides me the opportunity to improve my skills\nand knowledge for my individual and company's growth in best possible ways.\n\nWilling to relocate to: Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nApplication Development Associate\n\nAccenture -\n\nNovember 2017 to Present\n\nRole: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries\nfor the Bot which will be triggered based on given input. Also, Training the bot for different possible\nutterances (Both positive and negative), which will be given as\ninput by the user.\n\nEDUCATION\n\nB.E in Information science and engineering\n\nB.v.b college of engineering and technology -  Hubli, Karnataka\n\nAugust 2013 to June 2017\n\n12th in Mathematics\n\nWoodbine modern school\n\nApril 2011 to March 2013\n\n10th\n\nKendriya Vidyalaya\n

Get or create a component in the pipeline to the Named-Entity Recognition (NER)

In [0]:
  # create the built-in pipeline components and add them to the pipeline
  # nlp.create_pipe works for built-ins that are registered with spaCy
  if 'ner' not in nlp.pipe_names:
      ner = nlp.create_pipe('ner')
      nlp.add_pipe(ner, last=True)
  else:
    ner = nlp.get_pipe('ner')

Extract from the input data the considered labels and add them to the NER label collection 

In [0]:
# add labels
for _, annotations in TRAIN_DATA:
  for ent in annotations.get('entities'):
    ner.add_label(ent[2])

**Training the NER model with the new labels**

Disable the other components in the pipeline during training

In [52]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(10):
        print("Statring iteration " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            nlp.update(
                [text],  # batch of texts
                [annotations],  # batch of annotations
                drop=0.2,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights
                losses=losses)
        print(losses)

Statring iteration 0
{'ner': 10714.913392737508}
Statring iteration 1
{'ner': 9426.441921290443}
Statring iteration 2
{'ner': 8264.782624161802}
Statring iteration 3
{'ner': 7602.219878260046}
Statring iteration 4
{'ner': 7774.348719151691}
Statring iteration 5
{'ner': 7243.808744228678}
Statring iteration 6
{'ner': 6951.896825157804}
Statring iteration 7
{'ner': 7294.837433060631}
Statring iteration 8
{'ner': 6837.749791778624}
Statring iteration 9
{'ner': 6894.473289556801}


## Model evaluation

We evaluate the trained NER model using the input test data provided for `DataTurks`.

In [53]:
#test the model and evaluate it
examples = convert_dataturks_to_spacy(None, "https://raw.githubusercontent.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy/master/testdata.json")
tp=0
tr=0
tf=0

ta=0
c=0        
for text,annot in examples:

    f=open("resume"+str(c)+".txt","w")
    doc_to_test=nlp(text)
    d={}
    for ent in doc_to_test.ents:
        d[ent.label_]=[]
    for ent in doc_to_test.ents:
        d[ent.label_].append(ent.text)

    for i in set(d.keys()):

        f.write("\n\n")
        f.write(i +":"+"\n")
        for j in set(d[i]):
            f.write(j.replace('\n','')+"\n")
    d={}
    for ent in doc_to_test.ents:
        d[ent.label_]=[0,0,0,0,0,0]
    for ent in doc_to_test.ents:
        doc_gold_text= nlp.make_doc(text)
        gold = GoldParse(doc_gold_text, entities=annot.get("entities"))
        y_true = [ent.label_ if ent.label_ in x else 'Not '+ent.label_ for x in gold.ner]
        y_pred = [x.ent_type_ if x.ent_type_ ==ent.label_ else 'Not '+ent.label_ for x in doc_to_test]  
        if(d[ent.label_][0]==0):
            #f.write("For Entity "+ent.label_+"\n")   
            #f.write(classification_report(y_true, y_pred)+"\n")
            (p,r,f,s)= precision_recall_fscore_support(y_true,y_pred,average='weighted')
            a=accuracy_score(y_true,y_pred)
            d[ent.label_][0]=1
            d[ent.label_][1]+=p
            d[ent.label_][2]+=r
            d[ent.label_][3]+=f
            d[ent.label_][4]+=a
            d[ent.label_][5]+=1
    c+=1
for i in d:
    print("\n For Entity "+i+"\n")
    print("Accuracy : "+str((d[i][4]/d[i][5])*100)+"%")
    print("Precision : "+str(d[i][1]/d[i][5]))
    print("Recall : "+str(d[i][2]/d[i][5]))
    print("F-score : "+str(d[i][3]/d[i][5]))

  'recall', 'true', average, warn_for)



 For Entity Name

Accuracy : 99.83805668016194%
Precision : 0.9983831936194594
Recall : 0.9983805668016195
F-score : 0.9981113185060555

 For Entity Location

Accuracy : 99.10931174089069%
Precision : 0.9838647235218079
Recall : 0.9910931174089069
F-score : 0.987465692416357
