# Training Data Preparation

This notebook is about SpaCy annotator for Named Entity Recognition (NER) using ipywidgets. In this notebook, we will be preparing a training data set. First, the list of resumes will be read. Then, it would be converted to the Spacy input data format. The annotator allows users to quickly assign (custom) labels to one or more entities in the text.

## Libraries

In [52]:
import spacy_annotator as spa
import pandas
import os
import numpy
import natsort 
from pdfminer.high_level import extract_text
import re
import spacy
import pickle
import random
from spacy.training.example import Example

## Data Preparation

In [191]:
# reading extracted contents from resumes
input_data = pandas.read_excel('input.xlsx')

In [192]:
# selecting the list of resumes which we are gonna convert it into a Spacy input data format. 
start = 90
end = 100
input_data = input_data[start:end]

In [193]:
# getting the resume filenames
path = './train'
files = os.listdir(path)

In [194]:
# sort the resume filenames
files = natsort.natsorted(files)

In [195]:
files = files[90:100]

In [196]:
files

['resume91.pdf',
 'resume92.pdf',
 'resume93.pdf',
 'resume94.pdf',
 'resume95.pdf',
 'resume96.pdf',
 'resume97.pdf',
 'resume98.pdf',
 'resume99.pdf',
 'resume100.pdf']

In [197]:
# extract the resume contents from the resume
resume_text_list = []
for f in files:
    resume_text = extract_text(path+'/'+f)
    resume_text_list.append(resume_text)

In [198]:
# append a new resume content column 
input_data['resume_text'] = resume_text_list

In [200]:
def cleanResume(resumeText):
    ''' This function is used to clean the resume contents i.e., removing URLS, punctuations, newline and extra whitespaces'''
    
    resumeText = re.sub('httpS+s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespace
    resumeText = re.sub('\n', ' ', resumeText)  # remove newline
    
    return resumeText

In [201]:
# calling the cleanResume function
input_data['cleaned_resume'] = input_data.resume_text.apply(lambda x: cleanResume(x))

In [203]:
# taking out only the cleaned resume contents
cleaned_resume = input_data[['cleaned_resume']]

In [205]:
cleaned_resume.reset_index(drop=True,inplace=True)

In [207]:
# loading a blank Spacy model with en_core_web_sm which is a small English pipeline trained on written web text (For example: blogs, news, comments) that includes vocabulary, syntax and entities.
nlp = spacy.load("en_core_web_sm")

In [208]:
# Creating my own list of Entities
annotator = spa.Annotator(labels=['College Name',
 'Companies worked at',
 'Degree',
 'Designation',
 'Email Address',
 'Graduation Year',
 'Location',
 'Name',
 'Skills',
 'UNKNOWN',
 'Work Start Year',
 'Work End Year',
 'Years of Experience'], model=nlp)

## Annotator

In [210]:
# this line would open up the ipywidgets where we can assign each text to its entity by highlighting the text. 
# The output df_labels would have the resume text with the list of entities of its starting and ending indices.
df_labels = annotator.annotate(df=cleaned_resume, col_text="cleaned_resume")

HTML(value='-1 examples annotated, 11 examples left')

Text(value='', description='College Name', layout=Layout(width='auto'), placeholder='ent one, ent two, ent thr…

Text(value='', description='Companies worked at', layout=Layout(width='auto'), placeholder='ent one, ent two, …

Text(value='', description='Degree', layout=Layout(width='auto'), placeholder='ent one, ent two, ent three')

Text(value='', description='Designation', layout=Layout(width='auto'), placeholder='ent one, ent two, ent thre…

Text(value='', description='Email Address', layout=Layout(width='auto'), placeholder='ent one, ent two, ent th…

Text(value='', description='Graduation Year', layout=Layout(width='auto'), placeholder='ent one, ent two, ent …

Text(value='', description='Location', layout=Layout(width='auto'), placeholder='ent one, ent two, ent three')

Text(value='', description='Name', layout=Layout(width='auto'), placeholder='ent one, ent two, ent three')

Text(value='', description='Skills', layout=Layout(width='auto'), placeholder='ent one, ent two, ent three')

Text(value='', description='UNKNOWN', layout=Layout(width='auto'), placeholder='ent one, ent two, ent three')

Text(value='', description='Work Start Year', layout=Layout(width='auto'), placeholder='ent one, ent two, ent …

Text(value='', description='Work End Year', layout=Layout(width='auto'), placeholder='ent one, ent two, ent th…

Text(value='', description='Years of Experience', layout=Layout(width='auto'), placeholder='ent one, ent two, …

HBox(children=(Button(button_style='success', description='submit', style=ButtonStyle()), Button(button_style=…

Output()

In [211]:
type(df_labels['annotations'])

pandas.core.series.Series

In [212]:
# converting the df_labels to a text file so that I can combine them with the original training dataset of 200 resumes.
df_labels['annotations'].to_csv('90_to_100_resumes.txt', sep=' ', index=False)

There is an open source project available on GitHub where they have trained 200 resumes on a Spacy model with the similar approach. Their training data is in the text format containing the resume text with their entities of start and end indices.

Now, we are gonna combine our prepared training data of 100 resumes to the original training data of 200 resumes from GitHub.
Therefore, in total, we have 300 rows of training data.

In [99]:
# reading the train data of 200 resumes
resume_train_data = pickle.load(open('resume_train_data.pkl','rb'))

In [101]:
# total 200 resumes data
len(resume_train_data)

200

In [102]:
# appending my 100 resumes to the original 200 resumes
for d in df_labels['annotations']:
    resume_train_data.append(d)

## Model building and training

In [104]:
# loading pre-existing blank spacy model
nlp = spacy.blank('en')

def train_model(resume_train_data):
    '''
    This function helps to train the spacy model with the new entities
    '''
            
    # adding the ner pipeline component to the blank spacy model
    if 'ner' not in nlp.pipe_names:
        ner = nlp.add_pipe('ner')
        
    # adding the user defined entities to the model
    for _, annotation in resume_train_data:
        for ent in annotation['entities']:
            ner.add_label(ent[2])
            
    # spacy model have few other pipelines too. As of now, we are ignoring those and focus only on NER  
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print('starting iteration '+ str(itn))
            random.shuffle(resume_train_data) # shuffling the train data
            losses = {}
            index = 0
            for text, annotations in resume_train_data:
                try: 
                    # create Example
                    doc = nlp.make_doc(text)
                    example = Example.from_dict(doc, annotations)
                    # Update the model with the text and its entities
                    nlp.update([example], losses=losses, drop=0.3)
                    
                except Exception as e:
                    pass
                
            print(losses)

In [105]:
def trim_entity_spans(data: list) -> list:
    '''
    Removes leading and trailing white spaces from entity spans and return the cleaned data.
    '''
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])

    return cleaned_data

In [106]:
# trim the input data
resume_train_data_cleaned = trim_entity_spans(resume_train_data)

In [None]:
# build and train a spacy model 
train_model(resume_train_data)

The above training takes some time and that is why after the training, I am storing this model to the disk for future easy retrieval.

## Export Model

In [108]:
# store this model to the disk
nlp.to_disk('nlp_model_with_100_resumes_tuned')