# NLP - Session 17 - CV and Resume Summarization

#### Resume NER Training
In this blog, we are going to create a model using SpaCy which will extract the main points from a resume. We are going to train the model on almost 200 resumes. After the model is ready, we will extract the text from a new resume and pass it to the model to get the summary.

Collecting training data is a very crucial step while building any machine learning model. It may sound like an incredibly painful process. In this project, we have used about 200 resumes to train our model.

You can download the dataset from 

https://github.com/laxmimerit/Resume-and-CV-Summarization-and-Parsing-with-Spacy-in-Python.

Follow Example Here: https://spacy.io/usage/training#training-data

In [None]:
import pickle
import random

import spacy

We will load the training data. The data consists of the contents of the resume which is extracted from a PDF file, followed by a dictionary consisting of a label and the start and end index of the value in the resume. In the example given below `Companies worked at` is a custom label and there are multiple values for it in the resume.

In [None]:
train_data = pickle.load(open("data/cv-resume.pkl", "rb"))
train_data[0]

We will first load a black SpaCy english model. Then we will write a function which will take the training data as the input. In the function, first we will add a ner i.e. Named Entity Recognition in the last position in the pipeline. Then we will add our custom labels in the pipeline.

Now we are going to prepare our data for training. We are disable all the pipeline components except ner. We are only going to train ner. We are going to train for 10 iterations. At each iteration, the training data is shuffled to ensure the model doesn’t make any generalizations based on the order of examples. We are again going to read the training data. Another technique to improve the learning results is to set a dropout rate, a rate at which to randomly “drop” individual features and representations. This makes it harder for the model to memorize the training data. We have added a droupout of 0.2 which means that each feature or internal representation has a 1/5 likelihood of being dropped.

Lastly, we will train the model on our data.

In [None]:
nlp = spacy.blank("en")


def train_model(train_data):
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)

    for _, annotation in train_data:
        for ent in annotation["entities"]:
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Statring iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            index = 0
            for text, annotations in train_data:
                try:
                    nlp.update(
                        [text],  # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses,
                    )
                except Exception as e:
                    pass

            print(losses)

In [None]:
train_model(train_data)

The model will take a lot of time to train. So we are saving the model for further use.

In [None]:
nlp.to_disk("nlp_model")

Now we will load the saved model into nlp_model.

In [None]:
nlp_model = spacy.load("nlp_model")

This is the first resume from our training data. Due to `random.shuffle(train_data)` used in the function train_model() we are getting a different resume at the first position.

In [None]:
train_data[0][0]

Now we will pass this resume to our model and see the results. We have used some formatting while printing.

In [None]:
doc = nlp_model(train_data[0][0])
for ent in doc.ents:
    print(f"{ent.label_.upper():{30}}- {ent.text}")

Now we will test our model on an unseen resume. As the resume is in the PDF format we will extract the text from the PDF file using PyMuPDF. Then we will pass the text to our model and see the results.

You can install PyMuPDF using the following command:-

`conda activate tensorflow20`

`!pip install PyMuPDF`

We are opening the `Alice Clark CV.pdf`. Then we are extracting the text using `getText()`. After that we are removing the new line characters `'\n'` from the text.
    

In [None]:
import sys

import fitz

fname = "data/Alice Clark CV.pdf"
doc = fitz.open(fname)
text = ""
for page in doc:
    text = text + str(page.getText())

tx = " ".join(text.split("\n"))
print(tx)

Now we will pass the extracted text to our model and get the summary.

In [None]:
doc = nlp_model(tx)
for ent in doc.ents:
    print(f"{ent.label_.upper():{30}}- {ent.text}")

To get a better and accurate summary you can train the model on more data samples. You can include different kinds of resumes in the training samples.