# Problem Statement

Dictionary based approaches don’t work in the fast evolving professional landscape. CutShort gets thousands of resumes every day. How would you learn new skills automatically from them?

# About

Usually hiring partners look for candidates with a set of skills. Parsing resumes and extracting information about the individual is a good start but learning new skills from resume is a potential game changer for any hiring organisation since skills keep evolving over time and very often candidates miss out to add skills relevant to their experience. So, automatically learning new skills is a win-win for both candidates and hiring team. 

# Phase 1: Business Understanding

## Isolate Business Units 

In this problem, potential candidates of business units are **Skills**. 

## Objectives

Our business objective is to 'learn new skills'

# Phase 2: Data Understanding

## Look at Data

Data has been taken from Dataturks Repoistory which contains 220 annotated resumes [Entity Recognition In Resumes Data](https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy)

For this challenge, we are learning New Skills from Resumes

The keys fields in this dataset are:
- content - text from resume
- annotations - includes important labels like Name, Organisation,Skills etc

We are only interested in skills at this moment

## Imports

In [12]:
import pandas as pd
from collections import Counter
import spacy,re,json
from spacy.gold import GoldParse
from spacy.scorer import Scorer
from spacy import displacy  

from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [13]:
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

## Exploratory Data Analysis

In [14]:
PATH ='/Users/raj/Desktop/ML-Learning_skills/data/entity_recognition_in_resumes.json'

with open(PATH, 'r') as f:
    lines = f.readlines()
df=[]
for line in lines:
    data = json.loads(line)
    df.append(data)
    
df=pd.DataFrame(df)

In [15]:
df.head(5)

Unnamed: 0,annotation,content,extras
0,"[{'label': ['Skills'], 'points': [{'start': 12...",Abhishek Jha\nApplication Development Associat...,
1,"[{'label': ['Email Address'], 'points': [{'sta...",Afreen Jamadar\nActive member of IIIT Committe...,
2,"[{'label': ['Skills'], 'points': [{'start': 37...","Akhil Yadav Polemaina\nHyderabad, Telangana - ...",
3,"[{'label': ['Skills'], 'points': [{'start': 80...",Alok Khandai\nOperational Analyst (SQL DBA) En...,
4,"[{'label': ['Degree'], 'points': [{'start': 20...",Ananya Chavan\nlecturer - oracle tutorials\n\n...,


In [16]:
# Number of Rows and columns
df.shape

(220, 3)

In [17]:
# Finding all the unique labels in training data
uniq_skills=[]
for lines in df['annotation']:
    u = str(lines[0]['label']).strip('[]')
    uniq_skills.append(u)
    
set(uniq_skills) 

{'',
 "'College Name'",
 "'Companies worked at'",
 "'Degree'",
 "'Designation'",
 "'Email Address'",
 "'Graduation Year'",
 "'Location'",
 "'Name'",
 "'Skills'",
 "'Years of Experience'"}

In [18]:
# Finding all the unique skills in data
uniq_skills=[]
for lines in df['annotation']:
    u = str(lines[0]['label']).strip('[\'\']')
    if u == 'Skills':
        for skill in lines[0]['points']:
            l= skill['text']
            uniq_skills.append(l)
print(set(uniq_skills))            

{'CRM (3 years), DATABASE (3 years), ORACLE (3 years), Tosca (3 years), Automation Testing (3\nyears), Selenium (1 year), Core Java (1 year)\n\nADDITIONAL INFORMATION\n\nKey Skills:\n❖ Software tools: IBM Rational Collaborative Lifecycle Management\n❖ Testing Tool: IBM Rational Quality Management on Jazz Server\n❖ Test Automation Tools: TOSCA, Selenium\n❖ Programming Language: Core Java\n❖ IDE: Eclipse\n❖ Database: Oracle, EDB, Sqlserver\n❖ Database Tools: SQL Developer, Toad, Tora\n❖ Software tools: Filezilla, MobaXterm, Putty, Office tools\n❖ Platforms: Windows, UNIX\n❖ Domain Software Knowledge: Finacle Core Banking Solution, Finacle CRM Solution.\n\nSkills: Fast learner, leadership quality, team player, presentation skills, work devotee, punctual,\ngood communication and listening skills.', '\nTypewriting, Editing\n', 'excel, powerpoint, vlookup, formula, filters, paint, recruitment, (1 year)\n', 'Linux (Less than 1 year), Microsoft Office (Less than 1 year), MS OFFICE (Less than 1

We note here that annotated skills were not just skill, annotators labelled whole/subset line in skills section of resume including 
punctuations and next line character which is not at all a good idea for spacy ner input

# Phase 3 : Data Preparation

We will now annotate our training data using open source tool [doccano](https://github.com/chakki-works/doccano) since in dataturks annotation we are not aware of what guidelines were given to annotators for labeling and also we observed that they have not just annotated individual skills instead they went ahead and annotated whole line of skills along with punctuations and next line character. We want to avoid those when we annoate our training data.

**Guideline**: Our annotating guideline would be to label individual skills appearing in content of resume anywhere not just in skills section and avoiding punctuations and nextline character.
    
The keys fields in annotated data are:

- id - serial number given by doccano annotator tool
- text - content of resumes
- meta - no meta 
- annotation approver - null in our case
- label - skills which are annotated in each resumes content    

![alt text](annotator_1.png "Docanno Annotator")

![alt text](annotator_2.png "Annotating")

In [19]:
# Remove all the next line delimiter from training data before annotating and replace it with whitespace
PATH ='/Users/raj/Desktop/ML-Learning_skills/data/entity_recognition_in_resumes.json'

with open(PATH, 'r') as f:
    lines = f.readlines()
d=[]
for line in lines:
    data = json.loads(line)
    text = data['content']
    d.append(text.replace('\n', ' '))
    
d=pd.DataFrame(d)
d.to_csv('/Users/raj/Desktop/ML-Learning_skills/data/annotate_resume.csv',index=False)

Upload this csv file to annotator and export a json file from tool once annotation is done

In [73]:
# Preparing Training data for NER
PATH ='/Users/raj/Desktop/ML-Learning_skills/data/file_ner.json1'

annotated_data = []
lines=[]
with open(PATH, 'r') as f:
    lines = f.readlines()

for line in lines:
    data = json.loads(line)
    text = data['text']
    entities = []
    for label in data['labels']:
        entities.append(label)
    annotated_data.append((text, {"entities": data['labels']}))

In [74]:
 annotated_data[0]

('"Abhishek Jha Application Development Associate - Accenture  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a  • To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and company\'s growth in best possible ways.  Willing to relocate to: Bangalore, Karnataka  WORK EXPERIENCE  Application Development Associate  Accenture -  November 2017 to Present  Role: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries for the Bot which will be triggered based on given input. Also, Training the bot for different possible utterances (Both positive and negative), which will be given as input by the user.  EDUCATION  B.E in Information science and engineering  B.v.b college of engineering and technology -  Hubli, Karnataka  August 2013 to June 2017  12th in Mathematics  Woodbine modern school  April 2011 to March 2013  10th  Kendriya Vidyalaya  April 2001 to March 2011  SKILLS  C (

We have trained 220 resumes for skills , now we will train our model on 200 resumes and test on 20 resumes.

In [79]:
training_data=  annotated_data[0:200]
test_data = annotated_data[200:220]

In [80]:
print(len(training_data), len(test_data))

200 20


# Phase 4: Modeling

In this phase, we will train the model on 200 annotated resumes with 100 iterations on google colab in GPU environment to train our model faster and we will the use the saved model here for predictions

In [None]:
def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in training_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(training_data)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(training_data, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", losses)

    # test the trained model
    for text, _ in training_data:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in training_data:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# trained this model with 100 iterations in google colab using GPU runtime with loss: Losses {'ner': 465.24571515528373} 
main(output_dir='/Users/raj/Desktop/ML-Learning_skills/ner_model/') 

In [96]:
# Loading the saved model
ner_model = spacy.load('/Users/raj/Desktop/ML-Learning_skills/ner_model/')

In [447]:
# test the saved model
test_text = '"Abhishek Jha Application Development Associate - Accenture  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a  • To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and company\'s growth in best possible ways.  Willing to relocate to: Bangalore, Karnataka  WORK EXPERIENCE  Application Development Associate  Accenture -  November 2017 to Present  Role: Currently working on Chat-bot. Develop Keras models for the Bot which will be triggered based on given input. Also, Training the bot for different possible utterances (Both positive and negative), which will be given as input by the user.  EDUCATION  B.E in Information science and engineering  B.v.b college of engineering and technology -  Hubli, Karnataka  August 2013 to June 2017  12th in Mathematics  Woodbine modern school  April 2011 to March 2013  10th  Kendriya Vidyalaya  April 2001 to March 2011  SKILLS  KERAS (Less than 1 year), Database (Less than 1 year), Database Management (Less than 1 year), Database Management System (Less than 1 year), Java (Less than 1 year)  ADDITIONAL INFORMATION  Technical Skills  https://www.indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a?isid=rex-download&ikw=download-top&co=IN   • Programming language: C, C++, Java • Oracle PeopleSoft • Internet Of Things • Machine Learning • Database Management System • Computer Networks • Operating System worked on: Linux, Windows, Mac  Non - Technical Skills  • Honest and Hard-Working • Tolerant and Flexible to Different Situations • Polite and Calm • Team-Player"'

In the resume content I have added  skill **KERAS** which was not seen during labeling. Let's see if our model can learn this as new skill

In [448]:
doc = ner_model(test_text)

In [449]:
displacy.render(doc, style="ent")

- We observe that model was able to identify **KERAS** as new skill from resume

# Phase 5: Evaluation

Our goal was to learn new skills from resumes. So, how good our model is doing that ? 

We will test our trained model on 20 test annotated resumes which were not seen by model during training

In [370]:
#test the model and evaluate it
    nlp = spacy.load('/Users/raj/Desktop/ML-Learning_skills/ner_model/')  
    examples = test_data
    tp=0
    tr=0
    tf=0

    ta=0
    c=0        
    for text,annot in examples:
        f=open("resume"+str(c)+".txt","w")
        doc_to_test=nlp(text)
        d={}
        for ent in doc_to_test.ents:
            d[ent.label_]=[]
        for ent in doc_to_test.ents:
            d[ent.label_].append(ent.text)

        for i in set(d.keys()):

            f.write("\n\n")
            f.write(i +":"+"\n")
            for j in set(d[i]):
                f.write(j.replace('\n','')+"\n")
        d={}
        for ent in doc_to_test.ents:
            #print(ent)
            d[ent.label_]=[0,0,0,0,0,0]
        for ent in doc_to_test.ents:
            doc_gold_text= nlp.make_doc(text)
            gold = GoldParse(doc_gold_text, entities=annot.get("entities"))
            y_true = [ent.label_ if ent.label_ in x else 'Not '+ent.label_ for x in gold.ner]
            y_pred = [x.ent_type_ if x.ent_type_ ==ent.label_ else 'Not '+ent.label_ for x in doc_to_test]  
            if(d[ent.label_][0]==0):
               
                (p,r,f,s)= precision_recall_fscore_support(y_true,y_pred,average='weighted')
                a=accuracy_score(y_true,y_pred)
                d[ent.label_][0]=1
                d[ent.label_][1]+=p
                d[ent.label_][2]+=r
                d[ent.label_][3]+=f
                d[ent.label_][4]+=a
                d[ent.label_][5]+=1
        c+=1
    for i in d:
        print("\n For Entity "+i+"\n")
        print("Accuracy : "+str((d[i][4]/d[i][5])*100)+"%")
        print("Precision : "+str(d[i][1]/d[i][5]))
        print("Recall : "+str(d[i][2]/d[i][5]))
        print("F-score : "+str(d[i][3]/d[i][5]))        


 For Entity Skills

Accuracy : 98.7878787878788%
Precision : 0.9880266075388027
Recall : 0.9878787878787879
F-score : 0.9848856664807586


In [385]:
# Looking at all the skills learned by model in test data and converting them to lowercase
test_skills = []
for text, _ in test_data:
        doc = nlp(text)
        for ent in doc.ents:
            test_skills.append(ent.text.lower())
print(set(test_skills))

{'operations', 'html', 'ms-access', 'sap bi', 'server management', 'pl/sql', 'microsoft office', 'microsoft azure', '− conversant', 'asp.net', 'excel', 'web development', 'data management', 'ms office', 'siem', 'css3', 'angularjs', 'ms-office', 'css', 'splunk', 'sql', 'network management', 'manual testing', 'html5', 'network media', 'javascript', 'word', 'power point', 'windows', '5', 'javascript/jquery', 'derivatives', 'cfa', 'network according', 'c', 'auditing', 'data backup', 'c#', 'sap hana', 'project management', 'java', 'lan', 'sql server', 'microsoft office 2007', 'software management', 'exchange', 'unix', 'vba', 'web server', 'incident management', 'sap', 'powerpoint'}


# Phase 6: Deployment

Model can be deployed on cloud providers like AWS 

# Credits      

- [Automatic Summarization of Resumes using Spacy](https://medium.com/@dataturks/automatic-summarization-of-resumes-with-ner-8b97a5f562b)
- [Training NER models using Spacy](https://spacy.io/usage/training#ner)