# Resume and CV Parsing with Spacy

Resume parsing is a process which converts an unstructured form of resume data into the structured format.

Resumes from the applicants have different formats in terms of presentation, design, fonts, and layouts. 

An ideal system should extract insightful information or the content inside these resumes as quickly as possible and help recruiters no matter how they look because they contain essential qualifications like the candidate's experience, skills, academic excellence.

## What is NER? 

Named Entity Recognition is an algorithm where it takes a string of text as an input (either a paragraph or sentence) and identifies relevant nouns (people, places, and organizations) and other specific words.

## Data Preparation 

In [1]:
import spacy
import pickle
import random

In [2]:
train_data = pickle.load(open('train_data.pkl', 'rb'))

In [8]:
train_data[0][1].get('entities')

[(1749, 1755, 'Companies worked at'),
 (1696, 1702, 'Companies worked at'),
 (1417, 1423, 'Companies worked at'),
 (1356, 1793, 'Skills'),
 (1209, 1215, 'Companies worked at'),
 (1136, 1248, 'Skills'),
 (928, 932, 'Graduation Year'),
 (858, 889, 'College Name'),
 (821, 856, 'Degree'),
 (787, 791, 'Graduation Year'),
 (744, 750, 'Companies worked at'),
 (722, 742, 'Designation'),
 (658, 664, 'Companies worked at'),
 (640, 656, 'Designation'),
 (574, 580, 'Companies worked at'),
 (555, 573, 'Designation'),
 (470, 493, 'Companies worked at'),
 (444, 469, 'Designation'),
 (308, 314, 'Companies worked at'),
 (234, 240, 'Companies worked at'),
 (175, 198, 'Companies worked at'),
 (93, 137, 'Email Address'),
 (39, 48, 'Location'),
 (13, 38, 'Designation'),
 (0, 12, 'Name')]

## NER with Spacy 

In [11]:
import warnings

In [14]:
nlp = spacy.blank('en')


def train_model(train_data):
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
        
    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    
    #--------------------------
    
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
   
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        # reset and initialize the weights randomly – but only if we're
        # training a new model
        optimizer = nlp.begin_training()
            
        for itn in range(20):
            print('Starting iterations ', str(itn))
            random.shuffle(train_data)
            
            
            losses = {}
            index = 0
            for text, annotations in train_data:
                print(index)
                index = index + 1
                try:
                    nlp.update(
                        [text],  # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except Exception as e:
                    pass
#                     print(text)
                
            print(losses)




In [15]:
train_model(train_data)

Starting iterations  0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)


0
0
0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0
0

  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0
0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)


0
0
0


  gold = GoldParse(doc, **gold)


0
0
0
0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0

  gold = GoldParse(doc, **gold)





  gold = GoldParse(doc, **gold)


0
0


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


0


  gold = GoldParse(doc, **gold)


0
0
{'ner': 11544.824150722769}
Starting iterations  1
0


  gold = GoldParse(doc, **gold)


0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
{'ner': 12812.486236617091}
Starting iterations  2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
{'ner': 8085.747669925401}
Starting iterations  3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
{'ner': 3840.9885842388862}


In [18]:
nlp.to_disk('nlp_model')

## Model Testing 

In [19]:
nlp_model = spacy.load('nlp_model')

In [22]:
text = train_data[0][0]

In [25]:
doc = nlp_model(text)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Vijayalakshmi Govindarajan
DESIGNATION                   - SAP as a Consultant
COMPANIES WORKED AT           - SAP Basis
LOCATION                      - Chennai
EMAIL ADDRESS                 - indeed.com/r/Vijayalakshmi-Govindarajan/ d71bfb70a66b0046
COMPANIES WORKED AT           - SAP Basis
DEGREE                        - MCA in Computer Applications
COLLEGE NAME                  - Thiagarajar School of Management
DEGREE                        - BSc
COLLEGE NAME                  - Sri Sathya Sai Institute of Higher Learning
DEGREE                        - HSC
COLLEGE NAME                  - TVS Lakshmi Matriculation Higher Secondary School
DEGREE                        - TVS Lakshmi Matriculation Higher Secondary School -
SKILLS                        - JAVA (6 years), ORACLE (6 years), SAP (6 years), ABAP (Less than 1 year), ACCESS (Less than 1 year)  ADDITIONAL INFORMATION  TECHNICAL EXPERTISE
COMPANIES WORKED AT           - SAP Basis
COMPANIES WORKED

## CV Parsing from PDF Data 

In [26]:
!pip install PyMuPDF
#tried PyPDF2 but does not work properly. 

Collecting PyMuPDF
  Downloading PyMuPDF-1.17.4-cp37-cp37m-win_amd64.whl (5.1 MB)
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.17.4


In [27]:
import sys, fitz

In [28]:
fname = 'Alice Clark CV.pdf'
doc = fitz.open(fname)
text = ""

for page in doc:
    text = text + str(page.getText())

In [30]:
text = " ".join(text.split('\n'))

In [31]:
text

'Alice Clark  AI / Machine Learning    Delhi, India Email me on Indeed  •  20+ years of experience in data handling, design, and development  •  Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to  data warehousing and business intelligence  •  Database: Experience in database designing, scalability, back-up and recovery, writing and  optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes.  Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure,  Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake  analytics(U-SQL)  Willing to relocate anywhere    WORK EXPERIENCE  Software Engineer  Microsoft – Bangalore, Karnataka  January 2000 to Present  1. Microsoft Rewards Live dashboards:  Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping  online. Microsoft Rewards members can earn points when searching with Bing, br

In [32]:
doc = nlp_model(tx)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Alice Clark
LOCATION                      - Delhi
COMPANIES WORKED AT           - Microsoft
DESIGNATION                   - Software Engineer
COMPANIES WORKED AT           - Microsoft – Bangalore, Karnataka  January 2000 to Present  1. Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COMPANIES WORKED AT           - Microsoft
COLLEGE NAME                  - Indian Institute of Technology – Mumbai
SKILLS                        - SKILLS  Machine Learning, Natural Language Processing, and Big Data Handling    ADDITIONAL INFORMATION  Professional Skills  • Excellent analytical, problem solving, communication, knowledge transfer and interpersonal  skills with ability to interact with individuals at all the levels  • Quick learner and maintains cordial relationship with project manager and team members and  good performer both in team