# Problem Statement

We want to Identify and Parse some Important Key texts such as Name, Location, Designation from the Resume (Unstructured Text).
The Goal of the Project is to parse the Key Datapoints from Resume, and we plan to use NLP, Spacy for building the Data Model.

Dataset: PreAnnotated Resume Data with Entities Mapped on as Dictionary (Extracted from Indeed Resumes)

## Data Loading and Processing

In [1]:
# Importing the Libraries

import pandas as pd
import numpy as np
import re
import spacy
import pickle, random

In [2]:
# Load the Training Data

data = pickle.load(open('train_data.pkl', 'rb'))

In [3]:
# Validating the Data

data[random.randint(0,len(data))]

('Chaban kumar Debbarma Tripura - Email me on Indeed: indeed.com/r/Chaban-kumar-Debbarma/bf721c55fb380d19  Willing to relocate to: Agartala, Tripura - Tripura  WORK EXPERIENCE  Microsoft  -  June 2018 to December 2018  I want full time jobs  EDUCATION  10th  School  https://www.indeed.com/r/Chaban-kumar-Debbarma/bf721c55fb380d19?isid=rex-download&ikw=download-top&co=IN',
 {'entities': [(277, 328, 'Email Address'),
   (257, 263, 'College Name'),
   (251, 255, 'Degree'),
   (175, 185, 'Companies worked at'),
   (139, 147, 'Location'),
   (52, 103, 'Email Address'),
   (22, 30, 'Location'),
   (0, 21, 'Name')]})

In [4]:
# Get the Length of the Training Data

len(data)

200

## Model Building and Architecture Design

In [5]:
# Instantiating the NLP Model - Blank Model. Doesnt Contain anything

nlp = spacy.blank('en')

In [6]:
# Define the Training Function

def model_train(train_data):
    # Remove Pipeline and Add NER Pipeline to the Blank model
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        # Add the Pipe to the Blank Model - at Last position of pipeline
        nlp.add_pipe(ner, last= True)
    
    # Add Labels in NLP Pipeline
    for _,annotation in train_data:
        # We take _, annot as the Training Tuple has 2 parts [(0 - text start, 12 - text end), 'Feature Labels']
        # We Need only Feature Labels and so we skip the '_' text part
        for ents in annotation['entities']: # As entities is a Dictionary, we return the list one after the other
            ner.add_label(ents[2]) # We get Labels at 2nd Position -> [(0 - text start, 12 - text end), 'Feature Labels']
            
    # Preparing the Data for Training the Model
    
    # Check for Other Pipelines
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    
    # https://spacy.io/usage/training#ner - Updating the Named Entitity Recogniser
    
    # Only train NER Model - Disable other pipes present
    with nlp.disable_pipes(*other_pipes):
        # Instantiate Training
        optimizer = nlp.begin_training()
        
        # Training the Model for 25 Iteration
        for iteration in range(0, 25,1):
            print(f'The Starting Iteration is {str(iteration)}')
            # Shuffle the Data for Training at Each Iteration
            random.shuffle(train_data) 
            # Initiate an Empty {} for Losses and index
            index = 0
            losses = {}
            # Read the Text and Annotation in Training Data
            for text,annotation in train_data:
                # Train the Model
                try:
                    nlp.update([text], # Text Batch
                               [annotation], # Labels
                               drop= 0.2, #Dropout
                               sgd= optimizer, # Called to update Weights
                               losses= losses)
                
                except Exception as E:
                    pass
            print(f"The Losses are {losses}")

In [7]:
# Passing the Training data to the Function Read Model

model_train(train_data= data)

The Starting Iteration is 0
The Losses are {'ner': 12093.852423867782}
The Starting Iteration is 1
The Losses are {'ner': 10287.665329437561}
The Starting Iteration is 2
The Losses are {'ner': 9285.426861480897}
The Starting Iteration is 3
The Losses are {'ner': 7686.42610134936}
The Starting Iteration is 4
The Losses are {'ner': 6999.581332481909}
The Starting Iteration is 5
The Losses are {'ner': 6396.00451275155}
The Starting Iteration is 6
The Losses are {'ner': 4782.8191661516585}
The Starting Iteration is 7
The Losses are {'ner': 5105.98984045243}
The Starting Iteration is 8
The Losses are {'ner': 4534.768797782992}
The Starting Iteration is 9
The Losses are {'ner': 4592.296797033751}
The Starting Iteration is 10
The Losses are {'ner': 4155.295845598931}
The Starting Iteration is 11
The Losses are {'ner': 4497.874369037021}
The Starting Iteration is 12
The Losses are {'ner': 5166.3914691195705}
The Starting Iteration is 13
The Losses are {'ner': 3662.8170937136288}
The Starting I

In [8]:
# Saving the NLP Model

nlp.to_disk('Resume_NLP_Model')
# The Above Creates a Folder with "ner, vocab and Tokenizer"

## Model Validation

In [2]:
# Loading the NLP Model from the Disk

model = spacy.load('Resume_NLP_Model')

In [10]:
# Validating the Results - We take a Simple Training Set First. Index of Data is Shuffled from the First Import by model_train

print(data[12] , '\n\n' , '-'*120)

print(f"\nThe Test Data for Training to the Model is: \n\n { data[12][0] }")

('Aarti Pimplay Operations Center Shift Manager (OCSM)  - Email me on Indeed: indeed.com/r/Aarti-Pimplay/778c7a91033a71ca  To work with an organization where I can contribute to the growth of the organization through my skill &amp; knowledge for mutual benefit and to learn and excel in highly competitive environment  WORK EXPERIENCE  Operations Center Shift Manager (OCSM)  Microsoft India -  August 2012 to January 2016  • Handling escalations, notifications, task organization, distribution of work, site status enquiries • Monitoring the Incidents handled by the team in real time • Supervising the reporting of Incidents to respective stake holders • Ensuring proper workflow of Incident and major incident processes are followed • Escalate events that have a potential MS impacts to Security Analyst or as directed by the Escalation Matrix • Initiate problem tickets based on the recurring incidents identified • Reviewing the problem records to ensure timely closure of issues • Responsible f

In [11]:
# Check the Working of the Model for the Above data

doc= model( data[12][0] ) # Load only Text Value

# Get the Entities 

for entity in doc.ents:
    # Print Entities with Padding
    print(f" {entity.label_.upper() :{20}} ------> { entity.text } ")

 NAME                 ------> Aarti Pimplay 
 DESIGNATION          ------> Operations Center Shift Manager (OCSM) 
 EMAIL ADDRESS        ------> indeed.com/r/Aarti-Pimplay/778c7a91033a71ca 
 DESIGNATION          ------> Operations Center Shift Manager (OCSM) 
 COMPANIES WORKED AT  ------> Microsoft India 
 COMPANIES WORKED AT  ------> Microsoft India 


## Model Testing

with My Resume

In [3]:
# Import the libraries

import sys, fitz # Fitz is taken from PyMuPDF Library

In [4]:
# Read the Name

file = "Ravishankar Ramakrishnan_2020.pdf"

In [5]:
# Read the Document and Train the Model

doc_test = fitz.open(file)

# Initiate an Empty Text String

text = ""

# Loop through and Identify the Text

#text1 = [text + str( page.getText() for page in doc_test )]
for page in doc_test:
    # Add Text to Empty text string
    text = text + str( page.getText() )

In [6]:
print(text)

 
RAVISHANKAR RAMAKRISHNAN 
 
Phone: (91) 7010974018 
mailtoravi7895@gmail.com 
Adambakkam 
Chennai, 600088 
 
 
 
An Aspiring Data Scientist and Computer Science Engineer having 2.1 years’ Experience as 
Junior Data Scientist at a Talent Strategy Consulting firm and 2.4 years Overall. My 
expertise lies in performing Exploratory Data Analysis, Business Analytics, Feature 
Engineering, Data Mining, Data Visualization, Predictive Modelling, Statistics and Natural 
Language Processing. My Functional Management expertise lies around Six Sigma, 
Corporate Strategy, Advanced Management techniques, Growth Hacking and Digital 
Marketing 
 
EDUCATION 
 
MTech BITS Pilani, Data Science and Engineering 
 Apr 2019 - Present 
SUBJECTS: “Data Mining, Data Structures and Algorithms, Machine Learning, 
Data Science, Data Visualization, Statistics” 
 
BE 
Sathyabama University, Computer Science Engineering 
 June 2016 
 
12TH  New Prince, Computer Science 
May 2012 
 
SKILLS 
 
Technical Tools 
 
Pyth

In [7]:
# There are a Lot of \n \t etc present. So we perform Textual Processing

# Removing Newlines

text_process = " ".join(re.split( '\n|\uf0b7', text ))

In [8]:
text_process

'  RAVISHANKAR RAMAKRISHNAN    Phone: (91) 7010974018  mailtoravi7895@gmail.com  Adambakkam  Chennai, 600088        An Aspiring Data Scientist and Computer Science Engineer having 2.1 years’ Experience as  Junior Data Scientist at a Talent Strategy Consulting firm and 2.4 years Overall. My  expertise lies in performing Exploratory Data Analysis, Business Analytics, Feature  Engineering, Data Mining, Data Visualization, Predictive Modelling, Statistics and Natural  Language Processing. My Functional Management expertise lies around Six Sigma,  Corporate Strategy, Advanced Management techniques, Growth Hacking and Digital  Marketing    EDUCATION    MTech BITS Pilani, Data Science and Engineering   Apr 2019 - Present  SUBJECTS: “Data Mining, Data Structures and Algorithms, Machine Learning,  Data Science, Data Visualization, Statistics”    BE  Sathyabama University, Computer Science Engineering   June 2016    12TH  New Prince, Computer Science  May 2012    SKILLS    Technical Tools    Pyt

In [9]:
# Predicting the Working of the Model for the Above data

doc= model( text_process ) # Load only Text Value

# Get the Entities 

for entity in doc.ents:
    # Print Entities with Padding
    print(f" {entity.label_.upper() :{20}} ------> { entity.text } ")

 GRADUATION YEAR      ------> 7010974018 
 YEARS OF EXPERIENCE  ------> 2.1 years 
 DEGREE               ------> MTech BITS Pilani, 
 COLLEGE NAME         ------> Sathyabama University, Computer Science Engineering 
 GRADUATION YEAR      ------> 2016 
 SKILLS               ------> Technical Tools    Python, R, SQL, Tableau, Scala, Spark, PySpark, Hadoop HDFS, Flask, HTML, CSS, Excel  Solver, MEAN Stack etc.    Libraries    Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn, Beautifulsoup, Selenium, Py2Exe, Plotly, 
 LOCATION             ------> 2018 
 SKILLS               ------> Tamil, English, Hindi (Beginner) 


The Above was a Result of Model Trained for 25 Epochs. 

We Obtained this Result for Running the Model for 10 Epochs
![10Epoch](10Epoch_Test.PNG)

<h3><center>GARBAGE IN == GARBAGE OUT</center></h3>

Process/Get the Training Data more efficiently so the predictions can be good

**Thanks !**

# Test

In [23]:
# Predicting the Working of the Model for the Above data

doc= model( text_process ) # Load only Text Value

# Get the Entities 
lister_vals = []

for entity in doc.ents:
    lister = {}
    # Print Entities with Padding
    print(f" {entity.label_.upper() :{20}} ------> { entity.text } ")
    lister['Entity'] = entity.label_.upper()
    lister['Value'] = entity.text
    lister_vals.append(lister)

 GRADUATION YEAR      ------> 7010974018 
 YEARS OF EXPERIENCE  ------> 2.1 years 
 DEGREE               ------> MTech BITS Pilani, 
 COLLEGE NAME         ------> Sathyabama University, Computer Science Engineering 
 GRADUATION YEAR      ------> 2016 
 SKILLS               ------> Technical Tools    Python, R, SQL, Tableau, Scala, Spark, PySpark, Hadoop HDFS, Flask, HTML, CSS, Excel  Solver, MEAN Stack etc.    Libraries    Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn, Beautifulsoup, Selenium, Py2Exe, Plotly, 
 LOCATION             ------> 2018 
 SKILLS               ------> Tamil, English, Hindi (Beginner) 


In [25]:
lister_vals1 = pd.DataFrame(lister_vals)

In [26]:
lister_vals1

Unnamed: 0,Entity,Value
0,GRADUATION YEAR,7010974018
1,YEARS OF EXPERIENCE,2.1 years
2,DEGREE,"MTech BITS Pilani,"
3,COLLEGE NAME,"Sathyabama University, Computer Science Engine..."
4,GRADUATION YEAR,2016
5,SKILLS,"Technical Tools Python, R, SQL, Tableau, Sc..."
6,LOCATION,2018
7,SKILLS,"Tamil, English, Hindi (Beginner)"
