<h1><center>NER - Resume Extractor</center></h1>
<h1><center>By: Team ZeRoS</center></h1>

## Problem Description:
When companies recruit for any position, they usually end up receiving thousands, if not millions, of resumes. Such a uge number of resumes makes the task of going over all these resumes an extremely difficult and tedious job for HR employees. This made a lot of companies opt for systems that take the necessary information from the candidate after they fill an application with all the required fields. The solution worked greatly for employers; nevertheless, candidates have always found it very illogical to spend tens of hours sharpening their CVs and cover letters only to find out that they must spend another hour or so re-entering all the information they have on their CVs in the designated fields.

## Business Understanding:
It is often observed by HR that the manual process of evaluation of Resumes in bulk which are populated with excess information often becomes tedious and hectic. Therefore, we could automate this process by reading several formats of files (CV). Then using some basic techniques of Natural Language Processing like word parsing, chunking, regex parser and/or Named Entity Recognition to easily capture information like name, email id, address, educational qualification, experience in seconds from a large number of documents.

## File Ingestion and Schema Validation

In [136]:
%%writefile testutility.py
import logging
import os
import subprocess
import yaml
import pandas as pd
import datetime 
import gc
import re

def read_yml_file(file_path):
    with open(file_path, 'r') as stream:
        try:
            return yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            logging.error(exc)

def replacer(string, char):
    pattern = char + '{2,}'
    string = re.sub(pattern, char, string)
    return string

def col_validate(df, col_config):
  cols = df.columns
  cols = cols.str.strip()
  cols.str.replace("_", "")
  cols = list(map(lambda x: replacer(x, ' '), list(cols)))
  expected_col = list(map(lambda x: x.lower(),  col_config['columns']))
  cols.sort()
  expected_col.sort()
  if len(cols) == len(expected_col) and cols == expected_col:
    print("Column name and Column length Validation Passed!!")
    return 1
  else:
    print("Column name and Column length Validation Failed..")
    mismatched_columns_file = list(set(cols).difference(expected_col))
    print("Following File columns are not in the YAML file", mismatched_columns_file)
    missing_YAML_file = list(set(expected_col).difference(cols))
    print("Following YAML columns are not in the file uploaded", missing_YAML_file)
    logging.info(f'df columns: {cols}')
    logging.info(f'expected columns: {expected_col}')
    return 0

# Function to get the size of dataset
def humanbytes(B):
  'Return the given bytes as a human friendly KB, MB, GB, or TB string'
  B = float(B)
  KB = float(1024)
  MB = float(KB ** 2) # 1,048,576
  GB = float(KB ** 3) # 1,073,741,824
  TB = float(KB ** 4) # 1,099,511,627,776

  if B < KB:
    return '{0} {1}'.format(B,'Bytes' if 0 == B > 1 else 'Byte')
  elif KB <= B < MB:
    return '{0:.2f} KB'.format(B/KB)
  elif MB <= B < GB:
    return '{0:.2f} MB'.format(B/MB)
  elif GB <= B < TB:
    return '{0:.2f} GB'.format(B/GB)
  elif TB <= B:
    return '{0:.2f} TB'.format(B/TB)

Overwriting testutility.py


# Writing YAML File

In [137]:
%%writefile file.yaml
columns: 
  - annotation
  - content
dataset_name: resumes
dtypes: 
  annotation: list
  content: str
file_name: Resume
file_type: json
skip_leading_rows: 1
table_name: resumes

Overwriting file.yaml


# Reading the Configuration File

In [138]:
# Read config file
import testutility as util
col_config = util.read_yml_file("file.yaml")

In [139]:
#inspecting data of config file
col_config

{'columns': ['annotation', 'content'],
 'dataset_name': 'resumes',
 'dtypes': {'annotation': 'list', 'content': 'str'},
 'file_name': 'Resume',
 'file_type': 'json',
 'skip_leading_rows': 1,
 'table_name': 'resumes'}

In [140]:
# Read the file using yaml config file
file_type = col_config['file_type']
source_file = col_config['file_name'] + f'.{file_type}'
source_file

'Resume.json'

In [141]:
import pandas as pd

df = pd.read_json(source_file, lines=True)
df.to_csv('dataframe.csv', index = None)
df.head()

Unnamed: 0,content,annotation
0,Govardhana K\nSenior Software Engineer\n\nBeng...,"[{'label': ['Companies worked at'], 'points': ..."
1,"Harini Komaravelli\nTest Analyst at Oracle, Hy...","[{'label': ['Companies worked at'], 'points': ..."
2,Hartej Kathuria\nData Analyst Intern - Oracle ...,"[{'label': ['Skills'], 'points': [{'start': 22..."
3,Ijas Nizamuddin\nAssociate Consultant - State ...,"[{'label': ['Skills'], 'points': [{'start': 46..."
4,"Imgeeyaul Ansari\njava developer\n\nPune, Maha...","[{'label': ['Skills'], 'points': [{'start': 18..."


In [142]:
# Getting the size of the dataset
size = df.memory_usage(deep=True).sum()
file_size = util.humanbytes(size)
print(file_size)

1.41 MB


In [143]:
# Checking the shape of dataframe
df.shape

(200, 2)

In [144]:
# Checking no. of NA Values
df.isnull().sum()

content       0
annotation    0
dtype: int64

In [145]:
df['annotation'][0]

[{'label': ['Companies worked at'],
  'points': [{'start': 1749, 'end': 1754, 'text': 'Oracle'}]},
 {'label': ['Companies worked at'],
  'points': [{'start': 1696, 'end': 1701, 'text': 'Oracle'}]},
 {'label': ['Companies worked at'],
  'points': [{'start': 1417, 'end': 1422, 'text': 'Oracle'}]},
 {'label': ['Skills'],
  'points': [{'start': 1356,
    'end': 1792,
    'text': 'Languages: Core Java, Go Lang, Data Structures & Algorithms, Oracle\nPL-SQL programming, Sales Force with APEX.\nTools: RADTool, Jdeveloper, NetBeans, Eclipse, SQL developer,\nPL/SQL Developer, WinSCP, Putty\nWeb Technologies: JavaScript, XML, HTML, Webservice\n\nOperating Systems: Linux, Windows\nVersion control system SVN & Git-Hub\nDatabases: Oracle\nMiddleware: Web logic, OC4J\nProduct FLEXCUBE: Oracle FLEXCUBE Versions 10.x, 11.x and 12.x'}]},
 {'label': ['Companies worked at'],
  'points': [{'start': 1209, 'end': 1214, 'text': 'Oracle'}]},
 {'label': ['Skills'],
  'points': [{'start': 1136,
    'end': 1247,


In [146]:
df['content'][0]

'Govardhana K\nSenior Software Engineer\n\nBengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/\nb2de315d95905b68\n\nTotal IT experience 5 Years 6 Months\nCloud Lending Solutions INC 4 Month • Salesforce Developer\nOracle 5 Years 2 Month • Core Java Developer\nLanguages Core Java, Go Lang\nOracle PL-SQL programming,\nSales Force Developer with APEX.\n\nDesignations & Promotions\n\nWilling to relocate: Anywhere\n\nWORK EXPERIENCE\n\nSenior Software Engineer\n\nCloud Lending Solutions -  Bangalore, Karnataka -\n\nJanuary 2018 to Present\n\nPresent\n\nSenior Consultant\n\nOracle -  Bangalore, Karnataka -\n\nNovember 2016 to December 2017\n\nStaff Consultant\n\nOracle -  Bangalore, Karnataka -\n\nJanuary 2014 to October 2016\n\nAssociate Consultant\n\nOracle -  Bangalore, Karnataka -\n\nNovember 2012 to December 2013\n\nEDUCATION\n\nB.E in Computer Science Engineering\n\nAdithya Institute of Technology -  Tamil Nadu\n\nSeptember 2008 to June 2012\n\nhttps://www.

In [147]:
# Validating Input File
if util.col_validate(df, col_config) == 0:
    print("Validation Failed!!")
else:
    print("Column Validation Passed..")

Column name and Column length Validation Passed!!
Column Validation Passed..


# Cleaning the Data

In [148]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from num2words import num2words

# Dictionary of English Contractions
contractions_dict = { "ain't": "are not","aren't": "are not",
                     "can't": "cannot","can't've": "cannot have",
                     "'cause": "because","could've": "could have","couldn't": "could not",
                     "couldn't've": "could not have", "didn't": "did not","doesn't": "does not",
                     "don't": "do not","hadn't": "had not","hadn't've": "had not have",
                     "hasn't": "has not","haven't": "have not","he'd": "he would",
                     "he'd've": "he would have","he'll": "he will", "he'll've": "he will have",
                     "how'd": "how did","how'd'y": "how do you","how'll": "how will",
                     "I'd": "I would", "I'd've": "I would have","I'll": "I will",
                     "I'll've": "I will have","I'm": "I am","I've": "I have", "isn't": "is not",
                     "it'd": "it would","it'd've": "it would have","it'll": "it will",
                     "it'll've": "it will have", "let's": "let us","ma'am": "madam",
                     "mayn't": "may not","might've": "might have","mightn't": "might not", 
                     "mightn't've": "might not have","must've": "must have","mustn't": "must not",
                     "mustn't've": "must not have", "needn't": "need not",
                     "needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
                     "oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
                     "shan't've": "shall not have","she'd": "she would","she'd've": "she would have",
                     "she'll": "she will", "she'll've": "she will have","should've": "should have",
                     "shouldn't": "should not", "shouldn't've": "should not have","so've": "so have",
                     "that'd": "that would","that'd've": "that would have", "there'd": "there would",
                     "there'd've": "there would have", "they'd": "they would",
                     "they'd've": "they would have","they'll": "they will",
                     "they'll've": "they will have", "they're": "they are","they've": "they have",
                     "to've": "to have","wasn't": "was not","we'd": "we would",
                     "we'd've": "we would have","we'll": "we will","we'll've": "we will have",
                     "we're": "we are","we've": "we have", "weren't": "were not","what'll": "what will",
                     "what'll've": "what will have","what're": "what are", "what've": "what have",
                     "when've": "when have","where'd": "where did", "where've": "where have",
                     "who'll": "who will","who'll've": "who will have","who've": "who have",
                     "why've": "why have","will've": "will have","won't": "will not",
                     "won't've": "will not have", "would've": "would have","wouldn't": "would not",
                     "wouldn't've": "would not have","y'all": "you all", "y'all'd": "you all would",
                     "y'all'd've": "you all would have","y'all're": "you all are",
                     "y'all've": "you all have", "you'd": "you would","you'd've": "you would have",
                     "you'll": "you will","you'll've": "you will have", "you're": "you are",
                     "you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)


def clean_text(resume):
    
    resume = resume.lower()
    
    resume = expand_contractions(resume)
    
    resume = re.sub("\n", ' ', resume)
    resume = re.sub(r'[,•()➢❑]', ' ', resume)
    resume = re.sub(r'\s\s+|\s-\s|\.\s', ' ', resume)
    
    tokenized_words = resume.split(" ")
    
    #length = len(tokenized_words)
    #for i in range(length):
     #   if tokenized_words[i].isdigit():
      #      tokenized_words[i] = num2words(tokenized_words[i])
        
    sw = set(stopwords.words('english'))
    
    tokens_without_sw = []
    
    for w in tokenized_words:
        if w not in sw:
            tokens_without_sw.append(w)
            
    lm = WordNetLemmatizer() 
    tokens_without_sw = [lm.lemmatize(word) for word in tokens_without_sw if not word in set(sw)]
    
    final_resume = " ".join(tokens_without_sw)
    
    return final_resume

In [149]:
resume_content = df['content']

resume_content

0      Govardhana K\nSenior Software Engineer\n\nBeng...
1      Harini Komaravelli\nTest Analyst at Oracle, Hy...
2      Hartej Kathuria\nData Analyst Intern - Oracle ...
3      Ijas Nizamuddin\nAssociate Consultant - State ...
4      Imgeeyaul Ansari\njava developer\n\nPune, Maha...
                             ...                        
195    Mansi Thanki\nStudent\n\nJamnagar, Gujarat - E...
196    Anil Kumar\nMicrosoft Azure (Basic Management)...
197    Siddharth Choudhary\nMicrosoft Office Suite - ...
198    Valarmathi Dhandapani\nInvestment Banking Oper...
199    Pradeep Kumar\nSecurity Analyst in Infosys - C...
Name: content, Length: 200, dtype: object

In [150]:
dict = {'content':[]}
clean_resumes = pd.DataFrame(dict)

for resume in resume_content:
    clean_resumes.loc[len(clean_resumes.index)] = clean_text(resume) 
    
    
clean_resumes

Unnamed: 0,content
0,govardhana k senior software engineer bengalur...
1,harini komaravelli test analyst oracle hyderab...
2,hartej kathuria data analyst intern oracle ret...
3,ijas nizamuddin associate consultant state str...
4,imgeeyaul ansari java developer pune maharasht...
...,...
195,mansi thanki student jamnagar gujarat email in...
196,anil kumar microsoft azure basic management de...
197,siddharth choudhary microsoft office suite exp...
198,valarmathi dhandapani investment banking opera...


In [151]:
# A sample of the clean resume data

clean_resumes['content'][0]

'govardhana k senior software engineer bengaluru karnataka karnataka email indeed: indeed.com/r/govardhana-k/ b2de315d95905b68 total experience 5 year 6 month cloud lending solution inc 4 month salesforce developer oracle 5 year 2 month core java developer language core java go lang oracle pl-sql programming sale force developer apex  designation & promotion willing relocate: anywhere work experience senior software engineer cloud lending solution  bangalore karnataka  january 2018 present present senior consultant oracle  bangalore karnataka  november 2016 december 2017 staff consultant oracle  bangalore karnataka  january 2014 october 2016 associate consultant oracle  bangalore karnataka  november 2012 december 2013 education b.e computer science engineering adithya institute technology  tamil nadu september 2008 june 2012 https://www.indeed.com/r/govardhana-k/b2de315d95905b68?isid=rex-download&ikw=download-top&co=in https://www.indeed.com/r/govardhana-k/b2de315d95905b68?isid=rex-dow

In [152]:
# We can see that this resume has an issue of no spaces between words, which makes it hardly ever readable.
# Thus, deleting this row is the best solution in this case.

clean_resumes['content'][154]

"jacob philip kottayam kerala email indeed: indeed.com/r/jacob-philip/db00d831146c9228 strategicsales experienceinsales skills.currently success  work experience sale marketing specialist assistantbusinessdevelopmentmanager  dubai ae  february 2017 october 2017 uae -builtstrong clientrelationshipsandprovidedhighvalue-addingservices resultingina15% company marketshareincrease developstools practicesacrosstheorganization negotiatingcontractsandpackages negotiatingthetermsofanagreementwithaviewto closingsale expense andnew businessdata workedcloselywithpartners throughconductingqualityassurancetests.actasthepointofcontactandcommunicate projectstatustoallparticipantsinourteam ordinator marketingco  january 2015 june 2016 bhimajewelers kerala india systemreportforms plannedandexecutedeventsandmarketingprograms producingfivetimestargetnumberof qualifiedleads implement marketing advertising campaign assembling analyzing sale forecastsandincreasedperformanceby52percent preparesmarketingreports

In [153]:
clean_resumes.drop([154], axis = 0, inplace = True)

# Featurization Using Tf-Idf

In [154]:
### We use the Tf-Idf vectorizer to gain insights about the probabilities of all the possible words and collocations
### we can find in these reusmes.
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(ngram_range = (1, 3))
tf_idf = vect.fit_transform(clean_resumes['content'])
terms = vect.get_feature_names()
#print(terms)
print(tf_idf)

  (0, 459)	0.04099920254609702
  (0, 360)	0.04099920254609702
  (0, 178)	0.03803341461440172
  (0, 116439)	0.04099920254609702
  (0, 44280)	0.04099920254609702
  (0, 74526)	0.04099920254609702
  (0, 44277)	0.04099920254609702
  (0, 82309)	0.04099920254609702
  (0, 72772)	0.04099920254609702
  (0, 62767)	0.04099920254609702
  (0, 117792)	0.04099920254609702
  (0, 68239)	0.04099920254609702
  (0, 74606)	0.04099920254609702
  (0, 30160)	0.04099920254609702
  (0, 50198)	0.04099920254609702
  (0, 46566)	0.04099920254609702
  (0, 104470)	0.04099920254609702
  (0, 105218)	0.04099920254609702
  (0, 26164)	0.04099920254609702
  (0, 116449)	0.035929154589110036
  (0, 118658)	0.03803341461440172
  (0, 62247)	0.03803341461440172
  (0, 105334)	0.035929154589110036
  (0, 73924)	0.03085910663212305
  (0, 117964)	0.04099920254609702
  :	:
  (198, 119196)	0.006696917115165575
  (198, 52504)	0.020090751345496725
  (198, 52688)	0.006696917115165575
  (198, 37486)	0.006696917115165575
  (198, 107278)	0.02

# POS Tagging

In [158]:
import spacy
import en_core_web_sm

# load en_core_web_md of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

#  "nlp" Object used to create documents with linguistic annotations.
tags = nlp(clean_resumes['content'][100])

for word in tags:
    print(word.text, word.pos_)

puneet VERB
bhandari NOUN
sap NOUN
sd PART
lead VERB
microsoft ADJ
pune ADJ
maharashtra ADJ
email NOUN
indeed ADV
: PUNCT
indeed.com/r/puneet-bhandari/c9002fa44d6760bd NOUN
willing ADJ
relocate VERB
: PUNCT
anywhere ADV
work NOUN
experience NOUN
sap NOUN
sd PART
lead VERB
microsoft NOUN
  SPACE
august NOUN
2010 NUM
present ADJ
team NOUN
size NOUN
: PUNCT
8 NUM
duration NOUN
: PUNCT
seven NUM
month NOUN
scope NOUN
: PUNCT
* PUNCT
enhancement NOUN
mexico NOUN
invoicing NOUN
process NOUN
per ADP
current ADJ
regulation NOUN
* PUNCT
requirement NOUN
gathering VERB
third ADJ
party NOUN
client NOUN
new ADJ
process NOUN
* PUNCT
responsible ADJ
implementing VERB
change NOUN
system NOUN
area NOUN
exposure NOUN
: PUNCT
* PUNCT
understand VERB
as ADP
- PUNCT
is VERB
process NOUN
develop VERB
to- X
design NOUN
document NOUN
meet NOUN
business NOUN
government NOUN
requirement NOUN
* PUNCT
requirement NOUN
gathering NOUN
sd ADP
process NOUN
client NOUN
* SYM
developed VERB
solution NOUN
blueprint NOU

indore ADJ
madhya NOUN
pradesh NOUN
2002 NUM
2003 NUM
higher ADJ
secondary ADJ
certificate NOUN
shanti NOUN
nagar ADJ
high ADJ
school NOUN
maharashtra NOUN
board NOUN
  SPACE
mumbai VERB
maharashtra ADJ
1999 NUM
2000 NUM
skill NOUN
sap VERB
sd ADP
7 NUM
year NOUN
