# Arbyte üë®‚Äçüíª 

Your one-stop solution for job applications.

## Data Exploration

Import libraries

In [1]:
import pandas as pd
import huggingface_hub as hf_hub
import json

Log into HuggingFace for access to data

In [2]:
hf_hub.notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

Import Dataset from AzharAli05/Resume-Screening-Dataset

In [5]:
# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_csv("hf://datasets/AzharAli05/Resume-Screening-Dataset/dataset.csv")

In [5]:
df

Unnamed: 0,Role,Resume,Decision,Reason_for_decision,Job_Description
0,E-commerce Specialist,Here's a professional resume for Jason Jones:\...,reject,Lacked leadership skills for a senior position.,Be part of a passionate team at the forefront ...
1,Game Developer,Here's a professional resume for Ann Marshall:...,select,Strong technical skills in AI and ML.,Help us build the next-generation products as ...
2,Human Resources Specialist,Here's a professional resume for Patrick Mccla...,reject,Insufficient system design expertise for senio...,We need a Human Resources Specialist to enhanc...
3,E-commerce Specialist,Here's a professional resume for Patricia Gray...,select,Impressive leadership and communication abilit...,Be part of a passionate team at the forefront ...
4,E-commerce Specialist,Here's a professional resume for Amanda Gross:...,reject,Lacked leadership skills for a senior position.,We are looking for an experienced E-commerce S...
...,...,...,...,...,...
10169,Product Manager,Here's a sample resume for Diana Miller:\n\n**...,reject,Unsatisfactory references or background check.,Here is a comprehensive job description for a ...
10170,UI Engineer,Here's a sample resume for Grace Taylor:\n\n**...,reject,Lack of relevant skills or experience.,Here is a sample job description for a UI Engi...
10171,UI Engineer,Here's a sample resume for Hank Brown:\n\n**Ha...,select,Growth mindset and adaptability.,Here is a job description for a UI Engineer ro...
10172,Data Engineer,Here's a sample resume for Diana Wilson:\n\n**...,reject,Lack of relevant skills or experience.,Here is a comprehensive job description for a ...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10174 entries, 0 to 10173
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Role                 10174 non-null  object
 1   Resume               10174 non-null  object
 2   Decision             10174 non-null  object
 3   Reason_for_decision  10174 non-null  object
 4   Job_Description      10174 non-null  object
dtypes: object(5)
memory usage: 397.5+ KB


Extract Small Sample Size for Proof of Concept

In [7]:
poc_sample = df.head(5)
poc_sample

Unnamed: 0,Role,Resume,Decision,Reason_for_decision,Job_Description
0,E-commerce Specialist,Here's a professional resume for Jason Jones:\...,reject,Lacked leadership skills for a senior position.,Be part of a passionate team at the forefront ...
1,Game Developer,Here's a professional resume for Ann Marshall:...,select,Strong technical skills in AI and ML.,Help us build the next-generation products as ...
2,Human Resources Specialist,Here's a professional resume for Patrick Mccla...,reject,Insufficient system design expertise for senio...,We need a Human Resources Specialist to enhanc...
3,E-commerce Specialist,Here's a professional resume for Patricia Gray...,select,Impressive leadership and communication abilit...,Be part of a passionate team at the forefront ...
4,E-commerce Specialist,Here's a professional resume for Amanda Gross:...,reject,Lacked leadership skills for a senior position.,We are looking for an experienced E-commerce S...


In [8]:
for resume in poc_sample["Resume"]:
    print(len(resume))

2613
429
3046
2344
2784


In [9]:
for job_dec in poc_sample["Job_Description"]:
    print(len(job_dec))

137
110
136
136
129


In [10]:
poc_sample["Job_Description"][2]

"We need a Human Resources Specialist to enhance our team's technical capabilities and contribute to solving complex business challenges."

Here, we save the sample data as a json (.json) file for the Proof of Concept (PoC).

In [11]:
poc_sample.to_json("..\poc\poc_data.json",orient="records",indent=4)

  poc_sample.to_json("..\poc\poc_data.json",orient="records",indent=4)


### Issue : 
- Job descriptions in the dataset are significantly shorter than resumes. Job descriptions that will be scraped from LinkedIn, JobStreet will be much longer and more detailed. 
- If during training, it is used to seeing shorter JDs, then it will perform badly when it infers longer JDs from scraped data.
- There is a need to match the training JDs to the scraped JDs.

### Solution
- We add an LLM Summarization layer right after scraping the JDs, to shorten them to the length familiar to the machine learning model.

## Data Cleaning

Drop null values if any. 

In [6]:
df = df.dropna()
df

Unnamed: 0,Role,Resume,Decision,Reason_for_decision,Job_Description
0,E-commerce Specialist,Here's a professional resume for Jason Jones:\...,reject,Lacked leadership skills for a senior position.,Be part of a passionate team at the forefront ...
1,Game Developer,Here's a professional resume for Ann Marshall:...,select,Strong technical skills in AI and ML.,Help us build the next-generation products as ...
2,Human Resources Specialist,Here's a professional resume for Patrick Mccla...,reject,Insufficient system design expertise for senio...,We need a Human Resources Specialist to enhanc...
3,E-commerce Specialist,Here's a professional resume for Patricia Gray...,select,Impressive leadership and communication abilit...,Be part of a passionate team at the forefront ...
4,E-commerce Specialist,Here's a professional resume for Amanda Gross:...,reject,Lacked leadership skills for a senior position.,We are looking for an experienced E-commerce S...
...,...,...,...,...,...
10169,Product Manager,Here's a sample resume for Diana Miller:\n\n**...,reject,Unsatisfactory references or background check.,Here is a comprehensive job description for a ...
10170,UI Engineer,Here's a sample resume for Grace Taylor:\n\n**...,reject,Lack of relevant skills or experience.,Here is a sample job description for a UI Engi...
10171,UI Engineer,Here's a sample resume for Hank Brown:\n\n**Ha...,select,Growth mindset and adaptability.,Here is a job description for a UI Engineer ro...
10172,Data Engineer,Here's a sample resume for Diana Wilson:\n\n**...,reject,Lack of relevant skills or experience.,Here is a comprehensive job description for a ...


We drop the "Reason for Decision" column, since we are trying to predict the decision and the reason comes after the fact. 

In [7]:
df = df.drop(["Reason_for_decision"], axis=1)
df

Unnamed: 0,Role,Resume,Decision,Job_Description
0,E-commerce Specialist,Here's a professional resume for Jason Jones:\...,reject,Be part of a passionate team at the forefront ...
1,Game Developer,Here's a professional resume for Ann Marshall:...,select,Help us build the next-generation products as ...
2,Human Resources Specialist,Here's a professional resume for Patrick Mccla...,reject,We need a Human Resources Specialist to enhanc...
3,E-commerce Specialist,Here's a professional resume for Patricia Gray...,select,Be part of a passionate team at the forefront ...
4,E-commerce Specialist,Here's a professional resume for Amanda Gross:...,reject,We are looking for an experienced E-commerce S...
...,...,...,...,...
10169,Product Manager,Here's a sample resume for Diana Miller:\n\n**...,reject,Here is a comprehensive job description for a ...
10170,UI Engineer,Here's a sample resume for Grace Taylor:\n\n**...,reject,Here is a sample job description for a UI Engi...
10171,UI Engineer,Here's a sample resume for Hank Brown:\n\n**Ha...,select,Here is a job description for a UI Engineer ro...
10172,Data Engineer,Here's a sample resume for Diana Wilson:\n\n**...,reject,Here is a comprehensive job description for a ...


Remove newlines "\n" from the resume column

In [9]:
def remove_newlines(text):
    return text.replace('\n', ' ')

df['Resume'] = df['Resume'].apply(remove_newlines)
df['Resume'][0]

'Here\'s a professional resume for Jason Jones:  Jason Jones E-commerce Specialist  Contact Information:  * Email: [jasonjones@email.com](mailto:jasonjones@email.com) * Phone: 555-123-4567 * LinkedIn: linkedin.com/in/jasonjones  Summary: Results-driven E-commerce Specialist with 5+ years of experience in inventory management, SEO, online advertising, and analytics. Proven track record of increasing online sales, improving website traffic, and optimizing inventory levels. Skilled in analyzing complex data sets, identifying trends, and making data-driven decisions. Passionate about staying up-to-date with the latest e-commerce trends and technologies.  Professional Experience:  E-commerce Specialist, XYZ Corporation (2018-Present)  * Managed inventory levels across multiple channels, resulting in a 25% reduction in stockouts and a 15% reduction in overstocking * Developed and implemented SEO strategies that increased website traffic by 30% and improved search engine rankings by 20% * Cre

There is a Regex pattern here that is constant and would deliver no semantic value to the embeddings, which is "Here's a professional resume for [Name of Candidate]:". We want to remove this before we turn the resume into embeddings.

In [10]:
pattern = r"Here.*?resume for .*?:\s+"
df['Resume'] = df['Resume'].str.replace(pattern, "", regex=True)

Last check to see if the pattern still exists.

In [11]:
df["Resume"].str.contains(pattern).any()

np.False_

## Data Engineering

In this section, we define our target labels as 'Decision', whether or not the application was accepted or rejected. Then we define the features for the ML model. (For the PoC, we are not yet training the model, we will simulate the prediction of the model using the actual labels of the small sample size).

### Features

1. **Role-Resume Sim** : "Role" and "Resume" semantic similarity score will be calculated and used as a feature for the ML model. This is because we want to find out the Job-to-Resume match as a whole. If a candidate has previously worked in the exact role (for example, Junior Data Scientist applying for Senior Data Scientist), the score would be boosted. Conversely, if the SHAPley feature explanation says that the score is low due to the influence from the Job-to-Resume score, we know to give the feedback "Your resume doesn't sound like the role" and tailor the resume accordingly.

2. **Resume-JD Sim** : "Resume" and "Job Description" semantic similarity score will also be calculated and used as a feature. This is the more detailed sort of Skillstack-to-Skillstack comparison. Resume contents that describe skills or experiences that are semantically close to the job description would be more represented by this feature. Hence, it also provides a straightforward SHAPley-based feedback mechanism to give to the LLM for tailoring. For example, if the decision predicted is a rejection and this feature had a major influence according to SHAPley, we can instruct the LLM to tailor the resume to include words or change the phrasing to make it more semantically similar.

3. **Word Overlap** : "Resume" and "Job Description" will also be used to engineer another feature, the word overlap count. This tells us how many words in the Job Decription that are missing and how many are present in the resume. Essentially, its a measure of how similar the writing style is. This will also play a part in the feedback mechanism, telling us if the LLM should include more words that are present in the job description into the resume.

4. **Tech Skills Overlap** : Since this is intended specifically for use in the Tech industry, it is wise to compare whether or not the overlapping skills are actually high demand skills in the industry. To do this, we compare the overlap with a list of skills keywords from a website that lists all the keywords that are detected by the Applicant Tracking System (ATS) that companies use to filter candidate resumes.

5. **Job Title Match Score** : This is a count of whether or not the Job Title is explicitly mentioned in the resume. Statistics say that candidates who include the job title in their resume is 10.6 times more likely to land the job. This would also be a direct feedback for the LLM. If the score is 0 (job title not mentioned) and the decision is a rejection due to the influence of this score, we know to suggest including the job title in the resume.

### Synthetic Augmentation of Job Descriptions

To refresh, the job descriptions in the data are a form of summary from the actual job descriptions (~100 words). We need a longer, more detailed job descriptions that are accurate to industry standards. We do this by passing the Job Description to an LLM prompted to synthetically rewrite the job descriptions, paying close attention to hallucination by setting low Temperature. This ensures the LLM generates token deterministically and predictably, since we want the most commonly associated words with the job description.

> We acknowledge the risk of hallucination that still exists. This is the best method found to engineer the data to suit our machine learning use case.

In [43]:
import ollama

def augment_data(role, job_desc):
    client = ollama.Client()
    response = client.generate(
        model="llama3.2",# 3.2 has less parameters and is faster
        prompt = f"""
        Role: {role}
        Context: {job_desc}
        Task: Rewrite this job description to be a standard, professional length (<4000 words). 
        In it, only include two sections, industry standard key responsibilities and technical skills associated with {role}.
        Make sure the technical skills are relevant to the role of {role} and are commonly used in the industry.
        Name only 6 key responsibilities and 6 technical skills.
        Name no company names, locations or company descriptions. Don't add placeholders for them.
        Name no company values or benefits.
        Format the output as:
        Key Responsibilities:
        - Responsibility 1
        - Responsibility 2
        - Responsibility 3
        - Responsibility 4
        - Responsibility 5
        - Responsibility 6
        Technical Skills:
        - Skill 1
        - Skill 2
        - Skill 3
        - Skill 4
        - Skill 5
        - Skill 6
        """,
        options={
            "temperature":0.1, # lower temperature for more deterministic output, less likely to hallucinate
            "num_ctx":500 #set it lower to reduce context length and workload on VRAM
            }
    )
    return response.response

test_augmented_jd = augment_data(df['Role'][0], df['Job_Description'][0])
print(test_augmented_jd)

Key Responsibilities:

- Develop and implement e-commerce strategies to drive business growth and improve customer engagement.
- Analyze market trends, competitor activity, and customer behavior to inform product development and marketing initiatives.
- Collaborate with cross-functional teams to design and launch new products, features, and experiences that meet evolving customer needs.
- Optimize website performance, user experience, and conversion rates through data-driven decision making and A/B testing.
- Manage and analyze large datasets to identify insights and trends that inform business decisions and drive growth.
- Stay up-to-date with the latest e-commerce technologies, platforms, and tools to ensure the company remains competitive in the market.

Technical Skills:

- Programming languages: Python, JavaScript, or Ruby
- E-commerce platforms: Shopify, Magento, or WooCommerce
- Data analysis and machine learning libraries: Pandas, NumPy, scikit-learn, or TensorFlow
- Web develo

Time taken to augment one job description is around 6-7s. For 10k rows (10k augmentations) it will take close to 16 hours which is too long and not computationally efficient. So we group the data by role, and job description and generate only one for each.

In [32]:
roles = df["Role"].unique().tolist()
roles

['E-commerce Specialist',
 'Game Developer',
 'Human Resources Specialist',
 'Mobile App Developer',
 'UX Designer',
 'Cloud Engineer',
 'Digital Marketing Specialist',
 'AI Researcher',
 'UI Engineer',
 'AR/VR Developer',
 'Machine Learning Engineer',
 'Database Administrator',
 'Data Engineer',
 'Cybersecurity Analyst',
 'Robotics Engineer',
 'Business Analyst',
 'Data Analyst',
 'Cloud Architect',
 'Data Architect',
 'QA Engineer',
 'System Administrator',
 'DevOps Engineer',
 'Product Manager',
 'Data Scientist',
 'Full Stack Developer',
 'Blockchain Developer',
 'Software Engineer',
 'Content Writer',
 'IT Support Specialist',
 'UI Designer',
 'Cybersecurity Specialist',
 'HR Specialist',
 'Network Engineer',
 'Graphic Designer',
 'UI/UX Designer',
 'AI Engineer',
 'Project Manager',
 'Software Developer',
 'product manager',
 'software engineer',
 'data engineer',
 'ui engineer',
 'data scientist',
 'data analyst',
 'ui designer']

In [34]:
role_counts = {}
total_counts = 0
for role in roles:
    mask = df["Role"] == role
    role_counts[role] = df[mask]["Job_Description"].nunique()
    total_counts += role_counts[role]

role_counts
total_counts

3802

There are only 3802 unique role-job description pairs that we need to augment.

In [36]:
unique_df = df[["Role", "Job_Description"]].drop_duplicates().reset_index(drop=True)
unique_df

Unnamed: 0,Role,Job_Description
0,E-commerce Specialist,Be part of a passionate team at the forefront ...
1,Game Developer,Help us build the next-generation products as ...
2,Human Resources Specialist,We need a Human Resources Specialist to enhanc...
3,E-commerce Specialist,Be part of a passionate team at the forefront ...
4,E-commerce Specialist,We are looking for an experienced E-commerce S...
...,...,...
3797,Product Manager,Here is a comprehensive job description for a ...
3798,UI Engineer,Here is a sample job description for a UI Engi...
3799,UI Engineer,Here is a job description for a UI Engineer ro...
3800,Data Engineer,Here is a comprehensive job description for a ...


In [37]:
print("Length of unique job descriptions dataframe:", len(unique_df))
print("Length of original dataframe:", len(df))

Length of unique job descriptions dataframe: 3802
Length of original dataframe: 10174


Lets now augment the data on the 3802 rows.

In [44]:
from tqdm import tqdm

augmented_descriptions = []
for index, row in tqdm(unique_df.iterrows(), total=len(unique_df), desc="Augmenting Job Descriptions"):
    role = row["Role"]
    job_desc = row["Job_Description"]
    augmented_jd = augment_data(role, job_desc)
    augmented_descriptions.append(augmented_jd)
unique_df["Augmented_Job_Description"] = augmented_descriptions

Augmenting Job Descriptions: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3802/3802 [3:05:25<00:00,  2.93s/it]  


By using 3B parameter model instead of 8b and lowering context window, we have reduced the computation times by ~45%
> **Before** : (8B model and 4096 tokens context window) = 7s/output. ~7 hours for 3082 rows.

> **After** : (3B model and 500 tokens context window) = 3s/output. ~3 hours for 3082 rows.

In [45]:
unique_df

Unnamed: 0,Role,Job_Description,Augmented_Job_Description
0,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:\n\n- Develop and impleme...
1,Game Developer,Help us build the next-generation products as ...,Key Responsibilities:\n\n- Design and develop ...
2,Human Resources Specialist,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources Specialist\n\nWe ar...
3,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:\n\n- Design and implemen...
4,E-commerce Specialist,We are looking for an experienced E-commerce S...,Key Responsibilities:\n\n- Develop and impleme...
...,...,...,...
3797,Product Manager,Here is a comprehensive job description for a ...,Here is the rewritten job description:\n\nKey ...
3798,UI Engineer,Here is a sample job description for a UI Engi...,Key Responsibilities:\n\n- Design and develop ...
3799,UI Engineer,Here is a job description for a UI Engineer ro...,Here is the rewritten job description:\n\nKey ...
3800,Data Engineer,Here is a comprehensive job description for a ...,Here is the rewritten job description:\n\nKey ...


In [50]:
pattern = r"Here is the rewritten job description"
mask = unique_df['Augmented_Job_Description'].str.contains(pattern,regex=True)
unique_df[mask]

Unnamed: 0,Role,Job_Description,Augmented_Job_Description
1124,Digital Marketing Specialist,Here is a detailed job description for a Digit...,Here is the rewritten job description:\n\nKey ...
1126,HR Specialist,**Job Title:** HR Specialist\n\n**Job Summary:...,Here is the rewritten job description:\n\nKey ...
1127,Network Engineer,**Job Title:** Network Engineer\n\n**Job Summa...,Here is the rewritten job description:\n\nKey ...
1133,Business Analyst,**Job Title:** Business Analyst\n\n**Job Summa...,Here is the rewritten job description:\n\nKey ...
1134,Database Administrator,**Job Title:** Database Administrator\n\n**Job...,Here is the rewritten job description:\n\nKey ...
...,...,...,...
3792,Data Scientist,Here is a comprehensive job description for a ...,Here is the rewritten job description:\n\nKey ...
3797,Product Manager,Here is a comprehensive job description for a ...,Here is the rewritten job description:\n\nKey ...
3799,UI Engineer,Here is a job description for a UI Engineer ro...,Here is the rewritten job description:\n\nKey ...
3800,Data Engineer,Here is a comprehensive job description for a ...,Here is the rewritten job description:\n\nKey ...


We don't want these repetitive patterns in the embeddings. They serve no value so we'll clean it up.

In [52]:
unique_df["Augmented_Job_Description"] = unique_df["Augmented_Job_Description"].str.replace(pattern, "", regex=True)
unique_df

Unnamed: 0,Role,Job_Description,Augmented_Job_Description
0,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:\n\n- Develop and impleme...
1,Game Developer,Help us build the next-generation products as ...,Key Responsibilities:\n\n- Design and develop ...
2,Human Resources Specialist,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources Specialist\n\nWe ar...
3,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:\n\n- Design and implemen...
4,E-commerce Specialist,We are looking for an experienced E-commerce S...,Key Responsibilities:\n\n- Develop and impleme...
...,...,...,...
3797,Product Manager,Here is a comprehensive job description for a ...,:\n\nKey Responsibilities:\n- Develop and exec...
3798,UI Engineer,Here is a sample job description for a UI Engi...,Key Responsibilities:\n\n- Design and develop ...
3799,UI Engineer,Here is a job description for a UI Engineer ro...,:\n\nKey Responsibilities:\n- Design and devel...
3800,Data Engineer,Here is a comprehensive job description for a ...,":\n\nKey Responsibilities:\n- Design, build, a..."


It also contains the "\n" newline pattern which we don't want so we will drop them.

In [None]:
unique_df["Augmented_Job_Description"] = unique_df["Augmented_Job_Description"].str.replace(r"\n+", "", regex=True)
unique_df

Unnamed: 0,Role,Job_Description,Augmented_Job_Description
0,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...
1,Game Developer,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...
2,Human Resources Specialist,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...
3,E-commerce Specialist,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...
4,E-commerce Specialist,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...
...,...,...,...
3797,Product Manager,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...
3798,UI Engineer,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...
3799,UI Engineer,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...
3800,Data Engineer,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai..."


In [54]:
df_merged = pd.merge(df, unique_df, on=["Role", "Job_Description"], how="left")
df_merged

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...
...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai..."


Save the file.

In [55]:
df_merged.to_csv("../data/processed_resume_screening_dataset.csv", index=False)

Load csv.

In [11]:
df_augmented = pd.read_csv("../data/processed_resume_screening_dataset.csv")
df_augmented

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...
...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai..."


Final cleaning of non alphabet characters.

In [12]:
import regex as re

def clean_text(text):
   pattern = r'[\t\r\n**]'
   cleaned_text = re.sub(pattern, ' ', text)
   return cleaned_text

df_augmented['Cleaned_JD'] = df_augmented['Augmented_Job_Description'].apply(clean_text)
df_augmented['Cleaned_Resume'] = df_augmented['Resume'].apply(clean_text)
df_augmented['Cleaned_Resume'][10172]

"  Diana Wilson     Contact Information:      Address: 123 Main St, Anytown, USA 12345   Phone: (555) 555-5555   Email: [diana.wilson@email.com](mailto:diana.wilson@email.com)   LinkedIn: linkedin.com/in/dianawilson    Summary:   Highly motivated and detail-oriented data engineer with 5+ years of experience in designing, developing, and maintaining large-scale data systems. Skilled in data warehousing, ETL, data visualization, and machine learning. Proficient in a range of programming languages and tools, including Python, Java, SQL, and Hadoop.    Technical Skills:      Programming languages: Python, Java, SQL, R   Data engineering tools: Apache Beam, Apache Spark, Apache Kafka, Apache Hadoop   Data warehousing: Amazon Redshift, Google BigQuery   Data visualization: Tableau, Power BI, D3.js   Machine learning: scikit-learn, TensorFlow, PyTorch   Operating systems: Windows, Linux, macOS   Cloud platforms: Amazon Web Services, Google Cloud Platform, Microsoft Azure    Professional Exper

### Generating Embeddings

We first need to generate the embeddings.

For maximum speed and efficiency, we group all of the texts in a list and pass it once into the sentence transformer (embedding model). By doing this, we managed to optimize the runtimes of:

> Resume and Job Description Embeddings generation from : **80 Minutes** to **3 Minutes**.

> Job Title Embedding generation : **75 Minutes** to **0.2 Minutes**.

In [13]:
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', device='cuda' if torch.cuda.is_available() else 'cpu')

In [14]:
all_resumes = df_augmented['Cleaned_Resume'].tolist()
all_embeddings = model.encode(all_resumes, show_progress_bar=True)

Batches:   0%|          | 0/318 [00:00<?, ?it/s]

In [15]:
df_augmented['Resume_Embeddings'] = list(all_embeddings)
len(df_augmented['Resume_Embeddings'][2])

384

Repeat for Job Description Embeddings and Job Title

In [16]:
all_jds = df_augmented['Cleaned_JD'].tolist()
all_jd_embeddings = model.encode(all_jds, show_progress_bar=True)

Batches:   0%|          | 0/318 [00:00<?, ?it/s]

In [17]:
df_augmented['JD_Embeddings'] = list(all_jd_embeddings)
len(df_augmented['JD_Embeddings'][0])

384

In [18]:
all_roles = df_augmented['Role'].tolist()
all_roles_embeddings = model.encode(all_roles, show_progress_bar=True)

Batches:   0%|          | 0/318 [00:00<?, ?it/s]

In [19]:
df_augmented["Role_Embeddings"] = list(all_roles_embeddings)
len(df_augmented["Role_Embeddings"][0])

384

In [26]:
df_augmented.to_pickle("../data/embedded_resume_screening_dataset.pkl")

In [28]:
df_embedded = pd.read_pickle("../data/embedded_resume_screening_dataset.pkl")
df_embedded

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description,Cleaned_JD,Cleaned_Resume,Resume_Embeddings,JD_Embeddings,Role_Embeddings
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...,Key Responsibilities:- Develop and implement e...,Jason Jones E-commerce Specialist Contact Inf...,"[-0.17973936, -0.07596597, -0.09723778, -0.114...","[-0.054287765, -0.13262808, 0.15113121, -0.081...","[-0.010778938, -0.33309612, 0.049484182, 0.016..."
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...,Key Responsibilities:- Design and develop enga...,Ann Marshall Contact Information: Email: [a...,"[0.13636228, -0.17706871, 0.0489342, 0.0220676...","[0.0084996885, -0.0960481, 0.1524509, -0.02108...","[-0.14157347, -0.13269219, -0.009378961, -0.32..."
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...,Job Title: Human Resources SpecialistWe are se...,Patrick Mcclain Human Resources Specialist Co...,"[-0.12360888, -0.0028093122, -0.059053816, -0....","[-0.10314158, 0.08767564, 0.106681176, 0.14662...","[-0.44442204, 0.1633573, -0.12067189, 0.144497..."
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...,Key Responsibilities:- Design and implement e-...,Patricia Gray Contact Information: Email: [...,"[0.08986741, -0.15142985, -0.020609418, 0.0756...","[-0.044191867, -0.20852506, 0.1408912, -0.0783...","[-0.010778938, -0.33309612, 0.049484182, 0.016..."
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...,Key Responsibilities:- Develop and implement d...,Amanda Gross Contact Information: Email: [a...,"[-0.16076297, -0.256428, 0.077939644, 0.081880...","[-0.098053396, -0.0045841606, 0.017642075, -0....","[-0.010778938, -0.33309612, 0.049484182, 0.016..."
...,...,...,...,...,...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...,:Key Responsibilities:- Develop and execute pr...,Diana Miller Contact Information: E...,"[-0.089459516, -0.040913753, -0.1686066, 0.121...","[-0.13952348, 0.021376673, -0.040870953, -0.10...","[-0.23212944, -0.029141244, 0.11806843, -0.110..."
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...,Key Responsibilities:- Design and develop visu...,Grace Taylor Contact Information: ...,"[-0.08140646, -0.13228643, -0.045160342, 0.020...","[-0.104550816, -0.24781276, 0.09464185, 0.1046...","[-0.45176134, -0.18218127, 0.14167099, 0.09667..."
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...,:Key Responsibilities:- Design and develop use...,Hank Brown UI Engineer Contact Info...,"[-0.35859102, 0.08209543, -0.12568502, -0.0462...","[-0.13481516, -0.1497447, 0.089010455, 0.03778...","[-0.45176134, -0.18218127, 0.14167099, 0.09667..."
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai...",":Key Responsibilities:- Design, build, and mai...",Diana Wilson Contact Information: A...,"[0.11309587, -0.123301044, -0.20182684, 0.0077...","[-0.2218479, -0.09838769, -0.17422803, -0.1901...","[-0.3564379, 0.2009739, -0.4245388, -0.3691941..."


### Creating Feature 1  (Resume-JD Similarity)

Example similarity score for a rejected resume

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

embedding1 = df_embedded['Resume_Embeddings'][0].reshape(1, -1)
embedding2 = df_embedded['JD_Embeddings'][0].reshape(1, -1)
similarity_score = cosine_similarity(embedding1, embedding2)
similarity_score

array([[0.5053542]], dtype=float32)

Example similarity score for an accepted resume

In [32]:
embedding1 = df_embedded['Resume_Embeddings'][1].reshape(1, -1)
embedding2 = df_embedded['JD_Embeddings'][1].reshape(1, -1)
similarity_score = cosine_similarity(embedding1, embedding2)
similarity_score

array([[0.4116441]], dtype=float32)

Apply to whole dataset.

In [34]:
def compute_similarity(embedding1, embedding2):
    emb1_reshaped = embedding1.reshape(1, -1)
    emb2_reshaped = embedding2.reshape(1, -1)
    similarity = cosine_similarity(emb1_reshaped, emb2_reshaped)
    return similarity[0][0]

df_embedded['Resume_JD_Sim'] = df_embedded.apply(lambda row: compute_similarity(row['Resume_Embeddings'], row['JD_Embeddings']), axis=1)
df_embedded['Resume_JD_Sim']

0        0.505354
1        0.411644
2        0.604468
3        0.351232
4        0.428203
           ...   
10169    0.435778
10170    0.367484
10171    0.459990
10172    0.403455
10173    0.305488
Name: Resume_JD_Sim, Length: 10174, dtype: float32

In [38]:
df_embedded = df_embedded.drop(["Similarity_Score"], axis=1)
df_embedded

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description,Cleaned_JD,Cleaned_Resume,Resume_Embeddings,JD_Embeddings,Role_Embeddings,Resume_JD_Sim
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...,Key Responsibilities:- Develop and implement e...,Jason Jones E-commerce Specialist Contact Inf...,"[-0.17973936, -0.07596597, -0.09723778, -0.114...","[-0.054287765, -0.13262808, 0.15113121, -0.081...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.505354
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...,Key Responsibilities:- Design and develop enga...,Ann Marshall Contact Information: Email: [a...,"[0.13636228, -0.17706871, 0.0489342, 0.0220676...","[0.0084996885, -0.0960481, 0.1524509, -0.02108...","[-0.14157347, -0.13269219, -0.009378961, -0.32...",0.411644
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...,Job Title: Human Resources SpecialistWe are se...,Patrick Mcclain Human Resources Specialist Co...,"[-0.12360888, -0.0028093122, -0.059053816, -0....","[-0.10314158, 0.08767564, 0.106681176, 0.14662...","[-0.44442204, 0.1633573, -0.12067189, 0.144497...",0.604468
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...,Key Responsibilities:- Design and implement e-...,Patricia Gray Contact Information: Email: [...,"[0.08986741, -0.15142985, -0.020609418, 0.0756...","[-0.044191867, -0.20852506, 0.1408912, -0.0783...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.351232
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...,Key Responsibilities:- Develop and implement d...,Amanda Gross Contact Information: Email: [a...,"[-0.16076297, -0.256428, 0.077939644, 0.081880...","[-0.098053396, -0.0045841606, 0.017642075, -0....","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.428203
...,...,...,...,...,...,...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...,:Key Responsibilities:- Develop and execute pr...,Diana Miller Contact Information: E...,"[-0.089459516, -0.040913753, -0.1686066, 0.121...","[-0.13952348, 0.021376673, -0.040870953, -0.10...","[-0.23212944, -0.029141244, 0.11806843, -0.110...",0.435778
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...,Key Responsibilities:- Design and develop visu...,Grace Taylor Contact Information: ...,"[-0.08140646, -0.13228643, -0.045160342, 0.020...","[-0.104550816, -0.24781276, 0.09464185, 0.1046...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.367484
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...,:Key Responsibilities:- Design and develop use...,Hank Brown UI Engineer Contact Info...,"[-0.35859102, 0.08209543, -0.12568502, -0.0462...","[-0.13481516, -0.1497447, 0.089010455, 0.03778...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.459990
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai...",":Key Responsibilities:- Design, build, and mai...",Diana Wilson Contact Information: A...,"[0.11309587, -0.123301044, -0.20182684, 0.0077...","[-0.2218479, -0.09838769, -0.17422803, -0.1901...","[-0.3564379, 0.2009739, -0.4245388, -0.3691941...",0.403455


### Feature 2 (Role-Resume Similarity)

In [43]:
df_embedded["Role_Resume_Sim"] = df_embedded.apply(lambda row: compute_similarity(row['Role_Embeddings'], row['Resume_Embeddings']), axis=1)
df_embedded

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description,Cleaned_JD,Cleaned_Resume,Resume_Embeddings,JD_Embeddings,Role_Embeddings,Resume_JD_Sim,Role_Resume_Sim
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...,Key Responsibilities:- Develop and implement e...,Jason Jones E-commerce Specialist Contact Inf...,"[-0.17973936, -0.07596597, -0.09723778, -0.114...","[-0.054287765, -0.13262808, 0.15113121, -0.081...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.505354,0.646826
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...,Key Responsibilities:- Design and develop enga...,Ann Marshall Contact Information: Email: [a...,"[0.13636228, -0.17706871, 0.0489342, 0.0220676...","[0.0084996885, -0.0960481, 0.1524509, -0.02108...","[-0.14157347, -0.13269219, -0.009378961, -0.32...",0.411644,0.502842
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...,Job Title: Human Resources SpecialistWe are se...,Patrick Mcclain Human Resources Specialist Co...,"[-0.12360888, -0.0028093122, -0.059053816, -0....","[-0.10314158, 0.08767564, 0.106681176, 0.14662...","[-0.44442204, 0.1633573, -0.12067189, 0.144497...",0.604468,0.587193
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...,Key Responsibilities:- Design and implement e-...,Patricia Gray Contact Information: Email: [...,"[0.08986741, -0.15142985, -0.020609418, 0.0756...","[-0.044191867, -0.20852506, 0.1408912, -0.0783...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.351232,0.512997
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...,Key Responsibilities:- Develop and implement d...,Amanda Gross Contact Information: Email: [a...,"[-0.16076297, -0.256428, 0.077939644, 0.081880...","[-0.098053396, -0.0045841606, 0.017642075, -0....","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.428203,0.641594
...,...,...,...,...,...,...,...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...,:Key Responsibilities:- Develop and execute pr...,Diana Miller Contact Information: E...,"[-0.089459516, -0.040913753, -0.1686066, 0.121...","[-0.13952348, 0.021376673, -0.040870953, -0.10...","[-0.23212944, -0.029141244, 0.11806843, -0.110...",0.435778,0.481033
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...,Key Responsibilities:- Design and develop visu...,Grace Taylor Contact Information: ...,"[-0.08140646, -0.13228643, -0.045160342, 0.020...","[-0.104550816, -0.24781276, 0.09464185, 0.1046...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.367484,0.471639
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...,:Key Responsibilities:- Design and develop use...,Hank Brown UI Engineer Contact Info...,"[-0.35859102, 0.08209543, -0.12568502, -0.0462...","[-0.13481516, -0.1497447, 0.089010455, 0.03778...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.459990,0.501841
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai...",":Key Responsibilities:- Design, build, and mai...",Diana Wilson Contact Information: A...,"[0.11309587, -0.123301044, -0.20182684, 0.0077...","[-0.2218479, -0.09838769, -0.17422803, -0.1901...","[-0.3564379, 0.2009739, -0.4245388, -0.3691941...",0.403455,0.499404


### Creating Feature 3 (Resume-JD Word Overlap)

How many words from the JD are actually written in the JD (normalized).

In [68]:
def calculate_overlap(row):
    resume_words = set(str(row['Cleaned_Resume']).lower().split())
    jd_words = set(str(row['Cleaned_JD']).lower().split())
    common_words = resume_words.intersection(jd_words)
    return len(common_words)/ len(jd_words) if len(jd_words) > 0 else 0

df_embedded['Word_Overlap'] = df_embedded.apply(calculate_overlap, axis=1)
df_embedded[['Cleaned_Resume', 'Cleaned_JD', 'Word_Overlap']].head()

Unnamed: 0,Cleaned_Resume,Cleaned_JD,Word_Overlap
0,Jason Jones E-commerce Specialist Contact Inf...,Key Responsibilities:- Develop and implement e...,0.298246
1,Ann Marshall Contact Information: Email: [a...,Key Responsibilities:- Design and develop enga...,0.069767
2,Patrick Mcclain Human Resources Specialist Co...,Job Title: Human Resources SpecialistWe are se...,0.269504
3,Patricia Gray Contact Information: Email: [...,Key Responsibilities:- Design and implement e-...,0.231579
4,Amanda Gross Contact Information: Email: [a...,Key Responsibilities:- Develop and implement d...,0.300885


### Creating Feature 4 (Tech Skills Overlap)

### Scraping ATS Tech Keywords

Here will extract the ATS keywords that we will use to generate feature 4 above.

In [46]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
# options.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

In [47]:
url = "https://www.jobscan.co/blog/top-resume-keywords-boost-resume/#information-technology-computer-science"
driver.get(url)

In [48]:
header_xpath = "/html/body/div[5]/div[2]/div[1]/div/div[2]/main/article/div[1]/h3[10]"
tech_header = driver.find_element(By.XPATH, header_xpath)
tech_header.text

'Information Technology, Computer Science'

In [49]:
keyword_element = tech_header.find_element(By.XPATH, "./following-sibling::*[1]")
keyword_element.text

'.NET\nalgorithms\nandroid\narchitecture\narchitectures\naudio\nAutoCAD\nAWS\nbig data\nbusiness analysis\nbusiness continuity\nC (programming language)\nC#\nC++\nCAD\ncertification\nCisco\ncloud\ncompliance\ncomputer applications\ncomputer science\ncontrols\nCSS\nD (programming language)\ndata center\ndata collection\ndata entry\ndata management\ndatabase\ndatasets\ndesign\ndevelopment activities\ndigital marketing\ndigital media\ndistribution\nDNS\necommerce\ne-commerce\nend user\nexperimental\nexperiments\nframeworks\nfront-end\nGIS\ngraphic design\nhardware\nHTML5\nI-DEAS\ninformation management\ninformation security\ninformation technology\nintranet\niOS\niPhone\nIT infrastructure\nITIL\nJava\nJavascript\nJIRA\nLAN\nlicensing\nLinux\nmachine learning\nMATLAB\nmatrix\nmechanical engineering\nmigration\nmobile\nmodeling\nnetworking\noperations management\noracle\nOS\nprocess development\nprocess improvement\nprocess improvements\nproduct design\nproduct development\nproduct knowledg

In [52]:
tech_keywords = keyword_element.text.splitlines()
tech_keywords

['.NET',
 'algorithms',
 'android',
 'architecture',
 'architectures',
 'audio',
 'AutoCAD',
 'AWS',
 'big data',
 'business analysis',
 'business continuity',
 'C (programming language)',
 'C#',
 'C++',
 'CAD',
 'certification',
 'Cisco',
 'cloud',
 'compliance',
 'computer applications',
 'computer science',
 'controls',
 'CSS',
 'D (programming language)',
 'data center',
 'data collection',
 'data entry',
 'data management',
 'database',
 'datasets',
 'design',
 'development activities',
 'digital marketing',
 'digital media',
 'distribution',
 'DNS',
 'ecommerce',
 'e-commerce',
 'end user',
 'experimental',
 'experiments',
 'frameworks',
 'front-end',
 'GIS',
 'graphic design',
 'hardware',
 'HTML5',
 'I-DEAS',
 'information management',
 'information security',
 'information technology',
 'intranet',
 'iOS',
 'iPhone',
 'IT infrastructure',
 'ITIL',
 'Java',
 'Javascript',
 'JIRA',
 'LAN',
 'licensing',
 'Linux',
 'machine learning',
 'MATLAB',
 'matrix',
 'mechanical engineerin

Now the logic is simple. 

1. Get Intersection set of JD and Tech Skills, to filter out skills in JD that are actually in the most common ATS Tech Skill keywords.
2. Intersect this intersection set with the skills in the Resume.
3. End product gives us a measure of how many of the most common ATS Tech Skills listed in the JD is present in the resume (in a normalized percentage).

In [69]:
def calculate_tech_keyword_overlap(row):
    jd_words = set(str(row['Cleaned_JD']).lower().split())
    resume_words = set(str(row['Cleaned_Resume']).lower().split())
    tech_keywords_set = set([kw.lower() for kw in tech_keywords])
    jd_tech_keywords = jd_words.intersection(tech_keywords_set)
    resume_tech_keywords = resume_words.intersection(jd_tech_keywords)
    return len(resume_tech_keywords)/ len(jd_tech_keywords) if len(jd_tech_keywords) > 0 else 0

df_embedded['Tech_Keyword_Overlap'] = df_embedded.apply(calculate_tech_keyword_overlap, axis=1)
df_embedded['Tech_Keyword_Overlap']

0        0.250000
1        0.000000
2        0.000000
3        0.250000
4        0.333333
           ...   
10169    0.000000
10170    0.625000
10171    0.375000
10172    0.666667
10173    0.500000
Name: Tech_Keyword_Overlap, Length: 10174, dtype: float64

### Creating Feature 5 (Role Binary Count)

This is the simplest one. We want to see if the Job Title(Role) is explicitly mentioned in the resume or not.

In [59]:
def job_title_presence(row):
    job_title = str(row['Role']).lower()
    resume_text = str(row['Cleaned_Resume']).lower()
    return 1 if job_title in resume_text else 0

df_embedded['Job_Title_Presence'] = df_embedded.apply(job_title_presence, axis=1)
df_embedded['Job_Title_Presence'].value_counts()

Job_Title_Presence
1    10022
0      152
Name: count, dtype: int64

In [62]:
df_embedded["Decision"].value_counts()

Decision
reject    5114
select    5060
Name: count, dtype: int64

Though research shows that including job title in the resume boosts the chances of securing the job by 10.6 times, the data seems to suggest otherwise (most people have it in their resume anyway). Since the distribution is 50:50, we can make an educated guess that most of the ones rejected will also have the job title in the resume, disputing the claim in the research. So let's further investigate.

In [67]:
rejected_mask = df_embedded["Decision"] == "reject"
df_embedded[rejected_mask]["Job_Title_Presence"].value_counts()

Job_Title_Presence
1    5038
0      76
Name: count, dtype: int64

This means we better perform a continuous measure of job title presence rather than binary. The assumption could be that the binary count is diluted with illusion cases (Common title abbreviations such as "Specialist", "Engineer", "Manager") that gives the count 1 when it isn't necessarily accurate.

In [74]:
def title_presence_analysis(row):
    job_title = str(row['Role']).lower()
    resume_text = str(row['Cleaned_Resume']).lower()
    if job_title in resume_text:
        return 1.0
    role_words = set(job_title.split())
    resume_words = set(resume_text.split())
    intersection = role_words.intersection(resume_words)
    return len(intersection) / len(role_words)
df_embedded['Title_Presence_Analysis'] = df_embedded.apply(title_presence_analysis, axis=1)
df_embedded['Title_Presence_Analysis'].value_counts()

Title_Presence_Analysis
1.000000    10058
0.666667       87
0.500000       26
0.333333        2
0.000000        1
Name: count, dtype: int64

In [78]:
df_embedded

Unnamed: 0,Role,Resume,Decision,Job_Description,Augmented_Job_Description,Cleaned_JD,Cleaned_Resume,Resume_Embeddings,JD_Embeddings,Role_Embeddings,Resume_JD_Sim,Role_Resume_Sim,Word_Overlap,Tech_Keyword_Overlap,Job_Title_Presence,Title_Presence_Analysis,Title_Present
0,E-commerce Specialist,Jason Jones E-commerce Specialist Contact Inf...,reject,Be part of a passionate team at the forefront ...,Key Responsibilities:- Develop and implement e...,Key Responsibilities:- Develop and implement e...,Jason Jones E-commerce Specialist Contact Inf...,"[-0.17973936, -0.07596597, -0.09723778, -0.114...","[-0.054287765, -0.13262808, 0.15113121, -0.081...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.505354,0.646826,0.298246,0.250000,1,1.0,1.0
1,Game Developer,Ann Marshall Contact Information: * Email: [a...,select,Help us build the next-generation products as ...,Key Responsibilities:- Design and develop enga...,Key Responsibilities:- Design and develop enga...,Ann Marshall Contact Information: Email: [a...,"[0.13636228, -0.17706871, 0.0489342, 0.0220676...","[0.0084996885, -0.0960481, 0.1524509, -0.02108...","[-0.14157347, -0.13269219, -0.009378961, -0.32...",0.411644,0.502842,0.069767,0.000000,1,1.0,1.0
2,Human Resources Specialist,Patrick Mcclain Human Resources Specialist Co...,reject,We need a Human Resources Specialist to enhanc...,Job Title: Human Resources SpecialistWe are se...,Job Title: Human Resources SpecialistWe are se...,Patrick Mcclain Human Resources Specialist Co...,"[-0.12360888, -0.0028093122, -0.059053816, -0....","[-0.10314158, 0.08767564, 0.106681176, 0.14662...","[-0.44442204, 0.1633573, -0.12067189, 0.144497...",0.604468,0.587193,0.269504,0.000000,1,1.0,1.0
3,E-commerce Specialist,Patricia Gray Contact Information: * Email: [...,select,Be part of a passionate team at the forefront ...,Key Responsibilities:- Design and implement e-...,Key Responsibilities:- Design and implement e-...,Patricia Gray Contact Information: Email: [...,"[0.08986741, -0.15142985, -0.020609418, 0.0756...","[-0.044191867, -0.20852506, 0.1408912, -0.0783...","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.351232,0.512997,0.231579,0.250000,1,1.0,1.0
4,E-commerce Specialist,Amanda Gross Contact Information: * Email: [a...,reject,We are looking for an experienced E-commerce S...,Key Responsibilities:- Develop and implement d...,Key Responsibilities:- Develop and implement d...,Amanda Gross Contact Information: Email: [a...,"[-0.16076297, -0.256428, 0.077939644, 0.081880...","[-0.098053396, -0.0045841606, 0.017642075, -0....","[-0.010778938, -0.33309612, 0.049484182, 0.016...",0.428203,0.641594,0.300885,0.333333,1,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10169,Product Manager,**Diana Miller** **Contact Information:** * E...,reject,Here is a comprehensive job description for a ...,:Key Responsibilities:- Develop and execute pr...,:Key Responsibilities:- Develop and execute pr...,Diana Miller Contact Information: E...,"[-0.089459516, -0.040913753, -0.1686066, 0.121...","[-0.13952348, 0.021376673, -0.040870953, -0.10...","[-0.23212944, -0.029141244, 0.11806843, -0.110...",0.435778,0.481033,0.276596,0.000000,1,1.0,1.0
10170,UI Engineer,**Grace Taylor** **Contact Information:** * ...,reject,Here is a sample job description for a UI Engi...,Key Responsibilities:- Design and develop visu...,Key Responsibilities:- Design and develop visu...,Grace Taylor Contact Information: ...,"[-0.08140646, -0.13228643, -0.045160342, 0.020...","[-0.104550816, -0.24781276, 0.09464185, 0.1046...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.367484,0.471639,0.362637,0.625000,1,1.0,1.0
10171,UI Engineer,**Hank Brown** **UI Engineer** **Contact Info...,select,Here is a job description for a UI Engineer ro...,:Key Responsibilities:- Design and develop use...,:Key Responsibilities:- Design and develop use...,Hank Brown UI Engineer Contact Info...,"[-0.35859102, 0.08209543, -0.12568502, -0.0462...","[-0.13481516, -0.1497447, 0.089010455, 0.03778...","[-0.45176134, -0.18218127, 0.14167099, 0.09667...",0.459990,0.501841,0.337079,0.375000,1,1.0,1.0
10172,Data Engineer,**Diana Wilson** **Contact Information:** * A...,reject,Here is a comprehensive job description for a ...,":Key Responsibilities:- Design, build, and mai...",":Key Responsibilities:- Design, build, and mai...",Diana Wilson Contact Information: A...,"[0.11309587, -0.123301044, -0.20182684, 0.0077...","[-0.2218479, -0.09838769, -0.17422803, -0.1901...","[-0.3564379, 0.2009739, -0.4245388, -0.3691941...",0.403455,0.499404,0.386667,0.666667,1,1.0,1.0


Based on the results, we can consider omitting this feature as it has too low of a variance, potentially delivering no value to the model.

## Complete DataFrame with Features

Now we clean up and create the final dataframe with only our intended features, dropping columns we no longer need.

In [80]:
data = df_embedded.drop(["Role","Resume","Job_Description","Augmented_Job_Description","Cleaned_Resume","Cleaned_JD","Resume_Embeddings","JD_Embeddings","Role_Embeddings","Job_Title_Presence","Title_Presence_Analysis","Title_Present"], axis=1)
data

Unnamed: 0,Decision,Resume_JD_Sim,Role_Resume_Sim,Word_Overlap,Tech_Keyword_Overlap
0,reject,0.505354,0.646826,0.298246,0.250000
1,select,0.411644,0.502842,0.069767,0.000000
2,reject,0.604468,0.587193,0.269504,0.000000
3,select,0.351232,0.512997,0.231579,0.250000
4,reject,0.428203,0.641594,0.300885,0.333333
...,...,...,...,...,...
10169,reject,0.435778,0.481033,0.276596,0.000000
10170,reject,0.367484,0.471639,0.362637,0.625000
10171,select,0.459990,0.501841,0.337079,0.375000
10172,reject,0.403455,0.499404,0.386667,0.666667


In [81]:
data.to_pickle("../data/final_resume_screening_dataset.pkl")