## Problem Statement
Build an ML system that:

-reads resume text

-compares it with a job description

-ranks candidates based on relevance

-Identifies missing skills

In [1]:
import pandas as pd
import numpy as np
import re 
import string 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
import pandas as pd
df = pd.read_csv("task 3 dataset/Resume/Resume.csv")
df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [3]:
df = df[['Resume_str', 'Category']]
df.head()

Unnamed: 0,Resume_str,Category
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,HR
1,"HR SPECIALIST, US HR OPERATIONS ...",HR
2,HR DIRECTOR Summary Over 2...,HR
3,HR SPECIALIST Summary Dedica...,HR
4,HR MANAGER Skill Highlights ...,HR


## Resume Text Cleaning & Preprocessing

In this step,we clean the resume text by:

- Removing URLs

- Removing special characters

- Converting text to lowercase


In [4]:
import re

def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.lower()
    return text

df['cleaned_resume'] = df['Resume_str'].apply(clean_text)

df[['cleaned_resume']].head()

Unnamed: 0,cleaned_resume
0,hr administrator marketing associate hr admin...
1,hr specialist us hr operations summary versat...
2,hr director summary over 20 years experience ...
3,hr specialist summary dedicated driven and dy...
4,hr manager skill highlights hr skills hr depa...


## Convert Resume Text into Numerical Features(TF-IDF)

We Convert the cleaned resume text into numerical vectors using TF-IDF. This helps measure similarity between resumes and the job description.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

x = tfidf.fit_transform(df['cleaned_resume'])

x.shape

(2484, 40133)

In [6]:
job_description = """ Looking for an HR Specialist with experience in recuitment, employee engagement, payroll management, and HR operations.Strong Communication and leadership skills required."""

In [7]:
job_vector = tfidf.transform([job_description])

## Resume-to-job Similarity Scoring

we:

Transform the job description using the same TF-IDF model.

compute cosine similarity between job description and all resumes.

Assign similarity scores.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(job_vector, x)

similarity_scores

array([[0.15914048, 0.19349424, 0.17776378, ..., 0.01802586, 0.01015912,
        0.01250728]], shape=(1, 2484))

In [9]:
df['Similarity_Score'] = similarity_scores.flatten()
df.head()

Unnamed: 0,Resume_str,Category,cleaned_resume,Similarity_Score
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,HR,hr administrator marketing associate hr admin...,0.15914
1,"HR SPECIALIST, US HR OPERATIONS ...",HR,hr specialist us hr operations summary versat...,0.193494
2,HR DIRECTOR Summary Over 2...,HR,hr director summary over 20 years experience ...,0.177764
3,HR SPECIALIST Summary Dedica...,HR,hr specialist summary dedicated driven and dy...,0.159648
4,HR MANAGER Skill Highlights ...,HR,hr manager skill highlights hr skills hr depa...,0.443194


## Ranking Candidates Based on Relevance

Resumes are sorted in descending order of similarity score. Higher score means better match to job role.

In [10]:
ranked_df = df.sort_values(by='Similarity_Score', ascending=False)
ranked_df.head(5)

Unnamed: 0,Resume_str,Category,cleaned_resume,Similarity_Score
4,HR MANAGER Skill Highlights ...,HR,hr manager skill highlights hr skills hr depa...,0.443194
101,REGIONAL HR BUSINESS PARTNER Hu...,HR,regional hr business partner human resources ...,0.391982
65,HR CONSULTING Summary 7+ yea...,HR,hr consulting summary 7 years of experience a...,0.382787
58,HR CONSULTANT Summary C...,HR,hr consultant summary certified human resourc...,0.379556
92,GLOBAL HR MANAGER Summary ...,HR,global hr manager summary a global hr profess...,0.369236


## Step 5: Skill Extraction & Skill Gap Identification

In this step:
- We define required skills for the job
- Extract matching skills from resumes
- Identify missing skills

In [11]:
df.columns

Index(['Resume_str', 'Category', 'cleaned_resume', 'Similarity_Score'], dtype='object')

In [12]:
df['Candidate_ID'] = df.index

In [13]:
ranked_df = df.sort_values(by='Similarity_Score', ascending=False)

ranked_df[['Candidate_ID', 'Category', 'Similarity_Score']].head(5)

Unnamed: 0,Candidate_ID,Category,Similarity_Score
4,4,HR,0.443194
101,101,HR,0.391982
65,65,HR,0.382787
58,58,HR,0.379556
92,92,HR,0.369236


In [14]:
job_description = """
Looking for HR Manager with skills in recruitment, employee relations,
performance management, payroll, HR policies, communication, leadership.
"""

In [15]:
required_skills = [
    "recruitment",
    "employee relations",
    "performance management",
    "payroll",
    "hr policies",
    "communication",
    "leadership"
]

In [16]:
df.columns

Index(['Resume_str', 'Category', 'cleaned_resume', 'Similarity_Score',
       'Candidate_ID'],
      dtype='object')

In [17]:
def find_missing_skills(resume_text):
    missing = []
    for skill in required_skills:
        if skill.lower() not in resume_text.lower():
            missing.append(skill)
    return missing

df['Missing_Skills'] = df['cleaned_resume'].apply(find_missing_skills)

In [18]:
ranked_df = df.copy()

In [19]:
ranked_df = ranked_df.sort_values(by='Similarity_Score', ascending=False)

In [20]:
ranked_df[['Candidate_ID', 'Category', 'Similarity_Score', 'Missing_Skills']].head(5)

Unnamed: 0,Candidate_ID,Category,Similarity_Score,Missing_Skills
4,4,HR,0.443194,[communication]
101,101,HR,0.391982,"[employee relations, hr policies, communication]"
65,65,HR,0.382787,[communication]
58,58,HR,0.379556,"[recruitment, communication]"
92,92,HR,0.369236,"[employee relations, payroll, hr policies]"


## Final Output
The resumes are ranked based on similarity score with the job description.

Missing required HR skills are identified for each candidate.

Top candidates are displayed with similarity score and missing skills.