# Resume / Candidate Screening System
## Machine Learning Internship Task 3

This project builds an NLP-based resume screening system that:

- Reads and preprocesses resume text
- Extracts relevant skills and keywords
- Matches resumes against a given job description
- Scores and ranks candidates based on role fit
- Identifies missing or critical skills

This simulates a real-world ML-powered hiring support tool used in HR-tech systems.

In [3]:
import pandas as pd

df = pd.read_csv("../data/Resume.csv")
df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [5]:
df.columns

Index(['ID', 'Resume_str', 'Resume_html', 'Category'], dtype='object')

## Resume Text Cleaning & Preprocessing

In this step, we clean the raw resume text to prepare it for analysis.

We will:
- Convert text to lowercase
- Remove punctuation
- Remove stopwords
- Remove unnecessary whitespace

This ensures consistent processing for skill extraction and similarity matching.

In [6]:
import nltk
from nltk.corpus import stopwords
import string
import re

# Download stopwords (run once)
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.split()
    text = [word for word in text if word not in stop_words]
    text = " ".join(text)
    return text

# Apply cleaning
df['cleaned_resume'] = df['Resume_str'].apply(clean_text)

df[['Resume_str', 'cleaned_resume']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suddh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Unnamed: 0,Resume_str,cleaned_resume
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,hr administratormarketing associate hr adminis...
1,"HR SPECIALIST, US HR OPERATIONS ...",hr specialist us hr operations summary versati...
2,HR DIRECTOR Summary Over 2...,hr director summary years experience recruitin...
3,HR SPECIALIST Summary Dedica...,hr specialist summary dedicated driven dynamic...
4,HR MANAGER Skill Highlights ...,hr manager skill highlights hr skills hr depar...


## Defining the Target Job Description

We now define a sample job description for the role of "Data Scientist".

The resume screening system will:
- Compare each resume against this job description
- Calculate similarity scores
- Rank candidates based on role fit

In [7]:
job_description = """
We are looking for a Data Scientist with strong skills in:

Python programming
Machine learning
Deep learning
Data analysis
Statistics
SQL
Data visualization
Pandas and NumPy
Scikit-learn
Model evaluation
Communication skills

The candidate should have experience in building predictive models,
working with structured and unstructured data,
and deploying machine learning solutions.
"""

# Clean the job description
cleaned_job_description = clean_text(job_description)

print(cleaned_job_description[:500])

looking data scientist strong skills python programming machine learning deep learning data analysis statistics sql data visualization pandas numpy scikitlearn model evaluation communication skills candidate experience building predictive models working structured unstructured data deploying machine learning solutions


## Converting Text to Numerical Representation (TF-IDF)

To compare resumes with the job description,
we convert text data into numerical vectors using TF-IDF.

TF-IDF helps identify important terms in each document
while reducing the weight of common words.

We will:
- Combine all resumes with the job description
- Transform them into vectors
- Prepare them for similarity comparison

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Combine job description + all resumes
all_documents = [cleaned_job_description] + df['cleaned_resume'].tolist()

tfidf_matrix = vectorizer.fit_transform(all_documents)

print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (2485, 5000)


## Resume-to-Job Similarity Scoring

We now compute cosine similarity between the job description
and each resume.

Cosine similarity measures how similar two documents are
based on their TF-IDF vectors.

Higher similarity score → Better role fit.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# First row = Job description
job_vector = tfidf_matrix[0]

# Remaining rows = Resumes
resume_vectors = tfidf_matrix[1:]

# Compute similarity
similarity_scores = cosine_similarity(job_vector, resume_vectors)

# Flatten to 1D array
similarity_scores = similarity_scores.flatten()

# Add scores to DataFrame
df['Similarity_Score'] = similarity_scores

df[['Category', 'Similarity_Score']].head()

Unnamed: 0,Category,Similarity_Score
0,HR,0.05707
1,HR,0.011717
2,HR,0.00732
3,HR,0.008862
4,HR,0.018948


## Ranking Candidates Based on Similarity Score

We now rank resumes based on their similarity
to the job description.

Higher similarity score → Better job-role match.

This mimics how automated resume screening tools
prioritize applicants.

In [11]:
# Sort resumes by similarity (descending)
ranked_candidates = df.sort_values(by="Similarity_Score", ascending=False)

# Reset index
ranked_candidates = ranked_candidates.reset_index(drop=True)

# Show top 10 candidates
ranked_candidates[['Category', 'Similarity_Score']].head(10)

Unnamed: 0,Category,Similarity_Score
0,CONSULTANT,0.361864
1,ENGINEERING,0.330282
2,AUTOMOBILE,0.253319
3,BANKING,0.242877
4,INFORMATION-TECHNOLOGY,0.214309
5,SALES,0.198439
6,AGRICULTURE,0.198043
7,SALES,0.197443
8,DESIGNER,0.195183
9,ARTS,0.187023


## Skill Gap Identification

To enhance the screening system,
we now identify which required skills
are present or missing in each resume.

This helps recruiters understand:
- Why a candidate ranked higher
- Which skills are missing

In [12]:
# Define required skills for Data Scientist role
required_skills = [
    "python", "machine learning", "deep learning", "sql",
    "statistics", "data analysis", "pandas", "numpy",
    "scikit-learn", "data visualization"
]

def get_missing_skills(resume_text):
    missing = []
    text = resume_text.lower()
    for skill in required_skills:
        if skill not in text:
            missing.append(skill)
    return missing

# Apply only to top 5 candidates for demo
top_candidates = ranked_candidates.head(5).copy()

top_candidates["Missing_Skills"] = top_candidates["Resume_str"].apply(get_missing_skills)

top_candidates[['Category', 'Similarity_Score', 'Missing_Skills']]

Unnamed: 0,Category,Similarity_Score,Missing_Skills
0,CONSULTANT,0.361864,"[deep learning, statistics, pandas, numpy, sci..."
1,ENGINEERING,0.330282,"[deep learning, numpy, scikit-learn]"
2,AUTOMOBILE,0.253319,"[machine learning, deep learning, scikit-learn]"
3,BANKING,0.242877,"[deep learning, statistics, data analysis, pan..."
4,INFORMATION-TECHNOLOGY,0.214309,"[machine learning, deep learning, statistics, ..."


# Final Conclusion

In this project, we built a Resume Screening and Ranking System 
using Natural Language Processing techniques.

### What Was Implemented:

- Resume text cleaning and preprocessing
- Job description parsing
- TF-IDF based feature extraction
- Cosine similarity matching between resumes and job role
- Candidate ranking based on similarity score
- Skill gap identification for required competencies

### How the System Works:

1. The job description is converted into a numerical TF-IDF representation.
2. Each resume is transformed into the same feature space.
3. Cosine similarity measures how closely a resume matches the job role.
4. Candidates are ranked from highest to lowest similarity.
5. Missing required skills are identified for each candidate.

### Business Impact:

This system helps recruiters:

- Automatically shortlist candidates
- Reduce manual resume screening time
- Objectively compare applicants
- Identify skill gaps quickly
- Improve hiring efficiency

The approach mirrors how many modern HR-tech platforms
perform resume-to-job matching at scale.

This project demonstrates how NLP can transform 
unstructured resume data into structured, 
actionable hiring insights.