<a href="https://colab.research.google.com/github/sycamore-st/resume-job-matching-system/blob/main/Resume%E2%80%93Job_Matching_with_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Resume–Job Matching System Using Deep Learning
## Objective:
To build a semantic matching system that recommends the most relevant job postings for a given resume (and vice versa), using NLP and deep learning techniques.

**Key Components:**

1. Environment Check
  * Verifies GPU and RAM availability in the runtime (e.g., Google Colab).

2. Data Loading
	*	Loads two datasets:
	  *	📄 postings.csv: job listings with title, description, skills, etc.
	  *	👩‍💼 Resume.csv: raw text resumes from job applicants.

3. Text Preprocessing
	*	Cleans job descriptions using spaCy, removing boilerplate content.
	*	Cleans resumes using regex-based text normalization.

4. Text Embedding with Sentence Transformers
	*	Uses `all-MiniLM-L6-v2` to embed job descriptions and resumes into dense vector representations.

5. Semantic Similarity Matching
	*	Uses cosine similarity to:
	  *	Find top-matching resumes for a given job.
	  *	Find top-matching jobs for a given resume.

### Environment CheckCheck
Check if the Colab runtime has access to a GPU and how much RAM is available.

In [5]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Apr 23 12:57:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   33C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 56.9 gigabytes of available RAM

You are using a high-RAM runtime!


### Load Job Postings Dataset
Load and preview the job postings data from a CSV file. This helps us understand the structure and size of the dataset.

In [7]:
import pandas as pd
import os
import re

import torch

In [8]:
postings_path = "/content/sample_data/postings.csv"
postings_df = pd.read_csv(postings_path, on_bad_lines='skip')

In [9]:
# Load the Job Postings CSV file into a Pandas DataFrame
postings_df = pd.read_csv(postings_path)

# Print number of rows and columns
print(postings_df.shape)

# Display the first few rows of the dataset to understand its structure
postings_df.head()

(123849, 31)


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


Sample Job Postings
Take a 1000-row random sample from the full dataset to make development and testing faster.

In [10]:
postings_sample_df = postings_df.sample(1000)
postings_sample_df.shape

(1000, 31)

### Load spaCy for NLP Processing
We use spaCy to parse and clean the job description text, removing boilerplate and irrelevant info.

In [11]:
import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def nlp_clean_job_description(text, min_len=6):

    """
    A function to clean and extract relevant sentences from the job descriptions.
    Keep only the meaningful parts like responsibilities or required skills.
    """

    doc = nlp(text)

    # Keywords to guide relevance
    keep_keywords = [
        'responsibilities', 'duties', 'tasks', 'requirements',
        'qualifications', 'skills', 'experience',
        'values', 'mission', 'about us', 'role'
    ]

    remove_keywords = [
        'equal opportunity', 'how to apply', 'location', 'schedule',
        'compensation', 'benefits', 'job type', 'work hours', 'insurance',
        'pay', 'vacation', 'salary', 'application instructions'
    ]

    clean_sentences = []

    for sent in doc.sents:
        s_lower = sent.text.lower()

        # Exclude boilerplate
        if any(x in s_lower for x in remove_keywords):
            continue

        # Keep only if contains a keep keyword or is sufficiently descriptive
        if any(x in s_lower for x in keep_keywords) or len(sent.text.split()) > min_len:
            clean_sentences.append(sent.text.strip())

    return " ".join(clean_sentences)


def combine_features(row):
    """Combine title, description, and skills fields from each job into a single string for embedding."""
    features = []
    for col in ['title', 'description', 'skills_desc']:
        if not pd.isnull(row[col]):
            features.append(
                f"{col.capitalize()}: {nlp_clean_job_description(row[col])}\n")
    return ' '.join(features)

In [12]:
%%time
# Apply preprocessing and feature combination
postings_sample_df['cleaned_jd'] = postings_sample_df.apply(
    combine_features,
    axis=1
  )
print(postings_sample_df.iloc[0]['cleaned_jd'])

Title: 
 Description: Programs are designed in partnership with New York City Transit's departmental management and where appropriate, other MTA agencies, in response to mandates by Federal, State and City entities. Operations Training is seeking a highly motivated person to join the team of instructors at the Signals Learning Center. The instructor is also responsible for, but not limited to, supervising, and instructing students in math, electronics, fiber optic training as well as the operation, inspection, troubleshooting and maintenance of NYC Transit signal equipment; developing and writing lesson plans; and developing and maintaining training aids and mock-ups. The instructor will also be responsible for planning, organizing, and coordinating the delivery of training, both computer-based and instructor-led training. In addition, the instructor will maintain records and prepare reports and evaluations on trainees, prepare, and deliver formal presentations, review new technology c

### Load Resume Dataset

In [13]:
resume_path = '/content/sample_data/Resume.csv'
resume_df = pd.read_csv(
    resume_path,
    on_bad_lines='skip',
    encoding='utf-8',
    engine='python'
    )

# Display the first few rows of the dataset to understand its structure
resume_df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [14]:
resume_df.head(5)

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


### Clean Resume
TextNormalize, denoise, and structure resume text for better embedding representation.

In [15]:
resume_df['Resume_str'].iloc[0]

"         HR ADMINISTRATOR/MARKETING ASSOCIATE\n\nHR ADMINISTRATOR       Summary     Dedicated Customer Service Manager with 15+ years of experience in Hospitality and Customer Service Management.   Respected builder and leader of customer-focused teams; strives to instill a shared, enthusiastic commitment to customer service.         Highlights         Focused on customer satisfaction  Team management  Marketing savvy  Conflict resolution techniques     Training and development  Skilled multi-tasker  Client relations specialist           Accomplishments      Missouri DOT Supervisor Training Certification  Certified by IHG in Customer Loyalty and Marketing by Segment   Hilton Worldwide General Manager Training Certification  Accomplished Trainer for cross server hospitality systems such as    Hilton OnQ  ,   Micros    Opera PMS   , Fidelio    OPERA    Reservation System (ORS) ,   Holidex    Completed courses and seminars in customer service, sales strategies, inventory control, loss pr

In [16]:
import re

def clean_resume_text(raw_text):
    if pd.isnull(raw_text):
        return ""

    # Step 1: Normalize whitespace and remove weird characters
    text = re.sub(r'[^\w\s.,+/\-]', '', raw_text)  # keep key symbols
    text = re.sub(r'\s+', ' ', text)  # normalize whitespace
    text = text.strip().lower()  # lowercase

    # Step 2: Fix bullet-style compression
    text = re.sub(r'(\w)[\s]*[\r\n]+[\s]*([A-Z])', r'\1. \2', text)

    # Step 3: Collapse to meaningful sections
    text = re.sub(r'(company name[\s,]*)+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'(city\s*,?\s*state)+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'n/?a', '', text, flags=re.IGNORECASE)

    return text

In [17]:
clean_resume_text(resume_df['Resume_str'].iloc[0])

'hr administrator/marketing associate hr administrator summary dedicated customer service mager with 15+ years of experience in hospitality and customer service magement. respected builder and leader of customer-focused teams strives to instill a shared, enthusiastic commitment to customer service. highlights focused on customer satisfaction team magement marketing savvy conflict resolution techniques training and development skilled multi-tasker client relations specialist accomplishments missouri dot supervisor training certification certified by ihg in customer loyalty and marketing by segment hilton worldwide general mager training certification accomplished trainer for cross server hospitality systems such as hilton onq , micros opera pms , fidelio opera reservation system ors , holidex completed courses and semirs in customer service, sales strategies, inventory control, loss prevention, safety, time magement, leadership and performance assessment. experience hr administrator/mar

In [18]:
%%time
resume_df['cleaned_resume'] = resume_df['Resume_str'].apply(clean_resume_text)

CPU times: user 3.08 s, sys: 10.9 ms, total: 3.09 s
Wall time: 3.09 s


### Generate Text Embeddings
Load a SentenceTransformer model and encode both resumes and job descriptions into dense vector representations.

In [19]:
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  # Or try 'paraphrase-mpnet-base-v2'


def generate_embeddings(text) -> torch.Tensor:
    return model.encode(
        text,
        convert_to_tensor=True
      ).cpu().reshape(1, -1)


# Generate embeddings with batching
resume_df['embeddings'] = resume_df['cleaned_resume'].apply(generate_embeddings)
postings_sample_df['embeddings'] = postings_sample_df['cleaned_jd'].apply(
    generate_embeddings
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Find Top Resume Matches for a Job
Given a random job, compute similarity scores with all resumes and display the top matches based on cosine similarity.

In [20]:
def find_top_candidates(
    embedding: torch.Tensor,
    *,
    candidate_series: pd.Series,
    top_n=5
):
    similarities = candidate_series.apply(
        lambda x: cosine_similarity(
            embedding,
            x
        )
    )
    top_indices = similarities.argsort()[-top_n:][::-1]  # Get indices of top matches

    return top_indices


In [21]:
# Find top 5 candidate for a random job in the job description dataset

job_data = postings_sample_df['cleaned_jd'].sample(1).iloc[0]

print(job_data)

Title: 
 Description: About Liquid Death

Liquid Death is a healthy beverage company with ice-cold sustainable cans designed to murder your thirst. Founded in January of 2019 and based in Los Angeles, California, Liquid Death is one of the nation’s fastest growing beverage brands taking a completely unnecessary approach to canned water and iced teas. Unnecessary things tend to be far more interesting, fun, hilarious, captivating, memorable, exciting, and cult-worthy. Taking the world’s healthiest beverage and making it just as unnecessarily entertaining as the unhealthy brands has put Liquid Death on the map for LinkedIn's Top Startups 2022, Contagious's Brand of the Year 2022, Ad Age Top Marketer of the Year 2022, and Medium’s cult-worthy brands among other accolades. As Liquid Death continues to bring unnecessarily awesome beverage options to more people, Liquid Death is equally as excited to promote and help fund alternative art, music, and entertainment alongside the brand. Ensure 

In [22]:
# The preprocessed data of the first job
job_embedding = generate_embeddings(job_data)
top_candidate_indices = find_top_candidates(
    job_embedding,
    candidate_series=resume_df['embeddings']
    )

print("Top candidates for job:")

for i, candidate_index in enumerate(top_candidate_indices):
    candidate = resume_df.iloc[candidate_index]
    print(
        f"""
        Candidate {i + 1}: {candidate['Category']}
        Similarity Score:
          {cosine_similarity(job_embedding, candidate['embeddings'])[0][0]:.4f}
        Resume: {candidate['cleaned_resume']}
        """
        )

Top candidates for job:

        Candidate 1: DIGITAL-MEDIA
        Similarity Score: 
          0.3934
        Resume: business development digital media marketing specialist summary a self-starter and dymic professiol with over nine years of sales, marketing and customer service experience. key strengths include critical thinking, creativity in developing new sales strategies, resourceful problem solving and the ability to maximize resources. outstanding oral and written skills with demonstrated success in building relationships with co-workers, magement, exterl partners and customers. bilingual communication skills in portuguese tive language and english business level. accomplishments awarded the best therapeutic nutritiol representative of brazil for contributions to me of project . top performance award for the best therapeutic nutritiol representative of brazil in 2010 - abbott nutrition award for developing a marketing and distribution plan for home care service - abbott nutrit

In [23]:
resume_data = resume_df['cleaned_resume'].sample(1).iloc[0]
print(resume_data)

events public relations leader summary i am an marketing specialist that creates and executes first class corporate and store events, marketing plans, and social media content to support stores sales objectives as well as companys overall objectives. i am seeking a corporate event planning or marketing position. planned multiple events for new scheels stores including a number of pr events as well as formal events. major projects included social media development for our 26 stores and planning multiple expos and conferences. experience 12/2015 to current events public relations leader  collaborate with marketing leaders to understand stores markets and put together the best event and marketing plans for each region. create an annual strategy of events that promote and align with stores goals and creates customer and store interactions. lead the development and execution of strategic events, trade shows, demos, expos, event sponsorships, community involvement, and conferences. develop a

### Find Top Job Matches for a Resume
Given a random resume, compute similarity scores with all job descriptions and return the top job matches.

In [24]:
resume_embedding = generate_embeddings(resume_data)

similar_job_indices = find_top_candidates(
      embedding=resume_embedding,
      candidate_series=postings_sample_df['embeddings']
    )

print("Top job matches for resume:")
print(resume_df.iloc[0]['Resume_str'])  # Print the original resume content
print("-" * 50)  # Separator

for i, job_index in enumerate(similar_job_indices):
    job_posting = postings_sample_df.iloc[job_index]

    score = cosine_similarity(
                resume_embedding, job_posting['embeddings']
              )[0][0]

    print(
        f"""
        Match {i + 1}: {job_posting['title']}
        Similarity Score: {score:.4f}
        Description: {job_posting['cleaned_jd']}
        {"-" * 50}
        """
      )

Top job matches for resume:
         HR ADMINISTRATOR/MARKETING ASSOCIATE

HR ADMINISTRATOR       Summary     Dedicated Customer Service Manager with 15+ years of experience in Hospitality and Customer Service Management.   Respected builder and leader of customer-focused teams; strives to instill a shared, enthusiastic commitment to customer service.         Highlights         Focused on customer satisfaction  Team management  Marketing savvy  Conflict resolution techniques     Training and development  Skilled multi-tasker  Client relations specialist           Accomplishments      Missouri DOT Supervisor Training Certification  Certified by IHG in Customer Loyalty and Marketing by Segment   Hilton Worldwide General Manager Training Certification  Accomplished Trainer for cross server hospitality systems such as    Hilton OnQ  ,   Micros    Opera PMS   , Fidelio    OPERA    Reservation System (ORS) ,   Holidex    Completed courses and seminars in customer service, sales strategies, i