# 📄 Resume-to-O*NET Semantic Similarity Pipeline

This notebook computes semantic similarity between **resume text** and multiple **O*NET categories** (Knowledge, Skills, Abilities, Work Activities) using a transformer-based embedding model. The goal is to identify relevant O*NET concepts mentioned in resumes based on cosine similarity of noun phrases extracted from the text.

---

## 🚀 Features

- Loads resumes from a SQLite database with user annotations (`rating = 5`)
- Processes multiple O*NET CSV datasets from a directory
- Uses noun phrase extraction to analyze resume content
- Computes semantic similarity using sentence embeddings
- Saves similarity results to CSV files for each O*NET category

---

## 📦 Input Files

### SQLite Database

The database must include:
- `resumes` table with columns: `id`, `resume_text`
- `annotations` table with: `resume_id`, `rating`
- `predicted_jobs` table with: `resume_id`, `job_title`

Only resumes with `rating = 5` will be processed.

### O*NET CSV Files

Each CSV file should contain:
- `job_title`: the job name/title
- `knowledge_entity`: the concept, skill, or activity
- `data_value`: importance or level (numeric)

These files are automatically processed one by one.

---

## 🧠 How It Works

1. Load resumes with `rating = 5` from the SQLite database.
2. Loop over each O*NET CSV file (e.g., Knowledge, Skills, etc.).
3. For each resume:
   - Filter relevant O*NET entities by matching job titles.
   - Extract noun phrases from the resume text.
   - Generate sentence embeddings for both noun phrases and O*NET entities.
   - Compute cosine similarity between each pair.
   - Keep matches with similarity ≥ 0.65.
4. Save the similarity results into a separate CSV file for each category.

---

## 🧾 Output Format

Each output file contains:

| Column           | Description                                |
|------------------|--------------------------------------------|
| resume_id        | Resume ID                                  |
| job_title        | Job title associated with the resume       |
| noun_phrase      | Extracted noun phrase from resume          |
| knowledge_entity | Matching O*NET entity                      |
| similarity_score | Cosine similarity score (0 to 1)           |
| data_value       | Importance or level from O*NET (numeric)   |

---

## ✅ Sample Output



In [1]:
import os
import sqlite3
import pandas as pd
import logging
from textblob import TextBlob
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import torch
from tqdm import tqdm
import os

In [2]:
# --------------------------------------------------
# Setup: Logging and device check
# --------------------------------------------------

logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
logging.info(f"Using device: {device}")

# --------------------------------------------------
# Load SentenceTransformer embedding model
# --------------------------------------------------

model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

# --------------------------------------------------
# Output directory for similarity matrices
# --------------------------------------------------

output_dir = "../data/similarity_outputs"
os.makedirs(output_dir, exist_ok=True)

# --------------------------------------------------
# Function: Compute similarity between resume and O*NET entities
# --------------------------------------------------

INFO: Using device: cuda:0
INFO: Load pretrained SentenceTransformer: all-MiniLM-L6-v2


In [18]:
def compute_resume_similarity(resume_row, df_onet, model, category_name, threshold=0.65, batch_size=32):
    results = []

    # Dynamic column names
    entity_col = f"{category_name}_entity"
    soc_code_col = "onetsoc_code"
    job_title_col = "job_title"
    data_value_col = "data_value"

    # Check for required columns
    required_cols = [entity_col, soc_code_col, job_title_col, data_value_col]
    for col in required_cols:
        if col not in df_onet.columns:
            logging.error(f"Missing column '{col}' in O*NET data for category '{category_name}'")
            return results

    # Resume info
    resume_id = resume_row["resume_id"]
    resume_text = resume_row["resume_text"]
    original_job = resume_row["original_job"]

    # Extract unique noun phrases
    blob = TextBlob(resume_text)
    noun_phrases = list(set(blob.noun_phrases))
    if not noun_phrases:
        logging.warning(f"No noun phrases found in resume ID {resume_id}")
        return results

    # Encode noun phrases and O*NET entities
    resume_embeddings = model.encode(noun_phrases, convert_to_numpy=True, batch_size=batch_size)
    entity_values = df_onet[entity_col].tolist()
    entity_embeddings = model.encode(entity_values, convert_to_numpy=True, batch_size=batch_size)

    # print(f"Length of {category_name} entities: {len(entity_values)}")
    # print(f"Length of noun phrase list: {len(noun_phrases)}")
    
    # Compute cosine similarities
    similarity_matrix = cosine_similarity(resume_embeddings, entity_embeddings)

    # Match and collect results
    for i, noun_phrase in enumerate(noun_phrases):
        for j, entity in enumerate(entity_values):
            score = similarity_matrix[i, j]
            if score >= threshold:
                matched = df_onet[df_onet[entity_col] == entity].iloc[0]
                results.append({
                    "resume_id": resume_id,
                    "resume_text": resume_text,
                    "noun_phrase": noun_phrase,
                    "ksa_entity": entity,
                    "similarity_score": round(score, 4),
                    "entity_job_title": matched[job_title_col],
                    "onetsoc_code": matched[soc_code_col]
                })

    return results

In [13]:
def main():
    resume_path = "../data/annotations_scenario_1/cleaned_resumes.csv"
    df_resumes = pd.read_csv(resume_path)
    df_resumes = df_resumes[df_resumes["annotation_1"].astype(int) >= 3]

    onet_dir = "../data/o_net_files"
    for filename in os.listdir(onet_dir)[:1]:
        if filename.endswith(".csv"):
            category = os.path.splitext(filename)[0].lower()
            logging.info(f"🔍 Processing category: {category}")

            df_onet = pd.read_csv(os.path.join(onet_dir, filename))

            all_results = []
            for _, row in tqdm(df_resumes.iterrows(), total=len(df_resumes), desc=f"Matching for {category}"):
                matches = compute_resume_similarity(row, df_onet, model, category)
                all_results.extend(matches)

            # Save similarity matrix for this category
            df_out = pd.DataFrame(all_results).drop_duplicates()
            out_path = os.path.join(output_dir, f"resume_{category}_similarity_matrix.csv")
            df_out.to_csv(out_path, index=False)
            logging.info(f"✅ Saved: {out_path}")

    logging.info("🎉 Done computing similarity for all O*NET categories.")



In [19]:
# --------------------------------------------------
# Entry point
# --------------------------------------------------

if __name__ == "__main__":
    main()

Matching for abilities:   1%|          | 1/92 [00:00<00:09,  9.55it/s]

Length of abilities entities: 52
Length of noun phrase list: 169
Length of abilities entities: 52
Length of noun phrase list: 188
Length of abilities entities: 52
Length of noun phrase list: 152


Matching for abilities:   5%|▌         | 5/92 [00:00<00:08, 10.87it/s]

Length of abilities entities: 52
Length of noun phrase list: 111
Length of abilities entities: 52
Length of noun phrase list: 64
Length of abilities entities: 52
Length of noun phrase list: 99
Length of abilities entities: 52
Length of noun phrase list: 24


Matching for abilities:  11%|█         | 10/92 [00:00<00:05, 14.99it/s]

Length of abilities entities: 52
Length of noun phrase list: 83
Length of abilities entities: 52
Length of noun phrase list: 55
Length of abilities entities: 52
Length of noun phrase list: 48
Length of abilities entities: 52
Length of noun phrase list: 35


Matching for abilities:  13%|█▎        | 12/92 [00:00<00:05, 15.35it/s]

Length of abilities entities: 52
Length of noun phrase list: 31
Length of abilities entities: 52
Length of noun phrase list: 267


Matching for abilities:  17%|█▋        | 16/92 [00:01<00:06, 12.54it/s]

Length of abilities entities: 52
Length of noun phrase list: 67
Length of abilities entities: 52
Length of noun phrase list: 58
Length of abilities entities: 52
Length of noun phrase list: 34
Length of abilities entities: 52
Length of noun phrase list: 23


Matching for abilities:  22%|██▏       | 20/92 [00:01<00:04, 15.06it/s]

Length of abilities entities: 52
Length of noun phrase list: 115
Length of abilities entities: 52
Length of noun phrase list: 60
Length of abilities entities: 52
Length of noun phrase list: 32
Length of abilities entities: 52
Length of noun phrase list: 28


Matching for abilities:  27%|██▋       | 25/92 [00:01<00:03, 17.27it/s]

Length of abilities entities: 52
Length of noun phrase list: 86
Length of abilities entities: 52
Length of noun phrase list: 85
Length of abilities entities: 52
Length of noun phrase list: 71
Length of abilities entities: 52
Length of noun phrase list: 50
Length of abilities entities: 52
Length of noun phrase list: 30


Matching for abilities:  32%|███▏      | 29/92 [00:02<00:04, 15.39it/s]

Length of abilities entities: 52
Length of noun phrase list: 102
Length of abilities entities: 52
Length of noun phrase list: 101
Length of abilities entities: 52
Length of noun phrase list: 110


Matching for abilities:  34%|███▎      | 31/92 [00:02<00:04, 13.40it/s]

Length of abilities entities: 52
Length of noun phrase list: 110
Length of abilities entities: 52
Length of noun phrase list: 126
Length of abilities entities: 52
Length of noun phrase list: 74


Matching for abilities:  38%|███▊      | 35/92 [00:02<00:03, 15.39it/s]

Length of abilities entities: 52
Length of noun phrase list: 75
Length of abilities entities: 52
Length of noun phrase list: 76
Length of abilities entities: 52
Length of noun phrase list: 67


Matching for abilities:  40%|████      | 37/92 [00:02<00:04, 12.11it/s]

Length of abilities entities: 52
Length of noun phrase list: 292
Length of abilities entities: 52
Length of noun phrase list: 185
Length of abilities entities: 52
Length of noun phrase list: 69
Length of abilities entities: 52
Length of noun phrase list: 64


Matching for abilities:  46%|████▌     | 42/92 [00:03<00:03, 12.96it/s]

Length of abilities entities: 52
Length of noun phrase list: 178
Length of abilities entities: 52
Length of noun phrase list: 113
Length of abilities entities: 52
Length of noun phrase list: 111


Matching for abilities:  48%|████▊     | 44/92 [00:03<00:03, 12.47it/s]

Length of abilities entities: 52
Length of noun phrase list: 105
Length of abilities entities: 52
Length of noun phrase list: 226
Length of abilities entities: 52
Length of noun phrase list: 138


Matching for abilities:  52%|█████▏    | 48/92 [00:03<00:03, 12.99it/s]

Length of abilities entities: 52
Length of noun phrase list: 168
Length of abilities entities: 52
Length of noun phrase list: 126
Length of abilities entities: 52
Length of noun phrase list: 79


Matching for abilities:  57%|█████▋    | 52/92 [00:03<00:02, 14.02it/s]

Length of abilities entities: 52
Length of noun phrase list: 172
Length of abilities entities: 52
Length of noun phrase list: 117
Length of abilities entities: 52
Length of noun phrase list: 24
Length of abilities entities: 52
Length of noun phrase list: 124


Matching for abilities:  61%|██████    | 56/92 [00:04<00:02, 15.12it/s]

Length of abilities entities: 52
Length of noun phrase list: 102
Length of abilities entities: 52
Length of noun phrase list: 72
Length of abilities entities: 52
Length of noun phrase list: 46
Length of abilities entities: 52
Length of noun phrase list: 24


Matching for abilities:  63%|██████▎   | 58/92 [00:04<00:03, 10.79it/s]

Length of abilities entities: 52
Length of noun phrase list: 315
Length of abilities entities: 52
Length of noun phrase list: 154
Length of abilities entities: 52
Length of noun phrase list: 137


Matching for abilities:  67%|██████▋   | 62/92 [00:04<00:02, 11.48it/s]

Length of abilities entities: 52
Length of noun phrase list: 111
Length of abilities entities: 52
Length of noun phrase list: 158
Length of abilities entities: 52
Length of noun phrase list: 192


Matching for abilities:  72%|███████▏  | 66/92 [00:04<00:01, 13.18it/s]

Length of abilities entities: 52
Length of noun phrase list: 145
Length of abilities entities: 52
Length of noun phrase list: 144
Length of abilities entities: 52
Length of noun phrase list: 108
Length of abilities entities: 52
Length of noun phrase list: 92


Matching for abilities:  74%|███████▍  | 68/92 [00:05<00:01, 12.34it/s]

Length of abilities entities: 52
Length of noun phrase list: 149
Length of abilities entities: 52
Length of noun phrase list: 181
Length of abilities entities: 52
Length of noun phrase list: 134


Matching for abilities:  76%|███████▌  | 70/92 [00:05<00:01, 11.54it/s]

Length of abilities entities: 52
Length of noun phrase list: 155
Length of abilities entities: 52
Length of noun phrase list: 134


Matching for abilities:  80%|████████  | 74/92 [00:05<00:01, 10.34it/s]

Length of abilities entities: 52
Length of noun phrase list: 138
Length of abilities entities: 52
Length of noun phrase list: 161
Length of abilities entities: 52
Length of noun phrase list: 145


Matching for abilities:  83%|████████▎ | 76/92 [00:05<00:01, 10.74it/s]

Length of abilities entities: 52
Length of noun phrase list: 89
Length of abilities entities: 52
Length of noun phrase list: 133
Length of abilities entities: 52
Length of noun phrase list: 145


Matching for abilities:  87%|████████▋ | 80/92 [00:06<00:00, 12.15it/s]

Length of abilities entities: 52
Length of noun phrase list: 64
Length of abilities entities: 52
Length of noun phrase list: 45
Length of abilities entities: 52
Length of noun phrase list: 172


Matching for abilities:  91%|█████████▏| 84/92 [00:06<00:00, 14.07it/s]

Length of abilities entities: 52
Length of noun phrase list: 123
Length of abilities entities: 52
Length of noun phrase list: 57
Length of abilities entities: 52
Length of noun phrase list: 49
Length of abilities entities: 52
Length of noun phrase list: 125


Matching for abilities:  96%|█████████▌| 88/92 [00:06<00:00, 15.35it/s]

Length of abilities entities: 52
Length of noun phrase list: 125
Length of abilities entities: 52
Length of noun phrase list: 110
Length of abilities entities: 52
Length of noun phrase list: 81
Length of abilities entities: 52
Length of noun phrase list: 45


Matching for abilities: 100%|██████████| 92/92 [00:06<00:00, 13.25it/s]

Length of abilities entities: 52
Length of noun phrase list: 112
Length of abilities entities: 52
Length of noun phrase list: 67
Length of abilities entities: 52
Length of noun phrase list: 64
Length of abilities entities: 52
Length of noun phrase list: 61





In [4]:
df_onet = pd.read_csv("../data/o_net_files/abilities.csv")
entity_values = df_onet["abilities_entity"].dropna().drop_duplicates().tolist()
# for j, entity in enumerate(entity_values):
#     matched = df_onet[df_onet["abilities_entity"] == entity].iloc[0]
#     print(j, entity, matched["job_title"])


In [10]:
entity_values = df_onet.loc[df_onet.scale_name=="Importance", "abilities_entity"].dropna().drop_duplicates().tolist()

In [13]:
df_onet[df_onet["abilities_entity"] == "Oral Comprehension"]

Unnamed: 0,onetsoc_code,job_title,abilities_entity,scale_name,scale_id,data_value
0,11-1011.00,Chief Executives,Oral Comprehension,Importance,IM,4.62
1,11-1011.00,Chief Executives,Oral Comprehension,Level,LV,4.88
104,11-1011.03,Chief Sustainability Officers,Oral Comprehension,Importance,IM,4.00
105,11-1011.03,Chief Sustainability Officers,Oral Comprehension,Level,LV,4.62
208,11-1021.00,General and Operations Managers,Oral Comprehension,Importance,IM,4.00
...,...,...,...,...,...,...
91105,53-7073.00,Wellhead Pumpers,Oral Comprehension,Level,LV,2.88
91208,53-7081.00,Refuse and Recyclable Material Collectors,Oral Comprehension,Importance,IM,3.00
91209,53-7081.00,Refuse and Recyclable Material Collectors,Oral Comprehension,Level,LV,2.88
91312,53-7121.00,"Tank Car, Truck, and Ship Loaders",Oral Comprehension,Importance,IM,3.25
