## 🎯 Resume-to-O*NET Knowledge Mapping (High-Quality Annotations)

### Purpose
This notebook demonstrates a focused exploration of how high-quality resumes align with the **O*NET Knowledge taxonomy** using semantic similarity techniques.

> ✅ This exploration is based on:
> - **Objective 1**
> - **Annotations Set 1**
> - Resumes with a **rating score of 5**

The goal is to:
- Extract meaningful **noun phrases** from resume text
- Match them semantically to relevant **O*NET knowledge entities**
- Assess the **semantic similarity** between the resume content and standardized knowledge requirements
- Compare against the **importance (`data_value`)** of those knowledge areas for a given job role


---

### Methodology

1. **Data Selection**  
   - Resumes are queried from a SQLite database where the annotation rating is 5.
   - Each resume is associated with a predicted job title.

2. **Text Processing**  
   - `TextBlob` is used to extract **noun phrases** from each resume, replicating the method described in the referenced research.

3. **Knowledge Entity Matching**  
   - For each resume, we select the corresponding **O*NET knowledge entities** by fuzzy-matching the job title to the `job_title` column in the O*NET knowledge dataset.

4. **Embedding & Similarity Scoring**  
   - Both noun phrases and knowledge entities are encoded using the `all-MiniLM-L6-v2` **SentenceTransformer**.
   - **Cosine similarity** is calculated between each noun phrase and each knowledge entity.
   - All matches with **similarity ≥ 0.65** are retained.

5. **Output Construction**  
   - For each `(noun phrase, knowledge entity)` pair, the following are recorded:
     - Resume ID
     - Job Title
     - Noun Phrase
     - O*NET Knowledge Entity
     - Cosine Similarity Score
     - O*NET `data_value` (importance level of the knowledge for that job)

---

### Outcome
The output DataFrame provides an interpretable mapping between **resume content and job-specific knowledge requirements**. It enables:
- Visualizing **how well a resume covers the most important knowledge areas** for a job
- Exploring **semantic overlap** between applicant experience and formal occupation standards
- Evaluating entity alignment at a **granular, phrase-level resolution**

This exploration supports downstream use cases like:
- Automated resume-job fit scoring
- Gap analysis between candidate skills and job requirements
- Training data analysis for classification models

---

### References
Alonso, R., Dessí, D., Meloni, A., & Reforgiato Recupero, D. (2025).  
**A novel approach for job matching and skill recommendation using transformers and the O\*NET database**.  
*Big Data Research, 39*, 100509. [DOI: 10.1016/j.bdr.2024.100509](https://doi.org/10.1016/j.bdr.2024.100509)

This notebook replicates and adapts core techniques from the above paper, including noun phrase extraction with TextBlob and semantic similarity scoring against O\*NET entities.


In [1]:
import sqlite3
import pandas as pd
from textblob import TextBlob
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the embedding model
import torch

# Set the target GPU (e.g., GPU 0 or GPU 1)
device = torch.device("cuda:0")  # or "cuda:1" for the second GPU

# Load the model
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)


# Connect to the SQLite database
db_path = "../data/annotations_scenario_1/annotations_scenario_1.db"
conn = sqlite3.connect(db_path)

# Step 1: Query 10 resumes with rating = 5
query_resumes = """
SELECT r.id AS resume_id, r.resume_text, pj.job_title
FROM resumes r
JOIN annotations a ON r.id = a.resume_id
JOIN predicted_jobs pj ON r.id = pj.resume_id
WHERE a.rating = 5
"""
df_resumes = pd.read_sql_query(query_resumes, conn)
conn.close()

# Step 2: Load the O*NET knowledge dataset (previously processed)
df_onet = pd.read_csv("../data/annotations_scenario_1/processed_onet_knowledge.csv")

# Step 3: Initialize a list to collect similarity results
similarity_results = []

# Step 4: Iterate over resumes
for _, row in df_resumes.iterrows():
    resume_id = row["resume_id"]
    resume_text = row["resume_text"]
    job_title = row["job_title"]

    # Get knowledge entities for this job title using pattern matching
    df_knowledge = df_onet[df_onet["job_title"].str.contains(job_title, case=False, na=False, regex=True)]

    if df_knowledge.empty:
        print(f"⚠️ No knowledge entities found for job: {job_title} (Resume ID: {resume_id})")
        continue

    # Extract noun phrases from the resume
    blob = TextBlob(resume_text)
    noun_phrases = list(set(blob.noun_phrases))  # Remove duplicates

    if not noun_phrases:
        print(f"⚠️ No noun phrases found in resume ID {resume_id}")
        continue

    # Encode noun phrases and knowledge entities
    resume_embeddings = model.encode(noun_phrases, convert_to_numpy=True)
    knowledge_entities = df_knowledge["knowledge_entity"].tolist()
    knowledge_embeddings = model.encode(knowledge_entities, convert_to_numpy=True)

    # Compute pairwise cosine similarity
    similarity_matrix = cosine_similarity(resume_embeddings, knowledge_embeddings)

    # Store all similarity scores ≥ 0.65, and include data_value from df_knowledge
    threshold = 0.65
    for i, noun_phrase in enumerate(noun_phrases):
        for j, knowledge_entity in enumerate(knowledge_entities):
            score = similarity_matrix[i, j]
            if score >= threshold:
                data_value = df_knowledge.iloc[j]["data_value"]
                similarity_results.append({
                    "resume_id": resume_id,
                    "job_title": job_title,
                    "noun_phrase": noun_phrase,
                    "knowledge_entity": knowledge_entity,
                    "similarity_score": score,
                    "data_value": data_value
                })



# Convert results to DataFrame
df_similarity = pd.DataFrame(similarity_results).drop_duplicates()

# Display the similarity results
print(df_similarity.head())

# (Optional) Save to CSV
# df_similarity.to_csv("resume_knowledge_similarity_matrix.csv", index=False)

print("✅ Done computing similarity based on noun phrases.")


AssertionError: Torch not compiled with CUDA enabled

In [None]:
df_similarity["resume_id"].value_counts().reset_index()

# Lets connect to neo4j

In [None]:
# !pip install neo4j

In [None]:
from neo4j import GraphDatabase

# Replace with your Neo4j URI and credentials
NEO4J_URI = "bolt://20.14.162.151:7687" 
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "recluse2025"

# Connect to Neo4j
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Transaction function to push nodes and relationships
def push_to_neo4j(tx, record):
    tx.run("""
        MERGE (r:Resume {id: $resume_id})
        SET r.job_title = $job_title

        MERGE (n:NounPhrase {text: $noun_phrase})
        MERGE (k:Knowledge {entity: $knowledge_entity})
        MERGE (j:JobTitle {title: $job_title})

        MERGE (r)-[:CONTAINS]->(n)
        MERGE (n)-[s:SIMILAR_TO]->(k)
        SET s.score = $similarity_score

        MERGE (k)-[rj:REQUIRED_FOR]->(j)
        SET rj.importance = $data_value
    """, 
    resume_id=record["resume_id"],
    job_title=record["job_title"],
    noun_phrase=record["noun_phrase"],
    knowledge_entity=record["knowledge_entity"],
    similarity_score=round(record["similarity_score"], 4),
    data_value=record["data_value"])


# Push rows from DataFrame to Neo4j
with driver.session() as session:
    for _, row in df_similarity.iterrows():
        session.execute_write(push_to_neo4j, row)

driver.close()