### Similarity between O*Net Knowledge and Resumes

This is a quick test to see how similarity between a whole resume and the o*net knowledge data looks like. The assumption is that measuring the entire resume embeddings to each Knowledge entity will result in a much lower score than the 65% refered in the referenced researh paper


In [1]:
import sqlite3
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Connect to the SQLite database
conn = sqlite3.connect("../data/annotations_scenario_1/annotations_scenario_1.db")

# Step 1: Query 10 resumes with rating = 5
query_resumes = """
SELECT r.id AS resume_id, r.resume_text, pj.job_title
FROM resumes r
JOIN annotations a ON r.id = a.resume_id
JOIN predicted_jobs pj ON r.id = pj.resume_id
WHERE a.rating = 5
LIMIT 10;
"""
df_resumes = pd.read_sql_query(query_resumes, conn)

# Close the connection
conn.close()

# Step 2: Load the O*NET knowledge dataset (previously processed)
df_onet = pd.read_csv("../data/annotations_scenario_1/processed_onet_knowledge.csv")

# Step 3: Initialize an empty list to store similarity results
similarity_results = []

# Step 4: Compute similarity for each resume and its corresponding job knowledge entities
for _, row in df_resumes.iterrows():
    resume_id = row["resume_id"]
    resume_text = row["resume_text"]
    job_title = row["job_title"]

    # Get knowledge entities for this job title
    df_knowledge = df_onet[df_onet["job_title"] == job_title]

    if df_knowledge.empty:
        print(f"‚ö†Ô∏è No knowledge entities found for job: {job_title} (Resume ID: {resume_id})")
        continue  # Skip if no knowledge data exists for this job

    # Generate embeddings
    resume_embedding = model.encode(resume_text, convert_to_numpy=True)
    knowledge_embeddings = df_knowledge["knowledge_entity"].apply(lambda x: model.encode(x, convert_to_numpy=True))

    # Compute similarity
    similarity_scores = cosine_similarity([resume_embedding], list(knowledge_embeddings))

    # Store results
    for knowledge_entity, score in zip(df_knowledge["knowledge_entity"], similarity_scores[0]):
        similarity_results.append({"resume_id": resume_id, "job_title": job_title, "knowledge_entity": knowledge_entity, "similarity_score": score})

# Convert results to DataFrame
df_similarity = pd.DataFrame(similarity_results)

# Remove duplicates if any remain
df_similarity.drop_duplicates(inplace=True)

# Print a preview of the similarity matrix
print(df_similarity.head())

# # Save to CSV for further analysis
# df_similarity.to_csv("../data/annotations_scenario_1/resume_knowledge_similarity_matrix.csv", index=False)

# print("‚úÖ Similarity matrix saved as 'resume_knowledge_similarity_matrix.csv'.")


   resume_id             job_title               knowledge_entity  \
0          4  Computer Programmers  Administration and Management   
2          4  Computer Programmers                 Administrative   
4          4  Computer Programmers       Economics and Accounting   
6          4  Computer Programmers            Sales and Marketing   
8          4  Computer Programmers  Customer and Personal Service   

   similarity_score  
0          0.309047  
2          0.231593  
4          0.223551  
6          0.269907  
8          0.232240  


In [2]:
# Print the highest similarity score
max_similarity = df_similarity["similarity_score"].max()
highest_match = df_similarity[df_similarity["similarity_score"] == max_similarity]

print("\nüéØ Highest Similarity Score:")
print(highest_match)





üéØ Highest Similarity Score:
     resume_id                                     job_title  \
346          5                   Computer Network Architects   
412          5                          Computer Programmers   
478          5         Computer Systems Engineers/Architects   
544          5  Computer and Information Research Scientists   
610          5                            Robotics Engineers   

              knowledge_entity  similarity_score  
346  Computers and Electronics          0.418219  
412  Computers and Electronics          0.418219  
478  Computers and Electronics          0.418219  
544  Computers and Electronics          0.418219  
610  Computers and Electronics          0.418219  


### Analysis:

As anticipated, even the best score is less than the 65% threshold