### Data Exploration

The research paper uses sentence embedding on noun and noun phrases. This analysis is to see of other modern approaches can reach same or better score. Ultimately, we want to see if the new approach at least mataches the high scores annotated. 

1. Compare embedding of entire resume to individual entities of a category. Expect sim,ilarity to be less than the research paper
2. Colbert index and search. 

In [7]:
import pandas as pd

In [4]:


# Load the O*NET Knowledge Excel file
knowledge_file = "data/annotations_scenario_1/Knowledge.xlsx"  # Update with the actual filename
df_onet = pd.read_excel(knowledge_file)

# Select relevant columns
df_onet = df_onet[["O*NET-SOC Code", "Title", "Element Name", "Scale ID", "Data Value"]]

# Filter for only importance (IM) and level (LV)
df_onet = df_onet[df_onet["Scale ID"].isin(["IM", "LV"])]

# Rename columns for consistency
df_onet.rename(columns={
    "O*NET-SOC Code": "onetsoc_code",
    "Title": "job_title",
    "Element Name": "knowledge_entity",
    "Scale ID": "scale_id",
    "Data Value": "data_value"
}, inplace=True)

# Display the processed data in a Pandas DataFrame
print(df_onet.head())  # Show the first few rows

# Save to CSV if you want to inspect it further
df_onet.to_csv("data/annotations_scenario_1/processed_onet_knowledge.csv", index=False)


  onetsoc_code         job_title               knowledge_entity scale_id  \
0   11-1011.00  Chief Executives  Administration and Management       IM   
1   11-1011.00  Chief Executives  Administration and Management       LV   
2   11-1011.00  Chief Executives                 Administrative       IM   
3   11-1011.00  Chief Executives                 Administrative       LV   
4   11-1011.00  Chief Executives       Economics and Accounting       IM   

   data_value  
0        4.78  
1        6.50  
2        2.42  
3        2.69  
4        4.04  


In [5]:
# Load the O*NET Knowledge Excel file
occupation_file = "data/annotations_scenario_1/Occupation Data.xlsx"  # Update with the actual filename

df_occupation = pd.read_excel(occupation_file)

# Select relevant columns
df_occupation = df_occupation[["O*NET-SOC Code", "Title", "Description"]]

# Rename columns for consistency
df_occupation.rename(columns={
    "O*NET-SOC Code": "onetsoc_code",
    "Title": "job_title",
    "Description": "job_description"
}, inplace=True)

# Display the first few rows
print(df_occupation.head())

# Save to CSV for further inspection (optional)
df_occupation.to_csv("data/annotations_scenario_1/processed_onet_occupation.csv", index=False)


  onetsoc_code                            job_title  \
0   11-1011.00                     Chief Executives   
1   11-1011.03        Chief Sustainability Officers   
2   11-1021.00      General and Operations Managers   
3   11-1031.00                          Legislators   
4   11-2011.00  Advertising and Promotions Managers   

                                     job_description  
0  Determine and formulate policies and provide o...  
1  Communicate and coordinate with management, sh...  
2  Plan, direct, or coordinate the operations of ...  
3  Develop, introduce, or enact laws and statutes...  
4  Plan, direct, or coordinate advertising polici...  


In [2]:
import sqlite3
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Connect to the SQLite database
conn = sqlite3.connect("data/annotations_scenario_1/annotations_scenario_1.db")

# Step 1: Query 10 resumes with rating = 5
query_resumes = """
SELECT r.id AS resume_id, r.resume_text, pj.job_title
FROM resumes r
JOIN annotations a ON r.id = a.resume_id
JOIN predicted_jobs pj ON r.id = pj.resume_id
WHERE a.rating = 5
LIMIT 10;
"""
df_resumes = pd.read_sql_query(query_resumes, conn)

# Close the connection
conn.close()

# Step 2: Load the O*NET knowledge dataset (previously processed)
df_onet = pd.read_csv("data/annotations_scenario_1/processed_onet_knowledge.csv")

# Step 3: Initialize an empty list to store similarity results
similarity_results = []

# Step 4: Compute similarity for each resume and its corresponding job knowledge entities
for _, row in df_resumes.iterrows():
    resume_id = row["resume_id"]
    resume_text = row["resume_text"]
    job_title = row["job_title"]

    # Get knowledge entities for this job title
    df_knowledge = df_onet[df_onet["job_title"] == job_title]

    if df_knowledge.empty:
        print(f"⚠️ No knowledge entities found for job: {job_title} (Resume ID: {resume_id})")
        continue  # Skip if no knowledge data exists for this job

    # Generate embeddings
    resume_embedding = model.encode(resume_text, convert_to_numpy=True)
    knowledge_embeddings = df_knowledge["knowledge_entity"].apply(lambda x: model.encode(x, convert_to_numpy=True))

    # Compute similarity
    similarity_scores = cosine_similarity([resume_embedding], list(knowledge_embeddings))

    # Store results
    for knowledge_entity, score in zip(df_knowledge["knowledge_entity"], similarity_scores[0]):
        similarity_results.append({"resume_id": resume_id, "job_title": job_title, "knowledge_entity": knowledge_entity, "similarity_score": score})

# Convert results to DataFrame
df_similarity = pd.DataFrame(similarity_results)

# Remove duplicates if any remain
df_similarity.drop_duplicates(inplace=True)

# Print a preview of the similarity matrix
print(df_similarity.head())

# Save to CSV for further analysis
df_similarity.to_csv("data/annotations_scenario_1/resume_knowledge_similarity_matrix.csv", index=False)

print("✅ Similarity matrix saved as 'resume_knowledge_similarity_matrix.csv'.")


   resume_id             job_title               knowledge_entity  \
0          4  Computer Programmers  Administration and Management   
2          4  Computer Programmers                 Administrative   
4          4  Computer Programmers       Economics and Accounting   
6          4  Computer Programmers            Sales and Marketing   
8          4  Computer Programmers  Customer and Personal Service   

   similarity_score  
0          0.309047  
2          0.231593  
4          0.223551  
6          0.269907  
8          0.232240  
✅ Similarity matrix saved as 'resume_knowledge_similarity_matrix.csv'.


In [8]:
# Print the highest similarity score
max_similarity = df_similarity["similarity_score"].max()
highest_match = df_similarity[df_similarity["similarity_score"] == max_similarity]

print("\n🎯 Highest Similarity Score:")
print(highest_match)





🎯 Highest Similarity Score:
     resume_id                                     job_title  \
346          5                   Computer Network Architects   
412          5                          Computer Programmers   
478          5         Computer Systems Engineers/Architects   
544          5  Computer and Information Research Scientists   
610          5                            Robotics Engineers   

              knowledge_entity  similarity_score  
346  Computers and Electronics          0.418219  
412  Computers and Electronics          0.418219  
478  Computers and Electronics          0.418219  
544  Computers and Electronics          0.418219  
610  Computers and Electronics          0.418219  


In [4]:
import torch
from ragatouille import RAGPretrainedModel

# Load ColBERT-based RAG model from Ragatouille
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")


  self.scaler = torch.cuda.amp.GradScaler()


In [None]:

# Step 1: Group by resume_id to ensure one resume per ID
df_resumes_grouped = df_resumes.groupby("resume_id")["resume_text"].first().reset_index()

# Step 2: Index Each Resume Independently
for resume_id, resume_text in zip(df_resumes_grouped["resume_id"], df_resumes_grouped["resume_text"]):
    index_name = f"resume_{resume_id}"  # Unique index name per resume
    
    print(f"Indexing Resume ID: {resume_id}...")  # Debugging output
    RAG.index(
        collection=[resume_text],  # Store the full resume as a single document
        index_name=index_name,
        max_document_length=180,
        split_documents=True  # Ragatouille will handle chunking
    )

print("✅ All resumes have been indexed successfully (only once per ID)!")


Indexing Resume ID: 4...
New index_name received! Updating current index_name (resume_4) to resume_4
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Mar 28, 14:12:15] #> Note: Output directory .ragatouille/colbert/indexes/resume_4 already exists


[Mar 28, 14:12:15] #> Will delete 10 files already at .ragatouille/colbert/indexes/resume_4 in 20 seconds...
#> Starting...
#> Starting...


  self.scaler = torch.cuda.amp.GradScaler()


nranks = 2 	 num_gpus = 2 	 device=1
[Mar 28, 14:12:40] [1] 		 #> Encoding 2 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
  self.scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


nranks = 2 	 num_gpus = 2 	 device=0
[Mar 28, 14:12:43] [0] 		 #> Encoding 4 passages..
[Mar 28, 14:12:44] [1] 		 avg_doclen_est = 139.625 	 len(local_sample) = 2
[Mar 28, 14:12:44] [0] 		 avg_doclen_est = 139.625 	 len(local_sample) = 4
[Mar 28, 14:12:44] [0] 		 Creating 256 partitions.
[Mar 28, 14:12:44] [0] 		 *Estimated* 837 embeddings.
[Mar 28, 14:12:44] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/resume_4/plan.json ..


  sub_sample = torch.load(sub_sample_path)
Process Process-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/opt/conda/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 232, in train
    centroids = self._train_kmeans(samp

In [9]:
import json

# Step 1: Prepare queries (all O*NET knowledge entities)
knowledge_queries = df_onet["knowledge_entity"].drop_duplicates().tolist() # All queries at once

# Step 2: Search for each indexed resume
similarity_results = []
for resume_id in df_resumes["resume_id"].unique():
    index_name = f"resume_{resume_id}"  # Resume index name

    print(f"🔍 Searching Resume ID: {resume_id}...")  # Debugging output

    # Perform search using all knowledge entities as queries
    retrieved_docs = RAG.search(query=knowledge_queries, index_name=index_name, k=3)  

    # Parse JSON response and store results
    for doc in retrieved_docs:
        knowledge_entity = doc["query"]  # The knowledge entity used for retrieval
        matched_text = doc["text"]  # The matching chunk from the resume
        similarity_score = doc["score"]  # Similarity score

        similarity_results.append({
            "resume_id": resume_id,
            "knowledge_entity": knowledge_entity,
            "matched_resume_chunk": matched_text,
            "similarity_score": round(similarity_score, 4)
        })

# Convert results to DataFrame
df_similarity = pd.DataFrame(similarity_results)

# Step 3: Find the highest similarity score
max_similarity = df_similarity["similarity_score"].max()
highest_match = df_similarity[df_similarity["similarity_score"] == max_similarity]

print("\n🎯 Highest Similarity Score using Ragatouille (Knowledge as Queries):")
print(highest_match)

# Save to CSV
df_similarity.to_csv("data/annotations_scenario_1/ragatouille_resume_knowledge_similarity_matrix.csv", index=False)
print("✅ Similarity matrix saved as 'ragatouille_resume_knowledge_similarity_matrix.csv'.")


🔍 Searching Resume ID: 4...


AssertionError: 

In [61]:
resume_text_4 = df_resumes[df_resumes["resume_id"] == 4]["resume_text"].iloc[0]

RAG.index(
    collection=[resume_text_4],
    index_name="resume_4",
    max_document_length=180,
    split_documents=True
)

New index_name received! Updating current index_name (resume_4) to resume_4
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Mar 27, 20:31:43] #> Note: Output directory .ragatouille/colbert/indexes/resume_4 already exists


[Mar 27, 20:31:43] #> Will delete 10 files already at .ragatouille/colbert/indexes/resume_4 in 20 seconds...
[Mar 27, 20:32:04] [0] 		 #> Encoding 6 passages..


100%|██████████| 1/1 [00:00<00:00,  1.27it/s]

[Mar 27, 20:32:05] [0] 		 avg_doclen_est = 136.0 	 len(local_sample) = 6
[Mar 27, 20:32:05] [0] 		 Creating 256 partitions.
[Mar 27, 20:32:05] [0] 		 *Estimated* 816 embeddings.
[Mar 27, 20:32:05] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/resume_4/plan.json ..



  sub_sample = torch.load(sub_sample_path)


used 6 iterations (0.007s) to cluster 776 items into 256 clusters
[0.034, 0.031, 0.035, 0.033, 0.026, 0.036, 0.032, 0.039, 0.036, 0.027, 0.032, 0.045, 0.03, 0.028, 0.034, 0.034, 0.032, 0.039, 0.033, 0.036, 0.032, 0.044, 0.036, 0.03, 0.029, 0.034, 0.039, 0.028, 0.042, 0.029, 0.039, 0.04, 0.04, 0.036, 0.03, 0.03, 0.025, 0.041, 0.04, 0.038, 0.037, 0.029, 0.037, 0.031, 0.032, 0.038, 0.028, 0.034, 0.032, 0.038, 0.027, 0.036, 0.034, 0.032, 0.032, 0.041, 0.03, 0.031, 0.032, 0.039, 0.035, 0.033, 0.029, 0.032, 0.044, 0.036, 0.042, 0.039, 0.034, 0.029, 0.033, 0.026, 0.031, 0.034, 0.028, 0.032, 0.041, 0.039, 0.03, 0.032, 0.031, 0.035, 0.024, 0.033, 0.034, 0.037, 0.038, 0.033, 0.031, 0.032, 0.033, 0.03, 0.03, 0.041, 0.029, 0.027, 0.04, 0.028, 0.031, 0.037, 0.03, 0.036, 0.032, 0.037, 0.036, 0.028, 0.025, 0.036, 0.032, 0.025, 0.038, 0.033, 0.036, 0.04, 0.034, 0.036, 0.041, 0.039, 0.033, 0.028, 0.028, 0.027, 0.035, 0.042, 0.039, 0.034, 0.032, 0.03]


0it [00:00, ?it/s]

[Mar 27, 20:32:05] [0] 		 #> Encoding 6 passages..



  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:00<00:00,  7.20it/s][A
1it [00:00,  6.77it/s]
100%|██████████| 1/1 [00:00<00:00, 2064.13it/s]

[Mar 27, 20:32:05] #> Optimizing IVF to store map from centroids to list of pids..
[Mar 27, 20:32:05] #> Building the emb2pid mapping..
[Mar 27, 20:32:05] len(emb2pid) = 816



100%|██████████| 256/256 [00:00<00:00, 80118.03it/s]

[Mar 27, 20:32:05] #> Saved optimized IVF to .ragatouille/colbert/indexes/resume_4/ivf.pid.pt





Done indexing!


'.ragatouille/colbert/indexes/resume_4'

In [63]:
RAG.search(query="Mathematics", index_name="resume_5", k=3) 

New index_name received! Updating current index_name (resume_4) to resume_5
Loading searcher for index resume_5 for the first time... This may take a few seconds
[Mar 27, 20:32:28] #> Loading codec...
[Mar 27, 20:32:28] #> Loading IVF...
[Mar 27, 20:32:28] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3221.43it/s]

[Mar 27, 20:32:28] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 1198.37it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Mathematics, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 5597,  102,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])






[{'content': 'Reports and Forecasts Education Details PGP in Data Science Mumbai, Maharashtra Aegis School of data science & Business B. E. in Electronics & Communication Electronics & Communication Indore, Madhya Pradesh IES IPS Academy Data Scientist Data Scientist with PR Canada Skill Details Algorithms- Exprience - 6 months BI- Exprience - 6 months Business Intelligence- Exprience - 6 months Machine Learning- Exprience - 24 months Visualization- Exprience - 24 months spark- Exprience - 24 months python- Exprience - 36 months tableau- Exprience - 36 months Data Analysis- Exprience - 24 monthsCompany Details company - Aegis school of Data Science & Business description - Mostly working on industry project for providing solution along with Teaching Appointments: Teach undergraduate and graduate-level courses in Spark and Machine Learning as an adjunct faculty member at Aegis School of Data Science,',
  'score': 11.318258285522461,
  'rank': 1,
  'document_id': 'c00c7312-edb5-4962-a355