## Resume Anonymization

In [1]:
import kagglehub
import os
from pathlib import Path
os.environ["KAGGLEHUB_CACHE"] = str(Path.cwd() / "data" / "kagglehub")

import pandas as pd

path = kagglehub.dataset_download("snehaanbhawal/resume-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/socialaistudio/Code/hiring_agent/data/kagglehub/datasets/snehaanbhawal/resume-dataset/versions/1


In [2]:
resume_df = pd.read_csv(path + "/Resume/resume.csv")
resume_df.drop(columns=['ID', 'Resume_html'], inplace=True)
resume_df.rename(columns={'Resume_str': 'resume'}, inplace=True)
resume_df = resume_df[['Category', 'resume']]
resume_df.head()

Unnamed: 0,Category,resume
0,HR,HR ADMINISTRATOR/MARKETING ASSOCIATE\...
1,HR,"HR SPECIALIST, US HR OPERATIONS ..."
2,HR,HR DIRECTOR Summary Over 2...
3,HR,HR SPECIALIST Summary Dedica...
4,HR,HR MANAGER Skill Highlights ...


In [3]:
# resume_df.iloc[0].resume
resume_df.Category.unique(), len(resume_df.Category.unique())

(array(['HR', 'DESIGNER', 'INFORMATION-TECHNOLOGY', 'TEACHER', 'ADVOCATE',
        'BUSINESS-DEVELOPMENT', 'HEALTHCARE', 'FITNESS', 'AGRICULTURE',
        'BPO', 'SALES', 'CONSULTANT', 'DIGITAL-MEDIA', 'AUTOMOBILE',
        'CHEF', 'FINANCE', 'APPAREL', 'ENGINEERING', 'ACCOUNTANT',
        'CONSTRUCTION', 'PUBLIC-RELATIONS', 'BANKING', 'ARTS', 'AVIATION'],
       dtype=object),
 24)

In [4]:
# Select 2 resume samples from each category
sampled_resumes = pd.concat([resume_df[resume_df.Category == cat].sample(
    2, random_state=42) for cat in resume_df.Category.unique()
]).reset_index(drop=True)
sampled_resumes.head(5)

Unnamed: 0,Category,resume
0,HR,ASSISTANT MANAGER - HR www...
1,HR,HR ASSISTANT Summary Highly ...
2,DESIGNER,SENIOR GRAPHIC DESIGNER Summary...
3,DESIGNER,WEBSITE DESIGNER Summary ...
4,INFORMATION-TECHNOLOGY,INFORMATION TECHNOLOGY SPECIALIST (IN...


In [5]:

# create a subsample of 5 resumes for testing
subsampled_resumes = resume_df.sample(2, random_state=42).reset_index(drop=True)
subsampled_resumes.head(2)

Unnamed: 0,Category,resume
0,TEACHER,Kpandipou Koffi Summary ...
1,DIGITAL-MEDIA,DIRECTOR OF DIGITAL TRANSFORMATION ...


In [6]:
[
        {"resume_id": idx, "resume_text": text}
        for idx, text in subsampled_resumes.resume.items()
]

[{'resume_id': 0,
  'resume_text': '           Kpandipou    Koffi         Summary      Compassionate teaching professional delivering exemplary support and assistance to teachers and students. Display exceptional Communication and problem solving skills.  Experience in office administration and public speaking. Attentive and adaptable, skilled in management of classroom operations. Effective in leveraging student feedback to create dynamic lesson plans that address individual strengths and weaknesses.  Dedicated and responsive team leader with proven skills in classroom management, behavior modification and individualized support.  Personable with experience using relationship-building to cultivate positive client, staff and management connections. Highly-developed communicator with outstanding skills in complex problem-solving and conflict resolution.  High-performing Administrative Assistant offering experience working with diverse client base and delivering exceptional results. Poli

In [7]:
from src.utils.pipeline_utils import batch_process_resumes

processed_resumes = batch_process_resumes(
    sampled_resumes.resume,
    country="Singapore"
)
# add the processed resumes to sampled_resumes dataframe as new columns
# for key in processed_resumes[0].keys():
#     subsampled_resumes[key] = [dct[key] for dct in processed_resumes]
# subsampled_resumes.head(5)

In [8]:
print(processed_resumes)

{0: {'anonymized': "## Anonymized Resume\n\n**ASSISTANT MANAGER - HR**\n\n**Professional Summary**  \nLooking for a challenging position that utilizes my skills, hard work, and provides opportunities to learn and contribute to the organization. I want to see myself as an active contributor to a team of ambitious people and thereby enhance my knowledge and personality. Human Resource Professional with over 4 years of rich experience in Recruitment, Organization Development, Time Management, Training & Development, Performance Management, Employee Engagement, TPM & Audit. Worked as an Assistant Manager - HR (Generalist Profile) with [COMPANY X] at its manufacturing unit and assisted HRM & SAP at the unit. Possess strong communication, interpersonal, problem-solving skills, and analytical skills. Strong communication, collaboration & team-building skills with proficiency at grasping new technical concepts quickly and utilizing the same in a productive manner. Fast Learner (demonstrated ab

In [9]:
# Create empty columns in the dataframe for the new keys
for key in processed_resumes[0].keys():
    if key not in sampled_resumes.columns:
        sampled_resumes[key] = None
# Fill the new columns with the processed resume data
for k in processed_resumes:
    for key in processed_resumes[k].keys():
        sampled_resumes.loc[sampled_resumes.index == k, key] = processed_resumes[k][key]
sampled_resumes.head(5)

Unnamed: 0,Category,resume,anonymized,reformatted,localized
0,HR,ASSISTANT MANAGER - HR www...,## Anonymized Resume\n\n**ASSISTANT MANAGER - ...,```markdown\n## [Candidate Name]\n\n**ASSISTAN...,```markdown\n## [Candidate Name]\n\n**ASSISTAN...
1,HR,HR ASSISTANT Summary Highly ...,## Anonymized Resume\n\n**HR ASSISTANT**\n\n**...,```markdown\n[Candidate Name]\n\n## HR ASSISTA...,```markdown\n[Candidate Name]\n\n## HR ASSISTA...
2,DESIGNER,SENIOR GRAPHIC DESIGNER Summary...,# SENIOR GRAPHIC DESIGNER\n\n## Summary\nDiver...,```\n[CANDIDATE NAME]\n\n# SENIOR GRAPHIC DESI...,```\n[CANDIDATE NAME]\n\n# SENIOR GRAPHIC DESI...
3,DESIGNER,WEBSITE DESIGNER Summary ...,## WEBSITE DESIGNER\n\n**Summary** \nSoftware...,```\n[CANDIDATE NAME]\n\n## WEBSITE DESIGNER\n...,```\n[CANDIDATE NAME]\n\n## WEBSITE DESIGNER\n...
4,INFORMATION-TECHNOLOGY,INFORMATION TECHNOLOGY SPECIALIST (IN...,## Anonymized Resume\n\nINFORMATION TECHNOLOGY...,[Candidate Name] \n\n## INFORMATION TECHNOLOG...,[Candidate Name] \n\n## INFORMATION TECHNOLOG...


In [10]:
# Save the processed resumes to a new CSV file
os.makedirs("data/processed", exist_ok=True)
sampled_resumes.to_csv("data/processed/resumes.csv", index=False)

## Post-processing - Resume

In [7]:
import pandas as pd
from pathlib import Path
import re 


def clean_text(text):
    # remove multiple newlines
    text = re.sub(r'\n+', '\n', text)
    # remove multiple spaces
    text = re.sub(r' +', ' ', text)
    # strip leading and trailing whitespace
    text = text.strip()
    # remove triple backticks and the word markdown
    text = text.replace("```", "").replace("markdown", "")
    # remove first and last newlines
    text = text.lstrip('\n').rstrip('\n')
    # replace Candidate Name with CANDIDATE NAME
    text = text.replace("Candidate Name", "CANDIDATE NAME")
    return text


resume_df = pd.read_csv(Path("data/processed/resumes.csv"))
resume_df['localized'] = resume_df['localized'].apply(clean_text)
# drop resume, anonymized, and reformatted columns, rename localized to resume
resume_df = resume_df.drop(columns=['resume', 'anonymized', 'reformatted']).rename(columns={'localized': 'resume'})
resume_df.head()

Unnamed: 0,Category,resume
0,HR,## [CANDIDATE NAME]\n**ASSISTANT MANAGER - HR*...
1,HR,[CANDIDATE NAME]\n## HR ASSISTANT\n**Summary**...
2,DESIGNER,[CANDIDATE NAME]\n# SENIOR GRAPHIC DESIGNER\n#...
3,DESIGNER,[CANDIDATE NAME]\n## WEBSITE DESIGNER\n**Summa...
4,INFORMATION-TECHNOLOGY,[CANDIDATE NAME] \n## INFORMATION TECHNOLOGY S...


## Job Processing

In [12]:
import kagglehub
import os
from pathlib import Path

os.environ["KAGGLEHUB_CACHE"] = str(Path.cwd() / "data" / "kagglehub")

# Download latest version
path = kagglehub.dataset_download("marcocavaco/scraped-job-descriptions")

print("Path to dataset files:", path)

import pandas as pd
df = pd.read_csv(path + "/JD_data.csv", index_col=0)
df['description'] = df['description'].apply(lambda x: x[3:-3])
df.head()

Path to dataset files: /Users/socialaistudio/Code/hiring_agent/data/kagglehub/datasets/marcocavaco/scraped-job-descriptions/versions/1


Unnamed: 0,ISCO,major_job,job,position,location,description
0,21,SCIENCE AND ENGINEERING PROFESSIONALS,physicist,Accelerator Physicist id54315,"Villigen PSI, Aargau",You have an academic degree in physics or engi...
1,21,SCIENCE AND ENGINEERING PROFESSIONALS,physicist,Applied Physicist (Computing) (EP-LBC-2021-125...,Geneva,Be in charge of the development of application...
2,21,SCIENCE AND ENGINEERING PROFESSIONALS,physicist,Accelerator Physicist (BE-ABP-LNO-2021-122-LD)...,Geneva,Contribute to the maintenance and development ...
3,21,SCIENCE AND ENGINEERING PROFESSIONALS,physicist,Medical Devices Physicist,"Newton, Cambridgeshire",Agency: Newton Colmore Consulting Reference: M...
4,21,SCIENCE AND ENGINEERING PROFESSIONALS,physicist,Fluidics Physicist,Cambridge,Agency: Newton Colmore Consulting Reference: F...


In [13]:
print(df['major_job'].unique(), len(df['major_job'].unique()))
print(df.ISCO.unique(), len(df.ISCO.unique()))
print(df['job'].value_counts())

# for i in range(9):
    # print(" ".join(df.iloc[i].position.split(" ")[:-1]),'\t', df.iloc[i].job)
    # print(df.iloc[i].description[3:-3].replace(". ", ".\n"))
    # print()

['SCIENCE AND ENGINEERING PROFESSIONALS'
 'BUSINESS AND ADMINISTRATION ASSOCIATE PROFESSIONALS'
 'TEACHING PROFESSIONALS' 'HEALTH ASSOCIATE PROFESSIONALS'
 'SCIENCE AND ENGINEERING ASSOCIATE PROFESSIONALS' 'HEALTH PROFESSIONALS'
 'MARKET-ORIENTED SKILLED AGRICULTURAL WORKERS'
 'DRIVERS AND MOBILE PLANT OPERATORS'
 'BUSINESS AND ADMINISTRATION PROFESSIONALS' 'SALES WORKERS'
 'INFORMATION AND COMMUNICATIONS TECHNICIANS' 'CUSTOMER SERVICES CLERKS'
 'INFORMATION AND COMMUNICATIONS TECHNOLOGY PROFESSIONALS'
 'PROTECTIVE SERVICES WORKERS' 'ADMINISTRATIVE AND COMMERCIAL MANAGERS'
 'CLEANERS AND HELPERS'] 16
[21 33 23 32 31 22 61 83 24 52 35 42 25 54 12 91] 16
job
supervisor             84
inspection engineer    72
veterinary surgeon     57
bus driver             55
pharmacist             54
                       ..
immunologist           10
lender                 10
speech therapist       10
system programmer      10
nutritionist           10
Name: count, Length: 159, dtype: int64


In [14]:
# Sample a random row per unique job
sampled_jobs = df.groupby('job').apply(lambda x: x.sample(1, random_state=42)).reset_index(drop=True)
sampled_jobs.rename(columns={'description': 'job_description', 
                             'job': 'job_type', 
                             'major_job': 'job_classification'}, inplace=True)
sampled_jobs.drop(columns=['location', 'ISCO'], inplace=True)
sampled_jobs.head(5)

  sampled_jobs = df.groupby('job').apply(lambda x: x.sample(1, random_state=42)).reset_index(drop=True)


Unnamed: 0,job_classification,job_type,position,job_description
0,BUSINESS AND ADMINISTRATION PROFESSIONALS,accountant,Assistant Accountant,Management of large sets of data that contribu...
1,BUSINESS AND ADMINISTRATION ASSOCIATE PROFESSI...,administrative assistant,Administrative Assistant,Supports managers and employees through a vari...
2,ADMINISTRATIVE AND COMMERCIAL MANAGERS,administrative manager,Patents & Trademarks Administrative Manager M/F,You ensure that official filing procedures are...
3,HEALTH PROFESSIONALS,adviser,Customer Service Adviser,Customer Service Advisor Operational Hours: Mo...
4,INFORMATION AND COMMUNICATIONS TECHNOLOGY PROF...,application developer,React Native Application Developer,Work as part of a small team to improve our Re...


In [15]:
d = sampled_jobs.iloc[0]
print(d)

job_classification            BUSINESS AND ADMINISTRATION PROFESSIONALS
job_type                                                     accountant
position                                           Assistant Accountant
job_description       Management of large sets of data that contribu...
Name: 0, dtype: object


In [16]:
# from src.utils.pipeline_utils import job_pipeline

# job_dct = job_pipeline(job_classification=d['job_classification'], 
#                        job_type=d['job_type'], 
#                        position=d['position'], 
#                        job_description=d['job_description'])
# print(job_dct)

In [None]:
subsampled_jobs = sampled_jobs.sample(2, random_state=42).reset_index(drop=True)
subsampled_jobs.head(2)

Unnamed: 0,job_classification,job_type,position,job_description
0,BUSINESS AND ADMINISTRATION PROFESSIONALS,marketer,Digital Marketer,Отправить резюме
1,HEALTH PROFESSIONALS,veterinary surgeon,Veterinary Surgeon,40 hours a week including 1 in 3 weekends and ...


In [None]:
from src.utils.pipeline_utils import batch_job_pipeline

batch_results = batch_job_pipeline(sampled_jobs)

In [25]:
len(batch_results), list(batch_results.keys())[:5]

(159, ['0', '1', '2', '3', '4'])

In [37]:
# Add the processed job descriptions to the sampled_jobs dataframe as new columns
for key in batch_results[list(batch_results.keys())[0]].keys():
    if key not in sampled_jobs.columns:
        sampled_jobs[key] = None
# Fill the new columns with the processed job data
for job_id in batch_results:
    for key in batch_results[job_id].keys():
        sampled_jobs.loc[sampled_jobs.index == int(job_id), key] = batch_results[job_id][key]
sampled_jobs.drop(columns=['job_id'], inplace=True)
sampled_jobs.head(5)

Unnamed: 0,job_classification,job_type,position,job_description,company_criteria,previous_hires
0,BUSINESS AND ADMINISTRATION PROFESSIONALS,accountant,Assistant Accountant,Management of large sets of data that contribu...,- **Criterion 1: Relevant Educational Backgrou...,### Candidate Profiles for Assistant Accountan...
1,BUSINESS AND ADMINISTRATION ASSOCIATE PROFESSI...,administrative assistant,Administrative Assistant,Supports managers and employees through a vari...,- **Criterion 1: Strong Organizational Skills*...,### Candidate Profiles\n\n- **Candidate 1: Adm...
2,ADMINISTRATIVE AND COMMERCIAL MANAGERS,administrative manager,Patents & Trademarks Administrative Manager M/F,You ensure that official filing procedures are...,- **Criterion 1: Relevant Experience in Intell...,### Candidate Profiles\n\n- **Candidate 1: Pro...
3,HEALTH PROFESSIONALS,adviser,Customer Service Adviser,Customer Service Advisor Operational Hours: Mo...,- **Customer-Centric Approach**: Candidates sh...,### Candidate Profiles for Customer Service Ad...
4,INFORMATION AND COMMUNICATIONS TECHNOLOGY PROF...,application developer,React Native Application Developer,Work as part of a small team to improve our Re...,- **Technical Proficiency in React Native**: C...,### Candidate Profiles\n\n- **Candidate 1**: \...


In [38]:
# save the sampled_jobs to a csv file
import os
import pandas as pd
os.makedirs("data/processed", exist_ok=True)
sampled_jobs.to_csv("data/processed/sampled_jobs.csv", index=False)

## Post-processing - Job

In [40]:
import pandas as pd
import re 

def clean_text(text):
    # remove multiple newlines
    text = re.sub(r'\n+', '\n', text)
    # remove multiple spaces
    text = re.sub(r' +', ' ', text)
    # strip leading and trailing whitespace
    text = text.strip()
    # remove triple backticks and the word markdown
    text = text.replace("```", "").replace("markdown", "")
    # remove first and last newlines
    text = text.lstrip('\n').rstrip('\n')
    return text

results_df = pd.read_csv("data/processed/sampled_jobs.csv", index_col=0)
results_df['company_criteria'] = results_df['company_criteria'].apply(clean_text)
results_df['previous_hires'] = results_df['previous_hires'].apply(clean_text)
results_df.head()

Unnamed: 0_level_0,job_type,position,job_description,company_criteria,previous_hires
job_classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BUSINESS AND ADMINISTRATION PROFESSIONALS,accountant,Assistant Accountant,Management of large sets of data that contribu...,- **Criterion 1: Relevant Educational Backgrou...,### Candidate Profiles for Assistant Accountan...
BUSINESS AND ADMINISTRATION ASSOCIATE PROFESSIONALS,administrative assistant,Administrative Assistant,Supports managers and employees through a vari...,- **Criterion 1: Strong Organizational Skills*...,### Candidate Profiles\n- **Candidate 1: Admin...
ADMINISTRATIVE AND COMMERCIAL MANAGERS,administrative manager,Patents & Trademarks Administrative Manager M/F,You ensure that official filing procedures are...,- **Criterion 1: Relevant Experience in Intell...,### Candidate Profiles\n- **Candidate 1: Profi...
HEALTH PROFESSIONALS,adviser,Customer Service Adviser,Customer Service Advisor Operational Hours: Mo...,- **Customer-Centric Approach**: Candidates sh...,### Candidate Profiles for Customer Service Ad...
INFORMATION AND COMMUNICATIONS TECHNOLOGY PROFESSIONALS,application developer,React Native Application Developer,Work as part of a small team to improve our Re...,- **Technical Proficiency in React Native**: C...,### Candidate Profiles\n- **Candidate 1**: \n ...
