## Talent Trove Data Generation

In order to populate our Talent Trove database, we used real job postings from job aggregators online (e.g. Simplify by Michael Yan) for our job postings, and Mockaroo for our other tables. Additionally, we use an LLM as shown below to generate authentic-seeming reviews for the Review table text. 

The procedure used to obtain and preprocess each data CSV for the Talent Trove Database relations is described under its respective heading.

### Import Statements

In [205]:
import os
import time
import random
! pip install openai
import openai
import pandas as pd
import numpy as np
import torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Job_Posting
- Job Posting relation contains all of the job postings in the application and has a Job_ID, Experience, Location, Requirements, and Skills value for each.
- Foreign keys included in the Job Posting relation are Recruiter.Username, Job_Portal.Portal_ID, Company.Company_ID.

In [236]:
job_ids_df = pd.read_csv('data/Mockaroo/Mockaroo-Job-Postings-IDs.csv')
job_ids_df 

Unnamed: 0,Job_ID
0,01HQY593T458D022KE1X5KW3GH
1,01HQY593T4WCDYXM6WM9A6H0EN
2,01HQY593T5XMCEY5W4CFFR7PDA
3,01HQY593T5SCN9Y2ERYWHDH46Q
4,01HQY593T5WG867X8XF3X3WRFP
...,...
595,01HQY594028NGXXS2M0VX5Q5JS
596,01HQY59402WWT0MVTDPMEFTZ23
597,01HQY59402G3G666SG3MWAHWN1
598,01HQY594035G4V2RSJSC0Q3RW8


In [237]:
# Add 300 rows from full time data and 300 rows from intern data
full_time_df = pd.read_csv('data/Tech_Full_Time_Roles.csv', header=0)[:300]
full_time_df.rename(columns={'Role': 'Job_Title'}, inplace=True)
full_time_df = full_time_df.drop(columns=['Application/Link','Date Posted'])

In [238]:
full_time_requirements = ['Tableau data analyst certification', 'AWS Cloud Practitioner certification', 'Oracle SQL certification required', 'US Citizen', 'Must be willing to relocate', 'Must be willing to travel 20% of time', 'Must have experience leading cross-functional teams', 'Must have at least one year of non-internship relevant experience', 'Must have 3+ years of relevant experience', 'Must have 5+ years of relevant experience']

job_skills = ['Python', 'Scala', 'Pandas', 'AI/ML frameworks', 'PyTorch', 'Keras', 'Tensorflow', 'AWS', 'Azure', 'Google Cloud Platform', 'SQL', 'Oracle Databases', 'Databases', 'Java', 'Julia', 'Pascal', 'Perl', 'LaTeX', 'Scheme', 'Ruby', 'Ruby SQL', 'Flask', 'SQLite', 'PostgreSQL', 'MySQL', 'Tableau', 'PowerBI', 'Software Development', 'C++', 'C#', 'HTML/CSS/JavaScript', 'HTML/CSS', 'JavaScript', 'MongoDB', 'Neo4j', 'Agile', 'Scrum', 'Vue.js', 'React', 'Angular', 'Docker', 'Kubernetes', 'Sphinx', 'Jupyter', 'Git', 'Algorithms', 'Django', 'Problem-Solving', 'Leadership', 'Communication', 'Web Development', 'Frontend Stack', 'Backend Stack', 'Adobe Creative Suite', 'Financial modeling', 'Spanish', 'French', 'Hindi', 'Urdu', 'German', 'Italian', 'American Sign Language', 'Cybersecurity', 'Linux', 'Bash Scripting', 'Customer Support', 'Project Management', 'Networking Fundamentals', 'Algorithms', 'Jenkins', 'Kotlin', 'Swift', 'OOP', 'Ansible', 'Snowflake', 'Databricks', 'Budgeting', 'UX/UI Design', 'Graphic Design', 'Content Marketing', 'Technical Writing', 'SEO/SEM', 'Social Media Marketing', 'Hootsuite']

job_experience = ['0-1 years relevant experience', '1-2 years relevant experience', '3+ years relevant experience', '5+ years relevant experience']

In [239]:
full_time_df['Experience'] = full_time_df.apply(lambda row: random.choice(job_experience), axis=1)
full_time_df['Requirements'] = full_time_df.apply(lambda row: random.choice(full_time_requirements), axis=1)
full_time_df['Skills'] = full_time_df.apply(lambda row: get_random_skills(job_skills), axis=1)
full_time_df

Unnamed: 0,Company,Job_Title,Location,Experience,Requirements,Skills
0,Squarepoint Capital,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R..."
1,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB..."
2,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock..."
3,Lucid,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C..."
4,Lucid,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang..."
...,...,...,...,...,...,...
295,Arista Networks,UX Designer - Remote from Turkey - Hungary or...,Remote,0-1 years relevant experience,Must be willing to travel 20% of time,"Flask, JavaScript, Social Media Marketing, Sof..."
296,Arista Networks,Software Engineer - Packet Forwarding Engines,"Vancouver, BC, Canada",0-1 years relevant experience,Must have at least one year of non-internship ...,"PostgreSQL, Azure, Google Cloud Platform, Reac..."
297,Arista Networks,Software Engineer - Network Systems,"Vancouver, BC, Canada",5+ years relevant experience,Tableau data analyst certification,"Backend Stack, Pandas, Software Development, R..."
298,Connectly,Software Engineer - Backend,Remote,0-1 years relevant experience,Must have 5+ years of relevant experience,"Content Marketing, Git, Urdu, Pandas, React, C..."


In [240]:
internship_df = pd.read_csv('data/Tech_Internship.csv', header=0)[:300]
internship_df

Unnamed: 0,Company,Role,Location,Application/Link,Date Posted
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27
...,...,...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA","<a href=""https://boards.greenhouse.io/inariag...",Dec 08
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA","<a href=""https://jobs.lever.co/quantcast/9ab8...",Dec 07
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA","<a href=""https://eh2.com/careers?gh_jid=43477...",Dec 07
298,Niantic,Software Engineering Intern,"San Francisco, CA","<a href=""https://app.ripplematch.com/v2/publi...",Dec 06


In [241]:
# Replace 'Ü≥' symbols with correct company names, i.e. company name before that row
for i in range(len(internship_df)): 
    if internship_df.loc[i, 'Company'] == 'Ü≥ ' or internship_df.loc[i, 'Company'] == ' ‚Ü≥ ':
        internship_df.loc[i, 'Company'] = internship_df.loc[i-1, 'Company']
        
# Verify this was done correctly for all tuples
if not (internship_df['Company'] == 'Ü≥').any() or (internship_df['Company'] == ',Ü≥').any():
    print("There are no more 'Ü≥' symbols in the 'Company' column!")
internship_df

There are no more 'Ü≥' symbols in the 'Company' column!


Unnamed: 0,Company,Role,Location,Application/Link,Date Posted
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27
...,...,...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA","<a href=""https://boards.greenhouse.io/inariag...",Dec 08
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA","<a href=""https://jobs.lever.co/quantcast/9ab8...",Dec 07
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA","<a href=""https://eh2.com/careers?gh_jid=43477...",Dec 07
298,Niantic,Software Engineering Intern,"San Francisco, CA","<a href=""https://app.ripplematch.com/v2/publi...",Dec 06


In [242]:
# Drop Date Posted, Application Link, duplicates
internship_df.drop_duplicates(inplace=True)
internship_df = internship_df.drop(columns=['Application/Link', 'Date Posted'])
internship_df

Unnamed: 0,Company,Role,Location
0,Chime,Software Engineer Intern - Growth Funding,SF
1,CACI,Software Development Intern - Summer 2024,Remote in USA
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO"
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA"
4,Roku,Machine Learning Engineer Intern,"San Jose, CA"
...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA"
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA"
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA"
298,Niantic,Software Engineering Intern,"San Francisco, CA"


In [243]:
# Randomly impute a value for the Experience, Requirements, Skills, Saliaried, and Duration attributes for each tuple 
intern_experience_vals = ['None Required', 'Previous industry internship experience required (>=3 months)', 'Previous research/academic experience required (>=3 months)', 'Minimum 1 year previous industry internship experience required', 'Previous research/academic highly desirable']

intern_requirements_vals = ['Willing to relocate', 'Willing to travel up to 20%', 'Must meet base technical criteria', 'Must be proficient in Microsoft Office Suite', 'Good sense of humor', 'Ability to work independently', 'Familiar with Agile methodologies', 'Must have valid license', 'Must tolerate dogs in the workplace']

def get_random_skills(job_skills):
    # Pick random number of skills between 3 and 12
    num_skills = random.randint(5, 12)
    skills = []
    for i in range(num_skills):
        skills.append(random.choice(job_skills))
    skills = list(set(skills))
    skills_str = ', '.join(skills)
    return skills_str
    
duration_vals = ['4 weeks', '8 weeks', '12 weeks', '16 weeks']

# Make only 3% of values Salaried = False because I don't want to live in a database world where most companies benefit off the backs of innocent, eager interns 
salaried = [True, False]
salary_probs = [0.97, 0.03]

In [244]:
internship_df['Experience'] = internship_df.apply(lambda row: random.choice(intern_experience_vals), axis=1)
internship_df['Requirements'] = internship_df.apply(lambda row: random.choice(intern_requirements_vals), axis=1)
internship_df['Skills'] = internship_df.apply(lambda row: get_random_skills(job_skills), axis=1)
internship_df = internship_df.rename(columns={'Role': 'Job_Title'})
internship_df

Unnamed: 0,Company,Job_Title,Location,Experience,Requirements,Skills
0,Chime,Software Engineer Intern - Growth Funding,SF,Minimum 1 year previous industry internship ex...,Must meet base technical criteria,"OOP, Jupyter, Social Media Marketing, Italian,..."
1,CACI,Software Development Intern - Summer 2024,Remote in USA,Minimum 1 year previous industry internship ex...,Must tolerate dogs in the workplace,"Adobe Creative Suite, Pandas, Neo4j, Financial..."
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO",Previous industry internship experience requir...,Familiar with Agile methodologies,"Backend Stack, OOP, UX/UI Design, C#, Swift, I..."
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA",Previous research/academic highly desirable,Ability to work independently,"Urdu, Spanish, Software Development, Italian, ..."
4,Roku,Machine Learning Engineer Intern,"San Jose, CA",Previous research/academic highly desirable,Familiar with Agile methodologies,"Julia, Frontend Stack, Docker, Python, Databas..."
...,...,...,...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web..."
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo..."
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D..."
298,Niantic,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML..."


In [245]:
# Join full time and internship postings into one dataframe 
jobs_df = pd.concat([full_time_df, internship_df], axis=0)
jobs_df

Unnamed: 0,Company,Job_Title,Location,Experience,Requirements,Skills
0,Squarepoint Capital,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R..."
1,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB..."
2,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock..."
3,Lucid,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C..."
4,Lucid,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang..."
...,...,...,...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web..."
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo..."
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D..."
298,Niantic,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML..."


In [246]:
# Join jobs_df with job_ids 
job_posting_df = pd.concat([job_ids_df.reset_index(drop=True), jobs_df.reset_index(drop=True)], axis=1)
job_posting_df

Unnamed: 0,Job_ID,Company,Job_Title,Location,Experience,Requirements,Skills
0,01HQY593T458D022KE1X5KW3GH,Squarepoint Capital,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R..."
1,01HQY593T4WCDYXM6WM9A6H0EN,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB..."
2,01HQY593T5XMCEY5W4CFFR7PDA,Altera Digital Health,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock..."
3,01HQY593T5SCN9Y2ERYWHDH46Q,Lucid,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C..."
4,01HQY593T5WG867X8XF3X3WRFP,Lucid,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang..."
...,...,...,...,...,...,...,...
595,01HQY594028NGXXS2M0VX5Q5JS,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web..."
596,01HQY59402WWT0MVTDPMEFTZ23,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo..."
597,01HQY59402G3G666SG3MWAHWN1,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D..."
598,01HQY594035G4V2RSJSC0Q3RW8,Niantic,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML..."


In [247]:
# Replace Company with Company_ID
company_df = pd.read_csv('data/Company.csv', header=0)
company_id_mapping = company_df.set_index('Name')['Company_ID'].to_dict()
job_posting_df['Company'] = job_posting_df['Company'].map(company_id_mapping)
job_posting_df = job_posting_df.rename(columns={'Company': 'Company_ID'})
job_posting_df

Unnamed: 0,Job_ID,Company_ID,Job_Title,Location,Experience,Requirements,Skills
0,01HQY593T458D022KE1X5KW3GH,01HQXAVCMZWP3AR7WDCNQR8KSX,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R..."
1,01HQY593T4WCDYXM6WM9A6H0EN,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB..."
2,01HQY593T5XMCEY5W4CFFR7PDA,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock..."
3,01HQY593T5SCN9Y2ERYWHDH46Q,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C..."
4,01HQY593T5WG867X8XF3X3WRFP,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang..."
...,...,...,...,...,...,...,...
595,01HQY594028NGXXS2M0VX5Q5JS,01HQXAVCPHY1RNG3885A23RJYT,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web..."
596,01HQY59402WWT0MVTDPMEFTZ23,01HQXAVCP0DQPTCFMHX4QAZ2SE,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo..."
597,01HQY59402G3G666SG3MWAHWN1,01HQXAVCPHBN0G8N7AF63NRNFM,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D..."
598,01HQY594035G4V2RSJSC0Q3RW8,01HQXAVCPVEC2FHWBN2QHD4H87,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML..."


In [248]:
# Add random Recruiter_ID to each tuple from Recruiter
recruiter_df = pd.read_csv('data/Recruiter.csv', header=0)
all_recruiter_usernames = recruiter_df['Username'].unique().tolist()
job_posting_df['Recruiter_Username'] = job_posting_df.apply(lambda row: random.choice(all_recruiter_usernames), axis=1)
job_posting_df

Unnamed: 0,Job_ID,Company_ID,Job_Title,Location,Experience,Requirements,Skills,Recruiter_Username
0,01HQY593T458D022KE1X5KW3GH,01HQXAVCMZWP3AR7WDCNQR8KSX,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R...",icatfordy
1,01HQY593T4WCDYXM6WM9A6H0EN,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB...",dfarries16
2,01HQY593T5XMCEY5W4CFFR7PDA,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock...",ryerbym
3,01HQY593T5SCN9Y2ERYWHDH46Q,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C...",rasaaft
4,01HQY593T5WG867X8XF3X3WRFP,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang...",bgascar5
...,...,...,...,...,...,...,...,...
595,01HQY594028NGXXS2M0VX5Q5JS,01HQXAVCPHY1RNG3885A23RJYT,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web...",ahoutbyb
596,01HQY59402WWT0MVTDPMEFTZ23,01HQXAVCP0DQPTCFMHX4QAZ2SE,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo...",gjohnu
597,01HQY59402G3G666SG3MWAHWN1,01HQXAVCPHBN0G8N7AF63NRNFM,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D...",bgascar5
598,01HQY594035G4V2RSJSC0Q3RW8,01HQXAVCPVEC2FHWBN2QHD4H87,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML...",okiddd


In [250]:
# Add company's job_portal ID to each tuple based on the Company_ID
portal_df = pd.read_csv('data/Job_Portal.csv', header=0)
company_id_to_portal_id_map = portal_df.set_index('Company_ID')['Portal_ID'].to_dict()
job_posting_df['Portal_ID'] = job_posting_df['Company_ID'].map(company_id_to_portal_id_map)
job_posting_df

Unnamed: 0,Job_ID,Company_ID,Job_Title,Location,Experience,Requirements,Skills,Recruiter_Username,Portal_ID
0,01HQY593T458D022KE1X5KW3GH,01HQXAVCMZWP3AR7WDCNQR8KSX,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R...",icatfordy,01HQY4F1HFKQA9XKN0VKY865YF
1,01HQY593T4WCDYXM6WM9A6H0EN,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB...",dfarries16,01HQY4F1HC1XGPARVQC09JMXXB
2,01HQY593T5XMCEY5W4CFFR7PDA,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock...",ryerbym,01HQY4F1HC1XGPARVQC09JMXXB
3,01HQY593T5SCN9Y2ERYWHDH46Q,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C...",rasaaft,01HQY4F1JPS2JFGPVHB6W9AJZZ
4,01HQY593T5WG867X8XF3X3WRFP,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang...",bgascar5,01HQY4F1JPS2JFGPVHB6W9AJZZ
...,...,...,...,...,...,...,...,...,...
595,01HQY594028NGXXS2M0VX5Q5JS,01HQXAVCPHY1RNG3885A23RJYT,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web...",ahoutbyb,01HQY4F1JRTGQ41Y87VM181ENJ
596,01HQY59402WWT0MVTDPMEFTZ23,01HQXAVCP0DQPTCFMHX4QAZ2SE,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo...",gjohnu,01HQY4F1JCPNVC0BQ8TJBYMH09
597,01HQY59402G3G666SG3MWAHWN1,01HQXAVCPHBN0G8N7AF63NRNFM,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D...",bgascar5,01HQY4F1JRHTKSG0WE62KD649W
598,01HQY594035G4V2RSJSC0Q3RW8,01HQXAVCPVEC2FHWBN2QHD4H87,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML...",okiddd,01HQY4F1JZYV3P6GKPYQ8NQ2B7


In [251]:
job_posting_df.to_csv('data/Job_Posting.csv', index=False)

### Full Time Job
- The data scraped from the above included the Role, Location, Application Link, and Date Posted for each tuple. 
- The Full_Time_Job table has a Job_Posting.Job_ID foreign key as the primary key and an AnnualSalary attribute.

#### Preprocessing: 
- We keep the first 300 rows from our Job_Posting CSV.
- We drop all attributes besides Job_ID.
- We randomly populate AnnualSalary values.

In [256]:
full_time_job_df = pd.read_csv('data/Job_Posting.csv', header=0)
full_time_job_df

Unnamed: 0,Job_ID,Company_ID,Job_Title,Location,Experience,Requirements,Skills,Recruiter_Username,Portal_ID
0,01HQY593T458D022KE1X5KW3GH,01HQXAVCMZWP3AR7WDCNQR8KSX,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R...",icatfordy,01HQY4F1HFKQA9XKN0VKY865YF
1,01HQY593T4WCDYXM6WM9A6H0EN,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB...",dfarries16,01HQY4F1HC1XGPARVQC09JMXXB
2,01HQY593T5XMCEY5W4CFFR7PDA,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock...",ryerbym,01HQY4F1HC1XGPARVQC09JMXXB
3,01HQY593T5SCN9Y2ERYWHDH46Q,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C...",rasaaft,01HQY4F1JPS2JFGPVHB6W9AJZZ
4,01HQY593T5WG867X8XF3X3WRFP,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang...",bgascar5,01HQY4F1JPS2JFGPVHB6W9AJZZ
...,...,...,...,...,...,...,...,...,...
595,01HQY594028NGXXS2M0VX5Q5JS,01HQXAVCPHY1RNG3885A23RJYT,Enterprise Data Quality Intern,"Cambridge, MA",Minimum 1 year previous industry internship ex...,Must have valid license,"PyTorch, Graphic Design, Tableau, Italian, Web...",ahoutbyb,01HQY4F1JRTGQ41Y87VM181ENJ
596,01HQY59402WWT0MVTDPMEFTZ23,01HQXAVCP0DQPTCFMHX4QAZ2SE,Software Engineering Intern - Summer 2024,"Seattle, WA",None Required,Must have valid license,"Angular, Julia, Git, Backend Stack, Azure, Goo...",gjohnu,01HQY4F1JCPNVC0BQ8TJBYMH09
597,01HQY59402G3G666SG3MWAHWN1,01HQXAVCPHBN0G8N7AF63NRNFM,Firmware Intern Summer 2024,"San Jose, CA",Previous industry internship experience requir...,Must have valid license,"Flask, Pandas, Frontend Stack, Agile, React, D...",bgascar5,01HQY4F1JRHTKSG0WE62KD649W
598,01HQY594035G4V2RSJSC0Q3RW8,01HQXAVCPVEC2FHWBN2QHD4H87,Software Engineering Intern,"San Francisco, CA",None Required,Willing to relocate,"Git, OOP, Graphic Design, Perl, Ansible, AI/ML...",okiddd,01HQY4F1JZYV3P6GKPYQ8NQ2B7


In [257]:
# First 300 tuples are full time job roles
full_time_job_df = full_time_job_df.iloc[:300]
full_time_job_df

Unnamed: 0,Job_ID,Company_ID,Job_Title,Location,Experience,Requirements,Skills,Recruiter_Username,Portal_ID
0,01HQY593T458D022KE1X5KW3GH,01HQXAVCMZWP3AR7WDCNQR8KSX,Financing Trader,"London, UK",5+ years relevant experience,Tableau data analyst certification,"SQL, Julia, SEO/SEM, Social Media Marketing, R...",icatfordy,01HQY4F1HFKQA9XKN0VKY865YF
1,01HQY593T4WCDYXM6WM9A6H0EN,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,0-1 years relevant experience,AWS Cloud Practitioner certification,"Julia, Italian, Leadership, Keras, C++, PowerB...",dfarries16,01HQY4F1HC1XGPARVQC09JMXXB
2,01HQY593T5XMCEY5W4CFFR7PDA,01HQXAVCMWEXEJYR1CVK9QPZQ7,Associate Software Engineer - Remote,Remote in USA,3+ years relevant experience,Must have at least one year of non-internship ...,"Julia, Social Media Marketing, Snowflake, Dock...",ryerbym,01HQY4F1HC1XGPARVQC09JMXXB
3,01HQY593T5SCN9Y2ERYWHDH46Q,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Raleigh, NC",1-2 years relevant experience,US Citizen,"Flask, JavaScript, Azure, Algorithms, Neo4j, C...",rasaaft,01HQY4F1JPS2JFGPVHB6W9AJZZ
4,01HQY593T5WG867X8XF3X3WRFP,01HQXAVCPEM51PZ6JJR54JWJ1H,Data Analyst,"Salt Lake City, UT",1-2 years relevant experience,US Citizen,"Ruby SQL, Linux, LaTeX, Technical Writing, Ang...",bgascar5,01HQY4F1JPS2JFGPVHB6W9AJZZ
...,...,...,...,...,...,...,...,...,...
295,01HQY593WZE1AEE8N3EBPWCT8Y,01HQXAVCQS9YRKPJHYH127M12D,UX Designer - Remote from Turkey - Hungary or...,Remote,0-1 years relevant experience,Must be willing to travel 20% of time,"Flask, JavaScript, Social Media Marketing, Sof...",rasaaft,01HQY4F1KJNS9FNCHB8EX4R58Y
296,01HQY593WZ8NQWY5MZCXS6TEGT,01HQXAVCQS9YRKPJHYH127M12D,Software Engineer - Packet Forwarding Engines,"Vancouver, BC, Canada",0-1 years relevant experience,Must have at least one year of non-internship ...,"PostgreSQL, Azure, Google Cloud Platform, Reac...",kvivashq,01HQY4F1KJNS9FNCHB8EX4R58Y
297,01HQY593WZMQYQNW7XBG37NWJW,01HQXAVCQS9YRKPJHYH127M12D,Software Engineer - Network Systems,"Vancouver, BC, Canada",5+ years relevant experience,Tableau data analyst certification,"Backend Stack, Pandas, Software Development, R...",okiddd,01HQY4F1KJNS9FNCHB8EX4R58Y
298,01HQY593X0HFAWF9ST87SA04JH,01HQXAVCNHKWA836HCKT03M1DG,Software Engineer - Backend,Remote,0-1 years relevant experience,Must have 5+ years of relevant experience,"Content Marketing, Git, Urdu, Pandas, React, C...",eworledge17,01HQY4F1J0JSSV1HVEX780ANE5


In [258]:
# Drop everything except Job_ID
full_time_job_df = full_time_job_df.drop(columns=['Company_ID', 'Job_Title', 'Location', 'Experience', 'Requirements', 'Skills', 'Recruiter_Username', 'Portal_ID'])
full_time_job_df

Unnamed: 0,Job_ID
0,01HQY593T458D022KE1X5KW3GH
1,01HQY593T4WCDYXM6WM9A6H0EN
2,01HQY593T5XMCEY5W4CFFR7PDA
3,01HQY593T5SCN9Y2ERYWHDH46Q
4,01HQY593T5WG867X8XF3X3WRFP
...,...
295,01HQY593WZE1AEE8N3EBPWCT8Y
296,01HQY593WZ8NQWY5MZCXS6TEGT
297,01HQY593WZMQYQNW7XBG37NWJW
298,01HQY593X0HFAWF9ST87SA04JH


In [261]:
annual_salaries = [30000, 40000, 45000, 50000, 60000, 70000, 80000, 90000, 100000, 120000, 130000, 140000, 150000, 180000, 200000, 250000, 300000]
full_time_job_df['AnnualSalary'] = full_time_job_df.apply(lambda row: random.choice(annual_salaries), axis=1)
full_time_job_df

Unnamed: 0,Job_ID,AnnualSalary
0,01HQY593T458D022KE1X5KW3GH,100000
1,01HQY593T4WCDYXM6WM9A6H0EN,200000
2,01HQY593T5XMCEY5W4CFFR7PDA,70000
3,01HQY593T5SCN9Y2ERYWHDH46Q,45000
4,01HQY593T5WG867X8XF3X3WRFP,80000
...,...,...
295,01HQY593WZE1AEE8N3EBPWCT8Y,130000
296,01HQY593WZ8NQWY5MZCXS6TEGT,80000
297,01HQY593WZMQYQNW7XBG37NWJW,40000
298,01HQY593X0HFAWF9ST87SA04JH,200000


In [262]:
# Save final job postings df for full time roles in 'Full_Time_Job.csv' for Full_Time_Job relation!
full_time_job_df.to_csv('data/Full_Time_Job.csv', index=False)

### Internship
- Our original internship data, stored in Tech_Internship.csv, was sourced from the Github page "Summer 2024 Tech Internships by Pitt CSC & Simplify" owned by Simplify, found here, on February 28th: https://github.com/SimplifyJobs/Summer2024-Internships
- This included Company, Role, Location, Application/Link, and Date Posted for each role. 
- For our Internship_Job table, we need Job_ID, Company, Experience, Location, Requirements, Skills, Salaried (boolean), and Duration attributes.
- Internship also has a Job_Posting.Job_ID foreign key.
 
#### Preprocessing: 
- In the initial data scraped from Github, a 'Ü≥' symbol was present in certain rows' 'Company' column denoting that the company name is the same as in the row before it. We impute the correct company name for each occurrence of this symbol.
- We dropped the Application/Link and Date Posted columns. 
- We generate a unique integer Job_ID for each internship posting.
- We randomly impute a value for the Experience, Requirements, Skills, Saliaried, and Duration attributes.

In [163]:
internship_df = pd.read_csv('data/Tech_Internship.csv', header=0)
internship_df

Unnamed: 0,Company,Role,Location,Application/Link,Date Posted
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27
...,...,...,...,...,...
2480,Zurich,Internship Program,Multiple Locations,üîí,May 2023
2481,BTIG,Software Engineer Intern,Multiple Locations,üîí,May 2023
2482,Internet Brands,Intern Software Engineer,"Los Angeles, California",üîí,May 2023
2483,Panasonic,Software Electrical Engineer Intern,TX,üîí,May 2023


In [164]:
# Keep only 150 rows
internship_df = internship_df[:150]
print(len(internship_df))

150


In [165]:
# Replace 'Ü≥' symbols with correct company names, i.e. company name before that row
for i in range(len(internship_df)): 
    if internship_df.loc[i, 'Company'] == 'Ü≥ ' or internship_df.loc[i, 'Company'] == ' ‚Ü≥ ':
        internship_df.loc[i, 'Company'] = internship_df.loc[i-1, 'Company']
        
# Verify this was done correctly for all tuples
if not (internship_df['Company'] == 'Ü≥').any() or (internship_df['Company'] == ',Ü≥').any():
    print("There are no more 'Ü≥' symbols in the 'Company' column!")
internship_df

There are no more 'Ü≥' symbols in the 'Company' column!


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  internship_df.loc[i, 'Company'] = internship_df.loc[i-1, 'Company']


Unnamed: 0,Company,Role,Location,Application/Link,Date Posted
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27
...,...,...,...,...,...
145,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
146,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
147,Elevance Health,Data Analytics Internship - Undergrad - Summe...,"Burlington, MA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
148,Chamberlain Group,Intern ‚Äì Software Engineer - Summer 2024,"Western Springs, IL","<a href=""https://chamberlain.wd1.myworkdayjob...",Feb 05


In [166]:
# Drop duplicate rows
internship_df.drop_duplicates(inplace=True)
internship_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  internship_df.drop_duplicates(inplace=True)


Unnamed: 0,Company,Role,Location,Application/Link,Date Posted
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27
...,...,...,...,...,...
145,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
146,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
147,Elevance Health,Data Analytics Internship - Undergrad - Summe...,"Burlington, MA","<a href=""https://elevancehealth.wd1.myworkday...",Feb 05
148,Chamberlain Group,Intern ‚Äì Software Engineer - Summer 2024,"Western Springs, IL","<a href=""https://chamberlain.wd1.myworkdayjob...",Feb 05


In [167]:
# Drop Date Posted
internship_df = internship_df.drop(columns=['Application/Link', 'Date Posted'])
internship_df

Unnamed: 0,Company,Role,Location
0,Chime,Software Engineer Intern - Growth Funding,SF
1,CACI,Software Development Intern - Summer 2024,Remote in USA
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO"
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA"
4,Roku,Machine Learning Engineer Intern,"San Jose, CA"
...,...,...,...
145,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA"
146,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA"
147,Elevance Health,Data Analytics Internship - Undergrad - Summe...,"Burlington, MA"
148,Chamberlain Group,Intern ‚Äì Software Engineer - Summer 2024,"Western Springs, IL"


In [169]:
# Assign a Job_ID to each internship
job_ids_df = job_ids_df.iloc[301:451]
job_ids_df.reset_index(drop=True, inplace=True)

# Concatenate selected_job_ids with internship_df
internship_df = pd.concat([job_ids_df, internship_df.reset_index(drop=True)], axis=1)
internship_df

Unnamed: 0,Job_ID,Company,Role,Location
0,,Chime,Software Engineer Intern - Growth Funding,SF
1,,CACI,Software Development Intern - Summer 2024,Remote in USA
2,,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO"
3,,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA"
4,,Roku,Machine Learning Engineer Intern,"San Jose, CA"
...,...,...,...,...
145,,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA"
146,,Elevance Health,Data Analyst Internship - Summer 2024 - Under...,"Atlanta, GA"
147,,Elevance Health,Data Analytics Internship - Undergrad - Summe...,"Burlington, MA"
148,,Chamberlain Group,Intern ‚Äì Software Engineer - Summer 2024,"Western Springs, IL"


In [12]:
# Randomly impute a value for the Experience, Requirements, Skills, Saliaried, and Duration attributes for each tuple 
intern_experience_vals = ['None Required', 'Previous industry internship experience required (>=3 months)', 'Previous research/academic experience required (>=3 months)', 'Minimum 1 year previous industry internship experience required', 'Previous research/academic highly desirable']

In [13]:
intern_requirements_vals = ['Willing to relocate', 'Willing to travel up to 20%', 'Must meet base technical criteria', 'Must be proficient in Microsoft Office Suite', 'Good sense of humor', 'Ability to work independently', 'Familiar with Agile methodologies', 'Must have valid license', 'Must tolerate dogs in the workplace']

In [138]:
def get_random_skills(job_skills):
    # Pick random number of skills between 3 and 12
    num_skills = random.randint(5, 12)
    skills = []
    for i in range(num_skills):
        skills.append(random.choice(job_skills))
    skills = list(set(skills))
    skills_str = ', '.join(skills)
    return skills_str

In [15]:
duration_vals = ['4 weeks', '8 weeks', '12 weeks', '16 weeks']

In [16]:
# Make only 3% of values Salaried = False because I don't want to live in a database world where most companies benefit off the backs of innocent, eager interns 
salaried = [True, False]
salary_probs = [0.97, 0.03]
internship_df['Salaried'] = np.random.choice(salaried, size=len(internship_df), p=salary_probs)
internship_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  internship_df['Salaried'] = np.random.choice(salaried, size=len(internship_df), p=salary_probs)


Unnamed: 0,Company,Role,Location,Application/Link,Date Posted,Salaried
0,Chime,Software Engineer Intern - Growth Funding,SF,"<a href=""https://boards.greenhouse.io/chime/j...",Feb 28,True
1,CACI,Software Development Intern - Summer 2024,Remote in USA,"<a href=""https://caci.wd1.myworkdayjobs.com/E...",Feb 28,True
2,Western Digital,Summer 2024 Software Engineering Intern,"Longmont, CO","<a href=""https://jobs.smartrecruiters.com/Wes...",Feb 27,True
3,Veracode,Solutions Architecture and Security Consultin...,"Burlington, MA","<a href=""https://www.veracode.com/career/job?...",Feb 27,True
4,Roku,Machine Learning Engineer Intern,"San Jose, CA","<a href=""https://www.weareroku.com/jobs/57580...",Feb 27,True
...,...,...,...,...,...,...
295,Inari Agriculture,Enterprise Data Quality Intern,"Cambridge, MA","<a href=""https://boards.greenhouse.io/inariag...",Dec 08,True
296,Quantcast,Software Engineering Intern - Summer 2024,"Seattle, WA","<a href=""https://jobs.lever.co/quantcast/9ab8...",Dec 07,True
297,Electric Hydrogen,Firmware Intern Summer 2024,"San Jose, CA","<a href=""https://eh2.com/careers?gh_jid=43477...",Dec 07,True
298,Niantic,Software Engineering Intern,"San Francisco, CA","<a href=""https://app.ripplematch.com/v2/publi...",Dec 06,True


In [ ]:
# Save final internships df in 'Internship_Job.csv' for Internship_Job relation!

### Coop Job
- Coop job also has a Job_Posting.Job_ID foreign key.

### Company
- Mockaroo was used to obtain the Company_ID, Location, and Name column values for N rows, where N was the number of unique companies identified in our job postings. 
- We imput the Name attribute for the Company relation with Company names obtained from our job posting sources (namely, both full time roles and internship roles). There should be one entry in the Company table for each unique company found in either job postings CSV. 
- Because the location option on Mockaroo was only a street address, we impute a random city, state, and zip code as shown below.
- Additionally, we assign a random company type for the Type column. We arbitrarily deem 80% of the companies as private corporations, 5% as non-profit, and 15% as startups. 

In [16]:
company_df = pd.read_csv('data/Mockaroo/Mockaroo-Company.csv', header=0)[:311]
company_df

Unnamed: 0,Company_ID,Address,Name,Type
0,01HQXAVCMWEXEJYR1CVK9QPZQ7,414 Havey Hill,Dazzlesphere,
1,01HQXAVCMXRWCNSE1XGC88Z5KR,8989 Swallow Plaza,Skiba,
2,01HQXAVCMXTRC4Y81YV23C1ZQ8,4778 Sage Lane,Edgeclub,
3,01HQXAVCMXCRNPED9KR90QAGJN,42 Corben Road,Gigazoom,
4,01HQXAVCMXJ11TZ0EJQNZST6SP,3 Sycamore Parkway,Quire,
...,...,...,...,...
306,01HQXAVCQZT7RE42HZ1RT90XT3,23 Corscot Road,Meezzy,
307,01HQXAVCQZYF5N9HVP8QNCXW97,566 Cordelia Center,BlogXS,
308,01HQXAVCR0PX1DP6TJNR6FAJ4M,935 Old Gate Parkway,Skyble,
309,01HQXAVCR0113ZBMBE6D7ARSM2,3792 Rutledge Crossing,Mydeo,


In [17]:
# Get names of all unique companies represented in the database
full_time_companies = full_time_job_df['Company'].unique().tolist()
internship_companies = internship_df['Company'].unique().tolist()
all_company_names = list(set(full_time_companies + internship_companies))
print(f"There are {len(all_company_names)} companies in the dataset.")

There are 311 companies in the dataset.


In [18]:
# Replace Company Name values with real names
company_df['Name'] = all_company_names
company_df

Unnamed: 0,Company_ID,Address,Name,Type
0,01HQXAVCMWEXEJYR1CVK9QPZQ7,414 Havey Hill,Hudl,
1,01HQXAVCMXRWCNSE1XGC88Z5KR,8989 Swallow Plaza,Alarm.com,
2,01HQXAVCMXTRC4Y81YV23C1ZQ8,4778 Sage Lane,Forbes,
3,01HQXAVCMXCRNPED9KR90QAGJN,42 Corben Road,Konrad Group,
4,01HQXAVCMXJ11TZ0EJQNZST6SP,3 Sycamore Parkway,Ascend Analytics,
...,...,...,...,...
306,01HQXAVCQZT7RE42HZ1RT90XT3,23 Corscot Road,Zynga,
307,01HQXAVCQZYF5N9HVP8QNCXW97,566 Cordelia Center,Workato,
308,01HQXAVCR0PX1DP6TJNR6FAJ4M,935 Old Gate Parkway,Wisk,
309,01HQXAVCR0113ZBMBE6D7ARSM2,3792 Rutledge Crossing,Moveworks,


In [19]:
# Mapping of 'tech hub' cities, their state, and some example zip codes (making sure the city, state, and zip codes are coherent relative to each other). 
# Zip codes were obtained from Google searches. 
tech_hub_cities_mapping = {
    'New York': {'state': 'New York', 'zip_codes': ['10001', '10002', '10003']},
    'San Francisco': {'state': 'California', 'zip_codes': ['94102', '94103', '94107']},
    'Los Angeles': {'state': 'California', 'zip_codes': ['90001', '90002', '90003']},
    'Austin': {'state': 'Texas', 'zip_codes': ['73301', '73344', '778613']},
    'Dallas': {'state': 'Texas', 'zip_codes': ['75001', '75019', '75032']},
    'Seattle': {'state': 'Washington', 'zip_codes': ['98101', '98102', '98103']},
    'Atlanta': {'state': 'Georgia', 'zip_codes': ['30033', '30301', '30303']},
    'Denver': {'state': 'Colorado', 'zip_codes': ['80014', '80019', '80022']},
    'Chicago': {'state': 'Illinois', 'zip_codes': ['60007', '60018', '60106']},
    'Miami': {'state': 'Florida', 'zip_codes': ['33101', '33109', '33126']},
    'Tampa': {'state': 'Florida', 'zip_codes': ['33592', '33601', '33602']},
    'Boston': {'state': 'Massachusetts', 'zip_codes': ['02108', '02110', '02111']}
}

def generate_address(str_address):
    """Randomly selects a city and corresponding state and zip code from tech hub cities dictionary above."""
    city = random.choice(list(tech_hub_cities_mapping.keys()))
    state = tech_hub_cities_mapping[city]['state']
    zip_code = random.choice(tech_hub_cities_mapping[city]['zip_codes'])
    return str_address + ', ' + city + ', ' + state + ' ' + zip_code

# Assign random city, state, and zip code to each address
company_df['Address'] = company_df['Address'].apply(lambda x: generate_address(x).strip('"'))
company_df['Address'] = company_df['Address'].astype(str)
company_df 

Unnamed: 0,Company_ID,Address,Name,Type
0,01HQXAVCMWEXEJYR1CVK9QPZQ7,"414 Havey Hill, Austin, Texas 73344",Hudl,
1,01HQXAVCMXRWCNSE1XGC88Z5KR,"8989 Swallow Plaza, Los Angeles, California 90003",Alarm.com,
2,01HQXAVCMXTRC4Y81YV23C1ZQ8,"4778 Sage Lane, Dallas, Texas 75032",Forbes,
3,01HQXAVCMXCRNPED9KR90QAGJN,"42 Corben Road, San Francisco, California 94103",Konrad Group,
4,01HQXAVCMXJ11TZ0EJQNZST6SP,"3 Sycamore Parkway, Austin, Texas 73344",Ascend Analytics,
...,...,...,...,...
306,01HQXAVCQZT7RE42HZ1RT90XT3,"23 Corscot Road, Austin, Texas 73301",Zynga,
307,01HQXAVCQZYF5N9HVP8QNCXW97,"566 Cordelia Center, New York, New York 10001",Workato,
308,01HQXAVCR0PX1DP6TJNR6FAJ4M,"935 Old Gate Parkway, Denver, Colorado 80019",Wisk,
309,01HQXAVCR0113ZBMBE6D7ARSM2,"3792 Rutledge Crossing, Los Angeles, Californi...",Moveworks,


In [20]:
# Assign each company a random 'type' from some preset company types
company_types = ['Private Corporation', 'Non-Profit Organization', 'Startup']
probabilities = [0.80, 0.05, 0.15]
company_df['Type'] = np.random.choice(company_types, size=len(company_df), p=probabilities)
company_df

Unnamed: 0,Company_ID,Address,Name,Type
0,01HQXAVCMWEXEJYR1CVK9QPZQ7,"414 Havey Hill, Austin, Texas 73344",Hudl,Private Corporation
1,01HQXAVCMXRWCNSE1XGC88Z5KR,"8989 Swallow Plaza, Los Angeles, California 90003",Alarm.com,Private Corporation
2,01HQXAVCMXTRC4Y81YV23C1ZQ8,"4778 Sage Lane, Dallas, Texas 75032",Forbes,Startup
3,01HQXAVCMXCRNPED9KR90QAGJN,"42 Corben Road, San Francisco, California 94103",Konrad Group,Private Corporation
4,01HQXAVCMXJ11TZ0EJQNZST6SP,"3 Sycamore Parkway, Austin, Texas 73344",Ascend Analytics,Private Corporation
...,...,...,...,...
306,01HQXAVCQZT7RE42HZ1RT90XT3,"23 Corscot Road, Austin, Texas 73301",Zynga,Private Corporation
307,01HQXAVCQZYF5N9HVP8QNCXW97,"566 Cordelia Center, New York, New York 10001",Workato,Private Corporation
308,01HQXAVCR0PX1DP6TJNR6FAJ4M,"935 Old Gate Parkway, Denver, Colorado 80019",Wisk,Private Corporation
309,01HQXAVCR0113ZBMBE6D7ARSM2,"3792 Rutledge Crossing, Los Angeles, Californi...",Moveworks,Private Corporation


In [69]:
# Save final company data
company_df.to_csv('data/Company.csv', index=False)

### Employee
- Mockaroo was used to generate the Employee_ID, Name, Job Title, Department, and Company column values. 
- We modify the Company name column to be Company_ID (foreign key Company.Company_ID) and populate it with random Company_ID values.

In [199]:
employee_df = pd.read_csv('data/Mockaroo/Mockaroo-Employee.csv')
employee_df

Unnamed: 0,Employee_ID,Name,Job Title,Department,Company
0,01HQ9SN558SBTGP3CJ0SFCNX58,Johan Debold,Design Engineer,Accounting,Lendbuzz
1,01HQ9SN559DXNCX8PZGSPX86DQ,Aaron Janicki,Payment Adjustment Coordinator,Accounting,Zeno Group
2,01HQ9SN55ADXDE35WM2X25FRFN,Atalanta Watting,Marketing Manager,Business Development,Second Order Effects
3,01HQ9SN55BJ68WCF69VZ75WN9D,Kippie Caple,Director of Sales,Support,Domo
4,01HQ9SN55B1AQ4A44XRSQYVQ4B,Francklyn Jansey,Administrative Officer,Sales,Comerica Bank
...,...,...,...,...,...
94,01HQ9SN57BECEEC1DEDFDV2HPN,Katheryn Joannidi,Professor,Support,Tempus
95,01HQ9SN57B6SHA3DBMQQWB97AV,Irwin Giffen,Help Desk Operator,Services,Northwestern Mutual
96,01HQ9SN57C19SH5TQ03KMD41XB,Sarette Cheel,Speech Pathologist,Product Management,Skydio
97,01HQ9SN57DA8AY2Y6M4EE0KYVJ,Hillel Pero,Senior Financial Analyst,Business Development,The Walt Disney Company


In [200]:
# Add Company_ID to each tuple corresponding to the company, rather than having Company Name
employee_df.rename(columns={'Company': 'Company_ID'}, inplace=True)
all_company_ids = company_df['Company_ID'].unique().tolist()
employee_df['Company_ID'] = employee_df.apply(lambda row: random.choice(all_company_ids), axis=1)
employee_df

Unnamed: 0,Employee_ID,Name,Job Title,Department,Company_ID
0,01HQ9SN558SBTGP3CJ0SFCNX58,Johan Debold,Design Engineer,Accounting,01HQXAVCQBVHT9XC34FKA4XYXB
1,01HQ9SN559DXNCX8PZGSPX86DQ,Aaron Janicki,Payment Adjustment Coordinator,Accounting,01HQXAVCPJ218ZA3GJA08BR1B6
2,01HQ9SN55ADXDE35WM2X25FRFN,Atalanta Watting,Marketing Manager,Business Development,01HQXAVCQKKYWFF64YWV4K3E8R
3,01HQ9SN55BJ68WCF69VZ75WN9D,Kippie Caple,Director of Sales,Support,01HQXAVCQ02365QPRJ62VXC478
4,01HQ9SN55B1AQ4A44XRSQYVQ4B,Francklyn Jansey,Administrative Officer,Sales,01HQXAVCQNMEQEFMHW307R7PP8
...,...,...,...,...,...
94,01HQ9SN57BECEEC1DEDFDV2HPN,Katheryn Joannidi,Professor,Support,01HQXAVCP216S8X1ZN9GC7KQPJ
95,01HQ9SN57B6SHA3DBMQQWB97AV,Irwin Giffen,Help Desk Operator,Services,01HQXAVCPDFZZ0P10R3CDCYGF7
96,01HQ9SN57C19SH5TQ03KMD41XB,Sarette Cheel,Speech Pathologist,Product Management,01HQXAVCN1YW764GN3WKHBNSME
97,01HQ9SN57DA8AY2Y6M4EE0KYVJ,Hillel Pero,Senior Financial Analyst,Business Development,01HQXAVCQSQ1QSGKGKRFX2Y1B2


In [188]:
employee_df.to_csv('data/Employee.csv', index=False)

### Recruiter
- The initial Recruiter table data with a Username, Name, Address, Company, and Specialization for each row was generated using Mockaroo.
- Similar to the above Employee table, we impute the random Company names with Company names that are represented in the job postings to add realism.
- Similar to what was done for the Company table, given that Mockaroo can only generate street addresses, we also imputed each address with a random city and corresponding state and zip code.

In [177]:
recruiter_df = pd.read_csv('data/Mockaroo/Mockaroo-Recruiter.csv')
recruiter_df

Unnamed: 0,Username,Name,Address,Company,Specialization
0,fbrafield0,Fernandina Brafield,7272 Mesta Drive,Comerica Bank,Accounting
1,cmeech1,Corrine Meech,582 Warner Drive,Domo,Research and Development
2,dfeander2,Dalston Feander,12567 Elgar Street,Five9,Support
3,rbloggett3,Rawley Bloggett,533 Orin Street,TS Imagine,Legal
4,ehoulston4,Etheline Houlston,06 Pond Center,Align Technology,Marketing
5,bgascar5,Ban Gascar,1 Surrey Road,Linkedin,Business Development
6,lvose6,Loy Vose,3 Forest Run Road,Hudson River Trading,Marketing
7,ffossitt7,Ferdinanda Fossitt,98 Eliot Junction,Ramp,Accounting
8,abeagin8,Angela Beagin,0 Miller Place,Bodo.ai,Support
9,mcrowcher9,Miltie Crowcher,3824 Messerschmidt Plaza,Artisan Partners,Human Resources


In [179]:
# Change Company names to only be company ids in Company table
all_company_ids = company_df['Company_ID'].unique().tolist()
recruiter_df['Company'] = recruiter_df.apply(lambda row: random.choice(all_company_ids), axis=1)
recruiter_df

Unnamed: 0,Username,Name,Address,Company,Specialization
0,fbrafield0,Fernandina Brafield,7272 Mesta Drive,01HQXAVCQQBJ0N2PNV7PV6XR6E,Accounting
1,cmeech1,Corrine Meech,582 Warner Drive,01HQXAVCNTQNZCXJ8GX3F27CTG,Research and Development
2,dfeander2,Dalston Feander,12567 Elgar Street,01HQXAVCN13C9848XTXCNHZCC0,Support
3,rbloggett3,Rawley Bloggett,533 Orin Street,01HQXAVCNX82GGSTVNHY6F6RAG,Legal
4,ehoulston4,Etheline Houlston,06 Pond Center,01HQXAVCP352J4N10V5EPHAY74,Marketing
5,bgascar5,Ban Gascar,1 Surrey Road,01HQXAVCQV8G9VHVP1XP0SK707,Business Development
6,lvose6,Loy Vose,3 Forest Run Road,01HQXAVCNAW7MHYE3VAJZDNKH5,Marketing
7,ffossitt7,Ferdinanda Fossitt,98 Eliot Junction,01HQXAVCPCK9CXB4E8370GN18C,Accounting
8,abeagin8,Angela Beagin,0 Miller Place,01HQXAVCNW34J7WXJ18VDZW89S,Support
9,mcrowcher9,Miltie Crowcher,3824 Messerschmidt Plaza,01HQXAVCPPFYDJFMF8XWMF4SBT,Human Resources


In [180]:
# Adjust street addresses to also have city, state, and zip code
recruiter_df['Address'] = recruiter_df['Address'].apply(lambda x: generate_address(x).strip('"'))
recruiter_df['Address'] = recruiter_df['Address'].astype(str)
recruiter_df 

Unnamed: 0,Username,Name,Address,Company,Specialization
0,fbrafield0,Fernandina Brafield,"7272 Mesta Drive, Seattle, Washington 98103",01HQXAVCQQBJ0N2PNV7PV6XR6E,Accounting
1,cmeech1,Corrine Meech,"582 Warner Drive, Dallas, Texas 75019",01HQXAVCNTQNZCXJ8GX3F27CTG,Research and Development
2,dfeander2,Dalston Feander,"12567 Elgar Street, Miami, Florida 33101",01HQXAVCN13C9848XTXCNHZCC0,Support
3,rbloggett3,Rawley Bloggett,"533 Orin Street, Miami, Florida 33109",01HQXAVCNX82GGSTVNHY6F6RAG,Legal
4,ehoulston4,Etheline Houlston,"06 Pond Center, Dallas, Texas 75001",01HQXAVCP352J4N10V5EPHAY74,Marketing
5,bgascar5,Ban Gascar,"1 Surrey Road, Boston, Massachusetts 02110",01HQXAVCQV8G9VHVP1XP0SK707,Business Development
6,lvose6,Loy Vose,"3 Forest Run Road, Chicago, Illinois 60106",01HQXAVCNAW7MHYE3VAJZDNKH5,Marketing
7,ffossitt7,Ferdinanda Fossitt,"98 Eliot Junction, Los Angeles, California 90003",01HQXAVCPCK9CXB4E8370GN18C,Accounting
8,abeagin8,Angela Beagin,"0 Miller Place, Boston, Massachusetts 02108",01HQXAVCNW34J7WXJ18VDZW89S,Support
9,mcrowcher9,Miltie Crowcher,"3824 Messerschmidt Plaza, Denver, Colorado 80014",01HQXAVCPPFYDJFMF8XWMF4SBT,Human Resources


In [182]:
# Rename 'Company' to 'Company_ID'
recruiter_df = recruiter_df.rename(columns={'Company': 'Company_ID'})
recruiter_df

Unnamed: 0,Username,Name,Address,Company_ID,Specialization
0,fbrafield0,Fernandina Brafield,"7272 Mesta Drive, Seattle, Washington 98103",01HQXAVCQQBJ0N2PNV7PV6XR6E,Accounting
1,cmeech1,Corrine Meech,"582 Warner Drive, Dallas, Texas 75019",01HQXAVCNTQNZCXJ8GX3F27CTG,Research and Development
2,dfeander2,Dalston Feander,"12567 Elgar Street, Miami, Florida 33101",01HQXAVCN13C9848XTXCNHZCC0,Support
3,rbloggett3,Rawley Bloggett,"533 Orin Street, Miami, Florida 33109",01HQXAVCNX82GGSTVNHY6F6RAG,Legal
4,ehoulston4,Etheline Houlston,"06 Pond Center, Dallas, Texas 75001",01HQXAVCP352J4N10V5EPHAY74,Marketing
5,bgascar5,Ban Gascar,"1 Surrey Road, Boston, Massachusetts 02110",01HQXAVCQV8G9VHVP1XP0SK707,Business Development
6,lvose6,Loy Vose,"3 Forest Run Road, Chicago, Illinois 60106",01HQXAVCNAW7MHYE3VAJZDNKH5,Marketing
7,ffossitt7,Ferdinanda Fossitt,"98 Eliot Junction, Los Angeles, California 90003",01HQXAVCPCK9CXB4E8370GN18C,Accounting
8,abeagin8,Angela Beagin,"0 Miller Place, Boston, Massachusetts 02108",01HQXAVCNW34J7WXJ18VDZW89S,Support
9,mcrowcher9,Miltie Crowcher,"3824 Messerschmidt Plaza, Denver, Colorado 80014",01HQXAVCPPFYDJFMF8XWMF4SBT,Human Resources


In [183]:
recruiter_df.to_csv('data/Recruiter.csv', index=False)

### Candidate
- The original candidate data containing a Username, Name, Address, Education, and Email for each candidate was generated using Mockaroo.
- We add a city, state, and zip code to each of the street addresses Mockaroo generated.
- Because Mockaroo was only able to populate university names in the education column, we modify these entries below to include a random degree (followed by the university name).
- Skills are populated randomly based on the list of skills used to populate job postings.
- A plaintext resume for each candidate was generated using an OpenAI assistant as shown below. 
- The PastApplications column, which keeps track of the total number of applications a user has made in the past, is randomly generated.

In [123]:
candidate_df = pd.read_csv('data/Mockaroo/Mockaroo-Candidate.csv')
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email
0,pashlin0,Phyllis Ashlin,72188 Welch Circle,University of Victoria,pashlin0@skype.com
1,arowly1,Angelle Rowly,5227 Elka Junction,Universidad Capitain General Gerardo Barrios,arowly1@hostgator.com
2,kkumaar2,Kim Kumaar,25 Magdeline Trail,Sri Padmavati Women's University,kkumaar2@ihg.com
3,kmalthouse3,Kakalina Malthouse,884 Donald Drive,State University of New York at Buffalo,kmalthouse3@comsenz.com
4,ldrew4,Lorianna Drew,688 Myrtle Terrace,Beni Suef University,ldrew4@yale.edu
...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,79868 Schurz Place,New Era University,pduquesnay87@europa.eu
296,lgaydon88,Lonnard Gaydon,158 Ridge Oak Avenue,Copenhagen Business School,lgaydon88@php.net
297,dchristie89,Des Christie,80412 Grim Pass,Universitas Brawijaya,dchristie89@discovery.com
298,hmorse8a,Hillary Morse,31 Riverside Alley,Uzbek State World Languages University,hmorse8a@blogspot.com


In [124]:
# Adjust street addresses to also have city, state, and zip code
candidate_df['Address'] = candidate_df['Address'].apply(lambda x: generate_address(x).strip('"'))
candidate_df['Address'] = candidate_df['Address'].astype(str)
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email
0,pashlin0,Phyllis Ashlin,"72188 Welch Circle, Miami, Florida 33109",University of Victoria,pashlin0@skype.com
1,arowly1,Angelle Rowly,"5227 Elka Junction, Denver, Colorado 80022",Universidad Capitain General Gerardo Barrios,arowly1@hostgator.com
2,kkumaar2,Kim Kumaar,"25 Magdeline Trail, Chicago, Illinois 60007",Sri Padmavati Women's University,kkumaar2@ihg.com
3,kmalthouse3,Kakalina Malthouse,"884 Donald Drive, Atlanta, Georgia 30303",State University of New York at Buffalo,kmalthouse3@comsenz.com
4,ldrew4,Lorianna Drew,"688 Myrtle Terrace, Miami, Florida 33101",Beni Suef University,ldrew4@yale.edu
...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,"79868 Schurz Place, Los Angeles, California 90003",New Era University,pduquesnay87@europa.eu
296,lgaydon88,Lonnard Gaydon,"158 Ridge Oak Avenue, Chicago, Illinois 60018",Copenhagen Business School,lgaydon88@php.net
297,dchristie89,Des Christie,"80412 Grim Pass, Los Angeles, California 90003",Universitas Brawijaya,dchristie89@discovery.com
298,hmorse8a,Hillary Morse,"31 Riverside Alley, Chicago, Illinois 60018",Uzbek State World Languages University,hmorse8a@blogspot.com


In [125]:
# Populate Education column with random degrees 
degree_options = ['High School Diploma', 'Bachelor of Science', 'Bachelor of Arts', 'Master of Science', 'Master of Arts', 'PhD in Computer Science']
probabilities_degree = [0.03, 0.60, 0.15, 0.15, 0.05, 0.02]
degrees = np.random.choice(degree_options, size=len(candidate_df), p=probabilities_degree)
candidate_df['Education'] = candidate_df.apply(lambda row: f"{degrees[row.name]}, {row['Education']}" if row['Education'] != 'High School Diploma' and degrees[row.name] != 'High School Diploma' else 'High School Diploma', axis=1)
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email
0,pashlin0,Phyllis Ashlin,"72188 Welch Circle, Miami, Florida 33109","PhD in Computer Science, University of Victoria",pashlin0@skype.com
1,arowly1,Angelle Rowly,"5227 Elka Junction, Denver, Colorado 80022","Bachelor of Arts, Universidad Capitain General...",arowly1@hostgator.com
2,kkumaar2,Kim Kumaar,"25 Magdeline Trail, Chicago, Illinois 60007","Bachelor of Science, Sri Padmavati Women's Uni...",kkumaar2@ihg.com
3,kmalthouse3,Kakalina Malthouse,"884 Donald Drive, Atlanta, Georgia 30303","Bachelor of Arts, State University of New York...",kmalthouse3@comsenz.com
4,ldrew4,Lorianna Drew,"688 Myrtle Terrace, Miami, Florida 33101","Bachelor of Science, Beni Suef University",ldrew4@yale.edu
...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,"79868 Schurz Place, Los Angeles, California 90003","Bachelor of Science, New Era University",pduquesnay87@europa.eu
296,lgaydon88,Lonnard Gaydon,"158 Ridge Oak Avenue, Chicago, Illinois 60018","Bachelor of Arts, Copenhagen Business School",lgaydon88@php.net
297,dchristie89,Des Christie,"80412 Grim Pass, Los Angeles, California 90003","Bachelor of Arts, Universitas Brawijaya",dchristie89@discovery.com
298,hmorse8a,Hillary Morse,"31 Riverside Alley, Chicago, Illinois 60018","Bachelor of Science, Uzbek State World Languag...",hmorse8a@blogspot.com


In [126]:
# Populate Skills column randomly using skills used to populate job postings
skills_vals = [get_random_skills(job_skills) for i in range(len(candidate_df))]
candidate_df = candidate_df.assign(Skills=skills_vals)
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email,Skills
0,pashlin0,Phyllis Ashlin,"72188 Welch Circle, Miami, Florida 33109","PhD in Computer Science, University of Victoria",pashlin0@skype.com,"Urdu, PostgreSQL, Databases, German, Hootsuite..."
1,arowly1,Angelle Rowly,"5227 Elka Junction, Denver, Colorado 80022","Bachelor of Arts, Universidad Capitain General...",arowly1@hostgator.com,"Git, Julia, JavaScript, SEO/SEM, Project Manag..."
2,kkumaar2,Kim Kumaar,"25 Magdeline Trail, Chicago, Illinois 60007","Bachelor of Science, Sri Padmavati Women's Uni...",kkumaar2@ihg.com,"SQL, Azure, Customer Support, Google Cloud Pla..."
3,kmalthouse3,Kakalina Malthouse,"884 Donald Drive, Atlanta, Georgia 30303","Bachelor of Arts, State University of New York...",kmalthouse3@comsenz.com,"Flask, SEO/SEM, UX/UI Design, Python, Kotlin, ..."
4,ldrew4,Lorianna Drew,"688 Myrtle Terrace, Miami, Florida 33101","Bachelor of Science, Beni Suef University",ldrew4@yale.edu,"Julia, PostgreSQL, Google Cloud Platform, Vue...."
...,...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,"79868 Schurz Place, Los Angeles, California 90003","Bachelor of Science, New Era University",pduquesnay87@europa.eu,"Project Management, React, AI/ML frameworks, M..."
296,lgaydon88,Lonnard Gaydon,"158 Ridge Oak Avenue, Chicago, Illinois 60018","Bachelor of Arts, Copenhagen Business School",lgaydon88@php.net,"Backend Stack, Google Cloud Platform, Scrum, K..."
297,dchristie89,Des Christie,"80412 Grim Pass, Los Angeles, California 90003","Bachelor of Arts, Universitas Brawijaya",dchristie89@discovery.com,"SQL, OOP, Adobe Creative Suite, Julia, UX/UI D..."
298,hmorse8a,Hillary Morse,"31 Riverside Alley, Chicago, Illinois 60018","Bachelor of Science, Uzbek State World Languag...",hmorse8a@blogspot.com,"Backend Stack, Adobe Creative Suite, Jupyter, ..."


In [129]:
# Generate plaintext resume for each candidate tuple based on their other tuple attributes.
def generate_unique_resume(row):
    name = row['Name']
    address = row['Address']
    email = row['Email']
    education = row['Education']
    skills = row['Skills']
    skill_list = skills.split(', ')
    work_experience_1 = f"Developed a {skill_list[0]}-based analytics tool that increased data processing efficiency by 25%."
    work_experience_2 = f"Led a team of 25 individuals in {skill_list[1]}-based project that increased company revenue by 5%."
    work_experience_3 = f"Coordinated a {skill_list[2]}-based study that led to the company-wide adoption of a policy improving workplace productivity by 30% daily."
    resume = f"""
    {name}
    {address}
    Email: {email}

    Objective:
    Dedicated professional with a {education}. Skilled in {skills}. Eager to contribute to a dynamic team and further develop my expertise.

    Education:
    {education}

    Skills:
    - {skills}

    Experience:
    - {work_experience_1}
    - {work_experience_2}
    - {work_experience_3}

    References:
    Available upon request.
    """
    return resume.strip()

candidate_df['Resume'] = candidate_df.apply(generate_unique_resume, axis=1)
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email,Skills,Resume
0,pashlin0,Phyllis Ashlin,"72188 Welch Circle, Miami, Florida 33109","PhD in Computer Science, University of Victoria",pashlin0@skype.com,"Urdu, PostgreSQL, Databases, German, Hootsuite...","Phyllis Ashlin\n 72188 Welch Circle, Miami,..."
1,arowly1,Angelle Rowly,"5227 Elka Junction, Denver, Colorado 80022","Bachelor of Arts, Universidad Capitain General...",arowly1@hostgator.com,"Git, Julia, JavaScript, SEO/SEM, Project Manag...","Angelle Rowly\n 5227 Elka Junction, Denver,..."
2,kkumaar2,Kim Kumaar,"25 Magdeline Trail, Chicago, Illinois 60007","Bachelor of Science, Sri Padmavati Women's Uni...",kkumaar2@ihg.com,"SQL, Azure, Customer Support, Google Cloud Pla...","Kim Kumaar\n 25 Magdeline Trail, Chicago, I..."
3,kmalthouse3,Kakalina Malthouse,"884 Donald Drive, Atlanta, Georgia 30303","Bachelor of Arts, State University of New York...",kmalthouse3@comsenz.com,"Flask, SEO/SEM, UX/UI Design, Python, Kotlin, ...","Kakalina Malthouse\n 884 Donald Drive, Atla..."
4,ldrew4,Lorianna Drew,"688 Myrtle Terrace, Miami, Florida 33101","Bachelor of Science, Beni Suef University",ldrew4@yale.edu,"Julia, PostgreSQL, Google Cloud Platform, Vue....","Lorianna Drew\n 688 Myrtle Terrace, Miami, ..."
...,...,...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,"79868 Schurz Place, Los Angeles, California 90003","Bachelor of Science, New Era University",pduquesnay87@europa.eu,"Project Management, React, AI/ML frameworks, M...","Pancho Duquesnay\n 79868 Schurz Place, Los ..."
296,lgaydon88,Lonnard Gaydon,"158 Ridge Oak Avenue, Chicago, Illinois 60018","Bachelor of Arts, Copenhagen Business School",lgaydon88@php.net,"Backend Stack, Google Cloud Platform, Scrum, K...","Lonnard Gaydon\n 158 Ridge Oak Avenue, Chic..."
297,dchristie89,Des Christie,"80412 Grim Pass, Los Angeles, California 90003","Bachelor of Arts, Universitas Brawijaya",dchristie89@discovery.com,"SQL, OOP, Adobe Creative Suite, Julia, UX/UI D...","Des Christie\n 80412 Grim Pass, Los Angeles..."
298,hmorse8a,Hillary Morse,"31 Riverside Alley, Chicago, Illinois 60018","Bachelor of Science, Uzbek State World Languag...",hmorse8a@blogspot.com,"Backend Stack, Adobe Creative Suite, Jupyter, ...","Hillary Morse\n 31 Riverside Alley, Chicago..."


In [130]:
# Populate PastApplications attribute values by randomly choosing a number between 0 and 50 for each candidate
candidate_df = candidate_df.assign(PastApplications=[random.randint(0, 50) for _ in range(len(candidate_df))])
candidate_df

Unnamed: 0,Username,Name,Address,Education,Email,Skills,Resume,PastApplications
0,pashlin0,Phyllis Ashlin,"72188 Welch Circle, Miami, Florida 33109","PhD in Computer Science, University of Victoria",pashlin0@skype.com,"Urdu, PostgreSQL, Databases, German, Hootsuite...","Phyllis Ashlin\n 72188 Welch Circle, Miami,...",7
1,arowly1,Angelle Rowly,"5227 Elka Junction, Denver, Colorado 80022","Bachelor of Arts, Universidad Capitain General...",arowly1@hostgator.com,"Git, Julia, JavaScript, SEO/SEM, Project Manag...","Angelle Rowly\n 5227 Elka Junction, Denver,...",46
2,kkumaar2,Kim Kumaar,"25 Magdeline Trail, Chicago, Illinois 60007","Bachelor of Science, Sri Padmavati Women's Uni...",kkumaar2@ihg.com,"SQL, Azure, Customer Support, Google Cloud Pla...","Kim Kumaar\n 25 Magdeline Trail, Chicago, I...",27
3,kmalthouse3,Kakalina Malthouse,"884 Donald Drive, Atlanta, Georgia 30303","Bachelor of Arts, State University of New York...",kmalthouse3@comsenz.com,"Flask, SEO/SEM, UX/UI Design, Python, Kotlin, ...","Kakalina Malthouse\n 884 Donald Drive, Atla...",48
4,ldrew4,Lorianna Drew,"688 Myrtle Terrace, Miami, Florida 33101","Bachelor of Science, Beni Suef University",ldrew4@yale.edu,"Julia, PostgreSQL, Google Cloud Platform, Vue....","Lorianna Drew\n 688 Myrtle Terrace, Miami, ...",10
...,...,...,...,...,...,...,...,...
295,pduquesnay87,Pancho Duquesnay,"79868 Schurz Place, Los Angeles, California 90003","Bachelor of Science, New Era University",pduquesnay87@europa.eu,"Project Management, React, AI/ML frameworks, M...","Pancho Duquesnay\n 79868 Schurz Place, Los ...",10
296,lgaydon88,Lonnard Gaydon,"158 Ridge Oak Avenue, Chicago, Illinois 60018","Bachelor of Arts, Copenhagen Business School",lgaydon88@php.net,"Backend Stack, Google Cloud Platform, Scrum, K...","Lonnard Gaydon\n 158 Ridge Oak Avenue, Chic...",28
297,dchristie89,Des Christie,"80412 Grim Pass, Los Angeles, California 90003","Bachelor of Arts, Universitas Brawijaya",dchristie89@discovery.com,"SQL, OOP, Adobe Creative Suite, Julia, UX/UI D...","Des Christie\n 80412 Grim Pass, Los Angeles...",38
298,hmorse8a,Hillary Morse,"31 Riverside Alley, Chicago, Illinois 60018","Bachelor of Science, Uzbek State World Languag...",hmorse8a@blogspot.com,"Backend Stack, Adobe Creative Suite, Jupyter, ...","Hillary Morse\n 31 Riverside Alley, Chicago...",7


In [131]:
candidate_df.to_csv('data/Candidate.csv', index=False)

### Job Portal
- For each Company represented in our data, we generated a random Portal ID and Name (corresponding to the Company name).
- Rename 'Company' to 'Name'.
- Ensure Company.Company_ID foreign key is present.

In [191]:
portal_df = pd.read_csv('data/Mockaroo/Mockaroo-Job_Portal.csv')
portal_df

Unnamed: 0,Portal_ID,Company
0,01HQY4F1HC1XGPARVQC09JMXXB,Skinte
1,01HQY4F1HCVSC3ECXQQ0181Z9Y,Topicware
2,01HQY4F1HDQFA6E50X5N6AB7P7,Yakidoo
3,01HQY4F1HDDE43MHBEJD2ZG5AB,Innotype
4,01HQY4F1HDZAG3B5F4D146GEXR,Meeveo
...,...,...
306,01HQY4F1KTJGBN9QB2FVECCD3F,Tekfly
307,01HQY4F1KTTVY1YM1ZAMPGP66N,Mita
308,01HQY4F1KTD55SYC08WZ3410WP,Browseblab
309,01HQY4F1KTGG0WCW2SH8FCMM4G,Podcat


In [192]:
# Add Name column
portal_df['Company'] = all_company_names
portal_df = portal_df.rename(columns={'Company': 'Name'})
portal_df

Unnamed: 0,Portal_ID,Name
0,01HQY4F1HC1XGPARVQC09JMXXB,Hudl
1,01HQY4F1HCVSC3ECXQQ0181Z9Y,Alarm.com
2,01HQY4F1HDQFA6E50X5N6AB7P7,Forbes
3,01HQY4F1HDDE43MHBEJD2ZG5AB,Konrad Group
4,01HQY4F1HDZAG3B5F4D146GEXR,Ascend Analytics
...,...,...
306,01HQY4F1KTJGBN9QB2FVECCD3F,Zynga
307,01HQY4F1KTTVY1YM1ZAMPGP66N,Workato
308,01HQY4F1KTD55SYC08WZ3410WP,Wisk
309,01HQY4F1KTGG0WCW2SH8FCMM4G,Moveworks


In [193]:
# Add Company_ID for each column 
company_id_mapping = company_df.set_index('Name')['Company_ID'].to_dict()
portal_df['Company_ID'] = portal_df['Name'].map(company_id_mapping)
portal_df

Unnamed: 0,Portal_ID,Name,Company_ID
0,01HQY4F1HC1XGPARVQC09JMXXB,Hudl,01HQXAVCMWEXEJYR1CVK9QPZQ7
1,01HQY4F1HCVSC3ECXQQ0181Z9Y,Alarm.com,01HQXAVCMXRWCNSE1XGC88Z5KR
2,01HQY4F1HDQFA6E50X5N6AB7P7,Forbes,01HQXAVCMXTRC4Y81YV23C1ZQ8
3,01HQY4F1HDDE43MHBEJD2ZG5AB,Konrad Group,01HQXAVCMXCRNPED9KR90QAGJN
4,01HQY4F1HDZAG3B5F4D146GEXR,Ascend Analytics,01HQXAVCMXJ11TZ0EJQNZST6SP
...,...,...,...
306,01HQY4F1KTJGBN9QB2FVECCD3F,Zynga,01HQXAVCQZT7RE42HZ1RT90XT3
307,01HQY4F1KTTVY1YM1ZAMPGP66N,Workato,01HQXAVCQZYF5N9HVP8QNCXW97
308,01HQY4F1KTD55SYC08WZ3410WP,Wisk,01HQXAVCR0PX1DP6TJNR6FAJ4M
309,01HQY4F1KTGG0WCW2SH8FCMM4G,Moveworks,01HQXAVCR0113ZBMBE6D7ARSM2


In [194]:
portal_df.to_csv('data/Job_Portal.csv', index=False)

### Review

In [195]:
review_mock_df = pd.read_csv('data/Mockaroo/Mockaroo-Review.csv')
review_mock_df

Unnamed: 0,Review_ID
0,01HR06F991G1R0A01DSJP487NP
1,01HR06F992CEC2M2VE3DVXHRNX
2,01HR06F992C63BNDD9K5ANB3QQ
3,01HR06F9929KVQZWKPE26HXDJ2
4,01HR06F992YFEWZX5058TFCHDH
...,...
95,01HR06F99JP13J3AJQT44RBJAM
96,01HR06F99JZJTT02SBT5GW4NC8
97,01HR06F99J11NF9GPHBPD87BHX
98,01HR06F99JEEPWMDGZ3GPWD7X1


In [117]:
# Generate 50 sample Employee Reviews using ChatGPT.
prompt = "Generate 50 fictional reviews that are 100-300 words in length written by fictional employees (don't mention their names or any identifiers) for mock data for a platform where employees can post about their experience working at a company. If you want to mention the company name in some of them, just write COMPANY."

employee_reviews = [
    "Working at COMPANY has been a transformative experience. The supportive team and innovative culture have contributed significantly to my professional growth.",
    "I've been with COMPANY for two years, and it's been a fantastic journey. The work-life balance and management's care for employee well-being are commendable.",
    "COMPANY has a dynamic environment that keeps you on your toes. The collaborative atmosphere and supportive colleagues make it a great place to work.",
    "As a recent graduate, COMPANY provided a great start to my career. The mentorship program and exposure to cutting-edge technologies have been invaluable.",
    "The diversity and inclusion efforts at COMPANY are commendable. It's refreshing to work in an environment where everyone's ideas are valued.",
    "I've been with COMPANY for over five years, and it's been rewarding. The company's clear vision and the leadership team's transparency have created a collaborative work environment.",
    "COMPANY offers a unique blend of creativity and technical challenges. Working on a variety of projects has pushed the boundaries of my skills.",
    "One thing I appreciate about COMPANY is the focus on employee development. The ample training programs and workshops have helped me enhance my skills.",
    "The work culture at COMPANY is unlike any other I've experienced. The emphasis on teamwork and collective success has fostered a supportive atmosphere.",
    "COMPANY is a place where innovation thrives. The entrepreneurial spirit is encouraged, and employees are empowered to take ownership of their projects.",
    "I've been part of the COMPANY team for three years, and it's been fulfilling. The company values hard work and dedication, which is reflected in how employees are treated.",
    "The sense of community at COMPANY is strong. Regular team-building events and social activities have helped build a tight-knit and friendly work environment.",
    "Working at COMPANY has been an eye-opening experience. Being part of the company's journey in industry innovations has been incredibly rewarding.",
    "The commitment to sustainability and social responsibility at COMPANY is impressive. Working for a company that focuses on making a positive impact is inspiring.",
    "COMPANY has a vibrant work environment. The open office layout and modern amenities foster creativity and collaboration.",
    "I've had the pleasure of working at COMPANY for four years, and it's been remarkable. The company's growth mindset and focus on innovation have opened up numerous opportunities.",
    "The leadership team at COMPANY is exceptional. They lead by example and are always available to provide guidance and support.",
    "COMPANY's approach to problem-solving and project management is top-notch. The emphasis on data-driven decisions has led to the successful execution of complex projects.",
    "As a software engineer at COMPANY, I've worked with a talented team on cutting-edge projects. The company's commitment to using the latest technologies has allowed me to grow my technical skills.",
    "The inclusive culture at COMPANY has made it a wonderful place to work. The company celebrates diversity and fosters an environment where everyone feels welcome.",
    "COMPANY's focus on customer satisfaction is evident in everything we do. It's rewarding to be part of a team that prioritizes delivering high-quality products and services.",
    "While COMPANY has provided some growth opportunities, I've found the pace of career advancement to be slower than expected. More clarity on promotion paths would be helpful.",
    "The work at COMPANY can be quite demanding, with tight deadlines and high expectations. A better balance between challenging projects and manageable workloads would be appreciated.",
    "While COMPANY has a strong focus on innovation, I've noticed a resistance to change in some departments. Encouraging more openness to new ideas could enhance our adaptability.",
    "I've experienced some communication issues at COMPANY, where important information is not always shared promptly. Improving internal communication channels could increase efficiency.",
    "Working at COMPANY has been an average experience. The work is routine, and there's little room for creativity or innovation.",
    "I've been with COMPANY for a year now, and it's been underwhelming. The lack of clear direction and communication from management is frustrating.",
    "COMPANY's work environment is quite stressful. The constant pressure to meet unrealistic deadlines has taken a toll on my work-life balance.",
    "As an employee at COMPANY, I've found the opportunities for career advancement to be limited. It's disheartening to see little recognition for hard work.",
    "The company culture at COMPANY is not as inclusive as I had hoped. There's a noticeable lack of diversity in leadership positions.",
    "I've experienced a lack of support from my team at COMPANY. Collaboration is rare, and it often feels like everyone is working in silos.",
    "The workload at COMPANY is overwhelming. There's an expectation to be available 24/7, which is unsustainable in the long term.",
    "I've noticed a high turnover rate at COMPANY, which is concerning. It seems like many talented individuals are leaving due to dissatisfaction.",
    "The training and development opportunities at COMPANY are inadequate. There's a clear need for more investment in employee growth.",
    "The office politics at COMPANY can be draining. It often feels like progress is more about who you know rather than what you know.",
    "I've found the feedback culture at COMPANY to be lacking. Constructive criticism is rare, and it's challenging to know where you stand.",
    "The benefits package at COMPANY is subpar. It's disappointing to see minimal effort put into employee well-being and perks.",
    "I've encountered a lack of transparency at COMPANY. Decisions are made behind closed doors, leaving employees in the dark.",
    "The innovation at COMPANY is stagnant. There's a resistance to new ideas, which hinders growth and progress.",
    "I've felt undervalued at COMPANY. Despite putting in extra effort, there's little acknowledgment or reward.",
    "The work-life balance at COMPANY is non-existent. The expectation to prioritize work over personal life is unreasonable.",
    "I've noticed a lack of ethical practices at COMPANY. There are instances where profits are prioritized over integrity.",
    "The communication at COMPANY is poor. Important information is often not disseminated effectively, leading to confusion and errors.",
    "I've experienced a toxic work environment at COMPANY. There's a culture of blame and negativity that's demoralizing.",
    "The leadership at COMPANY is disconnected from the employees. There's a lack of understanding of the challenges faced by the team.",
    "I've observed a lack of accountability at COMPANY. When things go wrong, there's a tendency to pass the buck rather than address the issue.",
    "The focus on short-term gains at COMPANY is frustrating. There's a lack of long-term vision and planning.",
    "I've found the performance evaluation process at COMPANY to be unfair. It seems biased and doesn't accurately reflect contributions.",
    "The work environment at COMPANY is uninspiring. The office is outdated, and there's a lack of resources to do our best work.",
    "I've felt isolated at COMPANY. There's a lack of camaraderie and team spirit, which makes it a lonely place to work."
]

In [118]:
# Generate 50 sample Candidate Reviews using ChatGPT. 
prompt = "Generate 50 fictional reviews that are 100-300 words in length written by fictional job candidates (don't mention their names or any identifiers) for mock data for a platform where job candidates can post about their experience interviewing at a company. If you want to mention the company name in some of them, just write COMPANY."

interview_reviews = [
    "The interview process at COMPANY was well-organized and professional. The recruiters were communicative and provided clear instructions at each step.",
    "I had a positive experience interviewing with COMPANY. The interviewers were friendly and asked relevant questions that allowed me to showcase my skills.",
    "My interview at COMPANY was a bit intimidating. The questions were challenging, and I felt underprepared for some of the technical aspects.",
    "COMPANY's interview process was thorough and fair. I appreciated the opportunity to meet with multiple team members and get a sense of the company culture.",
    "I found the interview experience at COMPANY to be quite stressful. The expectations were not clearly communicated, and the process felt disorganized.",
    "Interviewing with COMPANY was a great learning experience. The feedback provided after the interview was constructive and helpful for my future interviews.",
    "The interview process at COMPANY felt very impersonal. It was difficult to connect with the interviewers, and I left feeling unsure about the company.",
    "I was impressed by the efficiency of COMPANY's interview process. The use of practical assessments gave me a good idea of what working there would be like.",
    "My interview experience at COMPANY was mixed. While the initial stages were smooth, the final round felt rushed and left me with unanswered questions.",
    "I appreciated the transparency of COMPANY's interview process. The clear communication of expectations and timelines made the experience less stressful.",
    "The interview at COMPANY was challenging but fair. The questions tested my problem-solving skills and ability to think on my feet.",
    "I found the interviewers at COMPANY to be very approachable and knowledgeable. They made the interview feel more like a conversation than an interrogation.",
    "The virtual interview process with COMPANY had some technical difficulties, which was frustrating. However, the interviewers were understanding and accommodating.",
    "Interviewing at COMPANY was an eye-opening experience. The focus on cultural fit and values alignment was refreshing and made me more interested in the company.",
    "I felt that the interview process at COMPANY lacked diversity. The panel did not seem representative of the company's stated commitment to inclusivity.",
    "The interview experience at COMPANY was enjoyable. The relaxed atmosphere and engaging questions made me feel comfortable and confident.",
    "I was disappointed by the lack of feedback after my interview with COMPANY. It would have been helpful to know where I could improve for future opportunities.",
    "The interview process at COMPANY was lengthy but worthwhile. Each stage was designed to assess different skills, which I found to be a comprehensive approach.",
    "I was surprised by the informal nature of the interview at COMPANY. While it was a pleasant experience, I was unsure of how to gauge my performance.",
    "The interviewers at COMPANY were very professional and provided clear explanations of the role and expectations. I left the interview with a positive impression of the company.",
    "I found the interview process at COMPANY to be somewhat disorganized. The scheduling was inconsistent, and there was a lack of communication between stages.",
    "Interviewing with COMPANY was a confidence-boosting experience. The positive reinforcement and constructive feedback from the interviewers were encouraging.",
    "The group interview at COMPANY was a unique experience. It was interesting to see how different candidates approached the same problems.",
    "I appreciated the focus on work-life balance during my interview with COMPANY. The questions about managing stress and maintaining productivity were insightful.",
    "The interview process at COMPANY was intense. The technical assessments were challenging, and I had to thoroughly prepare to meet their high standards.",
    "The interview process at COMPANY was average. It was a standard procedure with no standout moments or particularly engaging interactions.",
    "I found the interview at COMPANY to be disorganized. The interviewers seemed unprepared, and there were scheduling mix-ups.",
    "My experience interviewing with COMPANY was stressful. The interviewers were cold, and the atmosphere was intimidating.",
    "The interview at COMPANY felt one-sided. It was more of an interrogation than a conversation, and I didn't get a chance to ask questions.",
    "I was unimpressed with the interview process at COMPANY. It lacked structure, and there was no clear explanation of the next steps.",
    "The interviewers at COMPANY seemed disinterested. It felt like they had already made up their minds, which was discouraging.",
    "I had a negative experience interviewing at COMPANY. The questions were irrelevant to the role, and the process felt rushed.",
    "The virtual interview with COMPANY had technical issues. It was frustrating, and the interviewers didn't handle it well.",
    "I felt unwelcome during my interview at COMPANY. The interviewers were dismissive, and there was a lack of warmth or friendliness.",
    "The interview process at COMPANY was excessively long. It dragged on with too many rounds, which felt unnecessary.",
    "I left the interview at COMPANY feeling uncertain. There was a lack of clarity about the role and the company's expectations.",
    "The feedback after my interview with COMPANY was vague. It didn't provide any useful insights or areas for improvement.",
    "I found the interviewers at COMPANY to be arrogant. They talked down to me, which was off-putting and unprofessional.",
    "The interview at COMPANY lacked diversity. All the interviewers were from similar backgrounds, which raised concerns about inclusivity.",
    "My interview experience at COMPANY was forgettable. It was a standard process with nothing that made the company stand out.",
    "I felt rushed during my interview at COMPANY. The interviewers seemed like they were in a hurry, which made it hard to connect.",
    "The questions asked during the interview at COMPANY were generic. They didn't allow me to showcase my unique skills or experiences.",
    "I was disappointed by the lack of follow-up after my interview with COMPANY. I had to reach out multiple times to get any response.",
    "The interviewers at COMPANY seemed distracted. It was as if they had more important things to do, which was disheartening.",
    "I had a negative impression of COMPANY after the interview. The company culture seemed rigid and unwelcoming.",
    "The interview process at COMPANY felt impersonal. It was hard to get a sense of the company's values or what it would be like to work there.",
    "I was put off by the aggressive questioning during my interview at COMPANY. It felt more like an interrogation than an opportunity to discuss my qualifications.",
    "The lack of enthusiasm from the interviewers at COMPANY was noticeable. It made me question whether I would want to work in such an environment.",
    "I found the interview at COMPANY to be uninspiring. The questions were predictable, and there was no opportunity for meaningful dialogue.",
    "My interview experience at COMPANY was disappointing. The overall vibe was cold, and it didn't leave me with a positive impression of the company."
]

In [196]:
# Impute a real, random company name wherever COMPANY appears in a Review
def replace_company_name_review(review_1, review_2):
    company_name = random.choice(all_company_names).strip(' ')
    review_1 = review_1.replace('COMPANY', company_name)
    review_2 = review_2.replace('COMPANY', company_name)
    return review_1, review_2

for i in range(len(employee_reviews)):
    review_1 = employee_reviews[i]
    review_2 = interview_reviews[i]
    review_1, review_2 = replace_company_name_review(review_1, review_2)
    employee_reviews[i] = review_1
    interview_reviews[i] = review_2

In [201]:
# Create Reviews Dataframe and concatenate it with the Review_ID dataframe
employees_df = pd.DataFrame({'Interview_Feedback': [None] * len(employee_reviews), 'Company_Feedback': employee_reviews})
employees_df['Employee_ID'] = employee_df['Employee_ID'].apply(lambda x: random.choice(employee_df['Employee_ID']))
interview_df = pd.DataFrame({'Interview_Feedback': interview_reviews, 'Company_Feedback': [None] * len(interview_reviews)})
interview_df['Username'] = candidate_df['Username'].apply(lambda x: random.choice(candidate_df['Username']))

employee_reviews_df = pd.DataFrame({
    'Interview_Feedback': [None] * len(employee_reviews),
    'Company_Feedback': employee_reviews,
    'Employee_ID': [random.choice(employees_df['Employee_ID']) for _ in range(len(employee_reviews))],
    'Candidate_Username': [None] * len(employee_reviews)
})

interview_reviews_df = pd.DataFrame({
    'Interview_Feedback': interview_reviews,
    'Company_Feedback': [None] * len(interview_reviews),
    'Employee_ID': [None] * len(interview_reviews),
    'Candidate_Username': [random.choice(interview_df['Username']) for _ in range(len(interview_reviews))]
})

reviews_df = pd.concat([employee_reviews_df, interview_reviews_df], ignore_index=True)
reviews_df

Unnamed: 0,Interview_Feedback,Company_Feedback,Employee_ID,Candidate_Username
0,,Working at Intel has been a transformative exp...,01HQ9SN572JTH69FXWH71E80Q7,
1,,"I've been with NVIDIA for two years, and it's ...",01HQ9SN566QNWE4P10G3WT89D4,
2,,Chamberlain Group has a dynamic environment th...,01HQ9SN566QNWE4P10G3WT89D4,
3,,"As a recent graduate, Similarweb provided a gr...",01HQ9SN55Q7MXX3X7BBCST4QK7,
4,,The diversity and inclusion efforts at Exiger ...,01HQ9SN560JW8E1Y07P9VRYNPM,
...,...,...,...,...
95,The interview process at Skyways felt imperson...,,,eschonfeld61
96,I was put off by the aggressive questioning du...,,,wloblie3e
97,The lack of enthusiasm from the interviewers a...,,,gribeiro76
98,I found the interview at Hex Technologies to b...,,,jmoncreiffe14


In [202]:
# Output final Review dataframe
review_df = pd.concat([review_mock_df.reset_index(drop=True), reviews_df], axis=1)
review_df

Unnamed: 0,Review_ID,Interview_Feedback,Company_Feedback,Employee_ID,Candidate_Username
0,01HR06F991G1R0A01DSJP487NP,,Working at Intel has been a transformative exp...,01HQ9SN572JTH69FXWH71E80Q7,
1,01HR06F992CEC2M2VE3DVXHRNX,,"I've been with NVIDIA for two years, and it's ...",01HQ9SN566QNWE4P10G3WT89D4,
2,01HR06F992C63BNDD9K5ANB3QQ,,Chamberlain Group has a dynamic environment th...,01HQ9SN566QNWE4P10G3WT89D4,
3,01HR06F9929KVQZWKPE26HXDJ2,,"As a recent graduate, Similarweb provided a gr...",01HQ9SN55Q7MXX3X7BBCST4QK7,
4,01HR06F992YFEWZX5058TFCHDH,,The diversity and inclusion efforts at Exiger ...,01HQ9SN560JW8E1Y07P9VRYNPM,
...,...,...,...,...,...
95,01HR06F99JP13J3AJQT44RBJAM,The interview process at Skyways felt imperson...,,,eschonfeld61
96,01HR06F99JZJTT02SBT5GW4NC8,I was put off by the aggressive questioning du...,,,wloblie3e
97,01HR06F99J11NF9GPHBPD87BHX,The lack of enthusiasm from the interviewers a...,,,gribeiro76
98,01HR06F99JEEPWMDGZ3GPWD7X1,I found the interview at Hex Technologies to b...,,,jmoncreiffe14


In [203]:
review_df.to_csv('data/Review.csv', index=False)