## Task 1. 
Create one pandas dataframe that combines all the data scraped from May 22, 2022 together. Drop rows with missing job titles and/or job descriptions. Use `spacy` to tokenize all the job titles included in the cleaned dataframe. For each job title, find all the nouns and all the adjectives in the title and get their lowercased lemmatized form. Use the reformatted nouns to construct a vocabulary set for this dataframe. How many unique nouns are there? Construct another vocabulary set using the reformatted adjectives. How many unique adjectives are there? What kind of different information do the nouns versus the adjectives reveal about the specific job? 

In [2]:
import os
import pandas as pd
import spacy
import re
import json
from tqdm import tqdm
nlp = spacy.load('en_core_web_sm')
os.chdir('/Users/[editted]/Dropbox/work/compsoc/dataset/indeed_scraped_data/job_info_data')
os.listdir()

all_df=[]
for item in list(os.listdir()):
    if "5222022" in item:
        df = pd.read_csv(item)
        all_df.append(df)

jobinfo=pd.concat(all_df).dropna(subset=['lnks_job_title', 'lnks_job_description']).reset_index(drop=True)

In [3]:
vocab_noun_freq = {}
vocab_adj_freq = {}
for item in tqdm(jobinfo['lnks_job_title']):
    for token in nlp(item):
        if token.is_alpha:
            if token.pos_ =='NOUN':
                vocab_noun_freq[token.lemma_.lower()] = vocab_noun_freq.get(token.lemma_.lower(), 0) + 1
            elif token.pos_ =='ADJ':
                vocab_adj_freq[token.lemma_.lower()] = vocab_adj_freq.get(token.lemma_.lower(), 0) + 1
                
vocalist_noun=list(vocab_noun_freq.keys()) 
vocalist_adj=list(vocab_adj_freq.keys())
print('number of unique nouns in the job description:', len(vocalist_noun))
print('number of unique adjectives in the job description:', len(vocalist_adj))

100%|██████████████████████████████████████| 4803/4803 [00:10<00:00, 439.45it/s]

number of unique nouns in the job description: 325
number of unique adjectives in the job description: 77





In [6]:
print(vocalist_adj[0:10])
print(vocalist_noun[0:10])

# The token of adjectives refer to the job status in the job title. 
# For example, the words including 'licensed', 'certified', and 'registered' indicate how qualified the job is.
# The token of nouns shows the specific duty or position, such as managern consultant, and piolt.
# For instance, the job title, "senior benefits specialist", has one adjective and two nouns. 

['licensed', 'surgical', 'dairy', 'senior', 'full', 'certified', 'registered', 'new', 'financial', 'radiologic']
['contract', 'business', 'entry', 'purchasing', 'time', 'plant', 'manager', 'consultant', 'pilot', 'meat']


## Task 2. 
Choose the first job title in your dataframe as the primary string. Use one-hot encoding as the word embedding method and find jobs in your cleaned dataframe that have similar nouns in the title as your primary string. 

In [4]:
import numpy as np
one_hot_encodings = []

for i in tqdm(range(len(jobinfo))):
    job_title = jobinfo.loc[i, 'lnks_job_title']
    token_indices = []

    for token in nlp(job_title):
        if token.is_alpha:
            if token.lemma_.lower() in vocalist_noun:
                token_index_in_vocab = vocalist_noun.index(token.lemma_.lower())
                token_indices.append(token_index_in_vocab)

    one_hot_encoding = np.zeros(len(vocalist_noun))
    for token_index in token_indices:
        one_hot_encoding[token_index] = 1

    one_hot_encodings.append(one_hot_encoding)

100%|██████████████████████████████████████| 4803/4803 [00:10<00:00, 441.82it/s]


In [5]:
np.array(one_hot_encodings).shape

(4803, 325)

In [6]:
from scipy.spatial.distance import cosine
title_a = jobinfo.loc[0, 'lnks_job_title']
similarity_values = []
for i in tqdm(range(1, len(jobinfo))):
    similarity_value = 1 - cosine(one_hot_encodings[0], one_hot_encodings[i])
    similarity_values.append(similarity_value)
similar_df = pd.DataFrame(columns=['job_title', 'similarity_value_with_one_hot'])
similar_df['job_title'] = jobinfo.loc[1:, 'lnks_job_title']
similar_df['similarity_value_with_one_hot'] = similarity_values

print('the primary job title:', title_a)
similar_df.sort_values(by='similarity_value_with_one_hot', ascending=False)[0:20]    

  dist = 1.0 - uv / np.sqrt(uu * vv)
100%|████████████████████████████████████| 4802/4802 [00:00<00:00, 62169.87it/s]

the primary job title: Maintenance Controller (A&P) Technician





Unnamed: 0,job_title,similarity_value_with_one_hot
4165,Human Resources Generalist,1.0
4471,Maintenance Controller (A&P) Technician,1.0
4379,Referral Rep,1.0
3785,Coder - Hospital - Inpatient - FT - REMOTE,1.0
4383,Human Resources Generalist,1.0
3783,Respiratory Therapist,1.0
1630,Registered Nurse- LTCIS,1.0
2163,Paramedic - Offshore,1.0
1119,AUDITOR,1.0
452,"Head, FCC Investigations, Americas",1.0


## Task 3. 
Use spacy's word vector to do Task 2. Compare the results. 

In [70]:
title_a = jobinfo.loc[0, 'lnks_job_title']
title_tok=nlp(title_a)

vec_list=[]
for i in tqdm(range(1, len(jobinfo))):
    
    job_title = jobinfo.loc[i, 'lnks_job_title']
    token_vec = []
    for token in nlp(job_title):
        if token.is_alpha:
            if token.lemma_.lower() in vocalist_noun:
                token_vec.append(title_tok.similarity(token))
    vec_list.append(token_vec)   
vec_list2=[]
for item in vec_list:
    if item!=[]:
        item2=(sum(item)/len(item))
    vec_list2.append(item2)   


  token_vec.append(title_tok.similarity(token))
100%|██████████████████████████████████████| 4802/4802 [00:11<00:00, 424.34it/s]


In [71]:
similar_df['token_similarity'] = vec_list2
similar_df.sort_values(by=['similarity_value_with_one_hot', 'token_similarity'], ascending=False)

Unnamed: 0,job_title,similarity_value_with_one_hot,token_similarity
1732,Paramedic - Offshore,1.0,0.573498
3500,Human Resources Generalist,1.0,0.566191
4365,Licensed Respiratory Therapist (RT),1.0,0.553277
1395,Psychiatrist IV,1.0,0.533875
1396,Wellness Coach (Maternity & Pediatrics),1.0,0.533875
...,...,...,...
3950,Patient Financial Counseling Specialist - Full...,0.0,0.112782
4658,Patient Financial Counseling Specialist - Full...,0.0,0.112782
1095,Line Cook - Full time 1561/KW,0.0,0.102449
161,Recent pay increase ~ Cook,0.0,0.101998
