# Coding Exercise 13
The following codes uses files in the indeed_scraped_data/job_info_data that contain "5222022" in the filename.

In [2]:
import os
import pandas as pd
import numpy as np
import spacy

## Task 1
The following task creates one pandas dataframe that combines all the data collected on May 22, 2022, dropping rows with missing job titles and/or job descriptions. In addition, it uses 'spacy' to tokenize the job titles and constructions a vocabulary sets (one for nouns and one for adjectives) cotnaining words in the titles in lowercased lemmatized form. The code reports the number of unique nouns and adjectives.

In [4]:
#using code from take home #1
dfs=[]
headers=['URL','Job Title','Company','Company URL','Company Location','Job Description']
for filename in os.scandir('indeed_scraped_data/job_info_data'):
    if filename.is_file() and '5222022' in filename.path:
        if '.csv' in filename.path:
            data=pd.read_csv(filename.path,delimiter=',')
        elif '.json' in filename.path:
            use=[]
            with open(filename.path) as file:
                data=json.load(file)
                for i in data:
                    for item in data[i]:
                        if(len(item)==6):#only include those with all 6 fields
                            use.append(item)
                data=pd.DataFrame(use)
        data.columns=headers #make headers match
        dfs.append(data)
df=pd.concat(dfs) #combine into a single dataframe
df=df[df["Job Title"].notnull()]
df=df[df["Job Description"].notnull()].reset_index(drop=True)

In [9]:
adj=set()
noun=set()
nlp=spacy.load('en_core_web_sm')
for i in df.index:
    for token in nlp(df.iloc[i]['Job Title']):
        if(token.pos_=='ADJ'):
            adj.add(token.lemma_.lower())
        elif(token.pos_=='NOUN'):
            noun.add(token.lemma_.lower())
print(len(adj),'unique adjectives.')
print(len(noun),'unique nouns.')

84 unique adjectives.
364 unique nouns.


In [10]:
print('Adjectives:',adj)

Adjectives: {'western', 'compl', 'skilled', 'technician', 'regulatory', '-', 'senior', 'long', 'financial', '-universal', 'powered', 'top', 'full', 'psychiatrist', 'residential', 'early', 'dietary', 'radiologic', 'practical', 'other', 'radiologist', 'surgical', 'available', 'associate', 'analyst', 'first', 'dietitian', 'preferred', 'personal', 'operational', 'osha', '5th', 'direct', 'pharmacist', 'dairy', 'iosha', 'real', 'analytics', 'archaeological', 'legislative', 'clinical', 'new', 'private', 'legal', 'mid', 'licensed', '3rd', 'remote', 'whole', 'part', 'advisory', 'telemedicine', 'sanitarian', 'microbiologist', 'tiktok', 'public', 'advanced', 'administrative', 'spanish', 'total', 'multi', 'good', '1st', 'commercial', 'paramedic', '-inpatient', 'global', 'regional', 'virtual', 'registered', 'small', 'certified', 'medical', 'payable', 'interventional', 'human', 'respiratory', 'corporate', 'excelsior', 'gfcii', 'us-', 'recent', 'universal', 'environmental'}


It seems that the adjectives in the Job Titles tell us a lot about the job domain; for example, "financial","psychiatric","archeological", among other descriptive words related to specific domains. In addition, it gives us a sense of the "level" of the job which can tell us about the skills and experience necessary; for example, words like "mid","registered", "early","advancanced", etc.

In [11]:
print('Nouns:',noun)

Nouns: {'employee', 'technologist-', 'technician', 'telecommunications', 'duty', 'branch', 'site', 'fund', 'market', 'attendant', 'lead', '-19772', 'security', 'helper', 'sdc/', 'evening', 'increase', 'engineering', 'foreman', 'manufacturing', 'reporting', 'geologist', 'warehouse', 'differential', 'analyst', 'hvac', 'fitter', '-retail', 'lpn', 'licensing', 'relationship', 'pos', 'distribution', 'application', 'performance', 'surgery', 'cycle', 'host', 'es', 'tracking', 'chef', 'inspector', 'food', 'claim', 'enterprise', 'representative', 'weekend', 'level', 'documentation', 'metrology', 'officer', 'speech', 'clerk', 'ad', '-11pm', 'agent', 'manager-', 'crime', 'operator-1', 'cardiology', 'ehs&s', 'clinician', 'house', 'ldar', 'us_sr', 'pay', 'remote/25', 'fmc', 'infection', 'robotic', 'disability', 'systems', 'purchasing', 'construction', 'finance', 'safety', 'hybrid', 'position', 'building', 'team', 'medical', '3p-11p', 'treasurer', 'control', 'business', 'mortgage', 'office', 'traini

The nouns tell us more about specific tasks associated with the job in addition to the domain with words like "search","clerk","analyst", etc. Many of the words also contribute to situating the job in a particular domain like many of the adjectives.

## Task 2
The following code selects the first job title in the dataframe as the primary string and uses one-hot encoding as the word embedding method and finds jobs in the dataframe that have similar nouns in the title.

In [23]:
from scipy.spatial.distance import cosine
noun=list(noun)
primary = df.iloc[0]['Job Title'] #use first job title as primary string

In [24]:
encodings=[]
#essentially do the same for every job title in the 
for i in df.index:
    title=df.iloc[i]["Job Title"]
    indicies=[]
    for token in nlp(title):
         if token.lemma_.lower() in noun:
            indicies.append(noun.index(token.lemma_.lower()))
    one_hot_encoding=np.zeros(len(noun))
    for index in indicies:
        one_hot_encoding[index]=1
    encodings.append(one_hot_encoding)

In [25]:
similar=[]
for i in range(1,len(df)):
    similar.append(1-cosine(encodings[0],encodings[i]))
print(similar)

[0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.40824829046386313, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.28867513459481287, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.40824829046386313, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.40824829046386313, 0.0, 0.0, 0.0, 0.40824829046386313, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [29]:
similar_df=pd.DataFrame(columns=["Job Title","Similarity Value"])
similar_df["Job Title"]=df.loc[1:,"Job Title"]
similar_df["Similarity Value"]=similar
similar_df=similar_df.sort_values(by="Similarity Value",ascending=False)
print("Primary Job Title:",primary)
print(similar_df)

Primary Job Title: Maintenance Controller (A&P) Technician
                                          Job Title  Similarity Value
1944        Maintenance Controller (A&P) Technician               1.0
2150                          Senior VP, Operations               1.0
4373  Procurement Coordinator- Transplant Institute               1.0
3296                  Senior Maintenance Technician               1.0
1254                                     Controller               1.0
...                                             ...               ...
1663                        Health & Safety Manager               0.0
1664                   Quality Assurance Specialist               0.0
1665          Associate Specialist, Clinical Trails               0.0
1666          Prepared Foods Dishwasher - Full Time               0.0
4802                                 Grants Manager               0.0

[4802 rows x 2 columns]


Using the one hot encoding method, the job titles most similar to the first job title of "Mainenance Controller (A&P) Technician" are one of the same title, "Senior VP of Operations", "Procurement Coordinator - Translant Institute", and "Senior Maintenance Technician.

## Task 3
Finally, the following code uses spacy's word vector to complete the same task as that of task 2.

In [35]:
primary=nlp(primary)
sim_score=[]
for i in range(1,len(df)):
    title=[]
    for token in nlp(df.iloc[i]["Job Title"]):
        if token.lemma_.lower() in noun:
            title.append(str(token.lemma_.lower()))
    compare=nlp(' '.join(title))
    sim_score.append(primary.similarity(compare))

similar_df2=pd.DataFrame(columns=["Job Title","Similarity Value"])
similar_df2["Job Title"]=df.loc[1:,"Job Title"]
similar_df2["Similarity Value"]=sim_score
similar_df=similar_df.sort_values(by="Similarity Value",ascending=False)
print("Primary Job Title:",primary)
print(similar_df)

  sim_score.append(primary.similarity(compare))
  sim_score.append(primary.similarity(compare))


Primary Job Title: Maintenance Controller (A&P) Technician
                                              Job Title  Similarity Value
1944            Maintenance Controller (A&P) Technician               1.0
2458                          Human Resource Generalist               1.0
4696                            Mine / Crusher Operator               1.0
4111                                       Mixer Driver               1.0
1381                       Commercial Transact Coord II               1.0
...                                                 ...               ...
4625                               Paramedic - Offshore               0.0
4568                                          Geologist               0.0
4635                  Interim- Executive Director - NHA               0.0
4641  Prepared Foods Associate Team Leader (Culinary...               0.0
4802                                     Grants Manager               0.0

[4802 rows x 2 columns]


According to the word vector method, the job titles most similar to "Mainenance Controller (A&P) Technician" is also the title with the same name, "Human Resource Generalist", and "Mine / Crusher Operator". Based on my overall look, it seems that the one hot encoding method better finds similarities as I don't exactly see stark similarities in HR Generalist than I do to Senior Maintenance Technician as seen in the results from task 2. 