# exercise 13
# 1
Create one pandas dataframe that combines all the data scraped from May 22, 2022 together. Drop rows with missing job titles and/or job descriptions. Use `spacy` to tokenize all the job titles included in the cleaned dataframe. For each job title, find all the nouns and all the adjectives in the title and get their lowercased lemmatized form. Use the reformatted nouns to construct a vocabulary set for this dataframe. How many unique nouns are there? Construct another vocabulary set using the reformatted adjectives. How many unique adjectives are there? What kind of different information do the nouns versus the adjectives reveal about the specific job? 

In [1]:
import json
import pandas as pd
import os
os.chdir(r'C:\Users\[editted]\Documents\Me\BC\Advance\Data\jupyter\exercise 13\indeed_scraped_data\job_info_data')
job_info = pd.DataFrame(columns=['Company', 'Job Title', 'Location', 'Description', 'Company Link', 'Job Link'])
jsonMap = {'link':'Job Link', 'job_title':'Job Title', 'company':'Company', 'company_url':'Company Link',
                      'company_location':'Location', 'job_description':'Description'}
csvMap = {}
missJobs = job_info.copy()
for key in jsonMap:
    csvMap["lnks_"+key] = jsonMap[key]
for filename in os.listdir():
    if not filename.__contains__("5222022"):
        continue
    if filename[-3:] == "csv":
        data = pd.read_csv(filename)
        if(len(data.dtypes) != len(csvMap)):
            print("Unusual data, investigate")
            break
        data.rename(columns=csvMap, inplace=True)
        data = data.query('`Job Title`.notna() & `Description`.notna()')
        job_info = pd.concat([job_info, data])
    elif filename[-4:] == "json":
        with open(filename, encoding='UTF8') as jsonfile:
            otherdata = json.load(jsonfile)
            data = pd.DataFrame.from_dict(otherdata['lnks'])
            if(len(data.dtypes) != len(jsonMap)):
                print("Unusual data, investigate")
                break
            data.rename(columns=jsonMap, inplace=True)
            data = data.query('`Job Title`.notna() & `Description`.notna()')
            job_info = pd.concat([job_info, data])

In [2]:
job_info.reset_index(drop=True, inplace=True)

In [3]:
job_info.head()

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link
0,Baton Rouge General,Certified Pharmacy Technician II -Retail Pharmacy,"Baton Rouge, LA 70809",JOB PURPOSE OR MISSION: Assists pharmacists in...,https://www.indeed.com/cmp/Baton-Rouge-General...,https://www.indeed.com/rc/clk?jk=64568c71be4aa...
1,Anthem,Information Security Advisor,"Richmond, VA 23218",Description\nSHIFT: Day Job\nSCHEDULE: Full-ti...,"https://www.indeed.com/cmp/Anthem,-Inc.?campai...",https://www.indeed.com/rc/clk?jk=7b12bce39025f...
2,,Associate Director Learning Management Systems,United States,"Who we are\nWe’re a global, midsize CRO that p...",,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
3,"AVA Search Group, LLC",Plant Engineer,"Janesville, WI",Plant Engineer with a growing company in South...,"https://www.indeed.com/cmp/Ava-Search-Group,-L...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
4,,Medical Staff Coordinator,"Boston, MA 02114",Company Overview:\nShriners Children’s is a fa...,,https://www.indeed.com/rc/clk?jk=16d3c0f3ed372...


In [4]:
job_info.isna().sum()

Company         401
Job Title         0
Location          0
Description       0
Company Link    401
Job Link          0
dtype: int64

In [5]:
job_info.dropna(subset=['Company Link', 'Job Title'], inplace=True)

In [6]:
job_info['Job Title'] = job_info['Job Title'].str.replace(r'/', ' ')
job_info['Job Title'] = job_info['Job Title'].str.replace(r'[#()]', '', regex=True)
#it wasn't dealing with stuff like "Technician -Retail" very well
job_info['Job Title'] = job_info['Job Title'].str.replace(r'\s-|-\s|^-|-$', ' ', regex=True)

In [7]:
import spacy
nlp = spacy.load('en_core_web_md')

In [8]:
job_info['Job Title Token'] = job_info['Job Title'].str.lower().apply(lambda x: nlp(x))

In [9]:
job_info['Job Title Token'].iloc[0][0].lemma_

'certify'

In [10]:
nouns = set()
propnouns = set()
adjectives = set()

In [11]:
nouns

set()

In [12]:
for tokens in job_info['Job Title Token']:
    for token in tokens:
        if token.pos_ == 'NOUN':
            nouns.add(token.lemma_)
        if token.pos_ == 'PROPN':
            propnouns.add(token.lemma_)
        if token.pos_ == 'ADJ':
            adjectives.add(token.lemma_)

In [13]:
len(propnouns)

774

In [14]:
'inspector' in propnouns and 'manager' in propnouns

True

Some of the ones it detects as proper nouns like "insepctor" and "manager" should be nouns...

Previously I thought it was because I was using web_sm, I tried web_md though and had the same results. htps://universaldependencies.org/u/pos/PROPN.html, which spacy links to in its documentation, hints that it may be because of the odd syntax of the sentence?

In [15]:
len(nouns)

1139

In [16]:
#wanted to print out some just to see it
list(nouns)[:20]

['division',
 'ferry',
 'facilitator',
 'telecommunication',
 'cob',
 'aid',
 'pest',
 'grc',
 'generation',
 'student',
 'np',
 'bridge',
 'inpatient',
 'cancer',
 'cake',
 'helper',
 'steak',
 'voice',
 'mri',
 'treatment']

In [17]:
print("There are around", len(nouns), "nouns, though an unknown number of nouns were improperly classified as pronouns")
print("There are around", len(adjectives), "adjectives")

There are around 1139 nouns, though an unknown number of nouns were improperly classified as pronouns
There are around 276 adjectives


If I had more time (which isn't the fault of the class, since this assignment had more time than most), I would break the jobs down by state (or quarter of the country), and see if there was any relation between that and the nouns/adjectives.

## 2
Choose the first job title in your dataframe as the primary string. Use one-hot encoding as the word embedding method and find jobs in your cleaned dataframe that have similar nouns in the title as your primary string. 

In [18]:
job_info.head(1)

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Job Title Token
0,Baton Rouge General,Certified Pharmacy Technician II Retail Pharmacy,"Baton Rouge, LA 70809",JOB PURPOSE OR MISSION: Assists pharmacists in...,https://www.indeed.com/cmp/Baton-Rouge-General...,https://www.indeed.com/rc/clk?jk=64568c71be4aa...,"(certified, pharmacy, technician, ii, retail, ..."


In [19]:
firstNouns = set()
for tokens in job_info.head(1)['Job Title Token']:
    for token in tokens:
        if token.pos_ == 'NOUN':
            firstNouns.add(token.lemma_)
            print("NOUN")
        #doing this next part because improper parsing of proper nouns with the limited wordset I used
        if token.pos_ == 'PROPN':
            print('Proper noun of \"', token.lemma_, '\" decide if you want to add', sep='')

Proper noun of "pharmacy" decide if you want to add
Proper noun of "technician" decide if you want to add
Proper noun of "ii" decide if you want to add
Proper noun of "retail" decide if you want to add
Proper noun of "pharmacy" decide if you want to add


In [20]:
#not adding retail or pharmacy since they are attributive nouns (acting like adjectives)
firstNouns = ['technician']

In [21]:
def hasWord(tokens, word):
    for token in tokens:
        if token.lemma_ == word:
            return 1
    return 0

In [22]:
for word in firstNouns:
    job_info['has_' + word] = job_info['Job Title Token'].apply(hasWord, word=word)

In [23]:
job_info['noun_sum'] = job_info[['has_'+word for word in firstNouns]].sum(axis=1)

In [24]:
job_info.loc[job_info['noun_sum'] > 0]

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Job Title Token,has_technician,noun_sum
0,Baton Rouge General,Certified Pharmacy Technician II Retail Pharmacy,"Baton Rouge, LA 70809",JOB PURPOSE OR MISSION: Assists pharmacists in...,https://www.indeed.com/cmp/Baton-Rouge-General...,https://www.indeed.com/rc/clk?jk=64568c71be4aa...,"(certified, pharmacy, technician, ii, retail, ...",1,1
159,Global Medical Response,Avionics Technician Depot,"Ogden, UT 84405",Job Description:\nAvionics Technician\nIMMEDIA...,https://www.indeed.com/cmp/Global-Medical-Resp...,https://www.indeed.com/rc/clk?jk=3eaecb8ae5bcd...,"(avionics, technician, , depot)",1,1
189,Aerotek,Refrigeration Technician,"Macon, GA 31201",Equivalent Experience\nDescription:\nMaintaini...,https://www.indeed.com/cmp/Aerotek?campaignid=...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(refrigeration, technician)",1,1
235,Eurofins USA Food Testing,Media Production Technician,"Mounds View, MN 55112",Company Description\nEurofins Scientific is an...,https://www.indeed.com/cmp/Eurofins?campaignid...,https://www.indeed.com/rc/clk?jk=93924f9d84075...,"(media, production, technician)",1,1
288,Fresenius Medical Care,Patient Care Technician,"Attalla, AL 35954",PURPOSE AND SCOPE:\nSupports FMCNA's mission v...,https://www.indeed.com/cmp/Fresenius-Medical-C...,https://www.indeed.com/rc/clk?jk=6a3501cb2beb2...,"(patient, care, technician)",1,1
...,...,...,...,...,...,...,...,...,...
4712,Zymergen,Lab Technician II-III,"Emeryville, CA",Zymergen has an exciting opportunity for a Lab...,https://www.indeed.com/cmp/Zymergen?campaignid...,https://www.indeed.com/rc/clk?jk=92fb8e44d2b15...,"(lab, technician, ii, -, iii)",1,1
4729,DES Employment Group,Quality Assurance Technician $14.50-$15,"Omaha, NE",We have an immediate opportunity for Quality T...,https://www.indeed.com/cmp/Des-Employment-Grou...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(quality, assurance, technician, $, 14.50-$15)",1,1
4775,"EMCOR Facilities Services, Inc.",HVAC Technician,"Austin, TX","About Us:\nEMCOR Facilities Services, Inc. is ...",https://www.indeed.com/cmp/Emcor-Facilities-Se...,https://www.indeed.com/rc/clk?jk=5391654046800...,"(hvac, technician)",1,1
4783,NTT Ltd,Data Center Technician L2,"Ashburn, VA",At NTT we believe that by using innovative tec...,https://www.indeed.com/cmp/Ntt-6?campaignid=mo...,https://www.indeed.com/rc/clk?jk=5ee58949adef0...,"(data, center, technician, l2)",1,1


## 3
Use spacy's word vector to do Task 2. Compare the results. 

In [25]:
firstNounTokens = set()
for tokens in job_info.head(1)['Job Title Token']:
    for token in tokens:
        if token.pos_ == 'NOUN':
            firstNouns.add(token.lemma_)
            print("NOUN")
        #doing this next part because improper parsing of proper nouns with the limited wordset I used
        if token.pos_ == 'PROPN':
            print('Proper noun of \"', token.lemma_, '\" decide if you want to add', sep='')

Proper noun of "pharmacy" decide if you want to add
Proper noun of "technician" decide if you want to add
Proper noun of "ii" decide if you want to add
Proper noun of "retail" decide if you want to add
Proper noun of "pharmacy" decide if you want to add


In [26]:
list(job_info['Job Title Token'].head(1))

[certified pharmacy technician ii retail pharmacy]

In [27]:
firstNounTokens = [job_info['Job Title Token'].head(1).iloc[0][2]]

In [28]:
def wordSimilarity(tokens, wordList):
    similarSum = 0
    for token in tokens:
        for word in wordList:
            similarSum += token.similarity(word)
    #normalize to max of 1 at end, so longer sentences aren't prioritized
    return similarSum/(len(tokens)*len(wordList))

In [29]:
job_info['noun_similarity'] = job_info['Job Title Token'].apply(wordSimilarity, wordList=firstNounTokens)

  similarSum += token.similarity(word)


In [30]:
job_info

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Job Title Token,has_technician,noun_sum,noun_similarity
0,Baton Rouge General,Certified Pharmacy Technician II Retail Pharmacy,"Baton Rouge, LA 70809",JOB PURPOSE OR MISSION: Assists pharmacists in...,https://www.indeed.com/cmp/Baton-Rouge-General...,https://www.indeed.com/rc/clk?jk=64568c71be4aa...,"(certified, pharmacy, technician, ii, retail, ...",1,1,0.281419
1,Anthem,Information Security Advisor,"Richmond, VA 23218",Description\nSHIFT: Day Job\nSCHEDULE: Full-ti...,"https://www.indeed.com/cmp/Anthem,-Inc.?campai...",https://www.indeed.com/rc/clk?jk=7b12bce39025f...,"(information, security, advisor)",0,0,0.105457
3,"AVA Search Group, LLC",Plant Engineer,"Janesville, WI",Plant Engineer with a growing company in South...,"https://www.indeed.com/cmp/Ava-Search-Group,-L...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(plant, engineer)",0,0,0.216152
5,1st Jackpot Casino Tunica,Player Services Cashier,"Robinsonville, MS 38664",Overview:\nDon’t just work. Work Happy.\nA car...,https://www.indeed.com/cmp/1st-Jackpot-Casino-...,https://www.indeed.com/rc/clk?jk=8130f0942ee17...,"(player, services, cashier)",0,0,0.269085
6,"Genesis Financial Solutions, Inc.",Audit Director,"Beaverton, OR 97006",Overview:\nJoin the nation’s leader in second-...,https://www.indeed.com/cmp/Genesis-Financial-S...,https://www.indeed.com/rc/clk?jk=fcfd3edebc166...,"(audit, director)",0,0,0.139751
...,...,...,...,...,...,...,...,...,...,...
4796,Hanmi Bank,Loan Document Imaging Specialist,"Houston, TX 77036",SUMMARY\nThe Loan Document Imaging Specialist ...,https://www.indeed.com/cmp/Hanmi-Bank?campaign...,https://www.indeed.com/rc/clk?jk=d3000367bc0b9...,"(loan, document, imaging, specialist)",0,0,0.170080
4798,Convera,"Associate, AML Compliance","Santa Ana, CA","J\nAssociate, Anti Money Laundering Compliance...",https://www.indeed.com/cmp/Convera?campaignid=...,https://www.indeed.com/rc/clk?jk=c540bfbb56a88...,"(associate, ,, aml, compliance)",0,0,0.174346
4800,Royal Building Products,Associate Director-Operations,"Marion, VA 24354",Royal Building Products enables you to broaden...,https://www.indeed.com/cmp/Royal-Building-Prod...,https://www.indeed.com/rc/clk?jk=de74cc8e0427b...,"(associate, director, -, operations)",0,0,0.142786
4801,RockBridge Search & Recruitment,Financial Reporting Analyst Hybrid Work Location,"Charlotte, NC",Financial Reporting Analyst - Bank Regulatory ...,https://www.indeed.com/cmp/Rockbridge-Search-&...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(financial, reporting, analyst, hybrid, work, ...",0,0,0.168726


In [31]:
job_info.describe()

Unnamed: 0,has_technician,noun_sum,noun_similarity
count,4402.0,4402.0,4402.0
mean,0.034984,0.034984,0.153787
std,0.18376,0.18376,0.080515
min,0.0,0.0,-0.013787
25%,0.0,0.0,0.107133
50%,0.0,0.0,0.142998
75%,0.0,0.0,0.182121
max,1.0,1.0,1.0


In [32]:
job_info.query('noun_similarity > 0.5')

Unnamed: 0,Company,Job Title,Location,Description,Company Link,Job Link,Job Title Token,has_technician,noun_sum,noun_similarity
20,United Memorial Medical Center - North Street ...,HIM Tech,"Batavia, NY 14020",Maintains the security and integrity of the me...,https://www.indeed.com/cmp/St.-Lawrence-Health...,https://www.indeed.com/rc/clk?jk=cd1ed711a0abe...,"(him, tech)",0,0,0.596655
160,DCP Midstream,Operations Tech,"Lovington, NM",DCP Midstream is a Fortune 500 natural gas com...,https://www.indeed.com/cmp/Dcp-Midstream-Lp?ca...,https://www.indeed.com/rc/clk?jk=ea74ed849e4fd...,"(operations, tech)",0,0,0.556115
189,Aerotek,Refrigeration Technician,"Macon, GA 31201",Equivalent Experience\nDescription:\nMaintaini...,https://www.indeed.com/cmp/Aerotek?campaignid=...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(refrigeration, technician)",1,1,0.526821
417,WorkerBee's Pro,Quality Technician,"Lakeland, FL 33811",The Quality Technician will be expected to per...,https://www.indeed.com/cmp/Workerbee's-Pro?cam...,https://www.indeed.com/company/WorkerBee's-Pro...,"(quality, technician)",1,1,0.567625
437,CVS Health,Pharmacy Technician,"Kent, OH 44240",Pharmacy Technician\nPosition Summary\nHealth ...,https://www.indeed.com/cmp/CVS-Health?campaign...,https://www.indeed.com/company/CVS-Health/jobs...,"(pharmacy, technician)",1,1,0.533067
...,...,...,...,...,...,...,...,...,...,...
4061,"Tait & Associates, Inc.",Service Technician,"Lafayette, LA",Join the TAIT Team!\nAbout TAIT\nWe are leader...,"https://www.indeed.com/cmp/Tait-&-Associates,-...",https://www.indeed.com/rc/clk?jk=a934b2659e982...,"(service, technician)",1,1,0.621700
4158,"EMCOR Facilities Services, Inc.",Commercial Maintenance Technician,"McAllen, TX","About Us:\nEMCOR Facilities Services, Inc. is ...",https://www.indeed.com/cmp/Emcor-Facilities-Se...,https://www.indeed.com/rc/clk?jk=d4df1826d96dc...,"(commercial, maintenance, technician)",1,1,0.505022
4257,Horizon Personnel Services,Maintenance Technician,"Troy, MO",Conducts all duties in a manner consistent wit...,https://www.indeed.com/cmp/Horizon-Personnel-S...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(maintenance, technician)",1,1,0.646218
4639,"Unicep Packaging, LLC",Facilities Maintenance Technician,"Spokane, WA 99224",The Facilities Maintenance Technician is respo...,"https://www.indeed.com/cmp/Unicep-Packaging,-L...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"(facilities, maintenance, technician)",1,1,0.528291


Overall the word2vec method provides more granular data, and has the potential of determining that a job is similar to another job even if the exact same words aren't used. However, it is difficult to adjust for the effect of longer sentences, because those provide for the opportunity for more matches (and thus a higher score), but if the score is normalized based on the length of the sentence then shorter sentences with just the exact matches are prioritized.

## Bonus task 1
Do task 3 for both the nouns and the adjectives in the title. Combine the similarity values from comparing the nouns and from comparing the adjectives, and find jobs in your cleaned dataframe that are similar to your primary string in the combined setting. Compare the results with what you get from Task 3.

I plan to do these bonus tasks on my own time, because its been a busy few weeks for me. Probably won't turn them in as I'm doing them for the learning not for the points, but thanks for these tasks they definitely are helping me further develop my data analytics skills.