# Project 4: Web Scraping Job Postings

### Forecasting salary ranges for data science jobs in San Francisco using Indeed.com's API to gather job descriptions.
# 👨🏻‍💻👩🏻‍💻👨🏼‍💻👩🏼‍💻👨🏽‍💻👩🏽‍💻👨🏾‍💻👩🏾‍💻👨🏿‍💻👩🏿‍💻👨🏻‍💻👩🏻‍💻👨🏼‍💻👩🏼‍💻👨🏽‍💻👩🏽‍💻👨🏾‍💻👩🏾‍💻👨🏿‍💻👩🏿‍💻👨🏻‍💻👩🏻‍💻👨🏼‍💻👩🏼‍💻👨🏽‍💻👩🏽‍💻👨🏾‍💻👩🏾‍💻👨🏿‍💻👩🏿‍💻

In [84]:
import pprint
import requests
import json
import pandas as pd
import urllib2
import time
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression
import matplotlib
import plotly.plotly as py
import cufflinks as cf

# Using the Indeed API to collect data
from indeed import IndeedClient

# Shushing the compiler
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('chained_assignment',None)

In [85]:
# The API requires passing the client's IP.
# So, I'll create a function to collect that.

def get_ip():
    ext_ip = urllib2.urlopen('http://whatismyip.org').read()
    return ext_ip

In [86]:
# Setting my Indeed Developer API Key as a variable

client = IndeedClient(publisher = 1295525004807710)

In [87]:
# The parameter structure for the API request:

params = {
        'q': "data+scientist",
        'l': "any",                                                      
        'start': 0,                                               
        'end': 5000,
        'limit': 5000,
        'userip': get_ip,
        'useragent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)",
        'sort': 'date',
        'fromage': 'any',
        'co': 'any'
}

In [88]:
indeed = pd.DataFrame()
search_response = client.search(**params)
indeed = indeed.append(search_response['results'], ignore_index=True) 

In [89]:
indeed = pd.DataFrame(search_response['results'])
indeed.shape

(25, 19)

### Uh oh! An API limiting condition!! 😩<br>
The API will only pass 25 listings at a time.<br>
Even though we passed a limit at 5000 and an end at 5000, the API stops at 25.

So, we have to create a range with steps of 24:<br>
<ol>
    <li>Collect the reponse from Indeed</li>
    <li>Append the response to the DataFrame</li>
    <li>Increment the parameters passed to the Indeed servers so that we can bypass their API restriction of 25 records</li>
</ol>
# 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

ALSO! I did some prodding of the parameters and the formatting of the advanced search feature on Indeed's website.<br> Although undocumented, it is possible to pass salary ranges into our search parameters.<br> Indeed advises these numbers are approximations:<br>

<i>How much does a Data Scientist in the United States make?</i><br>

The average Data Scientist salary in the United States is approximately 💲130,164.
Salary information comes from 36,404 data points collected directly from employees, users, and past and present job advertisements on Indeed in the past 12 months.
Please note that all salary figures are approximations based upon third party submissions to Indeed.

In [101]:
indeed275k = pd.DataFrame()

for i in np.arange(0, 2000, 24):                                                                                                                                                                                            
    params = {                                                                
                'q': "data+scientist",
                'salary': '$265k-275k',
                'l': "san francisco",                                                      
                'start': 0 + i,                                               
                'end': 24 + i,
                'limit': 25,
                'userip': get_ip,
                'useragent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)",
                'sort': 'date',
                'fromage': 'any',
                'co': 'any'
             }
    search_response = client.search(**params)
    indeed275k = indeed275k.append(search_response['results'], ignore_index=True)   
                                                                                  
indeed275k.shape

IndexError: list index out of range

#### Looks like we hit a wall for salaries above 💲265,000.
But, realistically... who's getting more than $265k?? 😳

In [102]:
# DataFrame Shapes
print 'Indeed $0k - $65k: ', indeed65k.shape
print 'Indeed $65k - $85k: ', indeed85k.shape
print 'Indeed $85k - $105k: ', indeed105k.shape
print 'Indeed $105k - $125k: ', indeed125k.shape
print 'Indeed $125k - $145k: ', indeed145k.shape
print 'Indeed $145k - $165k: ', indeed165k.shape
print 'Indeed $165k - $185k: ', indeed185k.shape
print 'Indeed $185k - $205k: ', indeed205k.shape
print 'Indeed $205k - $225k: ', indeed225k.shape
print 'Indeed $225k - $245k: ', indeed245k.shape
print 'Indeed $245k - $265k: ', indeed265k.shape

Indeed $0k - $65k:  (644, 19)
Indeed $65k - $85k:  (1370, 19)
Indeed $85k - $105k:  (728, 19)
Indeed $105k - $125k:  (1821, 19)
Indeed $125k - $145k:  (903, 19)
Indeed $145k - $165k:  (1021, 19)
Indeed $165k - $185k:  (720, 19)
Indeed $185k - $205k:  (167, 19)
Indeed $205k - $225k:  (84, 19)
Indeed $225k - $245k:  (84, 19)
Indeed $245k - $265k:  (84, 19)


## Let's take a look at some of these DataFrames. 🔍

In [103]:
indeed185k.company.value_counts()

Workbridge Associates       393
Harnham                     163
Elevate Recruiting Group     83
Jobspring Partners           81
Name: company, dtype: int64

In [104]:
indeed205k.company.value_counts()

Workbridge Associates    167
Name: company, dtype: int64

In [105]:
indeed225k.company.value_counts()

Harnham    84
Name: company, dtype: int64

In [106]:
indeed245k.company.value_counts()

Averity    84
Name: company, dtype: int64

In [107]:
indeed265k.company.value_counts()

Workbridge Associates    84
Name: company, dtype: int64

### There doesn't seem to be much value in any of the $205k+ DataFrames.

#### Having these salary ranges allows me to create salary targets within each DataFrame.
# 💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲

In [108]:
indeed65k['salary'] = 65000
indeed85k['salary'] = 85000
indeed105k['salary'] = 105000
indeed125k['salary'] = 125000
indeed145k['salary'] = 145000
indeed165k['salary'] = 165000
indeed185k['salary'] = 185000

Let's put these all together into one DataFrame for cleaning and analysis.

In [109]:
yes_indeedy = [indeed85k, indeed105k, indeed125k, indeed145k, indeed165k, \
               indeed185k]

indeed = indeed65k.append(yes_indeedy)

In [110]:
indeed.shape

(7207, 20)

Let's check for duplicate job listings and take those out.

In [111]:
indeed[indeed.duplicated('jobkey', keep=False) == False].count()

city                     124
company                  124
country                  124
date                     124
expired                  124
formattedLocation        124
formattedLocationFull    124
formattedRelativeTime    124
indeedApply              124
jobkey                   124
jobtitle                 124
language                 124
onmousedown              124
snippet                  124
source                   124
sponsored                124
state                    124
stations                 124
url                      124
salary                   124
dtype: int64

Rut-roh! Looks like there are several duplicates!<br>
Let's nuke 'em!
# 💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥

In [112]:
indeed = indeed[indeed.duplicated('jobkey', keep=False) == True]
indeed.reset_index(drop=True, inplace=True)

I'm turning this DataFrame into a .CSV file to make a checkpoint of my work.

In [113]:
indeed.to_csv('indeed_collection.csv', sep=',', encoding='utf-8')

In [114]:
# indeed = pd.read_csv('./indeed_collection.csv')

In [115]:
indeed.shape

(7083, 20)

Around 7000 rows isn't very meaty, and my intial impulse is to widen the location net.<br> We have to consider that the Bay Area likely pays at a disproportionate rate compared to other regions.<br> So, for the time being, we'll work from here.

How tidy is the rest of the DataFrame?

In [116]:
indeed.dtypes

city                     object
company                  object
country                  object
date                     object
expired                    bool
formattedLocation        object
formattedLocationFull    object
formattedRelativeTime    object
indeedApply                bool
jobkey                   object
jobtitle                 object
language                 object
onmousedown              object
snippet                  object
source                   object
sponsored                  bool
state                    object
stations                 object
url                      object
salary                    int64
dtype: object

In [117]:
indeed.head(1)

Unnamed: 0,city,company,country,date,expired,formattedLocation,formattedLocationFull,formattedRelativeTime,indeedApply,jobkey,jobtitle,language,onmousedown,snippet,source,sponsored,state,stations,url,salary
0,San Francisco,IBM,US,"Mon, 15 May 2017 21:28:22 GMT",False,"San Francisco, CA","San Francisco, CA",3 days ago,False,c33fae3f8ce6ba8c,Data Science Community Manager,en,"indeed_clk(this,'4723');",<b>Data</b> <b>Scientist</b> Community Manager...,IBM,False,CA,,http://www.indeed.com/viewjob?jk=c33fae3f8ce6b...,65000


#### Looks like we have a few wonky columns to drop.
We will drop the <i>Country</i>, <i>State</i>, <i>Stations</i>, <i>Sponsored</i>, <i>Language</i>, and <i>Expired</i> columns, because they are uniform.<br>
We also don't have a need for the individual job listing <i>URL</i>s or the <i>On Mouse Down</i> actions column.<br> The <i>Company</i> and <i>Source</i> features have matching data, so we only need one.<br>The <i>Formatted Location</i> and <i>Formatted Location Full</i> columns also tell us little, so they can go.<br>Also, we no longer need the <i>JobKey</i> feature, because we have so far used it to remove any duplicates.
<br><br> There may be salary differences within the Bay Area, but we don't have a need for this combined detail considering that there is a <i>City</i> column provided.

In [118]:
indeed.drop(['city', 'country', 'state', 'stations', 'sponsored', 'language', 'expired', \
             'formattedLocation', 'source', 'formattedLocationFull', 'url', 'onmousedown', \
             'jobkey'],
             axis=1, inplace=True)

indeed.reset_index(drop=True, inplace=True)

Let's convert the date column to DateTime.

In [119]:
indeed['date'] = pd.to_datetime(indeed['date'], infer_datetime_format=True)

Let's see if we can do something with the <i>Formatted Relative Time</i> column.

In [120]:
indeed.formattedRelativeTime.value_counts()

30+ days ago    5753
2 days ago       170
20 days ago      163
21 days ago      159
23 days ago       96
7 days ago        96
6 days ago        94
17 days ago       93
15 days ago       93
8 days ago        91
3 days ago        87
10 days ago       83
28 days ago       48
27 days ago       21
26 days ago       17
12 days ago       11
5 days ago         4
29 days ago        2
24 days ago        2
Name: formattedRelativeTime, dtype: int64

In [121]:
indeed.dtypes

company                          object
date                     datetime64[ns]
formattedRelativeTime            object
indeedApply                        bool
jobtitle                         object
snippet                          object
salary                            int64
dtype: object

### Woof. That's no fun. 🤢
I'll need to convert the 'hours ago' listings to 1 day, and then I'll have to extract the integers from the 'day ago' strings.<br> Then, I'll drop the <i>Formatted Relative Time</i> column.

In [122]:
indeed['daysAgo'] = indeed['formattedRelativeTime'].str.extract('(\d+)').astype(int)
indeed['daysAgo'][indeed['formattedRelativeTime'].str.contains('hours')] = 1
indeed.drop('formattedRelativeTime', axis=1, inplace=True)

In [123]:
indeed.head(3)

Unnamed: 0,company,date,indeedApply,jobtitle,snippet,salary,daysAgo
0,IBM,2017-05-15 21:28:22,False,Data Science Community Manager,<b>Data</b> <b>Scientist</b> Community Manager...,65000,3
1,LiveRamp,2017-04-27 18:03:49,True,Senior Technical Recruiter,Successful track record and high level of expe...,65000,21
2,Upstart,2017-03-21 05:52:04,False,Data Scientist (Internship),We're looking for someone to join our <b>data<...,65000,30


### Are they really all data scientist jobs?

In [124]:
indeed.jobtitle.unique()

array([u'Data Science Community Manager', u'Senior Technical Recruiter',
       u'Data Scientist (Internship)',
       u'Data Scientist Intern - Summer 2017', u'Data Scientist',
       u'Market Research Analyst - Analyst Development Program',
       u'Client Support Specialist', u'Analytic Consultant 1',
       u'SAFETY AND INSURANCE DATA SCIENTIST',
       u'Statistician/Predictive Modeler/Data Scientist',
       u'Lead Data Science Instructor', u'BioMedical Data Scientist',
       u'Database Analsyt', u'Statistically Significant Data Scientist',
       u'Data Scientist - Economics & Legal', u'Data Scientist, CDHI',
       u'Connect - Director, Advertising, Media & Technology Research',
       u'Data Scientist - Yammer',
       u'Data Scientist Internship (Summer 2017)',
       u'Firmware Data Scientist', u'Data Scientist - Uber for Business',
       u'Website Designer and Manager',
       u'Clinical Data Scientist (Manager)', u'Data Scientist II',
       u'Sr. Data Scientist (Operati

#### We don't need no stinkin' Technical Recruiters or UX Content Designers muddying up our stats!
<i>(No offense to Technical Recruiters or UX Content Designers.)</i> 😅

In [125]:
ignore_these = ['UX', 'Recruiter', 'Sales', 'Bioinformatics', 'Connect', 'Biologist', \
                'Client']

indeed = indeed[~indeed['jobtitle'].str.contains('|'.join(ignore_these))]

indeed.reset_index(drop=True, inplace=True)

## Let's clean up and cosolidate these job types. 🛀🏼

In [126]:
indeed.jobtitle = indeed.jobtitle.str.lower()
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'manager' if 'mgr' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'sr' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'senior' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'head' if 'head' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'manager' if 'manager' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'principle' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'principal' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'ii' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'engineer' if 'engineer' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'lead' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'nlp' if 'nlp' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'intern' if 'internship' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'intern' if 'intern' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'contract' if 'contract' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analyst' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analytic' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'nlp' if 'natural' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analsyt' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'data scientist' if 'data scientist' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'director' if 'director' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'director' if 'vice' in x else x)

In [127]:
indeed.jobtitle.unique()

array(['manager', 'intern', 'data scientist', 'analyst', 'senior', 'nlp',
       'engineer', 'contract', 'director'], dtype=object)

## Lookin' good!
Now, we can dig into the job descriptions in the <i>Snippet</i> column.

#### First step: getting a good idea of the special words in the job snippets.

Cleaning up the <i>Snippets</i> and <i>JobTitle</i> features.

In [128]:
indeed['snippet'] = indeed['snippet'].str.replace('<b>', '')
indeed['snippet'] = indeed['snippet'].str.replace('</b>', '')
indeed['snippet'] = indeed['snippet'].str.replace('.', '')
indeed['snippet'] = indeed['snippet'].str.replace(',', '')
indeed['snippet'] = indeed['snippet'].str.lower()

In [129]:
snippets = indeed.snippet.to_string()

In [130]:
vect = CountVectorizer(stop_words='english')
vect.fit_transform(indeed.snippet)

<6537x808 sparse matrix of type '<type 'numpy.int64'>'
	with 86587 stored elements in Compressed Sparse Row format>

In [131]:
vocab_dict = vect.vocabulary_
vocab = [[k,v] for k,v in vocab_dict.items()]

In [132]:
indeed_vocab = pd.DataFrame(vocab)
indeed_vocab.columns = ['word', 'count']
print indeed_vocab.sort_values('count', ascending=False).to_string()

                 word  count
666              yume    807
387            yodlee    806
430             years    805
216           writing    804
29            wrangle    803
696             world    802
109           working    801
762              work    800
744              wish    799
467              wide    798
613               web    797
691           weaving    796
727               way    795
775         warehouse    794
516             wants    793
148              want    792
549                vp    791
736            volume    790
22        visualizing    789
586    visualizations    788
471     visualization    787
370         visionary    786
101           visible    785
769           virtual    784
443              view    783
626        vertically    782
157            verify    781
465           various    780
766           variety    779
537             value    778
583          validate    777
204           utilize    776
607             using    775
783           

#### Splitting up the the values of the Snippet feature for comparison.

In [133]:
indeed['snippet'] = indeed['snippet'].str.split()
indeed['split_jobtitle'] = indeed['jobtitle'].str.split()

#### I used the vocabulary to compile three lists of values based on areas of expertise. <br>
This I turn into a set for comparison with the JobTitle and Snippet features.

In [371]:
computering_skills = ['scala', 'python', 'r', 'hadoop', 'sql', 'nosql', 'mongodb', 'tableau', \
                      'spark', 'go', 'julia', 'd3', 'javascript', 'html', 'css', 'zoomdata', \
                      'insight', 've', 'spotfire', 'sas', 'pagerduty', 'owler', 'meetme', \
                      'mattermark', 'liveramp', 'java', 'hive', 'harnham', 'excel', \
                      'adadyn', 'yume', 'yodlee', 'google']

technical_skills = ['statistics', 'machine learning', 'algorithm', 'algorithms', \
                    'deep learning', 'ai', 'visualize', 'visualization', 'visualizations', \
                    'writing', 'wrangle', 'wrangling', 'triage', 'train', 'training', \
                    'testing', 'teach', 'teaching', 'security', 'parallelization', \
                    'nlp', 'model', 'modeling', 'modelling', 'ml', 'mining', \
                    'mentor', 'mathematician', 'mathematics', 'mathematical', \
                    'marketing', 'forecasting', 'algorithmic', 'aggregation', \
                    'storytelling']

experience_skills = ['manager', 'intern', 'data scientist', 'analyst', 'masters', 
                     'phd', 'senior', 'nlp', 'engineer', 'contract', 
                     'head', 'director']

combined_skills = set(computering_skills +  \
                      technical_skills + \
                      experience_skills)

## Creating a cumulative number for all skills.

In [369]:
indeed['skill_count'] = 0
indeed['title_count'] = 0

for idx in indeed.index:
    intersect = list(set(indeed['snippet'][idx]).intersection(combined_skills))
    indeed['skill_count'][idx] = len(intersect)
    
for idx in indeed.index:
    intersect = list(set(indeed['jobtitle'][idx]).intersection(combined_skills))
    indeed['title_count'][idx] = len(intersect)
    
indeed['total_skills'] = indeed['skill_count'] + indeed['title_count']
indeed.drop('skill_count', axis=1, inplace=True)
indeed.drop('title_count', axis=1, inplace=True)

In [370]:
indeed

Unnamed: 0,company,date,indeedApply,jobtitle,snippet,salary,daysAgo,split_jobtitle,total_skills
0,IBM,2017-05-15 21:28:22,False,manager,"[data, scientist, community, manager, responsi...",65000,3,[manager],2
1,Upstart,2017-03-21 05:52:04,False,intern,"[we're, looking, for, someone, to, join, our, ...",65000,30,[intern],2
2,Walmart eCommerce,2017-03-11 08:46:28,False,intern,"[experience, with, statistical, analysis, data...",65000,30,[intern],3
3,Payette Group,2017-03-14 06:06:07,False,data scientist,"[having, tens, of, thousands, of, debtors, and...",65000,30,"[data, scientist]",0
4,Ipsos North America,2017-02-27 07:10:07,False,analyst,"[love, data, consumer, decision-making, and, p...",65000,30,[analyst],0
5,"Gametime United, Inc.",2017-01-30 23:06:46,False,data scientist,"[process, clean, and, verify, data, used, for,...",65000,30,"[data, scientist]",1
6,IBM,2017-05-15 21:28:22,False,manager,"[data, scientist, community, manager, responsi...",65000,3,[manager],2
7,Upstart,2017-03-21 05:52:04,False,intern,"[we're, looking, for, someone, to, join, our, ...",65000,30,[intern],2
8,Walmart eCommerce,2017-03-11 08:46:28,False,intern,"[experience, with, statistical, analysis, data...",65000,30,[intern],3
9,Payette Group,2017-03-14 06:06:07,False,data scientist,"[having, tens, of, thousands, of, debtors, and...",65000,30,"[data, scientist]",0


# Let's get testing! 🏁

In [305]:
indeed_copy = indeed.copy()
indeed_copy.drop('snippet', axis=1, inplace=True)
indeed_copy.drop('date', axis=1, inplace=True)
indeed_copy.drop('split_jobtitle', axis=1, inplace=True)

In [306]:
dummies = ['indeedApply', 'jobtitle', 'company']
dummy_df = pd.get_dummies(indeed_copy[dummies])

In [307]:
indeed_analysis = pd.concat([indeed_copy, dummy_df], axis=1)
indeed_analysis.drop(dummies, axis=1, inplace=True)

## Train Test 🚂

In [308]:
X = indeed_analysis
y = indeed_analysis['salary'].ravel()
X.drop('salary', axis=1, inplace=True)

In [309]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [310]:
print 'X train shape: ', X_train.shape
print 'y train shape: ', y_train.shape
print 'X test shape', X_test.shape
print 'y test shape: ', y_test.shape

X train shape:  (3268, 122)
y train shape:  (3268,)
X test shape (3269, 122)
y test shape:  (3269,)


## Try a Lasso... 🐴

In [311]:
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv=5))
    return(rmse)

In [312]:
model_lasso = LassoCV(n_alphas=100, selection='random', max_iter=15000).fit(X_train, y_train)
res = rmse_cv(model_lasso)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 10274.061856624019)
('Min: ', 9437.059456704199)


In [313]:
lasso_df = indeed_analysis
l_coef = pd.Series(model_lasso.coef_, index = lasso_df.columns)
print("Lasso picked " + str(sum(l_coef != 0)) + " variables and eliminated the other " +  str(sum(l_coef == 0)) + " variables")

Lasso picked 78 variables and eliminated the other 44 variables


In [314]:
lasso_coef = pd.concat([l_coef.sort_values().head(10),
                        l_coef.sort_values().tail(10)])

lasso_coef.iplot(kind = "barh", title='Coefficients in the Lasso Model')

In [316]:
lasso_coef.sort_values()

company_IBM                           -84022.673694
company_Ipsos North America           -67540.509145
company_Payette Group                 -55972.149354
company_General Assembly              -53038.581281
jobtitle_intern                       -51345.462695
company_Wells Fargo                   -48517.377445
company_Blackstone Technology Group   -42302.611363
company_Abl Schools                   -38191.830602
company_Gametime United, Inc.         -36817.288700
company_Medal                         -36483.230255
company_Elevate Recruiting Group       27964.973040
company_Credit Karma                   29157.302926
company_Corporate Labs Technology      29331.536197
company_Twitter                        29625.223246
jobtitle_director                      33796.686948
company_DuPont                         38544.695681
company_Kite Staffing                  43941.335895
company_Workbridge Associates          45864.827060
company_Harnham                        50471.202191
company_Jobs

## Try a Ridge...

In [317]:
model_ridge = RidgeCV(alphas=(0.01, 0.1, 1.0, 10, 100)).fit(X_train, y_train)
res = rmse_cv(model_ridge)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 8919.6810894183545)
('Min: ', 8242.1713350845203)


In [318]:
ridge_df = indeed_analysis
r_coef = pd.Series(model_ridge.coef_, index = ridge_df.columns)
print("Ridge picked " + str(sum(r_coef != 0)) + " variables and eliminated the other " +  str(sum(r_coef == 0)) + " variables")

Ridge picked 121 variables and eliminated the other 1 variables


In [319]:
ridge_coef = pd.concat([r_coef.sort_values().head(10),
                    r_coef.sort_values().tail(10)])

ridge_coef.iplot(kind = "barh", title='Coefficients in the Ridge Model')

In [320]:
ridge_coef.sort_values()

company_IBM                     -106833.242428
company_Pfizer Inc.              -72812.295110
company_Zoomdata                 -72520.106139
company_Payette Group            -56922.355972
company_Ipsos North America      -56863.790561
company_General Assembly         -53155.756053
jobtitle_intern                  -49586.991240
company_Abl Schools              -40302.019542
company_Levi Strauss & Co.       -40274.778668
company_Stride Search            -40110.837913
company_Credit Karma              39407.263919
company_DuPont                    43054.393620
jobtitle_manager                  43517.694599
company_Demandbase                43969.504542
company_Kite Staffing             51228.928988
company_Workbridge Associates     51996.861721
company_Harnham                   54998.455339
company_Ticketfly                 55314.066609
jobtitle_director                 63601.710967
company_Jobspring Partners        87373.901211
dtype: float64

## Try an ElasticNet...

In [321]:
model_en = ElasticNetCV(n_alphas=100, alphas=(0.01, 0.1, 1.0, 10, 100, 500), \
                        max_iter=15000, cv=5, n_jobs=-1).fit(X_train, y_train)
res = rmse_cv(model_en)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 13350.834906986347)
('Min: ', 12652.25994496274)


In [322]:
en_df = indeed_analysis
e_coef = pd.Series(model_en.coef_, index = en_df.columns)
print("ElasticNet picked " + str(sum(e_coef != 0)) + " variables and eliminated the other " +  str(sum(e_coef == 0)) + " variables")

ElasticNet picked 121 variables and eliminated the other 1 variables


In [323]:
eln_coef = pd.concat([e_coef.sort_values().head(10),
                     e_coef.sort_values().tail(10)])

eln_coef.iplot(kind = "barh", title='Coefficients in the ElasticNet Model')

In [325]:
eln_coef.sort_values()

jobtitle_intern                     -40863.207666
company_IBM                         -39564.969077
company_Payette Group               -38947.121230
company_General Assembly            -35368.776276
company_Ipsos North America         -34783.147622
company_Abl Schools                 -28156.762787
company_Gametime United, Inc.       -27861.136472
company_Microsoft                   -23336.086066
company_Medal                       -22831.511337
company_LiveRamp                    -22520.104033
company_Credit Karma                 21393.799291
company_Twitter                      22826.419270
company_Corporate Labs Technology    24421.894934
company_Elevate Recruiting Group     29172.747501
jobtitle_director                    31531.059964
company_DuPont                       33516.096162
company_Kite Staffing                35063.508566
company_Harnham                      37683.089430
company_Workbridge Associates        40439.143442
company_Jobspring Partners           41121.645385


## 6 items continuously have the most negative coefficient.
#### (One of which is General Assembly. 🙊)

### LassoCV
<b><i>company_IBM                                      -83680.127033</i></b><br>
<b><i>company_Ipsos North America                      -61283.945882</i></b><br>
<b><i>company_Payette Group                            -55162.924482</i></b><br>
<b><i>jobtitle_intern                                  -52931.827562</i></b><br>
<b><i>company_General Assembly                         -51289.012674</i></b><br>
<b><i>company_Abl Schools                              -37933.885551</i></b><br>
company_Wells Fargo                              -37889.699213<br>
company_Gametime United, Inc.                    -37171.204553<br>
company_Blackstone Technology Group              -35675.025423<br>
company_University of California San Francisco   -34951.945051

### RidgeCV

<b><i>company_IBM                        -95302.293940</i></b><br>
company_Pfizer Inc.                -60019.066497<br>
company_Zoomdata                   -59851.290745<br>
<b><i>company_Payette Group              -56326.740790</i></b><br>
<b><i>company_Ipsos North America        -56002.468068</i></b><br>
<b><i>jobtitle_intern                    -52122.052715</i></b><br>
<b><i>company_General Assembly           -50058.492898</i></b><br>
<b><i>company_Abl Schools                -38392.647131</i></b><br>
company_Levi Strauss & Co.         -38091.714666<br>
jobtitle_contract                  -37356.497330

### ElasticNetCV

<b><i>company_IBM                                      -42188.536367</i></b><br>
<b><i>jobtitle_intern                                  -40680.543325</i></b><br>
<b><i>company_Payette Group                            -36127.867658</i></b><br>
<b><i>company_Ipsos North America                      -35141.452584</i></b><br>
<b><i>company_General Assembly                         -35133.725124</i></b><br>
company_Gametime United, Inc.                    -29467.558302<br>
<b><i>company_Abl Schools                              -29155.321394</i></b><br>
company_University of California San Francisco   -22169.940114<br>
company_LiveRamp                                 -21696.303155<br>
company_Microsoft                                -21495.080653

# 🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫

## Setting up for a regression.

In [835]:
indeed_copy = indeed.copy()
indeed_copy.drop('snippet', axis=1, inplace=True)
indeed_copy.drop('date', axis=1, inplace=True)
indeed_copy.drop('split_jobtitle', axis=1, inplace=True)

In [836]:
dummies = ['indeedApply', 'jobtitle', 'company']
dummy_df = pd.get_dummies(indeed_copy[dummies])

In [837]:
indeed_analysis = pd.concat([indeed_copy, dummy_df], axis=1)
indeed_analysis.drop(dummies, axis=1, inplace=True)

### I'll drop the poor performers denoted by LassoCV.

In [838]:
poor_performers = [u'jobtitle_nlp', u'company_6sense', u'company_Activision',
       u'company_Adadyn', u'company_BOLD', u'company_Big Fish',
       u'company_Blend Labs', u'company_Chariot', u'company_CircleUp',
       u'company_Counsyl', u'company_Dropbox', u'company_Genedata',
       u'company_Glassdoor', u'company_Intelliswift Software, Inc.',
       u'company_LendUp', u'company_Lending Club', u'company_Near',
       u'company_Nomis Solutions', u'company_Nuna', u'company_Oracle',
       u'company_PI Benchmark', u'company_Pfizer Inc.', u'company_Pinterest',
       u'company_Product Madness', u'company_Reddit', u'company_Rocket Lawyer',
       u'company_Salesforce', u'company_Samba TV', u'company_Showpad',
       u'company_Shutterfly', u'company_Stride Search', u'company_Supercell',
       u'company_TellApart', u'company_Thumbtack', u'company_Ticketfly',
       u'company_TrueAccord', u'company_Ultimate Software',
       u'company_Walmart eCommerce', u'company_WePay', u'company_Yodlee',
       u'company_YuMe', u'company_Zoomdata', u'company_art.com',
       u'company_ironSource']

indeed_analysis.drop(poor_performers, axis=1, inplace=True)

## Train Test 🚋

In [839]:
X = indeed_analysis
y = indeed_analysis['salary'].ravel()
X.drop('salary', axis=1, inplace=True)

In [840]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [841]:
print 'X train shape: ', X_train.shape
print 'y train shape: ', y_train.shape
print 'X test shape', X_test.shape
print 'y test shape: ', y_test.shape

X train shape:  (3268, 78)
y train shape:  (3268,)
X test shape (3269, 78)
y test shape:  (3269,)


## And now... Regression. 🕵🏻

In [842]:
lr = LinearRegression(n_jobs=-1)

lr.fit(X_train, y_train)

pred_salary = pd.DataFrame()
pred_salary['Predicted Salary'] = lr.predict(X_train.sort_index())

scores = cross_val_score(lr, X_test, y_test, cv=5)

print '---\n5 Fold Cross Validation Scores:', \
      'Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)

salary_score = '5-Fold CV Mean Average: {percent:.2%}'.format(percent=scores.mean())
salary_std = 'Standard Deviation: {percent:.2%}'.format(percent=scores.std() * 2)

---
5 Fold Cross Validation Scores: Accuracy: 0.93 (+/- 0.02)


In [843]:
true_salary = pd.DataFrame(y_train.astype(float)).sort_index()
true_salary.columns = ['True Salary']
true_salary.reset_index(inplace=True, drop=True)

In [844]:
predicted_salary = pd.concat([pred_salary, true_salary], axis=1)
predicted_salary['Difference'] = predicted_salary['True Salary'] - predicted_salary['Predicted Salary']
predicted_salary.sort_values('True Salary', inplace=True)
predicted_salary.reset_index(drop=True, inplace=True)

In [845]:
predicted_salary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267
Predicted Salary,132768.330817,165000.0,188493.992563,65000.0,105000.0,97626.402182,140289.407169,180872.386602,145000.0,182878.612457,...,169585.712418,180872.386602,81506.117208,90227.490057,85000.0,165000.0,169585.712418,125097.982954,161640.893338,102482.258622
True Salary,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,...,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0
Difference,-67768.330817,-100000.0,-123493.992563,2.182787e-10,-40000.0,-32626.402182,-75289.407169,-115872.386602,-80000.0,-117878.612457,...,15414.287582,4127.613398,103493.882792,94772.509943,100000.0,20000.0,15414.287582,59902.017046,23359.106662,82517.741378


In [846]:
predicted_salary['Difference'].mean()

-2.3077787960237725e-10

In [847]:
predicted_salary.iplot(kind='spread', title='Linear Regression with Salary as Target', \
                       opacity=20, xTitle=salary_score, yTitle=salary_std)

## These scores only okay...

## Set it up again!

In [848]:
indeed_copy = indeed.copy()

In [849]:
indeed_copy.drop('snippet', axis=1, inplace=True)
indeed_copy.drop('date', axis=1, inplace=True)
indeed_copy.drop('split_jobtitle', axis=1, inplace=True)

In [850]:
dummies = ['indeedApply', 'jobtitle', 'company']
dummy_df = pd.get_dummies(indeed_copy[dummies])

In [851]:
indeed_analysis = pd.concat([indeed_copy, dummy_df], axis=1)
indeed_analysis.drop(dummies, axis=1, inplace=True)

In [852]:
poor_performers = [u'jobtitle_nlp', u'company_6sense', u'company_Activision',
       u'company_Adadyn', u'company_BOLD', u'company_Big Fish',
       u'company_Blend Labs', u'company_Chariot', u'company_CircleUp',
       u'company_Counsyl', u'company_Dropbox', u'company_Genedata',
       u'company_Glassdoor', u'company_Intelliswift Software, Inc.',
       u'company_LendUp', u'company_Lending Club', u'company_Near',
       u'company_Nomis Solutions', u'company_Nuna', u'company_Oracle',
       u'company_PI Benchmark', u'company_Pfizer Inc.', u'company_Pinterest',
       u'company_Product Madness', u'company_Reddit', u'company_Rocket Lawyer',
       u'company_Salesforce', u'company_Samba TV', u'company_Showpad',
       u'company_Shutterfly', u'company_Stride Search', u'company_Supercell',
       u'company_TellApart', u'company_Thumbtack', u'company_Ticketfly',
       u'company_TrueAccord', u'company_Ultimate Software',
       u'company_Walmart eCommerce', u'company_WePay', u'company_Yodlee',
       u'company_YuMe', u'company_Zoomdata', u'company_art.com',
       u'company_ironSource']

indeed_analysis.drop(poor_performers, axis=1, inplace=True)

## Train Test 🚋

In [853]:
X = indeed_analysis
y = indeed_analysis['salary'].ravel()
X.drop('salary', axis=1, inplace=True)

In [854]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [855]:
print 'X train shape: ', X_train.shape
print 'y train shape: ', y_train.shape
print 'X test shape', X_test.shape
print 'y test shape: ', y_test.shape

X train shape:  (3268, 78)
y train shape:  (3268,)
X test shape (3269, 78)
y test shape:  (3269,)


## And now... Regression(s).
### Let's shake this up a bit.  
# 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

In [856]:
lr = LinearRegression(n_jobs=-1)

lr.fit(X_train, y_train)

pred_lr_salary = pd.DataFrame()
pred_lr_salary['Predicted Salary'] = lr.predict(X_train.sort_index())

scores = cross_val_score(lr, X_test, y_test, cv=5)

print '---\n5 Fold Cross Validation Scores:', \
      'Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)

lr_salary_score = '5-Fold CV Mean Average: {percent:.2%}'.format(percent=scores.mean())
lr_salary_std = 'Standard Deviation: {percent:.2%}'.format(percent=scores.std() * 2)

---
5 Fold Cross Validation Scores: Accuracy: 0.93 (+/- 0.02)


In [857]:
rfr = RandomForestRegressor(n_estimators=50)
    
rfr.fit(X_train, y_train)

pred_rfr_salary = pd.DataFrame()
pred_rfr_salary['Predicted Salary'] = rfr.predict(X_train.sort_index())

scores = cross_val_score(rfr, X_test, y_test, cv=5)

print '---\n5 Fold Cross Validation Scores:', \
      'Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)

rfr_salary_score = '5-Fold CV Mean Average: {percent:.2%}'.format(percent=scores.mean())
rfr_salary_std = 'Standard Deviation: {percent:.2%}'.format(percent=scores.std() * 2)

---
5 Fold Cross Validation Scores: Accuracy: 0.98 (+/- 0.02)


In [858]:
knn = KNeighborsRegressor(5)

knn.fit(X_train, y_train)

pred_knn_salary = pd.DataFrame()
pred_knn_salary['Predicted Salary'] = knn.predict(X_train.sort_index())

scores = cross_val_score(knn, X_test, y_test, cv=5)

print '---\n5 Fold Cross Validation Scores:', \
      'Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)

knn_salary_score = '5-Fold CV Mean Average: {percent:.2%}'.format(percent=scores.mean())
knn_salary_std = 'Standard Deviation: {percent:.2%}'.format(percent=scores.std() * 2)

---
5 Fold Cross Validation Scores: Accuracy: 0.96 (+/- 0.02)


In [859]:
true_salary = pd.DataFrame(y_train.astype(float)).sort_index()
true_salary.columns = ['True Salary']
true_salary.reset_index(inplace=True, drop=True)

In [860]:
predicted_lr_salary = pd.concat([pred_lr_salary, true_salary], axis=1)
predicted_lr_salary['Difference'] = predicted_lr_salary['True Salary'] - predicted_lr_salary['Predicted Salary']
predicted_lr_salary.sort_values('True Salary', inplace=True)
predicted_lr_salary.reset_index(drop=True, inplace=True)

In [861]:
predicted_rfr_salary = pd.concat([pred_rfr_salary, true_salary], axis=1)
predicted_rfr_salary['Difference'] = predicted_rfr_salary['True Salary'] - predicted_rfr_salary['Predicted Salary']
predicted_rfr_salary.sort_values('True Salary', inplace=True)
predicted_rfr_salary.reset_index(drop=True, inplace=True)

In [862]:
predicted_knn_salary = pd.concat([pred_knn_salary, true_salary], axis=1)
predicted_knn_salary['Difference'] = predicted_knn_salary['True Salary'] - predicted_knn_salary['Predicted Salary']
predicted_knn_salary.sort_values('True Salary', inplace=True)
predicted_knn_salary.reset_index(drop=True, inplace=True)

In [863]:
predicted_lr_salary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267
Predicted Salary,145000.0,105000.0,125000.0,182099.258868,138486.212852,111909.685011,75561.091741,105000.0,134013.07924,92048.631148,...,125707.670189,125707.670189,125707.670189,125707.670189,111542.647789,125000.0,125707.670189,123956.306793,125707.670189,125000.0
True Salary,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,...,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0
Difference,-80000.0,-40000.0,-60000.0,-117099.258868,-73486.212852,-46909.685011,-10561.091741,-40000.0,-69013.07924,-27048.631148,...,59292.329811,59292.329811,59292.329811,59292.329811,73457.352211,60000.0,59292.329811,61043.693207,59292.329811,60000.0


In [864]:
predicted_rfr_salary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267
Predicted Salary,145000.0,105000.0,125000.0,185000.0,142713.030598,125000.0,72719.132171,105000.0,125000.0,85000.0,...,125380.982471,125380.982471,125380.982471,125380.982471,125000.0,125000.0,125380.982471,125000.0,125380.982471,125000.0
True Salary,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,...,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0
Difference,-80000.0,-40000.0,-60000.0,-120000.0,-77713.030598,-60000.0,-7719.132171,-40000.0,-60000.0,-20000.0,...,59619.017529,59619.017529,59619.017529,59619.017529,60000.0,60000.0,59619.017529,60000.0,59619.017529,60000.0


In [865]:
predicted_knn_salary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3258,3259,3260,3261,3262,3263,3264,3265,3266,3267
Predicted Salary,145000.0,105000.0,125000.0,185000.0,141000.0,125000.0,65000.0,105000.0,125000.0,85000.0,...,129000.0,129000.0,129000.0,129000.0,125000.0,125000.0,129000.0,125000.0,129000.0,125000.0
True Salary,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,65000.0,...,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0,185000.0
Difference,-80000.0,-40000.0,-60000.0,-120000.0,-76000.0,-60000.0,0.0,-40000.0,-60000.0,-20000.0,...,56000.0,56000.0,56000.0,56000.0,60000.0,60000.0,56000.0,60000.0,56000.0,60000.0


In [866]:
predicted_lr_salary.iplot(kind='spread', title='Linear Regression with Salary as Target', \
                          opacity=20, xTitle=lr_salary_score, yTitle=lr_salary_std, mean=True)

In [867]:
predicted_rfr_salary.iplot(kind='spread', title='Random Forest Regression with Salary as Target', \
                           opacity=20, xTitle=rfr_salary_score, yTitle=rfr_salary_std, mean=True)

In [868]:
predicted_knn_salary.iplot(kind='spread', title='K-Nearest Neighbors Regression with Salary as Target', \
                           opacity=20, xTitle=knn_salary_score, yTitle=knn_salary_std, mean=True)

### <b>My response from adjusting the Linear Regression was minimal.</b> 😿
### <b>BUT! The response from a Random Forest Regression and K-Neighbors Regression were solid!</b>

# Well, lesson learned.
## Using the Indeed API to pull job descriptions performs only okay-ish at predicting salaries.
## I expect that this experiment would fare better if the API provided complete job descriptions instead of snippets.