# Project 4: Web Scraping Job Postings

Forecasting salary ranges for data science job postings.
# 👨🏻‍💻👩🏻‍💻👨🏼‍💻👩🏼‍💻👨🏽‍💻👩🏽‍💻👨🏾‍💻👩🏾‍💻👨🏿‍💻👩🏿‍💻👨🏻‍💻👩🏻‍💻👨🏼‍💻👩🏼‍💻👨🏽‍💻👩🏽‍💻👨🏾‍💻👩🏾‍💻👨🏿‍💻👩🏿‍💻

In [1]:
import pprint
import requests
import json
import pandas as pd
import urllib2
import time
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression
import matplotlib
import plotly.plotly as py
import cufflinks as cf

# Using the Indeed API to collect data
from indeed import IndeedClient

# Shushing the compiler
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('chained_assignment',None)

In [2]:
# The API requires passing the client's IP.
# So, I'll create a function to collect that.

def get_ip():
    ext_ip = urllib2.urlopen('http://whatismyip.org').read()
    return ext_ip

In [3]:
# Setting my Indeed Developer API Key as a variable

client = IndeedClient(publisher = 1295525004807710)

In [4]:
# The parameter structure for the API request:

params = {
        'q': "data+scientist",
        'l': "any",                                                      
        'start': 0,                                               
        'end': 5000,
        'limit': 5000,
        'userip': get_ip,
        'useragent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)",
        'sort': 'date',
        'fromage': 'any',
        'co': 'any'
}

In [5]:
indeed = pd.DataFrame()
search_response = client.search(**params)
indeed = indeed.append(search_response['results'], ignore_index=True) 

In [6]:
indeed = pd.DataFrame(search_response['results'])
indeed.shape

(25, 19)

### Uh oh! An API limiting condition!! 😩<br>
The API will only pass 25 listings at a time.<br>
Even though we passed a limit at 5000 and an end at 5000, the API stops at 25.

So, we have to create a range with steps of 24:<br>
<ol>
    <li>Collect the reponse from Indeed</li>
    <li>Append the response to the DataFrame</li>
    <li>Increment the parameters passed to the Indeed servers so that we can bypass their API restriction of 25 records</li>
</ol>
# 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

ALSO! I did some prodding of the parameters and the formatting of the advanced search feature on Indeed's website.<br> Although undocumented, it is possible to pass salary ranges into our search parameters.<br> Indeed advises these numbers are approximations:

<i>How much does a Data Scientist in the United States make?<br>

The average Data Scientist salary in the United States is approximately $130,164.
Salary information comes from 36,404 data points collected directly from employees, users, and past and present job advertisements on Indeed in the past 12 months.
Please note that all salary figures are approximations based upon third party submissions to Indeed.</i>

In [20]:
indeed275k = pd.DataFrame()

for i in np.arange(0, 2000, 24):                                                                                                                                                                                            
    params = {                                                                
                'q': "data+scientist",
                'salary': '$265k-275k',
                'l': "san francisco",                                                      
                'start': 0 + i,                                               
                'end': 24 + i,
                'limit': 25,
                'userip': get_ip,
                'useragent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)",
                'sort': 'date',
                'fromage': 'any',
                'co': 'any'
             }
    search_response = client.search(**params)
    indeed275k = indeed275k.append(search_response['results'], ignore_index=True)   
                                                                                  
indeed275k.shape

IndexError: list index out of range

#### Looks like we hit a wall for salaries above $265,000.

Let's run this API call a few times and create multiple DataFrames for the different salary ranges.

In [21]:
# DataFrame Shapes
print 'Indeed $0k - $65k: ', indeed65k.shape
print 'Indeed $65k - $85k: ', indeed85k.shape
print 'Indeed $85k - $105k: ', indeed105k.shape
print 'Indeed $105k - $125k: ', indeed125k.shape
print 'Indeed $125k - $145k: ', indeed145k.shape
print 'Indeed $145k - $165k: ', indeed165k.shape
print 'Indeed $165k - $185k: ', indeed185k.shape
print 'Indeed $185k - $205k: ', indeed205k.shape
print 'Indeed $205k - $225k: ', indeed225k.shape
print 'Indeed $225k - $245k: ', indeed245k.shape
print 'Indeed $245k - $265k: ', indeed265k.shape

Indeed $0k - $65k:  (644, 19)
Indeed $65k - $85k:  (1370, 19)
Indeed $85k - $105k:  (728, 19)
Indeed $105k - $125k:  (1768, 19)
Indeed $125k - $145k:  (826, 19)
Indeed $145k - $165k:  (1097, 19)
Indeed $165k - $185k:  (720, 19)
Indeed $185k - $205k:  (167, 19)
Indeed $205k - $225k:  (84, 19)
Indeed $225k - $245k:  (84, 19)
Indeed $245k - $265k:  (84, 19)


## Let's take a look at some of these DataFrames. 🔍

In [22]:
indeed185k.company.value_counts()

Workbridge Associates       393
Harnham                     163
Elevate Recruiting Group     83
Jobspring Partners           81
Name: company, dtype: int64

In [23]:
indeed205k.company.value_counts()

Workbridge Associates    167
Name: company, dtype: int64

In [24]:
indeed225k.company.value_counts()

Harnham    84
Name: company, dtype: int64

In [25]:
indeed245k.company.value_counts()

Averity    84
Name: company, dtype: int64

In [26]:
indeed265k.company.value_counts()

Workbridge Associates    84
Name: company, dtype: int64

### There doesn't seem to be much value in any of the $205k+ DataFrames.

#### Having these salary ranges allows me to create salary targets within each DataFrame.
# 💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲🤑💲

In [27]:
indeed65k['salary'] = 65000
indeed85k['salary'] = 85000
indeed105k['salary'] = 105000
indeed125k['salary'] = 125000
indeed145k['salary'] = 145000
indeed165k['salary'] = 165000
indeed185k['salary'] = 185000

Let's put these all together into one DataFrame for cleaning and analysis.

In [28]:
yes_indeedy = [indeed85k, indeed105k, indeed125k, indeed145k, indeed165k, \
               indeed185k]

indeed = indeed65k.append(yes_indeedy)

In [29]:
indeed.shape

(7153, 20)

Let's check for duplicate job listings and take those out.

In [30]:
indeed[indeed.duplicated('jobkey', keep=False) == False].count()

city                     124
company                  124
country                  124
date                     124
expired                  124
formattedLocation        124
formattedLocationFull    124
formattedRelativeTime    124
indeedApply              124
jobkey                   124
jobtitle                 124
language                 124
onmousedown              124
snippet                  124
source                   124
sponsored                124
state                    124
stations                 124
url                      124
salary                   124
dtype: int64

Rut-roh! Looks like there are several duplicates!<br>
Let's nuke 'em!
# 💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥💥

In [31]:
indeed = indeed[indeed.duplicated('jobkey', keep=False) == True]
indeed.reset_index(drop=True, inplace=True)

I'm turning this DataFrame into a .CSV file to make a checkpoint of my work.

In [32]:
indeed.to_csv('indeed_collection.csv', sep=',', encoding='utf-8')

In [33]:
indeed.shape

(7029, 20)

Around 7000 rows isn't very meaty, and my intial impulse is to widen the location net.<br> We have to consider that the Bay Area likely pays at a disproportionate rate compared to other regions.<br> So, for the time being, we'll work from here.

How tidy is the rest of the DataFrame?

In [34]:
indeed.dtypes

city                     object
company                  object
country                  object
date                     object
expired                    bool
formattedLocation        object
formattedLocationFull    object
formattedRelativeTime    object
indeedApply                bool
jobkey                   object
jobtitle                 object
language                 object
onmousedown              object
snippet                  object
source                   object
sponsored                  bool
state                    object
stations                 object
url                      object
salary                    int64
dtype: object

In [35]:
indeed.head(1)

Unnamed: 0,city,company,country,date,expired,formattedLocation,formattedLocationFull,formattedRelativeTime,indeedApply,jobkey,jobtitle,language,onmousedown,snippet,source,sponsored,state,stations,url,salary
0,San Francisco,IBM,US,"Mon, 15 May 2017 21:28:22 GMT",False,"San Francisco, CA","San Francisco, CA",3 days ago,False,c33fae3f8ce6ba8c,Data Science Community Manager,en,"indeed_clk(this,'3870');",<b>Data</b> <b>Scientist</b> Community Manager...,IBM,False,CA,,http://www.indeed.com/viewjob?jk=c33fae3f8ce6b...,65000


#### Looks like we have a few wonky columns to drop.
We will drop the <i>Country</i>, <i>State</i>, <i>Stations</i>, <i>Sponsored</i>, <i>Language</i>, and <i>Expired</i> columns, because they are uniform.<br>
We also don't have a need for the individual job listing <i>URL</i>s or the <i>On Mouse Down</i> actions column.<br> The <i>Company</i> and <i>Source</i> features have matching data, so we only need one.<br>The <i>Formatted Location</i> and <i>Formatted Location Full</i> columns also tell us little, so they can go.<br>Also, we no longer need the <i>JobKey</i> feature, because we have so far used it to remove any duplicates.
<br><br> There may be salary differences within the Bay Area, but we don't have a need for this combined detail considering that there is a <i>City</i> column provided.

In [36]:
indeed.drop(['city', 'country', 'state', 'stations', 'sponsored', 'language', 'expired', \
             'formattedLocation', 'source', 'formattedLocationFull', 'url', 'onmousedown', \
             'jobkey'],
             axis=1, inplace=True)

indeed.reset_index(drop=True, inplace=True)

Let's convert the date column to DateTime.

In [37]:
indeed['date'] = pd.to_datetime(indeed['date'], infer_datetime_format=True)

Let's see if we can do something with the <i>Formatted Relative Time</i> column.

In [38]:
indeed.formattedRelativeTime.value_counts()

30+ days ago    5672
2 days ago       175
20 days ago      161
21 days ago      159
7 days ago        98
15 days ago       95
8 days ago        93
6 days ago        90
3 days ago        84
9 days ago        83
17 days ago       80
22 days ago       79
28 days ago       52
26 days ago       42
23 days ago       21
16 days ago       15
12 days ago       13
5 days ago        11
1 day ago          2
29 days ago        2
19 days ago        2
Name: formattedRelativeTime, dtype: int64

In [39]:
indeed.dtypes

company                          object
date                     datetime64[ns]
formattedRelativeTime            object
indeedApply                        bool
jobtitle                         object
snippet                          object
salary                            int64
dtype: object

### Woof. That's no fun. 🤢
I'll need to convert the 'hours ago' listings to 1 day, and then I'll have to extract the integers from the 'day ago' strings.<br> Then, I'll drop the <i>Formatted Relative Time</i> column.

In [40]:
indeed['daysAgo'] = indeed['formattedRelativeTime'].str.extract('(\d+)').astype(int)
indeed['daysAgo'][indeed['formattedRelativeTime'].str.contains('hours')] = 1
indeed.drop('formattedRelativeTime', axis=1, inplace=True)

In [41]:
indeed.head(3)

Unnamed: 0,company,date,indeedApply,jobtitle,snippet,salary,daysAgo
0,IBM,2017-05-15 21:28:22,False,Data Science Community Manager,<b>Data</b> <b>Scientist</b> Community Manager...,65000,3
1,LiveRamp,2017-04-27 18:03:49,True,Senior Technical Recruiter,Successful track record and high level of expe...,65000,21
2,Upstart,2017-03-21 05:52:04,False,Data Scientist (Internship),We're looking for someone to join our <b>data<...,65000,30


### Are they really all data scientist jobs?

In [42]:
indeed.jobtitle.unique()

array([u'Data Science Community Manager', u'Senior Technical Recruiter',
       u'Data Scientist (Internship)',
       u'Data Scientist Intern - Summer 2017', u'Data Scientist',
       u'Market Research Analyst - Analyst Development Program',
       u'Client Support Specialist', u'Analytic Consultant 1',
       u'SAFETY AND INSURANCE DATA SCIENTIST',
       u'Statistician/Predictive Modeler/Data Scientist',
       u'Lead Data Science Instructor', u'BioMedical Data Scientist',
       u'Database Analsyt', u'Statistically Significant Data Scientist',
       u'Data Scientist - Economics & Legal', u'Data Scientist, CDHI',
       u'Connect - Director, Advertising, Media & Technology Research',
       u'Data Scientist - Yammer',
       u'Data Scientist Internship (Summer 2017)',
       u'Firmware Data Scientist', u'Data Scientist - Uber for Business',
       u'Data Scientist - Power Engineering Analytics',
       u'Clinical Data Scientist (Manager)', u'Data Scientist II',
       u'Sr. Data Sc

#### We don't need no stinkin' Technical Recruiters or UX Content Designers muddying up our stats!
<i>(No offense to Technical Recruiters or UX Content Designers.)</i> 😅

In [43]:
ignore_these = ['UX', 'Recruiter', 'Sales', 'Bioinformatics', 'Connect', 'Biologist', \
                'Client']

indeed = indeed[~indeed['jobtitle'].str.contains('|'.join(ignore_these))]

indeed.reset_index(drop=True, inplace=True)

## Let's clean up and cosolidate these job types. 🛀🏼

In [44]:
indeed.jobtitle = indeed.jobtitle.str.lower()
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'manager' if 'mgr' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'sr' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'senior' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'head' if 'head' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'manager' if 'manager' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'principle' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'principal' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'ii' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'engineer' if 'engineer' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'senior' if 'lead' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'nlp' if 'nlp' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'intern' if 'internship' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'intern' if 'intern' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'contract' if 'contract' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analyst' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analytic' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'nlp' if 'natural' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'analyst' if 'analsyt' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'data scientist' if 'data scientist' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'director' if 'director' in x else x)
indeed.jobtitle = indeed.jobtitle.map(lambda x: 'director' if 'vice' in x else x)

In [45]:
indeed.jobtitle.unique()

array(['manager', 'intern', 'data scientist', 'analyst', 'senior',
       'engineer', 'nlp', 'contract', 'director'], dtype=object)

## Lookin' good!
Now, we can dig into the job descriptions in the <i>Snippet</i> column.

#### First step: getting a good idea of the special words in the job snippets.

Cleaning up the <i>Snippets</i> and <i>JobTitle</i> features.

In [46]:
indeed['snippet'] = indeed['snippet'].str.replace('<b>', '')
indeed['snippet'] = indeed['snippet'].str.replace('</b>', '')
indeed['snippet'] = indeed['snippet'].str.replace('.', '')
indeed['snippet'] = indeed['snippet'].str.replace(',', '')
indeed['snippet'] = indeed['snippet'].str.lower()

In [47]:
snippets = indeed.snippet.to_string()

In [48]:
vect = CountVectorizer(stop_words='english')
vect.fit_transform(indeed.snippet)

<6481x809 sparse matrix of type '<type 'numpy.int64'>'
	with 85726 stored elements in Compressed Sparse Row format>

In [49]:
vocab_dict = vect.vocabulary_
vocab = [[k,v] for k,v in vocab_dict.items()]

In [50]:
indeed_vocab = pd.DataFrame(vocab)
indeed_vocab.columns = ['word', 'count']
print indeed_vocab.sort_values('count', ascending=False).to_string()

                 word  count
671              yume    808
391            yodlee    807
434             years    806
759              year    805
218           writing    804
28            wrangle    803
700             world    802
110           working    801
767              work    800
749              wish    799
619               web    798
695           weaving    797
732               way    796
780         warehouse    795
519             wants    794
148              want    793
554                vp    792
741            volume    791
21        visualizing    790
591    visualizations    789
475     visualization    788
374         visionary    787
102           visible    786
774           virtual    785
448              view    784
632        vertically    783
158            verify    782
470           various    781
771           variety    780
542             value    779
588          validate    778
204           utilize    777
613             using    776
788           

#### Splitting up the the values of the Snippet feature for comparison.

In [51]:
indeed['snippet'] = indeed['snippet'].str.split()
indeed['split_jobtitle'] = indeed['jobtitle'].str.split()

#### I used the vocabulary to compile three lists of values based on areas of expertise. <br>
This I turn into a set for comparison with the JobTitle and Snippet features.

In [52]:
computering_skills = ['scala', 'python', 'r', 'hadoop', 'sql', 'nosql', 'mongodb', 'tableau', \
                      'spark', 'go', 'julia', 'd3', 'javascript', 'html', 'css', 'zoomdata', \
                      'insight', 've', 'spotfire', 'sas', 'pagerduty', 'owler', 'meetme', \
                      'mattermark', 'liveramp', 'java', 'hive', 'harnham', 'excel', \
                      'adadyn', 'yume', 'yodlee']

technical_skills = ['statistics', 'machine learning', 'algorithm', 'algorithms', \
                    'deep learning', 'ai', 'visualize', 'visualization', 'visualizations', \
                    'writing', 'wrangle', 'wrangling', 'triage', 'train', 'training', \
                    'testing', 'teach', 'teaching', 'security', 'parallelization', \
                    'nlp', 'model', 'modeling', 'modelling', 'ml', 'mining', \
                    'mentor', 'mathematician', 'mathematics', 'mathematical', \
                    'marketing', 'forecasting', 'algorithmic', 'aggregation', \
                    'storytelling']

experience_skills = ['manager', 'intern', 'data scientist', 'analyst', 'masters', 'phd'\
                     'senior', 'nlp', 'engineer', 'contract', 'head', 'director']

combined_skills = set(computering_skills + technical_skills + experience_skills)

## Creating a cumulative number for all skills.

In [53]:
indeed['skill_count'] = 0
indeed['title_count'] = 0

for idx in indeed.index:
    intersect = list(set(indeed['snippet'][idx]).intersection(combined_skills))
    indeed['skill_count'][idx] = len(intersect)
    
for idx in indeed.index:
    intersect = list(set(indeed['jobtitle'][idx]).intersection(combined_skills))
    indeed['title_count'][idx] = len(intersect)
    
indeed['total_skills'] = indeed['skill_count'] + indeed['title_count']
indeed.drop('skill_count', axis=1, inplace=True)
indeed.drop('title_count', axis=1, inplace=True)

# Let's get testing! 🏁

In [54]:
indeed_copy = indeed.copy()
indeed_copy.drop('snippet', axis=1, inplace=True)
indeed_copy.drop('date', axis=1, inplace=True)
indeed_copy.drop('split_jobtitle', axis=1, inplace=True)

In [55]:
dummies = ['indeedApply', 'jobtitle', 'company']
dummy_df = pd.get_dummies(indeed_copy[dummies])

In [56]:
indeed_analysis = pd.concat([indeed_copy, dummy_df], axis=1)
indeed_analysis.drop(dummies, axis=1, inplace=True)

## Train Test 🚂

In [57]:
X = indeed_analysis
y = indeed_analysis['salary'].ravel()
X.drop('salary', axis=1, inplace=True)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [59]:
print 'X train shape: ', X_train.shape
print 'y train shape: ', y_train.shape
print 'X test shape', X_test.shape
print 'y test shape: ', y_test.shape

X train shape:  (3240, 121)
y train shape:  (3240,)
X test shape (3241, 121)
y test shape:  (3241,)


## Try a Lasso... 🐴

In [60]:
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv=5))
    return(rmse)

In [61]:
model_lasso = LassoCV(n_alphas=100, selection='random', max_iter=15000).fit(X_train, y_train)
res = rmse_cv(model_lasso)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 10634.733386966163)
('Min: ', 9726.9270032708991)


In [64]:
lasso_df = indeed_analysis
l_coef = pd.Series(model_lasso.coef_, index = lasso_df.columns)
print("Lasso picked " + str(sum(l_coef != 0)) + " variables and eliminated the other " +  str(sum(l_coef == 0)) + " variables")

Lasso picked 77 variables and eliminated the other 44 variables


In [65]:
lasso_coef = pd.concat([l_coef.sort_values().head(10),
                        l_coef.sort_values().tail(10)])

lasso_coef.iplot(kind = "barh", title='Coefficients in the Lasso Model')

In [66]:
lasso_coef.sort_values()

company_IBM                                      -84183.537506
company_Ipsos North America                      -63303.722599
company_Payette Group                            -56577.815255
jobtitle_intern                                  -54897.910472
company_General Assembly                         -50730.458804
company_Abl Schools                              -39307.250798
company_Wells Fargo                              -38046.068271
company_Blackstone Technology Group              -36301.403827
company_University of California San Francisco   -35751.044355
company_Gametime United, Inc.                    -35199.322448
company_PagerDuty                                 24925.774934
company_Corporate Labs Technology                 26569.287240
company_Twitter                                   27390.890191
company_Elevate Recruiting Group                  33688.120104
company_DuPont                                    37991.296685
jobtitle_director                                 38140

## Try a Ridge...

In [67]:
model_ridge = RidgeCV(alphas=(0.01, 0.1, 1.0, 10, 100)).fit(X_train, y_train)
res = rmse_cv(model_ridge)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 9491.0208750977163)
('Min: ', 8627.3985344530574)


In [68]:
ridge_df = indeed_analysis
r_coef = pd.Series(model_ridge.coef_, index = ridge_df.columns)
print("Ridge picked " + str(sum(r_coef != 0)) + " variables and eliminated the other " +  str(sum(r_coef == 0)) + " variables")

Ridge picked 119 variables and eliminated the other 2 variables


In [69]:
ridge_coef = pd.concat([r_coef.sort_values().head(10),
                    r_coef.sort_values().tail(10)])

ridge_coef.iplot(kind = "barh", title='Coefficients in the Ridge Model')

In [70]:
ridge_coef.sort_values()

company_IBM                                      -95523.041089
company_Pfizer Inc.                              -62511.718747
company_Payette Group                            -57001.293971
company_Ipsos North America                      -56470.710336
jobtitle_intern                                  -54456.716341
company_General Assembly                         -48786.201064
company_Abl Schools                              -38663.456268
company_Levi Strauss & Co.                       -38396.658410
company_University of California San Francisco   -37019.083855
jobtitle_contract                                -36610.592860
company_Twitter                                   33939.102415
company_Elevate Recruiting Group                  40174.561595
jobtitle_manager                                  40572.500011
company_DuPont                                    42765.123678
company_Workbridge Associates                     47238.990842
company_art.com                                   48244

## Try an ElasticNet...

In [71]:
model_en = ElasticNetCV(n_alphas=100, alphas=(0.01, 0.1, 1.0, 10, 100, 500), \
                        max_iter=15000, cv=5, n_jobs=-1).fit(X_train, y_train)
res = rmse_cv(model_en)
print("Mean:",res.mean())
print("Min: ",res.min())

('Mean:', 13475.544710648252)
('Min: ', 12415.825627686161)


In [72]:
en_df = indeed_analysis
e_coef = pd.Series(model_en.coef_, index = en_df.columns)
print("ElasticNet picked " + str(sum(e_coef != 0)) + " variables and eliminated the other " +  str(sum(e_coef == 0)) + " variables")

ElasticNet picked 119 variables and eliminated the other 2 variables


In [73]:
eln_coef = pd.concat([e_coef.sort_values().head(10),
                     e_coef.sort_values().tail(10)])

eln_coef.iplot(kind = "barh", title='Coefficients in the ElasticNet Model')

In [74]:
eln_coef.sort_values()

jobtitle_intern                                  -43644.182612
company_IBM                                      -40920.589957
company_Payette Group                            -39841.566554
company_General Assembly                         -34961.889628
company_Ipsos North America                      -34308.806273
company_Abl Schools                              -29622.983282
company_Gametime United, Inc.                    -27811.570554
company_University of California San Francisco   -23152.790386
company_Medal                                    -22099.845443
company_inVentiv Health Clinical                 -21976.829696
company_Credit Karma                              19609.235476
company_Corporate Labs Technology                 21039.821676
company_Twitter                                   22103.095332
jobtitle_director                                 31922.707706
company_Elevate Recruiting Group                  32217.217456
company_DuPont                                    32608

## Five items consistently rank lowest:

#### LassoCV:
<i>company_IBM                                      -89337.939938</i><br>
<i>company_Ipsos North America                      -62833.218868</i><br>
<i>company_Payette Group                            -56550.574495</i><br>
<i>company_General Assembly                         -53124.101956</i><br>
<i>jobtitle_intern                                  -51035.956142</i><br>
company_Wells Fargo                              -43636.332313<br>
company_Blackstone Technology Group              -38270.816321<br>
company_Gametime United, Inc.                    -37484.111854<br>
company_Medal                                    -36689.889638<br>
company_Smule                                    -36673.128248<br><br>

#### RidgeCV:
<i>company_IBM                                      -98493.634337</i><br>
<i>company_Payette Group                            -57724.671529</i><br>
<i>company_Ipsos North America                      -57042.876115</i><br>
<i>company_General Assembly                         -50163.581091</i><br>
<i>jobtitle_intern                                  -47837.768974</i><br>
company_University of California San Francisco   -37762.147808<br>
company_Medal                                    -37574.076939<br>
company_Smule                                    -37443.054554<br>
jobtitle_contract                                -37001.216705<br>
company_Abl Schools                              -36995.153859<br><br>

#### ElasticNetCV:
<i>company_IBM                                      -41233.522386</i><br>
<i>company_General Assembly                         -39940.058085</i><br>
<i>jobtitle_intern                                  -39117.140263</i><br>
<i>company_Payette Group                            -37952.037798</i><br>
<i>company_Ipsos North America                      -32631.289045</i><br>
company_Gametime United, Inc.                    -28559.314432<br>
company_LiveRamp                                 -22883.754084<br>
company_University of California San Francisco   -22760.442842<br>
company_inVentiv Health Clinical                 -22578.561263<br>
company_Microsoft                                -22250.575857<br>

### Manager, Scientist, Analyst and Intern tank in all areas.
# 🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫🌫

## Setting up for a regression.

In [75]:
indeed_copy = indeed.copy()
indeed_copy.drop('snippet', axis=1, inplace=True)
indeed_copy.drop('date', axis=1, inplace=True)
indeed_copy.drop('split_jobtitle', axis=1, inplace=True)

In [76]:
dummies = ['indeedApply', 'jobtitle', 'company']
dummy_df = pd.get_dummies(indeed_copy[dummies])

In [77]:
indeed_analysis = pd.concat([indeed_copy, dummy_df], axis=1)
indeed_analysis.drop(dummies, axis=1, inplace=True)

In [78]:
poor_performers = ['company_IBM', 'company_General Assembly', 'jobtitle_intern', \
                   'company_Payette Group', 'company_Ipsos North America']

indeed_analysis.drop(poor_performers, axis=1, inplace=True)

## Train Test 🚋

In [79]:
X = indeed_analysis
y = indeed_analysis['salary'].ravel()
X.drop('salary', axis=1, inplace=True)

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [81]:
print 'X train shape: ', X_train.shape
print 'y train shape: ', y_train.shape
print 'X test shape', X_test.shape
print 'y test shape: ', y_test.shape

X train shape:  (3240, 116)
y train shape:  (3240,)
X test shape (3241, 116)
y test shape:  (3241,)


## And now... Regression. 🕵🏻

In [82]:
lr = LinearRegression(n_jobs=-1)

lr.fit(X_train, y_train)

pred_lr_salary = pd.DataFrame()
pred_lr_salary['Predicted Salary'] = lr.predict(X_train.sort_index())

scores = cross_val_score(lr, X_test, y_test, cv=5)

print '---\n5 Fold Cross Validation Scores:', \
      'Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)

salary_score = '5-Fold CV Mean Average: {percent:.2%}'.format(percent=scores.mean())
salary_std = 'Standard Deviation: {percent:.2%}'.format(percent=scores.std() * 2)


internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.



---
5 Fold Cross Validation Scores: Accuracy: 0.93 (+/- 0.03)


In [83]:
true_salary = pd.DataFrame(y_train.astype(float)).sort_index()
true_salary.columns = ['True Salary']
true_salary.reset_index(inplace=True, drop=True)

In [84]:
predicted_salary = pd.concat([pred_lr_salary, true_salary], axis=1)

In [85]:
predicted_salary['Difference'] = predicted_salary['True Salary'] - predicted_salary['Predicted Salary']

In [86]:
predicted_salary.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3230,3231,3232,3233,3234,3235,3236,3237,3238,3239
Predicted Salary,78939.456595,73259.571205,61535.550375,64490.522114,81100.404872,61535.550375,64490.522114,81100.404872,78939.456595,62675.220403,...,185172.458336,181845.548824,185000.0,185172.458336,181845.548824,165894.283698,185000.0,186043.740804,163562.299397,185172.458336
True Salary,85000.0,105000.0,125000.0,85000.0,105000.0,85000.0,165000.0,125000.0,105000.0,125000.0,...,185000.0,65000.0,105000.0,165000.0,145000.0,165000.0,185000.0,185000.0,125000.0,125000.0
Difference,6060.543405,31740.428795,63464.449625,20509.477886,23899.595128,23464.449625,100509.477886,43899.595128,26060.543405,62324.779597,...,-172.458336,-116845.548824,-80000.0,-20172.458336,-36845.548824,-894.283698,-4.074536e-10,-1043.740804,-38562.299397,-60172.458336


In [87]:
predicted_salary['Difference'].mean()

9.327598008108728e-11

In [88]:
predicted_salary.iplot(kind='spread', title='Linear Regression with Salary as Target', \
                       xTitle=salary_score, yTitle=salary_std)