- Find out what factors most directly impact salaries (title, location, department, etc.). In this case, we do not want to predict mean salary as would be done in a regression. Your boss believes that salary is better represented in categories than continuously
- Test, validate, and describe your models. What factors predict salary category? How do your models perform?
- Prepare a presentation for your Principal detailing your analysis.

In [1]:
# Import relevant libraries
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from spacy import English
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import linear_model
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from IPython.core.display import HTML
from __future__ import unicode_literals

%matplotlib inline

In [2]:
path_to_phantom = '//Applications/phantomjs'

# Identify: Problem Statement / Aim

Our aim is to determine the factors that result in higher salaries for a data scientist.

# Acquire: Import Data Using Requests + BeautifulSoup

- Collect data on data science salary trends from a job listings aggregator for your analysis.
- Select and parse data from at least ~1000 postings for jobs, potentially from multiple location searches.

In [3]:
# create a webdriver PhantomJS object
driver = webdriver.PhantomJS(executable_path=path_to_phantom)
driver.set_window_size(1024,768)

In [4]:
# Indeed.com url formats
# http://www.indeed.com/jobs?q=data+scientist&l=New+York&start=10&pp=
# base - http://www.indeed.com/jobs?q=data+scientist&l=
# location - City, separated by +
# page 2 onwards - &start=10&pp=
# full url - base+location+page

url_top = 'http://www.indeed.com/jobs?q=data+scientist&l='
location = ['New+York', 'Seattle', 'San+Francisco', 'Boston']

In [5]:
# in the search results, each result is wrapped in a div class=' row result'

# for all results we want to get back
# company, span class='company'
# jobtitle, link class='jobtitle'
# location, span class='location'
# summary, span class='summary'
# salary, td class="snip", nobr with a $
# we need to account for missing data

# write a function that can retrieve job title, company, location, summary, and salary for each result
# if there are blanks, return '' for the first 4, and np.nan for salary
def get_details(each_item):
    try:
        job_title = each_item.find('h2', class_='jobtitle').text.strip('\n')
    except:
        job_title = ''
        
    try:
        company = each_item.find('span', class_='company').text.strip()
    except:
        company = ''
    
    try:
        location = each_item.find('span', class_='location').text.strip()
    except:
        location = ''
    
    try:
        summary = each_item.find('span', class_='summary').text.strip()
    except:
        summary = ''
        
    try:
        salary_text = each_item.find('td', class_='snip').find('nobr').text
        salary_text = salary_text.split()
        salary = float(salary_text[0].strip('$').replace(',',''))
    except:
        salary = np.nan
    
    return [job_title, company, location, summary, salary]

In [6]:
entries_required = 1000
entries_per_page = 9
pages_required_per_loc = (entries_required / entries_per_page) / len(location)

In [7]:
data = []
for x in location:
    page = 1
    while page < pages_required_per_loc+1:
        full_url = url_top + x
        if page != 1:
            page_url = '&start=' + str((page-1)*10) + '&pp='
            full_url = full_url + page_url
        #print full_url
        page += 1
        driver.get(full_url)
        soup = BeautifulSoup(driver.page_source, 'lxml')
        for item in soup.findAll('div', class_=' row result'):
            data.append(get_details(item))
print 'done'

done


In [8]:
df = pd.DataFrame(data, columns=['job_title', 'company', 'location','summary','salary'])

In [12]:
df.head(10)

Unnamed: 0,job_title,company,location,summary,salary
0,Digital Data Scientist,JPMorgan Chase,"Manhattan, NY",Digital Data Scientist. 5+ years of combined i...,
1,Data Scientist,Squarespace,"New York, NY 10014 (West Village area)",Impressive skill with a major programming lang...,
2,Data Scientist,Verizon,"New York, NY 10003 (Greenwich Village area)",Familiarity and experience with creating analy...,
3,Data Scientist / Modeler,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",The Modeling team and other MSSM project based...,
4,Data Scientist,Barclays,"New York, NY","Or Master’s Degree in operations research, app...",
5,"Data Scientist, Analytics",SoundCloud,"New York, NY 10003 (Greenwich Village area)",We’re looking for Data Scientists to join our ...,
6,"Data Scientist, Pricing & Yield",NBCUniversal,"New York, NY","Experience in handling large data sets, combin...",
7,Data Scientist,WebMD,"New York, NY","Medscape, a division of WebMD, develops and ho...",
8,Data Scientist - USA,Dataiku,"New York, NY",Familiarity with data visualization in R or Ja...,
9,Data Scientist/Analyst,YouNow,"New York, NY",Conduct in-depth research and analyses that tr...,


In [11]:
df.describe(include='all')

Unnamed: 0,job_title,company,location,summary,salary
count,972,972,972,972,53.0
unique,625,586,134,897,
top,Data Scientist,Amazon Corporate LLC,"New York, NY",UW Medicine’s mission is to improve the health...,
freq,165,59,156,7,
mean,,,,,102634.339623
std,,,,,55532.974045
min,,,,,45.0
25%,,,,,67500.0
50%,,,,,110000.0
75%,,,,,150000.0


# Parse: Clean & Organize Data

# Model: Perform Logistic Regression

# Evaluate: Logistic Regression
# Bonus: Countvectorizer, Regularization Parameters

# Present: Write a report for your audience addressing findings & recommendations