# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title, and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest classifier, as well as another classifier of your choice; either logistic regression, SVM, or KNN. 

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Set up a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)
The URL here has many query parameters
- q for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- l for a location
- start for what result number to start on

In [8]:
URL = "https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York"
base_url = "https://www.indeed.com/jobs?q=data+scientist&l="

In [62]:
import requests
import bs4
from bs4 import BeautifulSoup
import urllib
from urllib import urlopen 
import re 


In [10]:

# visit that url, and grab the html of said page
html = urllib.urlopen(URL).read()

# we need to convert this into a soup object
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

Let's look at one result more closely. A single result looks like
```JSON
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&campaignid=serp-linkcompanyname&fromjk=2480d203f7e97210&jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

In [11]:
# 'find' method returns the first matching Tag (and everything inside of it)
soup.find(name='body')
soup.find(name='h1')

<h1><font size="+1">data scientist $20,000 jobs in New York, NY</font></h1>

In [12]:
# Tags allow you to access the 'inside text'
soup.find(name='h1').text


u'data scientist $20,000 jobs in New York, NY'

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a nobr element inside of a td element with class='snip.
- The title of a job is in a link with class set to jobtitle and a data-tn-element="jobTitle.
- The location is set in a span with class='location'.
- The company is set in a span with class='company'.

## Write 4 functions to extract each item: location, company, job, and salary.¶
Example
```python
def extract_location_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [201]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [202]:
def extract_location_from_result(result):
    try:
        item = result.find(name='span', attrs={'class': 'location'}).text
    except: 
        item= None
    return item

    


In [203]:
def extract_company_from_result(result):
    try:
        item = result.find('span', {'class': 'company'}).text
        item = item.replace('\n','')
        while item[0] == ' ':
            item = item[1:]
    except: 
        item= None
    return item


In [204]:
def extract_job(result):
    item = result.find('a', attrs={"class":"turnstileLink"}).attrs['title']
    return item
    
extract_job(soup)

u'Medical Lab Scientist - On Call'

In [205]:
def extract_salary_from_result(result):
    try:
        item = result.find('td', {'class': 'snip'}).find('nobr').text
    except: 
        item= None
    return item 

extract_salary_from_result(soup)




#nobr 

# class="jobtitle turnstileLink"

In [206]:
#cleans data so that we have less u'/n/n presence 
def clean_one(row, job):
    for data in soup.find_all(name='td', attrs={'id':'resultsCol'}):
        bra = data.find_all(name='div', attrs={'class': row})
        for row in bra:
            company = row.find(class_ = "company").get_text()
            location = row.find(class_ = "location").get_text()
            title = row.find(class_ = job).get_text()
#             salary = row.find_all(name = 'td', attrs={'class':'snip'})
            print "title = "+title
            print "company ="+company
            print "location = "+location
#             print extract_salary_from_result
clean_one(' row result', 'jobtitle')
#clean_one(' row sjlast result', 'jobtitle')

title = 
Medical Lab Scientist - On Call

company =


        The Vancouver Clinic

location = Salmon Creek, WA
title = 
Director of training and clinical operations

company =

    Portland Psychotherapy Clinic, Research, & Trainin...

location = Portland, OR
title = 
Data Scientist Internship - Deep Learning

company =


        Cambia Health

location = Portland, OR
title = 
Data Scientist - Machine Learning, NLP

company =


        WSI Nationwide

location = Portland, OR
title = 
Statistician - Sustainable Business Innovation

company =


        24 Seven

location = Beaverton, OR
title = 
Laboratory Quality Manager

company =

    Cannabis Testing Facility

location = Portland, OR
title = 
Senior Lead Mechanical Engineer

company =

    LightSource Consulting

location = Portland, OR 97205 (Northwest area)
title = 
D2i Development Initiator/Development Coordinator

company =


        FEI Company

location = Hillsboro, OR 97124
title = 
Sr .NET Software Engineer

company =


    

In [207]:

#shows the value of .strip 
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})

for x in results:
    company = x.find('span', attrs={"itemprop":"name"})
    print 'company:', company.text.strip()

    job = x.find('a', attrs={'data-tn-element': "jobTitle"})
    print 'job:', job.text.strip()

    salary = x.find('nobr')
    if salary:
        print 'salary:', salary.text.strip()
        

company: The Vancouver Clinic
job: Medical Lab Scientist - On Call
company: Portland Psychotherapy Clinic, Research, & Trainin...
job: Director of training and clinical operations
company: Cambia Health
job: Data Scientist Internship - Deep Learning
company: WSI Nationwide
job: Data Scientist - Machine Learning, NLP
salary: $120,000 - $165,000 a year
company: 24 Seven
job: Statistician - Sustainable Business Innovation
company: Cannabis Testing Facility
job: Laboratory Quality Manager
salary: $18 - $22 an hour
company: LightSource Consulting
job: Senior Lead Mechanical Engineer
company: FEI Company
job: D2i Development Initiator/Development Coordinator
company: VanderHouwen
job: Sr .NET Software Engineer
company: Cambia Health
job: Data Scientist


Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.
- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the l=New+York and the start=10. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).
##### Complete the following code to collect results from multiple cities and starting points.
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [208]:
YOUR_CITY = ''

In [209]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 100 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results  
        url = url_template.format(city, start)
        html = urllib.urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
        results.append(soup)
        pass
        

In [210]:
data= []

for i in results:
    for o in i.find_all('div', class_=['row','result']): 
        location = extract_location_from_result(o)
        company = extract_company_from_result(o)
        job = extract_job(o)  
        salary = extract_salary_from_result(o)
        data.append({'location' : location, 'company': company, 'job': job, 'salary':salary})

In [211]:
data_df = pd.DataFrame(data)
print len(data_df)

1915


In [212]:
data_df.to_csv('~/desktop/datascientist.csv' ,mode='a', header=False, encoding = 'utf-8')

In [1]:
# data_2 = pd.read_csv('~/desktop/datascientist.csv')
# print data_2.shape

In [213]:
# df = pd.read_csv("http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}")
# df.head()
df_columns = ['company_name', 'title', 'location', 'salary' ]

In [250]:
import pandas as pd
df = pd.read_csv('~/desktop/datascientist.csv', header = 0, names = df_columns)

In [251]:
df.head()

Unnamed: 0,company_name,title,location,salary
1,"Syntelli Solutions, Inc",Data Scientist,"Charlotte, NC 28277",
2,NIT Finance,Senior Data Scientist,"New York, NY",
3,Freddie Mac,Data Science Professional,"McLean, VA 22102",
4,Agile,Data Scientist (USC or GC) (REMOTE),"Melbourne, FL",
5,NPR,Data Scientist / Digital Analyst,"Washington, DC",


#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [217]:
print "missing values \n", df.isnull().sum()

missing values 
company_name        7
title               0
location            0
salary          25789
dtype: int64


In [218]:
 print "dataframe types \n", df.dtypes     #look for data types of each column

dataframe types 
company_name    object
title           object
location        object
salary          object
dtype: object


In [252]:
print 'duplicates \n', df.duplicated().sum() # True if a row is identical to a previous row

duplicates 
20543


In [253]:
print 'drop dups \n', df.drop_duplicates(inplace = True)    # Drop the duplicate rows

drop dups 
None


In [254]:
for item in df:
    print item
    print df[item].nunique()  #number of unique values for each column

company_name
2877
title
5317
location
1269
salary
319


In [264]:
df

Unnamed: 0,company_name,title,location,salary
1,"Syntelli Solutions, Inc",Data Scientist,"Charlotte, NC 28277",
2,NIT Finance,Senior Data Scientist,"New York, NY",
3,Freddie Mac,Data Science Professional,"McLean, VA 22102",
4,Agile,Data Scientist (USC or GC) (REMOTE),"Melbourne, FL",
5,NPR,Data Scientist / Digital Analyst,"Washington, DC",
6,Axiologic Solutions,Data Subject Matter Expert,"Washington, DC",
7,BitVoyant,Junior Data Scientist,"Rosslyn, VA",
8,Facebook,"Data Engineer, SocialVR","Menlo Park, CA",
9,DataRobot,Data Science Evangelist,"Washington, DC",
10,Comverge Inc,Jr. Data Scientist,"Denver, CO",


Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [196]:
## YOUR CODE HERE
#create a for loop that goes through each iteration of salary 
# def sal_midpoint(n):
#     sal_range = n.split('-')  # ' - '  #split into lower and higher also remove commas "a year" etc If try and accept statement, look at documentation  
#     numlist = []
# #     not_dash = []
#     for i in sal_range:
#         number = re.search('[\d\,]+',i)
#         number = number.group(0).replace(',','')  #what does this do? why is group here 
#         number = float(number)
#         numlist.append(number)
#     sal_mid = sum(numlist)/len(numlist)
# #     if "-" not in n:
# #         not_dash
#     return sal_mid

In [272]:
salary2 = df[df.salary.notnull()]

In [256]:
# def sal_midpoint(n):
#     sal_range = n.split('-')
#     numlist = []

In [270]:
salary2

Unnamed: 0,company_name,title,location,salary
11,NxT Level,Data Scientist,"Seattle, WA","$125,000 - $160,000 a year"
23,Northeastern University - Network Science Inst...,Data Visualization Research Specialist,"Boston, MA","$50,000 a year"
26,Central Intelligence Agency,Data Scientist,"Washington, DC","$62,338 - $119,794 a year"
36,Keyo,Data Scientist,"Palo Alto, CA","$120,000 - $200,000 a year"
37,"Computer Enterprises, Inc. (CEI)",Data Scientist,"Philadelphia, PA","$160,000 a year"
52,NxT Level,Software Engineer - Machine Learning,"Seattle, WA","$100,000 - $135,000 a year"
54,Workbridge Associates,Data Scientist,"Washington, DC","$100,000 - $150,000 a year"
65,Barrington James,Real World Evidence Statistical Analyst,"New York, NY 10005 (Financial District area)","$110,000 a year"
78,Platinum Solutions,Data Scientist,"Houston, TX","$80,000 - $120,000 a year"
95,Genialis,Bioinformatics Software Developer,"Houston, TX","$50,000 - $80,000 a year"


In [273]:
type(salary2)

pandas.core.frame.DataFrame

In [268]:
# salary2.salary.str.replace('[\,$]', '')

11      $125,000 - $160,000 a year
23                  $50,000 a year
26       $62,338 - $119,794 a year
36      $120,000 - $200,000 a year
37                 $160,000 a year
52      $100,000 - $135,000 a year
54      $100,000 - $150,000 a year
65                 $110,000 a year
78       $80,000 - $120,000 a year
95        $50,000 - $80,000 a year
99                    $120 an hour
104                 $43,794 a year
116                 $46,831 a year
130       $40,800 - $79,100 a year
139                $150,000 a year
169       $39,983 - $55,500 a year
183       $45,000 - $77,000 a year
191       $63,696 - $94,557 a year
201       $70,000 - $90,000 a year
203                $100,000 a year
212     $120,000 - $140,000 a year
220             $85 - $110 an hour
230      $73,300 - $114,900 a year
232              $50 - $70 an hour
244                $150,000 a year
245                $130,000 a year
249                $125,000 a year
255              $50 - $70 an hour
256     $120,000 - $

In [275]:
 salary2.head(15)

Unnamed: 0,company_name,title,location,salary
11,NxT Level,Data Scientist,"Seattle, WA","$125,000 - $160,000 a year"
23,Northeastern University - Network Science Inst...,Data Visualization Research Specialist,"Boston, MA","$50,000 a year"
26,Central Intelligence Agency,Data Scientist,"Washington, DC","$62,338 - $119,794 a year"
36,Keyo,Data Scientist,"Palo Alto, CA","$120,000 - $200,000 a year"
37,"Computer Enterprises, Inc. (CEI)",Data Scientist,"Philadelphia, PA","$160,000 a year"
52,NxT Level,Software Engineer - Machine Learning,"Seattle, WA","$100,000 - $135,000 a year"
54,Workbridge Associates,Data Scientist,"Washington, DC","$100,000 - $150,000 a year"
65,Barrington James,Real World Evidence Statistical Analyst,"New York, NY 10005 (Financial District area)","$110,000 a year"
78,Platinum Solutions,Data Scientist,"Houston, TX","$80,000 - $120,000 a year"
95,Genialis,Bioinformatics Software Developer,"Houston, TX","$50,000 - $80,000 a year"


In [72]:
# for i,v in df.iterrows():
#     try:
#         df.loc[i,'estimate'] = sal_midpoint(v['salary'])
#     except:
#         df.loc[i,'estimate']= np.nan

In [226]:
df

1                            NaN
2                            NaN
3                            NaN
4                            NaN
5                            NaN
6                            NaN
7                            NaN
8                            NaN
9                            NaN
10                           NaN
11      125,000 - 160,000 a year
12                           NaN
13                           NaN
14                           NaN
15                           NaN
16                           NaN
17                           NaN
18                           NaN
19                           NaN
20                           NaN
21                           NaN
22                           NaN
23                 50,000 a year
24                           NaN
25                           NaN
26       62,338 - 119,794 a year
27                           NaN
28                           NaN
29                           NaN
30                           NaN
          

In [None]:
# sal_revised = sal_revised[sal_revised['salary'].str.contains("month", na=False) == False]
# sal_revised = sal_revised[sal_revised['salary'].str.contains("hour", na=False) == False]
# sal_revised = sal_revised[sal_revised['salary'].str.contains("week", na=False) == False]


In [285]:
yearly = salary2[salary2.salary.str.contains('year')]


In [286]:
yearly.head(15)

Unnamed: 0,company_name,title,location,salary
11,NxT Level,Data Scientist,"Seattle, WA","$125,000 - $160,000 a year"
23,Northeastern University - Network Science Inst...,Data Visualization Research Specialist,"Boston, MA","$50,000 a year"
26,Central Intelligence Agency,Data Scientist,"Washington, DC","$62,338 - $119,794 a year"
36,Keyo,Data Scientist,"Palo Alto, CA","$120,000 - $200,000 a year"
37,"Computer Enterprises, Inc. (CEI)",Data Scientist,"Philadelphia, PA","$160,000 a year"
52,NxT Level,Software Engineer - Machine Learning,"Seattle, WA","$100,000 - $135,000 a year"
54,Workbridge Associates,Data Scientist,"Washington, DC","$100,000 - $150,000 a year"
65,Barrington James,Real World Evidence Statistical Analyst,"New York, NY 10005 (Financial District area)","$110,000 a year"
78,Platinum Solutions,Data Scientist,"Houston, TX","$80,000 - $120,000 a year"
95,Genialis,Bioinformatics Software Developer,"Houston, TX","$50,000 - $80,000 a year"


In [287]:
yearly['salary'] = yearly.salary.str.replace('[\,$]', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [288]:
yearly['salary'] = yearly.salary.str.replace('a year', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [289]:
yearly.head(20)

Unnamed: 0,company_name,title,location,salary
11,NxT Level,Data Scientist,"Seattle, WA",125000 - 160000
23,Northeastern University - Network Science Inst...,Data Visualization Research Specialist,"Boston, MA",50000
26,Central Intelligence Agency,Data Scientist,"Washington, DC",62338 - 119794
36,Keyo,Data Scientist,"Palo Alto, CA",120000 - 200000
37,"Computer Enterprises, Inc. (CEI)",Data Scientist,"Philadelphia, PA",160000
52,NxT Level,Software Engineer - Machine Learning,"Seattle, WA",100000 - 135000
54,Workbridge Associates,Data Scientist,"Washington, DC",100000 - 150000
65,Barrington James,Real World Evidence Statistical Analyst,"New York, NY 10005 (Financial District area)",110000
78,Platinum Solutions,Data Scientist,"Houston, TX",80000 - 120000
95,Genialis,Bioinformatics Software Developer,"Houston, TX",50000 - 80000


In [298]:
def sal_midpoint(n):
    sal_range = n.split('-')  # ' - '  #split into lower and higher also remove commas "a year" etc If try and accept statement, look at documentation  
    numlist = []
    for i in sal_range:
        number = re.search('[\d\,]+',i)
        number = number.group(0).replace(',','')  #what does this do? why is group here 
        number = float(number)
        numlist.append(number)
    sal_mid = sum(numlist)/len(numlist)
    return sal_mid

In [300]:
for i,v in yearly.iterrows():
    try:
        yearly.loc[i,'estimate'] = sal_midpoint(v['salary'])
    except:
        yearly.loc[i,'estimate']= np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [302]:
yearly.head(20)

Unnamed: 0,company_name,title,location,salary,estimate
11,NxT Level,Data Scientist,"Seattle, WA",125000 - 160000,142500.0
23,Northeastern University - Network Science Inst...,Data Visualization Research Specialist,"Boston, MA",50000,50000.0
26,Central Intelligence Agency,Data Scientist,"Washington, DC",62338 - 119794,91066.0
36,Keyo,Data Scientist,"Palo Alto, CA",120000 - 200000,160000.0
37,"Computer Enterprises, Inc. (CEI)",Data Scientist,"Philadelphia, PA",160000,160000.0
52,NxT Level,Software Engineer - Machine Learning,"Seattle, WA",100000 - 135000,117500.0
54,Workbridge Associates,Data Scientist,"Washington, DC",100000 - 150000,150000.0
65,Barrington James,Real World Evidence Statistical Analyst,"New York, NY 10005 (Financial District area)",110000,110000.0
78,Platinum Solutions,Data Scientist,"Houston, TX",80000 - 120000,100000.0
95,Genialis,Bioinformatics Software Developer,"Houston, TX",50000 - 80000,65000.0


In [306]:
yearly.shape

(353, 5)

In [309]:
yearly['median'] = np.median(estimate)

NameError: name 'estimate' is not defined

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [None]:
## YOUR CODE HERE


### Save your results as a CSV

In [149]:
Export to csv
sal_revised = df.dropna(subset=['sal_update']

SyntaxError: invalid syntax (<ipython-input-149-a2be7785a427>, line 2)

In [185]:
# #sal_update shows average score of intervals between salary 

# sal_revised = sal_revised[sal_revised['salary'].str.contains("month", na=False) == False]
# sal_revised = sal_revised[sal_revised['salary'].str.contains("hour", na=False) == False]
# sal_revised = sal_revised[sal_revised['salary'].str.contains("week", na=False) == False]


In [304]:
# sal_revised4 = df[df['salary'].notnull()]
# sal_revised3 = sal_revised4[sal_revised4['salary'].str.contains("year")]
# # sal_revised3 = df[df['salary'].str.contains("hour", na=False) == False]
# # sal_revised3 = df[df['salary'].str.contains("week", na=False) == False]
# sal_revised3.drop('estimate', inplace = True)

In [303]:
# sal_revised3

## Predicting salaries using Random Forests + Another Classifier

In [151]:
# sal_revised1 = sal_revised.sal_update.str.contains('year')

#### Load in the the data of scraped salaries

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the location as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Random Forest with these features. Do they add any value? 


In [None]:
## YOUR CODE HERE

#### Rebuild this model with the new variables
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model. 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE