# Project 4: Web Scraping Indeed.com & Predicting Salaries

In Project 4, we practice two major skills: collecting data via  web scraping and building a binary predictor with Logistic Regression.

We will collect salary information on data science jobs in a variety of markets. Using location, title, and job summary, we'll predict the salary of the job. For job posting sites, this is extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), extrapolating expected salary can help guide negotiations.

Normally, we can use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Q: Why would we want this to be a classification problem?
- A: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Section one focuses on scraping Indeed.com; then we use listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

Scrape job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries. First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract.

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

Problem Statement: Enter the statement here

In [242]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd

In [245]:
# #read site in soup
# url= "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=Atlanta%2C+GA&start=10&pp="
# r = requests.get(url)
# soup = BeautifulSoup(r.content, "lxml")

# #Append to the full set of results
# results = soup.find_all('div', { "class" :"result" })
# results

[<div class="row result" data-jk="e49013aeeb6925aa" id="pj_e49013aeeb6925aa">\n<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0ADQeWqYYwuRDGK7FTA-YzU7Qag1HzehP-FB7aWzkBO6FRyVCTNkBclA9FrQyCAXLjn4u9IP3mepDeJnt-BtuBzfosurWUx8dMI4PS2dnzCGDjcb1rqHXIXHxiVvQw9rfF11WZQLdcQkIzgltmPYW2VZQafuPC2qYCDBhg02A2bHP0A24khtXASaA1S4GiUPCi_qCX8fwmpMBl5iYvUTK2k1IfbboiYxA9Jxxc1naN_r5O6LViJSAqWh2ZoGRQXYWg4MTeEFnYozxgl36GzTC6DTKTxwHlr7NcAoFcNZcERVSplTnIRv2EQ4kdq5AQoMJSXjXWa2KXxOAq49Nk5AiWGvIZzCZ5dHw7Y5WNuB_9IO7yZ2w8L3bg156CWyHETNOuUnEy1uj5eBE4F7sesj2-qFetUxMzP8oqj9QtylvrXQkfLzO1KgyvMqhIYpPyzihJdvOhwBqx2JViwpGxSwsNyrDo7dRDmVzc=&amp;p=1&amp;sk=&amp;fvj=1" id="sja1" onclick="setRefineByCookie(['salest']); sjoc('sja1',0); convCtr('SJ', pingUrlsForGA)" onmousedown="sjomd('sja1'); clk('sja1');" rel="nofollow" target="_blank" title="Data Scientist With Predictive Modeling"><b>Data</b> <b>Scientist</b> With Predictive Modeling</a>\n<br/>\n<div class="sjcl">\n<span c

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some of the more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

In [246]:
# results = soup.find_all('div', { "class" : "result" })
# df = pd.DataFrame()
# for element in range(len(results)):
#     title=results[element].find("a",{'data-tn-element' : 'jobTitle'}).get_text().strip()
#     company=results[element].find("span",{"class":"company"}).get_text().strip()
#     location=results[element].find("span",{"class":"location"}).get_text().strip()
#     salary=results[element].find("td",{"class":"snip"}).get_text().strip()
#     df=df.append({"Title":title, "Company":company, "Location": location,"Salary": salary},ignore_index=True)
    
# df

Unnamed: 0,Company,Location,Salary,Title
0,MobileDev Power,"Atlanta, GA 30305",Build scalable predictive models based on data...,Data Scientist With Predictive Modeling
1,Cox Automotive,"Atlanta, GA",Description Cox Automotive is hiring a Data Sc...,Data Scientist
2,Home Depot,"Atlanta, GA 30354",Strong communication and data presentation ski...,DATA SCIENTIST
3,ICF,"Atlanta, GA",Evaluate data errors and work to resolve data ...,Scientific Data Analyst
4,Ashton212,"Atlanta, GA",The Important Stuff A leader in the payments i...,Quantitative Analyst | #769
5,Honeywell,"Atlanta, GA","Advanced analytics/modelling projects, includi...",Analytics DevOps Engineer
6,GE Power,"Atlanta, GA",Ability to create and understand data models. ...,IT Project Manager – Sourcing BI
7,Northrop Grumman,"Atlanta, GA","Perform data entry, cleaning, and analysis (bo...",HAI Prevention Data Analyst
8,Lockheed Martin,"Marietta, GA 30063","Experience with full-scale pole model design, ...",Research Scientist Senior
9,"Eagle Medical Services, LLC","Atlanta, GA",Comfortable in collaborating with data scienti...,Data Analyst


### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```

- Make sure these functions are robust and can handle cases where the data/field may not be available
- Test the functions on the results above

In [247]:
# get text
# def extract_text(el):
#     if el:
#         return el.text.strip()
#     else:
#         return ''
        
# # company
# def get_company_from_result(result):
#     return extract_text(result.find('span', {'class' : 'company'}))

# # location
# def get_location_from_result(result):
#     return extract_text(result.find('span', {'class' : 'location'}))
# # summary
# def get_summary_from_result(result):
#     return extract_text(result.find('span', {'class' : 'sumamry'}))
# # title
# def get_title_from_result(result):
#     return extract_text(result.find('a', {'data-tn-element' : 'jobTitle'}))
# # get salary if exists
# def get_salary_from_result(result):
#     salary_table = result.find('td', {'class' : 'snip'})
#     if salary_table:
#         snip = salary_table.find('nobr')
#         if snip:
#             return snip.text.strip()   
#     return None

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results: the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try different city). The second controls where in the results to start and gives 10 results (so we can keep incrementing this by 10 to move further within the list).

#### Complete the following code to collect results from multiple cities and start points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [250]:
# # create template URL and max number of results (pages) to pull
# url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}&pp="

# cities=['atlanta','charlotte','newark','chicago','los+angeles','seattle',
#         'richmond','raleigh','miami','new+york','detroit','san+francisco','boston',
#         'dallas','houston','']

# df=pd.DataFrame()
# # for loop to pull data with bs4
# for city in cities: 
#     for i in range (0,500,10):
#         link=url_template.format(city,i)
#         r=requests.get(link)
#         soup=BeautifulSoup(r.content,'lxml')
#         results = soup.find_all('div', { "class" : "result" })
        
#         for result in results:
#             if result:
#                 company = get_company_from_result(result)
#                 title = get_title_from_result(result)
#                 location = get_location_from_result(result)
#                 salary = get_salary_from_result(result)
#                 df=df.append({"Company Name": company, "Job Title": title, "Location": location,
#                              "Salary": salary}, ignore_index=True)
            
# df

Unnamed: 0,Company Name,Job Title,Location,Salary
0,Predictive Science,Data Scientist,United States,
1,MobileDev Power,Data Scientist With Predictive Modeling,"Atlanta, GA 30305",
2,Home Depot,DATA SCIENTIST,"Atlanta, GA 30354",
3,"Vision3 Solutions, Inc",Data Scientist- Big Data,"Atlanta, GA",$90 an hour
4,"eTek IT Services, Inc.",Big Data Engineer,"Atlanta, GA",$67 an hour
5,Deloitte,Federal Mission Analytics Data Scientist - Sen...,"Atlanta, GA",
6,LexisNexis,Associate Statistical Modeler,"Alpharetta, GA 30005",
7,Google,"Cloud Instructor (Big Data, Machine Learning),...","Atlanta, GA",
8,Reed Elsevier,Associate Statistical Modeler,"Alpharetta, GA",
9,Vesta Corporation,Senior Data Scientist,"Atlanta, GA 30303 (Five Points area)",


In [251]:
# df.to_csv("indeed_file.csv",encoding='utf-8')

In [255]:
df=pd.read_csv("indeed_file.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Company Name,Job Title,Location,Salary
0,0,Predictive Science,Data Scientist,United States,
1,1,MobileDev Power,Data Scientist With Predictive Modeling,"Atlanta, GA 30305",
2,2,Home Depot,DATA SCIENTIST,"Atlanta, GA 30354",
3,3,"Vision3 Solutions, Inc",Data Scientist- Big Data,"Atlanta, GA",$90 an hour
4,4,"eTek IT Services, Inc.",Big Data Engineer,"Atlanta, GA",$67 an hour


In [256]:
len(df) # total of 11840 rows

11840

In [257]:
df.shape

(11840, 5)

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [287]:
df.set_index(["Unnamed: 0"])

Unnamed: 0_level_0,Company Name,Job Title,Location,Salary
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Predictive Science,Data Scientist,United States,
1,MobileDev Power,Data Scientist With Predictive Modeling,"Atlanta, GA 30305",
2,Home Depot,DATA SCIENTIST,"Atlanta, GA 30354",
3,"Vision3 Solutions, Inc",Data Scientist- Big Data,"Atlanta, GA",$90 an hour
4,"eTek IT Services, Inc.",Big Data Engineer,"Atlanta, GA",$67 an hour
5,Deloitte,Federal Mission Analytics Data Scientist - Sen...,"Atlanta, GA",
6,LexisNexis,Associate Statistical Modeler,"Alpharetta, GA 30005",
7,Google,"Cloud Instructor (Big Data, Machine Learning),...","Atlanta, GA",
8,Reed Elsevier,Associate Statistical Modeler,"Alpharetta, GA",
9,Vesta Corporation,Senior Data Scientist,"Atlanta, GA 30303 (Five Points area)",


In [293]:
df2=df[['Company Name','Job Title','Location','Salary']]
df2 #11840 rows

Unnamed: 0,Company Name,Job Title,Location,Salary
0,Predictive Science,Data Scientist,United States,
1,MobileDev Power,Data Scientist With Predictive Modeling,"Atlanta, GA 30305",
2,Home Depot,DATA SCIENTIST,"Atlanta, GA 30354",
3,"Vision3 Solutions, Inc",Data Scientist- Big Data,"Atlanta, GA",$90 an hour
4,"eTek IT Services, Inc.",Big Data Engineer,"Atlanta, GA",$67 an hour
5,Deloitte,Federal Mission Analytics Data Scientist - Sen...,"Atlanta, GA",
6,LexisNexis,Associate Statistical Modeler,"Alpharetta, GA 30005",
7,Google,"Cloud Instructor (Big Data, Machine Learning),...","Atlanta, GA",
8,Reed Elsevier,Associate Statistical Modeler,"Alpharetta, GA",
9,Vesta Corporation,Senior Data Scientist,"Atlanta, GA 30303 (Five Points area)",


In [304]:
df2.drop_duplicates()

Unnamed: 0,Company Name,Job Title,Location,Salary
0,Predictive Science,Data Scientist,United States,
1,MobileDev Power,Data Scientist With Predictive Modeling,"Atlanta, GA 30305",
2,Home Depot,DATA SCIENTIST,"Atlanta, GA 30354",
3,"Vision3 Solutions, Inc",Data Scientist- Big Data,"Atlanta, GA",$90 an hour
4,"eTek IT Services, Inc.",Big Data Engineer,"Atlanta, GA",$67 an hour
5,Deloitte,Federal Mission Analytics Data Scientist - Sen...,"Atlanta, GA",
6,LexisNexis,Associate Statistical Modeler,"Alpharetta, GA 30005",
7,Google,"Cloud Instructor (Big Data, Machine Learning),...","Atlanta, GA",
8,Reed Elsevier,Associate Statistical Modeler,"Alpharetta, GA",
9,Vesta Corporation,Senior Data Scientist,"Atlanta, GA 30303 (Five Points area)",


In [307]:
# Dropping rows that have NA values
df3=df2.dropna()
len(df3) #533 rows after dropping NA and duplicates

533

In [313]:
# Filter out salary entries referring to week, hour or month
df3 = df3[~(df3.Salary.astype('str').str.contains('hour'))]
df3 = df3[~(df3.Salary.astype('str').str.contains('month'))]
print df3.shape
df3.head()

(367, 4)


Unnamed: 0,Company Name,Job Title,Location,Salary
37,Centers for Disease Control and Prevention,Behavioral Scientist,"Atlanta, GA","$74,260 - $96,538 a year"
48,Analytic Recruiting,Junior Data Scientist,"Alpharetta, GA","$75,000 - $90,000 a year"
50,Smith Hanley Associates,Marketing Statistician,"Atlanta, GA","$75,000 - $100,000 a year"
84,Analytic Recruiting,Senior Data Scientist,"Atlanta, GA","$100,000 - $125,000 a year"
98,Stackfolio,Lead Data Scientist,"Atlanta, GA 30308 (Old Fourth Ward area)","$80,000 a year"


In [235]:
df2['Sal_range'] = [df2['Salary'].replace('a year','') for salry in df2['Salary']]
del df2["Sal_range"]
df2['ayear'] = df2['Salary'].str.extract('(. ....)', expand=True)
del df2['ayear']
df2['new_year1'] = df2['Salary'].str.extract('([0-9]+,[0-9]+)', expand=True)
df2['new_year2'] = df2['Salary'].str.extract('-([0-9]+,[0-9]+)', expand=True)
df2.head()

Unnamed: 0_level_0,Company Name,Job Title,Location,Salary,new_year,new_year2,new_year1
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7,Centers for Disease Control and Prevention,Centers for Disease Control and Prevention,"Atlanta, GA","$88,305 - $114,802 a year",88305,,88305
23,Emory University,Emory University,"Atlanta, GA","$85,500 a year",85500,,85500
26,Centers for Disease Control and Prevention,Centers for Disease Control and Prevention,"Atlanta, GA","$74,260 - $96,538 a year",74260,,74260
36,Analytic Recruiting,Analytic Recruiting,"Alpharetta, GA","$75,000 - $90,000 a year",75000,,75000
71,Stackfolio,Stackfolio,"Atlanta, GA 30308 (Old Fourth Ward area)","$80,000 a year",80000,,80000


#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [209]:
import re
import numpy as np
def extract_salary_average(salary_string):
    regex = r'\$([0-9]+,[0-9]+)'
    matches = re.findall(regex, salary_string)
    return np.mean([float(salary.replace(',', '')) for salary in matches ])

In [217]:
#extract_salary_average
sal = df2[['Salary']]
sal_col = extract_salary_average(sal)

TypeError: expected string or buffer

In [None]:
# use '.map' to transform salary to new feature


In [None]:
# save scraped results as a CSV for Tableau/external viz


## Predicting salaries using Logistic Regression

In [None]:
# load in the the data of scraped salaries


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

In [None]:
# calculate median and create feature with 1 as high salary


### Q: What is the baseline accuracy for this model?

It is 50% if we guess randomly, half the salaries will be below the median and half will be above.

#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [None]:
# create statsmodel and summary
import statsmodels.formula.api as sm


#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' or 'Manager' is in the title 
- Then build a new Logistic Regression model with these features. Do they add any value? 


In [None]:
# create senior, director, and manager dummies
salary_data['is_senior'] = salary_data['title'].str.contains('Senior').astype(int) # example


#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [None]:
# scale, (patsy optional), and fit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from patsy import dmatrix

scaler = StandardScaler()
model = LogisticRegression(penalty = 'l2', C=0.1)


#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy, AUC, precision and recall of the model. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.

In [None]:
from sklearn.cross_validation import cross_val_score

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']: # example
    

### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [None]:
model = LogisticRegression(penalty = 'l1', C=1.0)

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']:
    

In [None]:
model.fit(X_scaled, y)

df = pd.DataFrame({'features' : X.design_info.column_names, 'coef': model.coef_[0,:]})
df.sort_values('coef', ascending=False, inplace=True)
df

#### Optional: Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients. Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary. Which entries have the highest predicted salaries?

# Bonus Section: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are most valuable? 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform()

for metric in ['accuracy', 'precision', 'recall', 'roc_auc']:
    scores = cross_val_score(model, X_scaled, y, cv=3, scoring=metric)
    print(metric, scores.mean(), scores.std())

In [None]:
model.fit(X_scaled, y)

df = pd.DataFrame({'features' : vectorizer.get_feature_names(), 'coef': model.coef_[0,:]})
df.sort_values('coef', ascending=False, inplace=True)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# retest L1 and L2 regularization
from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV()


Score: | /24
------|-------
Identify: Problem Statement and Hypothesis | 
Acquire: Import Data using BeautifulSoup| 
Parse: Clean and Organize Data| 
Model: Perform Logistic Regression| 
Evaluate: Logistic Regression Results	|
Present: Blog Report with Findings and Recommendations		| 
Interactive Tableau visualizations | 
Regularization |
Bonus: Countvectorizer  | 