# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Webscraping Project 4 Lab

Week 4 | Day 4

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com (or other sites at your team's discretion). In the second part, the focus is on using listings with salary information to build a model and predict high or low salaries and what features are predictive of that result.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

In [2]:
import requests
import bs4
from bs4 import BeautifulSoup
from IPython.display import HTML
import numpy as np
import pandas as pd

In [9]:
# Functions to extract location, job title, company, salary and description

# extract location
def extract_location_from_soup(some_soup):
    try:
        return some_soup.find("span", class_="location").text
    except:
        return np.nan

# extract job title
def extract_title_from_soup(some_soup):
    try:
        return some_soup.find(class_="jobtitle").text.strip()
    except:
        return np.nan
    
# extract location
def extract_company_from_soup(some_soup):
    try:
        return some_soup.find("span", class_="company").text.strip()
    except:
        return np.nan

#extract salary 2 (salary_dffunction)
def extract_salary_from_soup_2(some_soup):
    for i in some_soup.findAll("div", class_="sjcl"):
        try:
            for item in i.find("div"):
                return item.strip()
        except:
            return np.nan

#extract salary
def extract_salary_from_soup(some_soup):
    for i in some_soup.findAll("td", class_="snip"):
        try:
            for item in i.find("nobr"):
                return item.strip()
        except:
            return extract_salary_from_soup_2(some_soup)
        
# extract description
def extract_description_from_soup(some_soup):
    try:
        return some_soup.find("span", class_="summary").text.strip()
    except:
        return np.nan


In [47]:
# Create a list of results from website search

url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 400

my_cities = ['New+York', 'San+Francisco', 'Seattle', 'San+Jose', 'Austin', 'Boston', \
             'Denver', 'Boulder', 'Minneapolis-Saint+Paul', 'Atlanta']
results = []
page_count = 0
results_pp = []
for city in set(my_cities):
    for start in range(0,max_results_per_city,10):
        URL = url_template.format(city,start)
        r = requests.get(URL)
        indeed_soup = BeautifulSoup(r.content, "lxml")
        results.append(indeed_soup)
        page_count+=1
        results_pp.append(len(indeed_soup.findAll("div", class_="result")))    

In [48]:
# Append search results into lists

location = []
title = []
company = []
salary = []
description = []
for page in results:
        for result in page.findAll("div", class_="result"):
            location.append(extract_location_from_soup(result))
            title.append(extract_title_from_soup(result))
            company.append(extract_company_from_soup(result))
            salary.append(extract_salary_from_soup(result))
            description.append(extract_description_from_soup(result))

In [49]:
print len(location)
print len(title)
print len(company)
print len(salary)
print len(description)

6000
6000
6000
6000
6000


In [67]:
# Sometimes the number of results per city is inconsitent
# Create df to calculate actual number of results per city

city_pages = []
for city in set(my_cities):
    for i in range(int(max_results_per_city)/10):
        city_pages.append(city)
city_count = pd.DataFrame({'city':city_pages, 'result_count':results_pp})
city_count = city_count.groupby(["city"])["result_count"].agg(np.sum).to_frame().reset_index()
city_count

Unnamed: 0,city,result_count
0,Atlanta,600
1,Austin,600
2,Boston,600
3,Boulder,600
4,Denver,600
5,Minneapolis-Saint+Paul,600
6,New+York,600
7,San+Francisco,600
8,San+Jose,600
9,Seattle,600


In [54]:
# Create list of cities (as location in search results do not always match exact search terms)

city_list =[]
for city in set(my_cities):
    for i in range(city_count[city_count["city"]==city]["result_count"]):
        city_list.append(city)

In [72]:
# Create df of all search results

df = pd.DataFrame(zip(city_list, location, company, title, salary, description))
df.columns = ["city", "location", "company", "title", "salary", "description"]
print df.shape
df.tail()

(6000, 6)


Unnamed: 0,city,location,company,title,salary,description
5995,New+York,"Jersey City, NJ",EXL,"Manager, Decision Analytics Services",,"Our global footprint of nearly 2,000 data scie..."
5996,New+York,"New York, NY",RangTech,Data Scientist,,"Web analytics, Data mining techniques applicat..."
5997,New+York,"Jersey City, NJ",ISO,Lead Data Scientist,,"Lead Data Scientist. Lead research, evaluation..."
5998,New+York,"New York, NY",Memorial Sloan Kettering,"Data Scientist, Imaging Informatics",,MSK is seeking an Imaging Informatics Data Sci...
5999,New+York,"New York, NY",Booking.com,Data Scientist - Relocation to China (Chinese ...,,Data Scientist – China Localisation. Data Scie...


In [74]:
# Replace None in salary column with np.nan

def replace_none(x):
    if x == None:
        return np.nan
    else:
        return x

df = df.applymap(replace_none)
df.head()

Unnamed: 0,city,location,company,title,salary,description
0,Boulder,"Broomfield, CO 80021",Return Path,Data Scientist,,The Data Scientist will work closely with othe...
1,Boulder,"Boulder, CO",RefactorU,Data Scientist (Curriculum Developer / Instruc...,,RefactorU is looking for one or more talented ...
2,Boulder,"Centennial, CO",Pearson,Machine Learning Engineer,,"Data structures, data representation, data cle..."
3,Boulder,"Boulder, CO",CaliberMind,Senior Data Scientist,"$120,000 a year",The Chief Data Scientist is a critical role in...
4,Boulder,"Boulder, CO",RefactorU,Data Scientist (Curriculum Developer / Instruc...,,RefactorU is looking for one or more talented ...


In [75]:
# Drop duplicates from df

df.drop_duplicates(inplace=True)
df.shape

(3316, 6)

In [76]:
df["city"].value_counts()

Boston                    405
San+Jose                  404
Seattle                   404
New+York                  404
San+Francisco             401
Denver                    347
Boulder                   312
Atlanta                   298
Austin                    171
Minneapolis-Saint+Paul    170
Name: city, dtype: int64

In [None]:
df['salary'].unique()

In [84]:
# Create a separate salary df for only entries with yearly salary

salary_df = df[~df["salary"].isnull()]
salary_df = salary_df[salary_df["salary"].str.contains("year")]
salary_df.reset_index(inplace=True, drop=True)
salary_df["salary"] = salary_df["salary"].apply(lambda x: x.replace("a year",""))
salary_df["salary"]= salary_df["salary"].apply(lambda x: (str(x)).split('-'))

In [89]:
# Create field for average salary if salary is a range

salary_df["low_range"] = salary_df["salary"].apply(lambda x: x[0].replace("$","").replace(",","")).astype(int)
salary_df["high_range"] = salary_df["salary"].apply(lambda x: x[-1].replace("$","").replace(",","")).astype(int)
salary_df["avg"] = (salary_df["low_range"] + salary_df["high_range"])/2

In [106]:
print "Job listings with salary info:", len(salary_df)
print "Total job listings: ", len(df)
print "Salaried listings / Total listings", round((float(len(salary_df)) / len(df)) * 100, 3), '%'
salary_df.head(2)

Job listings with salary info: 119
Total job listings:  3316
Salaried listings / Total listings 3.589 %


Unnamed: 0,city,location,company,title,salary,description,low_range,high_range,avg
0,Boulder,"Boulder, CO",CaliberMind,Senior Data Scientist,"[$120,000 ]",The Chief Data Scientist is a critical role in...,120000,120000,120000.0
1,Boulder,"Boulder, CO",University of Colorado,Managing Director for STROBE,"[$100,000 , $115,000 ]","Of visiting scientists; In use of excel, data ...",100000,115000,107500.0


In [111]:
salary_df.tail()

Unnamed: 0,city,location,company,title,salary,description,low_range,high_range,avg
114,New+York,"New York, NY",Beeswax,Senior / Director Data Science,"[$130,000 , $165,000 ]",A minimum of 5 years experience in Machine lea...,130000,165000,147500.0
115,New+York,"New York, NY",Averity,Senior Data Scientist for Hedge Fund,"[$150,000 , $250,000 ]",Are you Senior Data Scientist looking to join ...,150000,250000,200000.0
116,New+York,"New York, NY",Harnham,Big Data Engineer,"[$110,000 ]",You will join the data engineering team and wo...,110000,110000,110000.0
117,New+York,"New York, NY",Columbia University,Senior Research Staff Assistant,"[$50,000 ]",Work with scientists to conduct quality contro...,50000,50000,50000.0
118,New+York,"New York, NY 10018 (Clinton area)",ingenium,Lead Data Engineer,"[$200,000 ]",Lead and manage a team of data engineers and s...,200000,200000,200000.0


In [110]:
# Create new csv with salary_df

cols = salary_df.columns
salary_df[cols].to_csv("indeed_salary.csv", encoding = 'utf-8')