# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

## A. Test the scraping code to ensure that the correct fields are scraped

In [3]:
URL = 'https://www.indeed.com/jobs?q=data+scientist&l=New+York'
#conducting a request of the stated URL above:
page = requests.get(URL)
#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.
soup = BeautifulSoup(page.text, 'html.parser')
#printing soup in a more structured tree format that makes for easier reading
#print(soup.prettify())   # inserted a comment to make the notebook more readable

In [4]:
# Job Title

def extract_job_title_from_result(soup):
    jobs = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
            jobs.append(a['title'])
    return(jobs)
extract_job_title_from_result(soup)

[u'Data Scientist',
 u'Junior Data Scientist',
 u'Data Scientist',
 u'Data Scientist - New York',
 u'Data Scientist',
 u'Data Scientist (Product)',
 u'Data Scientist and Machine Learning Researcher',
 u'Data Scientist - Journey Analytics',
 u'Data Scientist - Marketing',
 u'Data Analyst / Data Scientist']

In [67]:
# Company Name

def extract_company_from_result(soup):
    companies = []
    for div in soup.find_all(name='div', attrs={'class':'row'}):
        company = div.find_all(name='span', attrs={'class':'company'})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
            else:
                sec_try = div.find_all(name='span', attrs={'class':'result-link-source'})
        for span in sec_try:
            companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(soup)

[u'Groupon',
 u'DiMeo Schneider & Associates',
 u'NORC at the University of Chicago',
 u'American College of Surgeons',
 u'NORC at the University of Chicago',
 u'The University of Chicago',
 u'Bobit Business Media',
 u'Natural Resources Defense Council',
 u'Rush University Medical Center',
 u'Nielsen']

In [5]:
# Location

def extract_location_from_result(soup):
    locations =[]
    spans=soup.find_all('span',attrs={'class':'location'})
    for span in spans:
        locations.append(span.text)
    return(locations)

extract_location_from_result(soup)


[u'New York, NY',
 u'New York, NY',
 u'New York, NY',
 u'New York, NY',
 u'New York, NY',
 u'New York, NY 10011 (Chelsea area)',
 u'New York, NY',
 u'New York, NY 10022 (Midtown area)',
 u'New York, NY',
 u'New York, NY']

In [6]:
# Confirming the tags for the salary

span=soup.find_all('span',attrs={'class':'no-wrap'})
span


[<span class="no-wrap"><b>relevance</b> -\n            <a href="/jobs?q=data+scientist&amp;l=New+York&amp;sort=date" rel="nofollow">date</a></span>,
 <span class="no-wrap">\n                $90,000 - $115,000 a year</span>,
 <span class="no-wrap">\n                $155,000 - $165,000 a year</span>,
 <span class="no-wrap">\n                $65 - $75 an hour</span>]

In [70]:
# Job Summary

def extract_summary_from_result(soup):
    summaries=[]
    spans = soup.findAll('span',attrs={'class':'summary'})
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)

extract_summary_from_result(soup)

[u'We are looking for a creative and innovative data scientist to drive strategic initiatives that will propel the future growth of Groupon....',
 u'DiMeo Schneider & Associates, L.L.C., a Chicago investment management consulting firm, seeks an experienced and ambitious professional for the position of...',
 u'Data mining, data analytics, "Big Data," administrative data linkage; Through our projects and presentations before Congress, federal, state and local...',
 u'Solid understanding of statistical reporting, data quality and data management programming principles (e.g., transposing and summarizing raw data and...',
 u'This position is responsible for performing moderately complex tasks related to research design, data collection and analysis, developing instrumentation for...',
 u'Experience with health and claims data highly appreciated. Ability to work discretely with sensitive and confidential data required....',
 u'Analyze and interpret various data sets including survey results

In [3]:
# Set the search parameters

max_results_per_city = 200
city_set = ['New+York','Chicago','San+Francisco','Austin','Seattle','Los+Angeles','Philadelphia','Atlanta','Dallas','Pittsburgh','Portland','Phoenix','Denver','Houston','Miami','Washington+DC','Boulder']
columns = ['city','job_title','company_name','location','summary','salary']
sample_df = pd.DataFrame(columns=columns)

In [4]:
# Scraping

for city in city_set:
    for start in range(0, max_results_per_city,10):
        page=requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=' + str(city) + '&start=' + str(start))
        time.sleep(1)
        soup = BeautifulSoup(page.text, 'lxml', from_encoding='utf-8')
        for div in soup.find_all(name='div',attrs={'class':'row'}):
            num=(len(sample_df)+1) # row number for index of job posting in df
            job_post=[]            # empty list for each posting
            job_post.append(city)  # append city name
            for a in div.find_all(name='a',attrs={'data-tn-element':'jobTitle'}):
                job_post.append(a['title'])    # append job title
            company = div.find_all(name='span',attrs={'class':'company'})
            if len(company) > 0:
                for b in company:
                    job_post.append(b.text.strip())
            else:
                sec_try = div.find_all(name='span',attrs ={'class':'result-link-source'})
                for span in c:
                    job_post.append(span.text)
            c = div.findAll('span',attrs={'class':'location'})
            for span in c:
                job_post.append(span.text)
            d = div.findAll('span',attrs={'class':'summary'})  # summary text
            for span in d:
                job_post.append(span.text.strip())
            try:                     # salary
                div_two = div.find(name='span',attrs={'class':'no-wrap'})
                job_post.append(div_two.text.strip())
            except:
                job_post.append('nothing_found')
            sample_df.loc[num] = job_post
                          
sample_df.to_csv('[scrape2].csv',encoding='utf-8')

                          

                          

