## Introduction

Knowing Beautiful Soup and how to select various elements from a web page, it's time to practice scraping a website. We'll start to see that scraping is a dynamic process that involves investigating the web page(s) at hand and developing scripts tailored to those structures.

## Objectives

In this analysis, we plan to:

* Navigate HTML documents using Beautiful Soup's children and sibling relations

* Select specific elements from HTML using Beautiful Soup

* Use regular expressions to extract items with a certain pattern within Beautiful Soup

* Determine the pagination scheme of a website and scrape multiple pages

## Overview

This analysis will build upon a script that will iterate over all of the pages for the demo site and extract the job title, company name, company rating, company location (city included) and job description summary of each job AD posted. Building up to that, we'll formalize the concepts by writing functions that will extract a list of each of these features for each web page. We'll then combine these functions into the full script which will look something like this:

df = pd.DataFrame()

for i in range(0, 11):

    url = "https://www.indeed.com/jobs?q=big+data+developer&l=Denver&start={}".format(i)

    soup = BeautifulSoup(html_page.content, 'html.parser')

    new_place = retrieve_comLocat(soup)
    
    new_names = retrieve_jobTitle(soup)
    
    new_comps = retrieve_comNames(soup)
    
    new_rates = retrieve_comRates(soup)
    
    new_jobad = retrieve_jobConts(soup)
    
    ...

In [None]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Grabbing an HTML Page

To start, here's how to retrieve an arbitrary web page and load its content into Beautiful Soup for parsing. You first use the requests package to pull the HTML itself and then pass that data to beautiful soup.

In [None]:
html_page = requests.get('https://www.indeed.com/jobs?q=big+data+engineer&l=Chicago&start=0') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing

## Previewing the Structure

While it's apt to be too much information to effectively navigate, taking a quick peek into the structure of the HTML page is always a good idea.

In [None]:
soup.prettify

## Selecting a Container

While we're eventually looking to select each of the individual jobs, it's often easier to start with an encapsulating container. In this case, the section displayed above. Once we select this container, we can then make sub-selections within it to find the relevant information we are searching for. In this case, the warning just above the div for the jobs is easy to identify. We can start by selecting this element and then navigating to the next div element.

In [None]:
job_container = soup.find_all(name="div", attrs={"class" : "row"})
job_container # Previewing is optional but can help you verify you are selecting what you think you are

In [None]:
max_results_per_city = 1
city_set = ['Raleigh', 'Boston', 'Portland', 'San+Diego', 'San+Francisco', 'Dallas', 'Denver', 'Hartford', 'Atlanta', 
            'Chicago', 'New+York', 'New+Jersey', 'Washington+DC', 'Tampa', 'Houston', 'Phoenix', 'philadelphia']
columns = ["job_title", "company_name", "company_rating", "company_location", "job_descriptions"]

## Retrieving Company Locations

write a function that extracts locations on a given page. The input for the function should be the soup for the HTML of the page.

In [None]:
def retrieve_comLocat(soup):
    
    locations = []
    
    job_container = soup.find_all(name="div", attrs={"class" : "row"})
    
    for div in job_container:
        sjcl = div.find(name="div", attrs={"class" : "sjcl"})
        locs = sjcl.find_all(name="div", attrs={"class" : "location"}) 
        if len(locs) > 0:
            for c in locs:
                locations.append(c.text)
        else:
            locations.append("N/A")
        
    return locations

## Retrieving Titles

write a function that extracts job titles on a given page. The input for the function should be the soup for the HTML of the page.

In [None]:
def retrieve_jobTitle(soup):
    
    titles = []
    
    job_container = soup.find_all(name="div", attrs={"class" : "row"})
    
    for div in job_container:
        for a in div.find_all(name="a", attrs={"data-tn-element" : "jobTitle"}):
            titles.append(a["title"])
            
    return titles            

## Retrieving Company Names

write a function that extracts company names on a given page. The input for the function should be the soup for the HTML of the page.

In [None]:
def retrieve_comNames(soup):
    
    companies = []
    
    job_container = soup.find_all(name="div", attrs={"class" : "row"})
    
    for div in job_container:
        company = div.find_all(name="span", attrs={"class" : "company"})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            companies.append("N/A")
            
    return companies

## Retrieving Company Rates

write a function that extracts company rates on a given page. The input for the function should be the `soup` for the HTML of the page.

In [None]:
def retrieve_comRates(soup):
    
    ratings = []
    
    job_container = soup.find_all(name="div", attrs={"class" : "row"})
    
    for div in job_container:
        sjcl = div.find(name="div", attrs={"class" : "sjcl"})
        rating = sjcl.find_all(name="span", attrs={"class" : "ratingsContent"})
        if len(rating) > 0:
            for d in rating:
                ratings.append(d.text.strip())
        else:
            ratings.append("N/A")
            
    return ratings        

## Retrieving Job Descriptions

write a function that extracts job descriptions on a given page. The input for the function should be the `soup` for the HTML of the page.

In [None]:
def retrieve_jobConts(soup):
    
    descriptions = []
    
    job_container = soup.find_all(name="div", attrs={"class" : "row"})
    
    for div in job_container:
        try:
            div_one = div.find("summary").text
            descriptions.append(div_one)
        except:
            try:
                div_two = div.find(name="ul", attrs={"style":"list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;"})
                div_three = div_two.find("li")
                descriptions.append(div_three.text.strip())
            except:
                descriptions.append("N/A")
                
    return descriptions

In [None]:
def job_ads_by_title(title):
    
    jobTitle = []
    compName = []
    rateNumb = []
    compLoct = []
    jobsPost = []
    
    for city in city_set:
        print(city)
        for i in range(0, max_results_per_city, 1):
            url = "https://www.indeed.com/jobs?q={0}&l={1}&start={2}".format(title, city, str(i))
            html_page = requests.get(url)
            soup = BeautifulSoup(html_page.text, "html.parser")
            
            new_names = retrieve_jobTitle(soup)
            new_comps = retrieve_comNames(soup)
            new_rates = retrieve_comRates(soup)
            new_place = retrieve_comLocat(soup)
            new_jobad = retrieve_jobConts(soup)
            
            jobTitle += new_names
            compName += new_comps
            rateNumb += new_rates
            compLoct += new_place
            jobsPost += new_jobad
    
    df = pd.DataFrame([jobTitle, compName, rateNumb, compLoct, jobsPost]).transpose()
    df.columns = columns
    print(len(df))
    print(df.head(10))
    return df

In [None]:
if __name__ == "__main__":
    big_data_engineer_df= job_ads_by_title("big+data+engineer")     
    big_data_engineer_df.to_csv("big_data_engineer_jobs.csv", encoding='utf-8')

In [None]:
if __name__ == "__main__":
    big_data_developer_df= job_ads_by_title("big+data+developer")     
    big_data_developer_df.to_csv("big_data_developer_jobs.csv", encoding='utf-8')

In [None]:
if __name__ == "__main__":
    data_engineer_df= job_ads_by_title("data+engineer")     
    data_engineer_df.to_csv("data_engineer_jobs.csv", encoding='utf-8')