## Introduction

Knowing Beautiful Soup and how to select various elements from a web page, it's time to practice scraping a website. We'll start to see that scraping is a dynamic process that involves investigating the web page(s) at hand and developing scripts tailored to those structures.

## Objectives

In this analysis, we plan to:

* Navigate HTML documents using Beautiful Soup's children and sibling relations

* Select specific elements from HTML using Beautiful Soup

* Use regular expressions to extract items with a certain pattern within Beautiful Soup

* Determine the pagination scheme of a website and scrape multiple pages

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Grabbing an HTML Page

To start, here's how to retrieve an arbitrary web page and load its content into Beautiful Soup for parsing. You first use the requests package to pull the HTML itself and then pass that data to beautiful soup.

In [2]:
html_page = requests.get('https://www.indeed.com/jobs?q=big+data+developer&l=Denver&start=0') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing

## Previewing the Structure

While it's apt to be too much information to effectively navigate, taking a quick peek into the structure of the HTML page is always a good idea.

In [3]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="//d3fw5vlhllyvee.cloudfront.net/s/bc08208/en_US.js" type="text/javascript"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/105b986/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="https://rss.indeed.com/rss?q=big+data+developer&amp;l=Denver" rel="alternate" title="Big Data Developer Jobs, Employment in Denver, CO" type="application/rss+xml"/>
<link href="/m/jobs?q=big+data+developer&amp;l=Denver" media="only screen and (max-width: 640px)" rel="alternate"/>
<script type="text/javascript">

if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallbacks'].push(cb);
}
}
</script>
<meta content="1" name="ppstriptst"/>
<script>
var _scriptDownloadCount = 0;
v

## Selecting a Container

While we're eventually looking to select each of the individual jobs, it's often easier to start with an encapsulating container. In this case, the section displayed above. Once we select this container, we can then make sub-selections within it to find the relevant information we are searching for. In this case, the warning just above the div for the jobs is easy to identify. We can start by selecting this element and then navigating to the next div element.

In [4]:
job_container = soup.find_all(name="div", attrs={"class" : "row"})
job_container # Previewing is optional but can help you verify you are selecting what you think you are

[<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="364183981" data-empn="5076844856522515" data-jk="ae19e6e1ff445359" id="pj_ae19e6e1ff445359">
 <style>
 .jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}.jasxrefreshcombotst .jobcard_logo img{max-height:2rem;max-width:100%}
 </style>
 <h2 class="title">
 <a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0DDHMFXDyHIjPleqKN232duWMk2IjjgKsEx-NnxmAniIGsi94dAlAFLUaenusLbuQSSmNWF3aAyNWEp3P0hBsW7XkSkiG9L6C5rXuqzmRII_JCKyHSieKdBKjupg9YkByyIRLeDtRvFeqwo0m8h5wcwZ-8knTaCiQnohfiEiHWZYxBPq-Wxcu4suPy8JcGOKNiHUsZdkUYMBuVXXW3SyAbkVXrme6aJuiiqOp2vJwP8t8aOIrr-8vhyOvvcN8CB4Nir4b7o5tWysNvbPqjTKTEJMLZGRSEwlZtCiR2fcEtJkaepOZrWp82iXTAx6NLrVqN2cyxA_0FJtJLcvDiW15O1zUSab3jwYQuux-8MzbuTTwNq3MzuRv9Kdgl_FgqJ4WiyPtysAkbAAnO30afwhABK2-3HlS44OpPfYwF4t-1fjfecUcTsxcQMfcGU12dU1OhZhh-Zq_qQmNgPOidscMnVv02DQ-ElS7Y=&amp;p=0&amp;fvj=1&amp;vjs=3" id="sja0" onclick="setRefine

In [5]:
max_results_per_city = 1
city_set = ['Raleigh','Boston','Portland', 'San+Diego', 'Dallas', 'Denver', 'Hartford', 'Atlanta']
columns = ["city", "job_title", "company_name", "rating", "location", "summary"]

In [6]:
def job_ads_by_title(title):
    
    cityName = []
    jobTitle = []
    compName = []
    rateNumb = []
    sumPaper = []
    jobsPost = []
    
    for city in city_set:
        for start in range(0, max_results_per_city, 1):
            page = requests.get(f'https://www.indeed.com/jobs?q={title}&l=' + str(city) + '&start=' + str(start))
            soup = BeautifulSoup(page.text, "html.parser")
            job_container = soup.find_all(name="div", attrs={"class" : "row"})
            for div in job_container:
                for a in div.find_all(name="a", attrs={"data-tn-element" : "jobTitle"}):
                    com = div.find_all(name="span", attrs={"class" : "company"})
                    sjcl = div.find('div', attrs={'class': 'sjcl'})
                    loc = sjcl.find_all('div', attrs={'class' : 'location'})
                    rat = sjcl.find_all(name="span", attrs={"class" : "ratingsContent"})    
                    if len(com) > 0 and len(loc) > 0 and len(rat) > 0:
                        cityName.append(city) 
                        jobTitle.append(a["title"])
                        compName.append(com[0].text.strip())
                        rateNumb.append(rat[0].text.strip())
                        sumPaper.append(loc[0].text.strip())
                        #print('>>> company:', com[0].text.strip(), '[ rating =', rat[0].text.strip(), '] in', loc[0].text)
                        #print('  Job Title:', a['title'], '\n  Job Descriptions:')      
                        try:    
                            div_one = div.find("summary").text
                            #print('  ', div_one)
                            jobsPost.append(div_one)
                        except:
                            try:
                                div_two = div.find(name="ul", attrs={"style":"list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;"})
                                div_three = div_two.find("li")
                                #print('  ', div_three.text.strip())
                                jobsPost.append(div_three.text.strip())
                            except:
                                #print("  Nothing_found")
                                jobsPost.append("N/A")
                        #print(' -------------------------------------------------------------------------------------------------------- ')
    
    df = pd.DataFrame([cityName, jobTitle, compName, rateNumb, sumPaper, jobsPost]).transpose()
    df.columns = columns
    print(len(df))
    print(df)
    return df

In [7]:
if __name__ == "__main__":
    big_data_df= job_ads_by_title("big+data+developer")     
    big_data_df.to_csv("big_data_developer_jobs.csv", encoding='utf-8')

29
         city                                          job_title  \
0     Raleigh                             Jr. Big Data Developer   
1     Raleigh                              Informatica Developer   
2     Raleigh                                 AbInitio Developer   
3      Boston                             Jr. Big Data Developer   
4    Portland                             Jr. Big Data Developer   
5    Portland                          Machine Learning Engineer   
6    Portland                           Data Engineering Manager   
7    Portland                         Big Data Software Engineer   
8    Portland                           Senior Big Data Engineer   
9   San+Diego                         Data Integration Developer   
10  San+Diego                             Jr. Big Data Developer   
11  San+Diego  Intern - Information Security: Application Sec...   
12  San+Diego                      Software Engineer 2, Big Data   
13     Dallas                           Big D

In [8]:
if __name__ == "__main__":
    data_engineer_df= job_ads_by_title("data+engineer")     
    data_engineer_df.to_csv("data_engineer_jobs.csv", encoding='utf-8')

17
         city                                    job_title  \
0     Raleigh                                Data Engineer   
1     Raleigh                            Big Data Engineer   
2      Boston                                Data Engineer   
3    Portland                         Senior Data Engineer   
4    Portland                                Data Engineer   
5   San+Diego                Software Engineer 2, Big Data   
6   San+Diego  Data and Analytics - Release Train Engineer   
7   San+Diego                                Data Engineer   
8   San+Diego                            Sr. Data Engineer   
9      Dallas                                Data Engineer   
10     Dallas                                Data Engineer   
11     Dallas                         Senior Data Engineer   
12     Dallas                          CW Sr Data Engineer   
13     Denver                         Senior Data Engineer   
14     Denver                         Junior Data Engineer   
15   