This notebook serves as an introduction to Web scraping in Python. <br />
We walkthrough a start to end web scraping pipeline and look to build a script to fetch job offers and provide links to apply to them.

---

The first step is to inspect the web page from which we wish to scrape the data. Here, I scrape data from a software developer job webpage in Monster.com <br />
The webpage for this is "https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia" <br />
The query parameters in this webpage are "?q=Software-Developer&where=Australia" <br />
<br />
We make use of Python's requests and BeautifulSoup packages to perform web scraping <br/>
<br />
To install the packages, use <br />
pip3 install requests <br />
pip3 install beautifulsoup4

In [1]:
import requests
from bs4 import BeautifulSoup

In [10]:
URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL) #retrieves the HTML data from the server

soup = BeautifulSoup(page.content, 'html.parser') #creates a Beautiful Soup object that takes the HTML content as input
soup

<!DOCTYPE html>

<html lang="en" xml:lang="en" xmlns="https://www.w3.org/1999/xhtml">
<head>
<link href="https://coda.newjobs.com" rel="preconnect"/>
<link href="https://js-seeker.newjobs.com" rel="preconnect"/>
<link href="https://css-seeker.newjobs.com" rel="preconnect"/>
<link href="https://securemedia.newjobs.com" rel="preconnect"/>
<link href="https://logs2.jobs.com" rel="preconnect"/>
<link href="https://job-openings.monster.com" rel="preconnect"/>
<link href="https://apis.google.com" rel="preconnect"/>
<link href="https://www.google.com" rel="preconnect"/>
<link href="https://accounts.google.com" rel="preconnect"/>
<link href="https://content.googleapis.com" rel="preconnect"/>
<link href="https://ssl.gstatic.com" rel="preconnect"/>
<link href="https://www.dropbox.com" rel="preconnect"/>
<link href="http://js.live.net" rel="preconnect"/>
<link href="https://r.turn.com" rel="preconnect"/>
<link href="https://www.facebook.com" rel="preconnect"/>
<link href="https://dpm.demdex.net" 

By inspecting the site's developer tools, we find that the HTML object that contains all of the job postings is a <div> with an id attribute that has the value "ResultsContainer" <br />
<br />
We use BeautifulSoup's find function to find that particular id

In [12]:
results = soup.find(id='ResultsContainer')
print(results.prettify()) #For easier viewing, you can .prettify() any Beautiful Soup object

<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">
 <div class="scrollable" id="ResultsScrollable">
  <script type="application/ld+json">
   {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Software-Developer&amp;where=Australia"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/senior-lead-software-engineer-browser-sunnyvale-ca-plantation-fl-hq-austin-tx-culver-new-york-city-ca-seattle-wa-toronto-ny-us-magic-leap-inc/36b509cf-114c-48aa-aede-e6574b6cbff5"}
                    ,
                 {"@type":"ListItem","position":2,"url":""}
                    ,
                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/sql-bi-ssrs-ssis-developer-for-blackboard-nyc-new-york-wa-us-lancesoft-inc/5

We use the results object and select all postings wrapped in a 'section' element with the class 'card-content'

In [13]:
job_elems = results.find_all('section', class_='card-content')

In [14]:
for job_elem in job_elems: #print them out individually for better viewing
    print(job_elem, end='\n'*2)

<section class="card-content" data-jobid="36b509cf-114c-48aa-aede-e6574b6cbff5" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="mux-company-logo thumbnail"></div>
<div class="summary">
<header class="card-header">
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="660" data-m_impr_j_coc="" data-m_impr_j_jawsid="435655462" data-m_impr_j_jobid="2219341" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="0" data-m_impr_j_lid="0" data-m_impr_j_long="0" data-m_impr_j_occid="11970" data-m_impr_j_p="1" data-m_impr_j_postingid="36b509cf-114c-48aa-aede-e6574b6cbff5" data-m_impr_j_pvc="ec3a6188-6a80-441a-814d-a9e2c9b76318" data-m_impr_s_t="t" data-m_impr_uuid="cc42d2f9-0c70-4f40-b5bc-34285df097fa" href="https://job-openings.monster.com/senior-lead-software-engineer-browser-sunnyvale-ca-plantation-fl-hq-austin-tx-culver-new-york-city-ca-seattle-wa-toronto-ny-us-magic-leap-inc/36b509cf-114c-48aa-

In [15]:
#Let's pick out individual class names
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    print(title_elem)
    print(company_elem)
    print(location_elem)
    print()

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="660" data-m_impr_j_coc="" data-m_impr_j_jawsid="435655462" data-m_impr_j_jobid="2219341" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="0" data-m_impr_j_lid="0" data-m_impr_j_long="0" data-m_impr_j_occid="11970" data-m_impr_j_p="1" data-m_impr_j_postingid="36b509cf-114c-48aa-aede-e6574b6cbff5" data-m_impr_j_pvc="ec3a6188-6a80-441a-814d-a9e2c9b76318" data-m_impr_s_t="t" data-m_impr_uuid="cc42d2f9-0c70-4f40-b5bc-34285df097fa" href="https://job-openings.monster.com/senior-lead-software-engineer-browser-sunnyvale-ca-plantation-fl-hq-austin-tx-culver-new-york-city-ca-seattle-wa-toronto-ny-us-magic-leap-inc/36b509cf-114c-48aa-aede-e6574b6cbff5" onclick="clickJobTitle('plid=0&amp;pcid=660&amp;poccid=11970','Software Developer',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Senior/Lead Software Engineer, Browser&quot;,&quot;eVar66&quot

Use BeautifulSoup's .text to only get text data from the HTML Elements

In [16]:
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem): #To prevent errors from items with a value of None
        continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    print()

Senior/Lead Software Engineer, Browser
Magic Leap, Inc.
Sunnyvale, CA; Plantation, FL (HQ); Austin, TX; Culver New York City, CA; Seattle, WA; Toronto, NY

SQL BI (SSRS, SSIS) developer for Blackboard - NYC
LanceSoft Inc
New york, WA

SAP BI Developer
Conoco Phillips
Brisbane, QLD

Python Developer
LanceSoft Inc
Woodlands, WA

SAP BI Developer
Conoco Phillips
Brisbane, QLD

Senior Software Engineer, Application Framework - Contractor
Magic Leap, Inc.
Plantation, FL (HQ); Toronto, ON; Sunnyvale, CA; Culver New York City, CA; Seattle, WA; Austin, TX

Agile Engineer
Commonwealth Superannuation Corporation (CSC)
Canberra, ACT

Customer Solutions Architect (Software) Professional Services Cyber Security
Varmour
Sydney, NSW

Software Platform Architect
Magic Leap, Inc.
Plantation, FL; Sunnyvale, CA; Culver New York City, CA; Austin, TX; Seattle, WA; Toronto, NY

Junior QA Analyst - Melbourne, Victoria
Mediaocean
Melbourne, VIC

Sr Software Engineer - C++
Adobe Inc.
Seattle, WA

Payroll Teste

Search for any jobs by specifying the text in the job title

In [29]:
#Search for jobs with the title 'analyst' in it
analyst_jobs = results.find_all('h2',
                               string=lambda text: 'analyst' in text.lower())

print("Number of analyst jobs: ",len(analyst_jobs))

Number of analyst jobs:  3


In [30]:
#Provide links for applying to those jobs
for a_job in analyst_jobs:
    link = a_job.find('a')['href']
    print(a_job.text.strip())
    print(f"Apply here: {link}\n")

Junior QA Analyst - Melbourne, Victoria
Apply here: https://job-openings.monster.com/junior-qa-analyst-melbourne-victoria-melbourne-vic-us-mediaocean/fcb03051-f03f-4d58-8198-09bcd25371db

Test Analyst
Apply here: https://job-openings.monster.com/test-analyst-canberra-act-us-dialog-group/ddcc107c-6d68-4d68-a494-1cef3dad39d2

Senior Test Analyst
Apply here: https://job-openings.monster.com/senior-test-analyst-canberra-act-us-dialog-group/78561a2e-5485-47c8-9441-e79f0e906cec



---