## Let's Scrape Job Listings!
We will scrape job listings from Monster.com, specifically Data Scientist roles in the Portland, OR area. We are interested in the job title, the company, and the location.

This notebook will take us step by step through scraping and eventually loading into MongoDB for further use. Please follow along and read the comments and text carefully!

In [1]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo
from pprint import pprint

In [2]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [3]:
# Define database and collection.
# If these don't exist, they will be created. Otherwise we will point to them.
# `monster_db` is the mongo database, `jobs` is the mongo collection 
db = client.monster_db
collection = db.jobs

In [4]:
# You can also see existing collections like this. 
# But note that the collection doesn't display until data is inserted.
names = db.list_collection_names()

for name in names:
    print(name)

In [5]:
# URL of page to be scraped
url = "https://www.monster.com/jobs/search/?q=Data-Scientist&where=Portland__2C-OR"

In [6]:
# Use the requests library to grab the HTML from the site. Then use beautiful soup
# to parse it and create the soup object.

# Minimize the number of times this code gets run! Once the data is in memory, we don't need
# to run this repeatedly. You only need to rerun it if you change the URL or if you restart 
# the jupyter notebook.
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

#### Step 1
Navigate to the page in a web browser and observe what it looks like. Also, right-click -> view page source to have the HTML open and available to you.

We want to scrape all the job listings on the left hand panel and store them in the mongoDB. Imagine that we are building a "job aggregator"-aggregator - so we want to scrape jobs from all different listing places, like monster.com, linkedin, indeed, etc and have them all in one location. This activity will take us only through monster.com scraping.

<img src="../images/listings.png" alt="Drawing" style="width: 400px;"/>

#### Step 2
We'll want to identify a bit of text that we assume is relatively unique, so we can locate it in the HTML. `Data Scientist` would be way too common. I chose `Lead Engineer - Analytics Architecture`. The added benefit of this is that the company, Eclaro, should be relatively uncommon in the text also.

<img src="../images/html.png" alt="Drawing" style="width: 600px;"/>

When I search for `Lead Engineer - Analytics Architecture` in the HTML, I see that it appears twice, both in the same row - line 1199. (It's all the way over to the right, so I had to scroll way left). I also notice that I can see the other information I'm interested nearby in the HTML: the company name, Eclaro, and the location, Happy Valley.

#### Step 3
Identify an HTML element where all the information I'm interested in is nested within it. For example, it looks like `<div class="summary">` has job title, company info, and location all nested under it. I might also be able to use `<div class="flex-row>` or `<section class="card-content">`. This might take a bit of experimentation to find the correct "level" of nesting. However, note that I can't choose something like `<div class="company">` because the job title and locations are not nested within it.

#### Step 4
Let's start writing code! I'm going to choose the "smallest" HTML container that I think will hold all the information I'm interested in. This is because it will take less navigating through an HTML hierarchy to grab the data I'm interested in.

In [23]:
# grab all HTML containers that hold the job listing information
# use `find_all()` along with the class name to filter correctly
results = soup.find_all("div", class_="summary")

In [24]:
# What type of data structure is the `results` variable?
results

[<div class="summary">
 <header class="card-header">
 <h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="11" data-m_impr_j_coc="xconwaycx" data-m_impr_j_jawsid="459374640" data-m_impr_j_jobid="222108792" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_lat="45.5303" data-m_impr_j_lid="578" data-m_impr_j_long="-122.6848" data-m_impr_j_occid="11892" data-m_impr_j_p="1" data-m_impr_j_postingid="ff3d7caf-47a3-4a7d-924f-5198583832aa" data-m_impr_j_pvc="monster" data-m_impr_s_t="t" data-m_impr_uuid="d97a4a1f-bd38-486f-82f7-4c355b26c7c8" href="https://job-openings.monster.com/senior-data-scientist-portland-or-us-xpo-logistics/222108792" onclick="clickJobTitle('plid=578&amp;pcid=11&amp;poccid=11892','Data Scientist',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Senior Data Scientist&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot;,&quot;eVar26&quot;:&quot;xc

In [25]:
# As we explore the HTML, it's easier to operate on one element of the results list, rather 
# than immediately need to loop through the entire list and parse through every element.
# So let's slice off the first element and just work with that for now.
# Once we've identified everything, we'll go back and put it into the loop.
result = results[0]

# Now, the `result` variable is still a soup object, so we can apply the BeautifulSoup methods.
# For example, we can pretty print the single element
print(result.prettify())

<div class="summary">
 <header class="card-header">
  <h2 class="title">
   <a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="11" data-m_impr_j_coc="xconwaycx" data-m_impr_j_jawsid="459374640" data-m_impr_j_jobid="222108792" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_lat="45.5303" data-m_impr_j_lid="578" data-m_impr_j_long="-122.6848" data-m_impr_j_occid="11892" data-m_impr_j_p="1" data-m_impr_j_postingid="ff3d7caf-47a3-4a7d-924f-5198583832aa" data-m_impr_j_pvc="monster" data-m_impr_s_t="t" data-m_impr_uuid="d97a4a1f-bd38-486f-82f7-4c355b26c7c8" href="https://job-openings.monster.com/senior-data-scientist-portland-or-us-xpo-logistics/222108792" onclick="clickJobTitle('plid=578&amp;pcid=11&amp;poccid=11892','Data Scientist',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Senior Data Scientist&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot;,&quot;eVar26&quot;:&quo

#### Step 5 
Note that this is NOT the same job we were examining before. This is because we just took the first element out of the results, rather than finding the specific job we were working with before. So I will go back to the actual webpage and confirm this information still makes sense. I see that there appears to be a `Senior Data Scientist` role at `XPO logistics`.

Yep.

<img src="../images/data-scientist.png" alt="Drawing" style="width: 600px;"/>

#### Step 6
Let's write code to extract all the relevant pieces of data

In [26]:
# Within our results soup object, identify HTML elements that hold the job title,
# the company, and the location.
# Select them with the `find()` method and a class name, then use `.text` to grab the
# HTML content from the elements, then use `strip()` to clean out any extra characters.

title = result.find("h2", class_="title").text.strip()
company = result.find("div", class_="company").text.strip()
location = result.find("div", class_="location").text.strip()

In [27]:
# Print out these values to ensure they look correct
print(title, company, location)

Senior Data Scientist XPO Logistics Portland, OR


#### Step 7
Let's put this all into a loop! So instead of just extracting from the single `result` variable we sliced off, we're going to go through the entire `results` iterable and apply the
parsing we just identified.

In [31]:
# for each element in our `results` iterable
for result in results:
    # We need to handle this in a try/except block because it will kick us out of the
    # loop if we have an error
    try:
        # get the title for each job listing
        title = result.find("h2", class_="title").text.strip()
        
        # get the company for each job listing
        company = result.find("div", class_="company").text.strip()
        
        # get the location for each job listing
        location = result.find("div", class_="location").text.strip()
        
        # print these out and make sure they look ok
        print(title, company, location)
        
    # handle an error here. Basically, do nothing
    except:
        print("Error")

Senior Data Scientist XPO Logistics Portland, OR
Software Dev Engineer III or IV, DOE Cambia Health Solutions, Inc. Portland, OR
Enterprise Account Executive Jobot Portland, OR
BI Analyst Apex Systems Portland, OR
Lead Engineer - Analytics Architecture Eclaro Happy Valley, OR
Healthcare Quality Data Scientist PeaceHealth Vancouver, WA
Senior Data Scientist, Supply Chain Advanced Analytics Nike Beaverton, OR
Principal Data Scientist (Clinical Data Management) Premier Research Group Portland, OR
Data Scientist comScore Portland, OR
Data Scientist - Portland, OR Conexess Group, LLC Portland, OR
Marketing Data Scientist - Portland, OR Moda Health Portland, OR
Data Scientist I DAT Beaverton, OR
Data Scientist BizTek People, Inc. Beaverton, OR
Senior Data Scientist APR Staffing Portland, OR
Data Scientist 3 Lam Research Corporation Tualatin, OR
Data Scientist 3 (Tualatin, OR, US, 97062) LAM Research Tualatin, OR
Associate, Graduate Program, IT LTL - July 2021 XPO Logistics Portland, OR
Healt

#### Step 8
Great, it looks like it's working! Now the last step is to store this information in MongoDB. We'll copy and paste the last bit of code and change the `print()` section to insert the data instead.

In [32]:
# for each element in our `results` iterable
for result in results:
    # We need to handle this in a try/except block because it will kick us out of the
    # loop if we have an error
    try:
        # get the title for each job listing
        title = result.find("h2", class_="title").text.strip()
        
        # get the company for each job listing
        company = result.find("div", class_="company").text.strip()
        
        # get the location for each job listing
        location = result.find("div", class_="location").text.strip()
        
        # Package this data up and insert into mongo DB
        # What data structure do we need to use to insert into mongo DB?
        post = {
            "title": title,
            "company": company,
            "location": location
        }
        
        # finally, insert that using the `collections` variable we created way up at the top
        collection.insert_one(post)
        
    # handle an error here. Basically, do nothing except print an error message.
    except:
        print("Error")

In [33]:
# Finally, display everything out of the mongo DB
listings = collection.find()

for listing in listings:
    pprint(listing)

{'_id': ObjectId('5fcfc420c830ee94de5b4cc9'),
 'company': 'XPO Logistics',
 'location': 'Portland, OR',
 'title': 'Senior Data Scientist'}
{'_id': ObjectId('5fcfc420c830ee94de5b4cca'),
 'company': 'Cambia Health Solutions, Inc.',
 'location': 'Portland, OR',
 'title': 'Software Dev Engineer III or IV, DOE'}
{'_id': ObjectId('5fcfc420c830ee94de5b4ccb'),
 'company': 'Jobot',
 'location': 'Portland, OR',
 'title': 'Enterprise Account Executive'}
{'_id': ObjectId('5fcfc420c830ee94de5b4ccc'),
 'company': 'Apex Systems',
 'location': 'Portland, OR',
 'title': 'BI Analyst'}
{'_id': ObjectId('5fcfc420c830ee94de5b4ccd'),
 'company': 'Eclaro',
 'location': 'Happy Valley, OR',
 'title': 'Lead Engineer - Analytics Architecture'}
{'_id': ObjectId('5fcfc420c830ee94de5b4cce'),
 'company': 'PeaceHealth',
 'location': 'Vancouver, WA',
 'title': 'Healthcare Quality Data Scientist'}
{'_id': ObjectId('5fcfc420c830ee94de5b4ccf'),
 'company': 'Nike',
 'location': 'Beaverton, OR',
 'title': 'Senior Data Scie

#### Step 9 / Bonus
Now that we've implemented the solution, it never fails that the requirements shift underneath us. In this case, a salesperson has come back to us and said that they also want to capture the link to the company's logo that monster displays. Note that not every job listing has an image!

In [34]:
# clear out the data from your mongo DB
collection.drop()

In [40]:
# Now, we need to identify where this data is stored. Is it nested under the same HTML 
# element as the other data we've already captured?

# No, we need to update our `results` iterable to capture the relevant info
results = soup.find_all("div", class_="flex-row")

# for each element in our `results` iterable
for result in results:
    # We need to handle this in a try/except block because it will kick us out of the
    # loop if we have an error
    try:
        # accessing title, company, and location should still work the same way
        title = result.find("h2", class_="title").text.strip()
        company = result.find("div", class_="company").text.strip()
        location = result.find("div", class_="location").text.strip()
        
        # Package this data up
        post = {
            "title": title,
            "company": company,
            "location": location
        }
        
        # but the image link is stored somewhere else, and it's stored as an attribute.
        # how do we access attributes?
        # The image link is stored in an `img` tag that's nested in a `div`
        img_container = result.find("div", class_="mux-company-logo")
        
        # now we need to access this nested img class and access the attribute inside,
        # but only if it exists!
        if img_container.img:
            img_link = img_container.img["src"]
            # and remember to add it to the data structure that we insert into mongo!
            post["img_link"] = img_link  
        
        # finally, insert that using the `collections` variable we created way up at the top
        collection.insert_one(post)
        
    # handle an error here. Basically, do nothing except print an error message.
    except:
        print("Error")

In [41]:
# Finally, display everything out of the mongo DB, this time with the images, if they exist
listings = collection.find()

for listing in listings:
    pprint(listing)

{'_id': ObjectId('5fcfc78dc830ee94de5b4ce2'),
 'company': 'XPO Logistics',
 'img_link': 'https://media.newjobs.com/clu/xcon/xconwaycx/branding/12271/XPO-Logistics-logo.png',
 'location': 'Portland, OR',
 'title': 'Senior Data Scientist'}
{'_id': ObjectId('5fcfc78dc830ee94de5b4ce3'),
 'company': 'Cambia Health Solutions, Inc.',
 'img_link': 'https://media.newjobs.com/clu/xw25/xw250306941wx/branding/95885/Cambia-Health-Solutions-Inc-logo.jpg',
 'location': 'Portland, OR',
 'title': 'Software Dev Engineer III or IV, DOE'}
{'_id': ObjectId('5fcfc78dc830ee94de5b4ce4'),
 'company': 'Jobot',
 'img_link': 'https://media.newjobs.com/clu/xjob/xjobot2018x/branding/161229/Jobot-logo-637293157573438943.png',
 'location': 'Portland, OR',
 'title': 'Enterprise Account Executive'}
{'_id': ObjectId('5fcfc78dc830ee94de5b4ce5'),
 'company': 'Apex Systems',
 'img_link': 'https://media.newjobs.com/clu/xape/xapexppvx/branding/27017/Apex-Systems-logo-637363659369052192.png',
 'location': 'Portland, OR',
 'ti