# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [2]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    for card in soup.select("div.result-card__contents"):
        title = card.findChild("h3", recursive=False)
        company = card.findChild("h4", recursive=False)
        location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
        titles.append(title.string)
        companies.append(company.string)
        locations.append(location.string)
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    return data

In [2]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,Title,Company,Location
0,Master Data Management Architect /Manager,"Synergy Business Consulting, Inc.","Teaneck, New Jersey, United States"
1,SEO Manager,Onward Select,"Atlanta, Georgia"
2,Data Governance Analyst,Momentum Consulting Corporation,"Miami, Florida"
3,Data Systems Analyst,Cypress HCM,"San Francisco, California"
4,Data Analyst w/ AWS,Digipulse Technologies Inc.,"Greensboro, North Carolina, United States"
5,Rebate & Data Analyst (Contract),SolomonEdwards,"Philadelphia, Pennsylvania"
6,Tibco Spotfire Data Analyst,GSquared Group,"Atlanta, Georgia"
7,Data Modeler,Confiance Tech Solutions,Greater Atlanta Area
8,HEDIS Quality Analyst,"HealthTECH Resources, Inc.","New York, New York"
9,Senior Financial Analyst,Talent 360 Solutions,Greater Atlanta Area


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [47]:
# your code here



# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search_25plus(keywords,num_pages):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    
    titles = []
    companies = []
    locations = []
    
    index_counter = 0
    while 25*index_counter < num_pages:
         # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&start=",str(25*index_counter)])
        
        try:
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
        except:
            break
    
        soup = BeautifulSoup(page.text, 'html.parser')

        # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
        # Then in each job card, extract the job title, company, and location data.

        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
            
        # make sure that it not goes beyond the first site whenever there are less than 25 entries on the first page
        if len(soup.select("div.result-card__contents")) < 25:
            break
        else:            
            index_counter += 1
    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)
    
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    if len(data) > num_pages:
        return data.iloc[0:num_pages]
    else:
        return data


In [49]:
scrape_linkedin_job_search_25plus('data%20analysis',50)

Unnamed: 0,Title,Company,Location
0,Data Analyst,MakeSpace,"New York, New York, United States"
1,Data Analyst,University of Colorado Colorado Springs,"Colorado Springs, CO, US"
2,VP Data Transformation,Centria Healthcare,"Farmington Hills, Michigan, United States"
3,Data Scientist,"BMW of North America, LLC","Woodcliff Lake, NJ, US"
4,Data Analyst,Hollstadt Consulting,Greater Minneapolis-St. Paul Area
5,Data Ops Analyst,Wish,"San Francisco, CA, US"
6,Data Analyst,Quest Financial,"Dunwoody, Georgia, United States"
7,Data Strategy Analyst,Payden & Rygel,"Los Angeles, California, United States"
8,Associate Data Analyst,ZoomInfo,"Vancouver, WA, US"
9,Data Analyst,State of North Carolina,"Raleigh, NC, US"


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [52]:
# your code here

def scrape_linkedin_job_search_25plus_city(keywords,num_pages,city):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    
    titles = []
    companies = []
    locations = []
    
    index_counter = 0
    while 25*index_counter < num_pages:
         # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, 'keywords=', keywords, "&start=",str(25*index_counter),"&location=",city])
        
        try:
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
        except:
            break
    
        soup = BeautifulSoup(page.text, 'html.parser')

        # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
        # Then in each job card, extract the job title, company, and location data.

        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
            
        # make sure that it not goes beyond the first site whenever there are less than 25 entries on the first page
        if len(soup.select("div.result-card__contents")) < 25:
            break
        else:            
            index_counter += 1
    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)
    
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    if len(data) > num_pages:
        return data.iloc[0:num_pages]
    else:
        return data




In [53]:
scrape_linkedin_job_search_25plus_city('data%20analysis',40,"munich")

Unnamed: 0,Title,Company,Location
0,Risk & Finance Analyst,Tesla,"Munich, DE"
1,Data Analyst,Amazon,"Munich, DE"
2,Data Scientist,Stylight,"Munich, DE"
3,Data Analyst,KeenRecruit,"Munich, DE"
4,Data Scientist (m/f/d),Huawei,"Munich, DE"
5,Knowledge Analyst - HR / People Strategy,Boston Consulting Group (BCG),"Munich Area, Germany"
6,BI / Product Data Analyst,TrustYou,"Munich, Bavaria, Germany"
7,Business Analyst,CVMC LTD,"Munich, Bavaria, Germany"
8,Data Scientist (m/f/d),FlixBus,"Munich, DE"
9,Strategy Implementation Intern (d/f/m),Airbus,"Munich, DE"


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [127]:
# your code here
def scrape_linkedin_job_search_25plus_city_numDays(keywords,num_pages,city,num_days):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    time_stamp = "f_TPR=r" + str(int(86400*num_days)) + "&" 
    
    
    titles = []
    companies = []
    locations = []
    
    index_counter = 0
    while 25*index_counter < num_pages:
         # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, time_stamp,'keywords=', keywords, "&start=",str(25*index_counter),"&location=",city,"&f_TP=1"])
        
        try:
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
        except:
            break
    
        soup = BeautifulSoup(page.text, 'html.parser')
        
        # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
        # Then in each job card, extract the job title, company, and location data.

       
    
        
        
        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
         
        # make sure that it not goes beyond the first site whenever there are less than 25 entries on the first page
        index_counter +=1

    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)
    
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    if len(data) > num_pages:
        return data.iloc[0:num_pages]
    else:
        return data


In [128]:
scrape_linkedin_job_search_25plus_city_numDays('data%20analysis',50,"munich",0.1)

https://www.linkedin.com/jobs/search/?f_TPR=r8640&keywords=data%20analysis&start=0&location=munich&f_TP=1
https://www.linkedin.com/jobs/search/?f_TPR=r8640&keywords=data%20analysis&start=25&location=munich&f_TP=1


Unnamed: 0,Title,Company,Location
0,SENIOR DATA ANALYST (m/f/d),Harnham,"Nürnberg Area, Germany"
1,Manager Global Operational Pricing (m/f/x),Daiichi Sankyo Europe GmbH,"München und Umgebung, Deutschland"
2,Internship Data Science (m/f/d),Limehome GmbH,"München, Bayern, Deutschland"
3,Senior Growth Manager (m/f/d),FlixBus,"Munich, DE"
4,Preclinical Development Manager (m/f/d),Sandoz,"Munich, DE"
5,Program Manager - Alexa Preview Team,Amazon,"Munich, DE"
6,Sales Operations Manager (m/f/d),Personio,"München, Bayern, Deutschland"
7,Senior Statistician,PSI CRO AG,"Munich, DE"
8,BIG DATA ARCHITECT (M/F/D),Altran,"Munich, DE"
9,Senior IT Audit Consultant,IAC,"Munich, Bavaria, Germany"


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [125]:
# your code here
import numpy as np

# your code here
def scrape_linkedin_job_search_25plus_city_numDays_level(keywords,num_pages,city,num_days):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    time_stamp = "f_TPR=r" + str(int(86400*num_days)) + "&" 
    
    
    titles = []
    companies = []
    locations = []
    senior_levels = []
    date_times = []
    
    index_counter = 0
    while 25*index_counter <= num_pages:
         # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, time_stamp,'keywords=', keywords, "&start=",str(25*index_counter),"&location=",city,"&f_TP=1"])
        
        try:
            # Create a request to get the data from the server 
            page = requests.get(scrape_url)
        except:
            break
    
        soup = BeautifulSoup(page.text, 'html.parser')
        
        # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
        # Then in each job card, extract the job title, company, and location data.

        # list of all li blocks. every block contains one search result
        job_cards = soup.find("ul",attrs={"class": "jobs-search__results-list"}).find_all("li",attrs={"class": "result-card"})
        
        for card in job_cards:
            
            # this is the old div.result-card__contents from the other functions
            card_div = card.find_all("div", attrs = {"class" : "result-card__contents"})[0]
                        
            title = card_div.findChild("h3", recursive=False)
            company = card_div.findChild("h4", recursive=False)
            location = card_div.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
            
            if len(card.find_all("time", attrs = {"class" : "job-result-card__listdate--new"}))>0:
                date_times.append(card.find_all("time", attrs = {"class" : "job-result-card__listdate--new"})[0].get("datetime"))
            else:
                date_times.append(np.nan)
            
            
            # now lets turn to the senior level part. First, we extract the Job-ID
            card_id = card.get("data-id") # this is the 
            
            # get new url   
            scrape_url_id = "https://www.linkedin.com/jobs/view/" + str(card_id)
           
            
            
            
            # load new url and parse
            page_id = requests.get(scrape_url_id)
            soup_id = BeautifulSoup(page_id.text, 'html.parser')
            
            # find the information
            senior_level = soup_id.find("span",attrs = {"class" : "job-criteria__text"}).string
            senior_levels.append(senior_level) 

           
        
        
        # make sure that it not goes beyond the first site whenever there are less than 25 entries on the first page
        if len(soup.select("div.result-card__contents")) < 25:
            break
        else:            
            index_counter += 1
            
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)
    
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations,senior_levels,date_times)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2],"Level":z[3],"Published":z[4]} , ignore_index=True)
    
    # Return dataframe
    if len(data) > num_pages:
        return data.iloc[0:num_pages]
    else:
        return data


In [126]:
scrape_linkedin_job_search_25plus_city_numDays_level('data%20analysis',50,"munich",3)

Unnamed: 0,Title,Company,Location,Level,Published
0,Manager Global Operational Pricing (m/f/x),Daiichi Sankyo Europe GmbH,"München und Umgebung, Deutschland",Associate,2020-01-21
1,Senior Growth Manager (m/f/d),FlixBus,"Munich, DE",Mid-Senior level,2020-01-20
2,Preclinical Development Manager (m/f/d),Sandoz,"Munich, DE",Mid-Senior level,2020-01-20
3,Program Manager - Alexa Preview Team,Amazon,"Munich, DE",Mid-Senior level,
4,Sales Operations Manager (m/f/d),Personio,"München, Bayern, Deutschland",Mid-Senior level,2020-01-20
5,Senior Statistician,PSI CRO AG,"Munich, DE",Mid-Senior level,2020-01-21
6,BIG DATA ARCHITECT (M/F/D),Altran,"Munich, DE",Associate,2020-01-21
7,Senior IT Audit Consultant,IAC,"Munich, Bavaria, Germany",Mid-Senior level,2020-01-20
8,Working Student (m/w/d) Group Taxation,Allianz,"Munich, DE",Not Applicable,2020-01-21
9,Account Manger - DACH,ClassPass,"Munich, Bavaria, Germany",Associate,2020-01-20


In [115]:
np.nan

nan