# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [None]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

# Set a User-Agent header so LinkedIn doesn't block us
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
LinkedIn's public search page now uses updated CSS classes:
  - Card container: div.base-search-card
  - Title: h3.base-search-card__title
  - Company: h4.base-search-card__subtitle
  - Location: span.job-search-card__location
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url, headers=HEADERS)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    for card in soup.select("div.base-search-card__info"):
        title = card.select_one("h3.base-search-card__title")
        company = card.select_one("h4.base-search-card__subtitle")
        location = card.select_one("span.job-search-card__location")
        titles.append(title.get_text(strip=True) if title else None)
        companies.append(company.get_text(strip=True) if company else None)
        locations.append(location.get_text(strip=True) if location else None)
    
    # Build the dataframe from the collected data
    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    
    # Return dataframe
    return data

In [None]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analyst')
print(f"Total jobs found: {len(results)}")
results.head(10)

## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [None]:
# Challenge 1: Add num_pages parameter for pagination
# LinkedIn loads ~25 results per "page". The first page is the main search URL.
# Additional pages use the guest API endpoint with a 'start' offset.

def scrape_linkedin_job_search(keywords, num_pages=1):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    PAGINATION_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?'
    
    titles = []
    companies = []
    locations = []
    
    for page in range(num_pages):
        start = page * 25
        
        if page == 0:
            scrape_url = f"{BASE_URL}keywords={keywords}"
        else:
            scrape_url = f"{PAGINATION_URL}keywords={keywords}&start={start}"
        
        page_response = requests.get(scrape_url, headers=HEADERS)
        soup = BeautifulSoup(page_response.text, 'html.parser')
        
        cards = soup.select("div.base-search-card__info")
        
        # If no cards found, we've run out of results — stop early
        if not cards:
            break
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            titles.append(title.get_text(strip=True) if title else None)
            companies.append(company.get_text(strip=True) if company else None)
            locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(0.5)  # Be polite — small delay between requests
    
    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    return data

# Test with 5 pages
results_c1 = scrape_linkedin_job_search('data%20analyst', num_pages=5)
print(f"Total jobs found: {len(results_c1)}")
results_c1.head(10)

## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [None]:
# Challenge 2: Add country parameter
# LinkedIn accepts a 'location' query param to filter by country

def scrape_linkedin_job_search(keywords, num_pages=1, country=None):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    PAGINATION_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?'
    
    titles = []
    companies = []
    locations = []
    
    for page in range(num_pages):
        start = page * 25
        
        if page == 0:
            scrape_url = f"{BASE_URL}keywords={keywords}"
        else:
            scrape_url = f"{PAGINATION_URL}keywords={keywords}&start={start}"
        
        if country:
            scrape_url += f"&location={country}"
        
        page_response = requests.get(scrape_url, headers=HEADERS)
        soup = BeautifulSoup(page_response.text, 'html.parser')
        
        cards = soup.select("div.base-search-card__info")
        
        if not cards:
            break
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            titles.append(title.get_text(strip=True) if title else None)
            companies.append(company.get_text(strip=True) if company else None)
            locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(0.5)
    
    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    return data

# Test: search for data analyst jobs in Germany, 3 pages
results_c2 = scrape_linkedin_job_search('data%20analyst', num_pages=3, country='Germany')
print(f"Total jobs found: {len(results_c2)}")
results_c2.head(10)

## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [None]:
# Challenge 3: Add num_days parameter for time filtering
# LinkedIn uses f_TPR=r{seconds} to filter by recency

def scrape_linkedin_job_search(keywords, num_pages=1, country=None, num_days=None):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    PAGINATION_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?'
    
    titles = []
    companies = []
    locations = []
    
    for page in range(num_pages):
        start = page * 25
        
        if page == 0:
            scrape_url = f"{BASE_URL}keywords={keywords}"
        else:
            scrape_url = f"{PAGINATION_URL}keywords={keywords}&start={start}"
        
        if country:
            scrape_url += f"&location={country}"
        
        if num_days:
            num_seconds = num_days * 24 * 60 * 60  # convert days to seconds
            scrape_url += f"&f_TPR=r{num_seconds}"
        
        page_response = requests.get(scrape_url, headers=HEADERS)
        soup = BeautifulSoup(page_response.text, 'html.parser')
        
        cards = soup.select("div.base-search-card__info")
        
        if not cards:
            break
        
        for card in cards:
            title = card.select_one("h3.base-search-card__title")
            company = card.select_one("h4.base-search-card__subtitle")
            location = card.select_one("span.job-search-card__location")
            titles.append(title.get_text(strip=True) if title else None)
            companies.append(company.get_text(strip=True) if company else None)
            locations.append(location.get_text(strip=True) if location else None)
        
        time.sleep(0.5)
    
    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    return data

# Test: search for data analyst jobs in Germany, 3 pages, posted in the last 7 days
results_c3 = scrape_linkedin_job_search('data%20analyst', num_pages=3, country='Germany', num_days=7)
print(f"Total jobs found: {len(results_c3)}")
results_c3.head(10)

## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [None]:
# Bonus Challenge: Add Seniority Level column
# Each job card has a data-entity-urn with the job ID.
# We fetch each job's detail page and extract "Seniority level" from the criteria list.

def get_seniority_level(job_id):
    """Fetch an individual job page and extract the Seniority Level."""
    url = f"https://www.linkedin.com/jobs/view/{job_id}"
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all criteria items and look for "Seniority level"
        criteria_items = soup.select("li.description__job-criteria-item")
        for item in criteria_items:
            header = item.select_one("h3.description__job-criteria-subheader")
            if header and "seniority" in header.get_text(strip=True).lower():
                value = item.select_one("span.description__job-criteria-text--criteria")
                return value.get_text(strip=True) if value else None
    except Exception:
        pass
    return None


def scrape_linkedin_job_search(keywords, num_pages=1, country=None, num_days=None, include_seniority=False):
    
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    PAGINATION_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?'
    
    titles = []
    companies = []
    locations = []
    job_ids = []
    
    for page in range(num_pages):
        start = page * 25
        
        if page == 0:
            scrape_url = f"{BASE_URL}keywords={keywords}"
        else:
            scrape_url = f"{PAGINATION_URL}keywords={keywords}&start={start}"
        
        if country:
            scrape_url += f"&location={country}"
        
        if num_days:
            num_seconds = num_days * 24 * 60 * 60
            scrape_url += f"&f_TPR=r{num_seconds}"
        
        page_response = requests.get(scrape_url, headers=HEADERS)
        soup = BeautifulSoup(page_response.text, 'html.parser')
        
        # Select the parent card (not just __info) so we can read data-entity-urn
        cards = soup.select("div.base-search-card.job-search-card")
        
        if not cards:
            break
        
        for card in cards:
            info = card.select_one("div.base-search-card__info")
            if not info:
                continue
            
            title = info.select_one("h3.base-search-card__title")
            company = info.select_one("h4.base-search-card__subtitle")
            location = info.select_one("span.job-search-card__location")
            titles.append(title.get_text(strip=True) if title else None)
            companies.append(company.get_text(strip=True) if company else None)
            locations.append(location.get_text(strip=True) if location else None)
            
            # Extract job ID from data-entity-urn="urn:li:jobPosting:1234567"
            urn = card.get("data-entity-urn", "")
            job_id = urn.split(":")[-1] if urn else None
            job_ids.append(job_id)
        
        time.sleep(0.5)
    
    data = pd.DataFrame({'Title': titles, 'Company': companies, 'Location': locations})
    
    # Optionally fetch seniority level for each job
    if include_seniority:
        print(f"Fetching seniority level for {len(job_ids)} jobs (this may take a while)...")
        seniority_levels = []
        for i, jid in enumerate(job_ids):
            if jid:
                seniority_levels.append(get_seniority_level(jid))
                time.sleep(0.3)  # Be polite with rate limiting
            else:
                seniority_levels.append(None)
            if (i + 1) % 10 == 0:
                print(f"  ...processed {i + 1}/{len(job_ids)} jobs")
        data['Seniority Level'] = seniority_levels
    
    return data

# Test: search for data analyst jobs, 1 page, with seniority level
# Using only 1 page to keep runtime reasonable since each job needs a separate request
results_bonus = scrape_linkedin_job_search('data%20analyst', num_pages=1, country='Germany', num_days=7, include_seniority=True)
print(f"\nTotal jobs found: {len(results_bonus)}")
results_bonus.head(10)