## Web Scraping Job Vacancies
Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

    Analyze the website structure of our job search platform
    Write the Python code to extract job data from our job search platform
    Save the data to a CSV file


In [None]:
!pip install bs4
!pip install requests
!pip install pandas

In [None]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
import json
from typing import List

We'll write the code to query both Linkedin and Indeed for job results and then write the results to a csv file.

We'll create a class to hold the state of the collected data to allow for any data processing required before writing the output to a file.

```scrape_jobs()``` will cylce through each website to call the corresponding method to scrape that specific website for job information.
This method will handle everything under the hood to:
1. Iterate through each website
2. Build the url for each website with the required query params and headers
3. Make the GET request for the url
4. Parse the returned html
5. Collect job description info from all the parsed jobs
6. Format and store data in ```self.jobs_df```

After the data has all been collected and no further data processing needs to be done, ```write_csv()``` can be called to write ```self.jobs_df``` to a csv file.


In [6]:
class Job_Scraper():
    """
    Class designed to scrape website job listings and output results to a csv.

    Use the scrape_jobs() method to gather information then call write_csv() to output the results.
    scrape_jobs() writes data to self.jobs_df DataFrame to allow for any data editing before writing to csv.
    """
    def __init__(self):
        self.jobs_df = pd.DataFrame()

    def append_df(self, data):
        # Either initialize DataFrame or append multiple data results together
        if self.jobs_df.empty:
            self.jobs_df = pd.DataFrame(data)
        else:
            self.jobs_df = pd.concat([self.jobs_df, pd.DataFrame(data)], ignore_index=True)
            
    def build_url(self, site, job, location):
        """
        Builds and returns website-specific urls with query params and headers.
        ::params
        site (str): website name e.g. 'indeed', 'linkedin'
        job (str): job title to search for
        location (str): job location to search for

        ::returns
        url (str): formatted website url with query params
        headers (dict): get request headers
        """
        # create dict of website-specific required url structure and headers
        site_params = {
            "linkedin": {
                "url": "https://www.linkedin.com/jobs/search/?",
                "job": "keywords=",
                "location": "location=",
                "headers": {
                },
            },
            "indeed": {
                "url": "https://www.indeed.com/jobs?",
                "job": "q=",
                "location": "l=",
                "headers": {
                    "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64; rv: 125.0) Gecko / 20100101 Firefox / 125.0",
                    "Accept": "text / html, application / xhtml + xml, application / xml;q = 0.9, image / avif, image / webp, * / *;q = 0.8",
                    "Accept-Language": "en-US, en;q = 0.5",
                    # "Accept-Encoding": "gzip, deflate, br",
                    "DNT": "1",
                    "Sec-GPC": "1",
                    "Connection": "keep-alive",
                    "Upgrade-Insecure-Requests": "1",
                    "Sec-Fetch-Dest": "document",
                    "Sec-Fetch-Mode": "navigate",
                    "Sec-Fetch-Site": "same-origin",
                    "Sec-Fetch-User": "?1",
                    "TE": "trailers",
                }
            }
        }

        url = f"{site_params[site]['url']}{site_params[site]['job']}{job}"

        if location is not None:
            url += f"&{site_params[site]['location']}{location}"

        return url, site_params[site]["headers"]

    def get_html(self, url, **kwargs):
        """
        Returns GET request results from url.
        """
        r = requests.get(url, **kwargs)
        print(r.status_code, r.reason)
        return r.text

    def parse_html(self, html):
        """
        Takes a url html response data and returns a parsed BeautifulSoup object.
        """
        soup = bs(html, 'html.parser')
        return (soup)

    def scrape_linkedin(self, results):
        data = {
            "title": [],
            "company": [],
            "location": [],
            "posted": [],
            "link": []
        }
        def append_data(field, value, strip=True):
            try:
                if strip:
                    value = value.text.strip()
                data[field].append(value)
            except AttributeError:
                data[field].append("")

        for job in results.find_all("div", class_="job-search-card"):
            append_data("title", job.find("span", "sr-only"))
            append_data("company", job.find("a", "hidden-nested-link"))
            append_data("location", job.find("span", "job-search-card__location"))
            append_data("posted", job.find("time", "job-search-card__listdate")["datetime"], strip=False)
            append_data("link", job.find("a", "base-card__full-link")["href"], strip=False)

        return data

    def scrape_indeed(self, results):
        data = {
            "title": [],
            "company": [],
            "location": [],
            "posted": [],
            "link": []
        }

        for job in results["metaData"]["mosaicProviderJobCardsModel"]["results"]:
            data["title"].append(job.get("title", ""))
            data["company"].append(job.get("company", ""))
            data["location"].append(job.get("formattedLocation", ""))
            data["posted"].append(job.get("formattedRelativeTime", ""))
            data["link"].append(f'https://www.indeed.com/{job.get("viewJobLink")}' if job.get("viewJobLink") else "")

        return data
    
    def scrape_jobs(self, sites:List, job:str, location:str=None):
        """
        Query and collect website information for the provided job information.
        All data is collected and added to self.jobs_df property
        ::params
        sites (List): list of website names to get data from e.g. ['indeed', 'linkedin']
        job (str): job title to search for
        location (str | None): job location to search for

        ::returns
        None
        """
        for site in sites:
            url, headers = self.build_url(site, job, location)
            html = self.get_html(url, headers=headers)

            if site == "linkedin":
                results = self.parse_html(html)
                data = self.scrape_linkedin(results)
                self.append_df(data)

            if site == "indeed":
                results = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html)
                if results:
                    results = json.loads(results[0])
                    data = self.scrape_indeed(results)
                    self.append_df(data)

    def write_data(self, filename):
        """
        Writes self.jobs_df to a given filename.
        Call scrape_jobs() first to collect data.
        """
        self.jobs_df.to_csv(filename, index=False)

Initialize the ```Job_Scraper``` and call ```scrape_jobs()``` to collect job data.

Each get request will also print out the response code and status:

```200 OK```

```403 Forbidden```

These websites implement various bot prevention techniques. Too many requests will result in a 400 response. Wait a few minutes and verify headers are correct and try again.


In [7]:
jobs = Job_Scraper()
jobs.scrape_jobs(["linkedin", "indeed"], "firefighter", "miami")

200 OK
200 OK


<h3>View the collected DataFrame:</h3>

In [10]:
jobs.jobs_df

Unnamed: 0,title,company,location,posted,link
0,Fire Prevention Officer I,"City of Hollywood, Florida","Hollywood, FL",2024-01-03,https://www.linkedin.com/jobs/view/fire-preven...
1,United States Capitol Police - Police Officer,United States Capitol Police,"Miami, FL",2024-04-14,https://www.linkedin.com/jobs/view/united-stat...
2,RESERVE POLICE OFFICER,City of Opa-locka,"Opa-Locka, FL",2023-04-20,https://www.linkedin.com/jobs/view/reserve-pol...
3,Criminal Investigator (Special Agent),U.S. Secret Service,"Miami, FL",2024-04-26,https://www.linkedin.com/jobs/view/criminal-in...
4,Non-Certified Police Officer,"City of Hollywood, Florida","Hollywood, FL",2024-03-05,https://www.linkedin.com/jobs/view/non-certifi...
5,Police Officer,University of Miami,"Coral Gables, FL",2024-03-20,https://www.linkedin.com/jobs/view/police-offi...
6,CERT POLICE OFFICER,City of Opa-locka,"Opa-Locka, FL",2023-04-20,https://www.linkedin.com/jobs/view/cert-police...
7,Assistant Scientist (Firefighter Cancer Initia...,University of Miami,"Miami, FL",2024-04-09,https://www.linkedin.com/jobs/view/assistant-s...
8,Certified Police Officer (Florida Only),"City of Hollywood, Florida","Hollywood, FL",2024-01-08,https://www.linkedin.com/jobs/view/certified-p...
9,Miami Firefighter Exam Tutor,"Varsity Tutors, a Nerdy Company","Miami, FL",2024-04-02,https://www.linkedin.com/jobs/view/miami-firef...


<h3>Looks good! Now write it to a csv file.</h3>

In [13]:
jobs.write_data("job_results")