# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [8]:
# Execute this to save new versions of the notebook
!pip install jovian --upgrade --quiet
# Install the library (library can be installed using pip)
!pip install requests --upgrade --quiet
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

import jovian
# Import the library
import requests
# Import time library
import time
# Import the library
from bs4 import BeautifulSoup

jovian.commit(project="python")

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
selenium 4.11.2 requires urllib3[socks]<3,>=1.26, but you have urllib3 1.25.11 which is incompatible.[0m[31m
[0m

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[jovian] Please enter your API key ( from https://jovian.com/ ):[0m
API KEY: ········
[jovian] Creating a new project "satyakipal99/python"[0m
[jovian] Committed successfully! https://jovian.com/satyakipal99/python[0m


'https://jovian.com/satyakipal99/python'

In [9]:
# Berlin
url_1 = 'https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=Berlin&geoId=&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
# Frankfurt
url_2 = 'https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=Frankfurt%20Rhine-Main%20Metropolitan%20Area&geoId=90009714&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
# Münich
url_3 = 'https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=m%C3%BCnich&geoId=&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
# Hamburg
url_4 = 'https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=hamburg&geoId=&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'
# Stuttgart
url_5 = 'https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=Stuttgart%20Region&geoId=90009750&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0'

In [10]:
urls = [url_1, url_2,url_3,url_4,url_5]

In [11]:
def get_page_contents(urls):
  page_contents = []
  for ind,url in enumerate(urls):
    response = requests.get(url)
    time.sleep(2)
    # use response.status_code to get response status (optional)
    print("Status code of URL {} is {}".format(ind+1, response.status_code))
    page_contents.append(response.text)
  return page_contents

In [12]:
# check also status code
page_contents = get_page_contents(urls)

Status code of URL 1 is 200
Status code of URL 2 is 200
Status code of URL 3 is 200
Status code of URL 4 is 200
Status code of URL 5 is 200


In [13]:
print(*[len(content) for content in page_contents])

169426 172604 171195 172935 171862


In [14]:
def create_htmls(page_contents):
  htmls = []
  for ind, content in enumerate(page_contents):
    name = "linkedin-{}.html".format(ind+1)
    htmls.append(name)
    print(name)
    with open(name, 'w', encoding="utf-8") as file:
      file.write(content)
  return htmls

In [15]:
html_files = create_htmls(page_contents)

linkedin-1.html
linkedin-2.html
linkedin-3.html
linkedin-4.html
linkedin-5.html


In [16]:
def get_soup_objects(htmls):
  '''
  A function to get the soup objects from html files.
  Arguments:
  htmls<list> : A list of html file names.
  Returns: 
  A list of soup objects
  '''
  soups = []
  for html in htmls:
    with open(html, 'r') as f:
      html_source = f.read()
      doc = BeautifulSoup(html_source, 'html.parser')
      soups.append(doc)
      print(type(doc))
  return soups

In [17]:
all_soups=get_soup_objects(html_files)
all_soups

<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>


[<!DOCTYPE html>
 
 <html lang="en">
 <head>
 <meta content="d_jobs_guest_search" name="pageKey"/>
 <!-- --> <meta content="urlType=jserp_custom;emptyResult=false" name="linkedin:pageTag"/>
 <meta content="en_US" name="locale"/>
 <meta data-app-version="2.0.1456" data-browser-id="af72db05-ee51-47fc-8149-ea65dab61fe6" data-call-tree-id="AAYL7983/gV7aLmm5FZ0/g==" data-disable-jsbeacon-pagekey-suffix="false" data-enable-page-view-heartbeat-tracking="" data-member-id="0" data-multiproduct-name="jobs-guest-frontend" data-page-instance="urn:li:page:d_jobs_guest_search;f+MHBJW9TuCe8juI8uqh6Q==" data-service-name="jobs-guest-frontend" id="config"/>
 <link href="https://de.linkedin.com/jobs/data-analyst-stellen-berlin" rel="canonical"/>
 <!-- --><!-- -->
 <!-- -->
 <!-- -->
 <!-- -->
 <!-- -->
 <link href="https://static.licdn.com/aero-v1/sc/h/al2o9zrvru7aqj8e1x2rzsrca" rel="icon"/>
 <script>
           function getDfd() {let yFn,nFn;const p=new Promise(function(y, n){yFn=y;nFn=n;});p.resolve=y

In [18]:
len(all_soups)

5

In [19]:
all_link_tags = all_soups

In [21]:
def job_titles(all_soups):
    '''
  A function to get the all job titles 
  
    '''
    job_titles = []

    selection_class = "base-card__full-link"
    #base-card__full-link

    tags = all_soups.find_all('a' , {'class' :selection_class})
  
    for tag in tags:
        job_titles.append(tag.text.strip())
        

    return job_titles

In [22]:
job_titles(all_soups[1])
len(job_titles(all_soups[0]))

25

In [23]:
def company_name(all_soups):
    '''
  A function to get the all company names 

    '''
    company_name=[]
    
    company_class = 'job-search-card__subtitle'
    
    company_name_tags= all_soups.find_all('a' , {'class' :company_class })
    
    for tag in company_name_tags:
        company_name.append(tag.text.strip())
        
        
    return company_name

In [24]:
print (len (company_name(all_soups[0])))

print (company_name(all_soups[0]))

0
[]


In [25]:
def job_urls(all_soups):
    '''
  A function to get all Job Urls 
    '''
    job_urls = []
    
    selection_class = "base-card__full-link"
    #base-card__full-link

    tags = all_soups.find_all('a' , {'class' :selection_class})
    
    
    for i in range(0, len(tags)) :
        job_urls.append(tags[i]['href'])                

    return job_urls

In [26]:
len(job_urls(all_soups[0]))

25

In [27]:
def extract_jobs(all_soups):
   
    job = []
    for soups in all_soups:
         job.extend(job_titles(soups))
    return job 

print("The length of all columns")
print(len(extract_jobs(all_soups)))


def extract_company(all_soups):
    company = []
    for soups in all_soups:
         company.extend(company_name(soups))
    return company

print(len(extract_jobs(all_soups)))



def extract_url(all_soups):
    url = []
    for soups in all_soups:
         url.extend(job_urls(soups))
    return url

print(len(extract_jobs(all_soups)))


The length of all columns
125
125
125


In [28]:
!pip install pandas --quiet
import pandas as pd

dict = {
    'Job title': extract_jobs(all_soups), 
    'Company name': extract_company(all_soups), 
    'Link URL':extract_url(all_soups),
}

df = pd.DataFrame({ key:pd.Series(value) for key, value in dict.items() })

df

Unnamed: 0,Job title,Company name,Link URL
0,Junior Data Analyst (m/f/d),,https://de.linkedin.com/jobs/view/junior-data-...
1,Junior Data Analyst,,https://de.linkedin.com/jobs/view/junior-data-...
2,Business Intelligence Data Analyst,,https://de.linkedin.com/jobs/view/business-int...
3,Data Analyst (m/f/d),,https://de.linkedin.com/jobs/view/data-analyst...
4,Data Analyst (m/f/x),,https://de.linkedin.com/jobs/view/data-analyst...
...,...,...,...
120,Analyst Data Analytics (m/w/d) in Stuttgart,,https://de.linkedin.com/jobs/view/analyst-data...
121,Data Engineer (m/w/d),,https://de.linkedin.com/jobs/view/data-enginee...
122,Data Engineer (w/m/div.),,https://de.linkedin.com/jobs/view/data-enginee...
123,Data Engineer / Data Lake Developer / IT Data ...,,https://de.linkedin.com/jobs/view/data-enginee...


In [29]:
df.to_csv('job_csv_file.csv', index=None)

In [30]:
jovian.commit(files = ['job_csv_file.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "satyakipal99/python" on https://jovian.com/[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.com/satyakipal99/python[0m


'https://jovian.com/satyakipal99/python'