# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [1]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import time
import random
import logging
from typing import Dict, Optional
from datetime import datetime
import csv

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

## Step 2: Generating a URL with a function

In [2]:
def generate_linkedin_url(position: str, location: str) -> str:
    """Generate LinkedIn URL for job search"""
    position = position.replace(' ', '%20')
    location = location.replace(' ', '%20')
    return f"https://www.linkedin.com/jobs/search?keywords={position}&location={location}"

# Test the function
test_url = generate_linkedin_url("Data Analyst", "New York")
print(f"Generated URL: {test_url}")

Generated URL: https://www.linkedin.com/jobs/search?keywords=Data%20Analyst&location=New%20York


## Step 3: Extract the Job Data from a single job posting card

In [6]:
def extract_linkedin_job_data(job_card: BeautifulSoup, headers: dict) -> Optional[Dict]:
    """Extract job data from a LinkedIn job card"""
    try:
        # Extract basic job information
        title = job_card.find('h3', class_='base-search-card__title').text.strip()
        company = job_card.find('h4', class_='base-search-card__subtitle').text.strip()
        location = job_card.find('span', class_='job-search-card__location').text.strip()
        
        # Extract job link
        job_link = job_card.find('a', class_='base-card__full-link')['href']
        
        # Extract posting date
        date_posted = job_card.find('time')['datetime'] if job_card.find('time') else None
        
        # Get job description from the job detail page
        description = ""
        try:
            # Add delay before requesting job details
            time.sleep(random.uniform(2, 4))
            
            # Make request to job detail page
            job_response = requests.get(job_link, headers=headers)
            
            if job_response.status_code == 200:
                job_soup = BeautifulSoup(job_response.text, 'html.parser')
                
                # Find the job description
                description_div = job_soup.find('div', class_='show-more-less-html__markup')
                if description_div:
                    description = description_div.get_text(strip=True)
                else:
                    # Try alternative class names
                    description_div = job_soup.find('div', class_='description__text')
                    if description_div:
                        description = description_div.get_text(strip=True)
            else:
                logger.warning(f"Could not fetch job description. Status code: {job_response.status_code}")
        
        except Exception as e:
            logger.warning(f"Error fetching job description: {e}")
        
        # Create job data dictionary
        job_data = {
            'title': title,
            'company': company,
            'location': location,
            'job_link': job_link,
            'date_posted': date_posted,
            'description': description
        }
        
        return job_data
    
    except Exception as e:
        logger.error(f"Error extracting job data: {e}")
        return None

## Step 4: Define the main function

In [7]:
def scrape_linkedin_jobs(position: str, location: str) -> list:
    """Main function to scrape LinkedIn jobs"""
    
    # Initialize variables
    jobs_list = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        # Generate URL
        url = generate_linkedin_url(position, location)
        
        # Add delay to prevent rate limiting
        time.sleep(random.uniform(2, 5))
        
        # Make request to LinkedIn
        response = requests.get(url, headers=headers)
        
        if response.status_code == 200:
            # Parse the page
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Find all job cards
            job_cards = soup.find_all('div', class_='base-card')
            
            # Extract data from each job card
            for card in job_cards:
                job_data = extract_linkedin_job_data(card, headers)
                if job_data:
                    jobs_list.append(job_data)
                    logger.info(f"Successfully scraped job: {job_data['title']} at {job_data['company']}")
            
            logger.info(f"Successfully scraped {len(jobs_list)} valid jobs from LinkedIn")
        else:
            logger.error(f"LinkedIn returned status code: {response.status_code}")
    
    except Exception as e:
        logger.error(f"Error occurred while scraping: {e}")
    
    return jobs_list


## Task 5: Describe Conclusions

In [8]:
# Test the scraping
position = "Data Analyst"
location = "New York"
jobs = scrape_linkedin_jobs(position, location)

# Save results
if jobs:
    filename = f"job_listings_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
    save_to_csv(jobs, filename)

2024-12-05 00:14:11,054 - ERROR - LinkedIn returned status code: 429


# Conclusions and Notes

1. **Functionality**:
   - The scraper successfully extracts key job information including:
     - Job title
     - Company name
     - Location
     - Job posting URL
     - Date posted

2. **Limitations**:
   - LinkedIn has rate limiting (429 errors)
   - Only scrapes the first page of results
   - Requires careful handling of request delays

3. **Best Practices Implemented**:
   - Error handling at each step
   - Logging for debugging
   - Random delays between requests
   - User-Agent headers to mimic browser
   - Type hints for better code documentation

4. **Potential Improvements**:
   - Add pagination to get more results
   - Implement proxy rotation
   - Add more job details from individual job pages
   - Add data cleaning and analysis
   - Implement retry mechanism for failed requests

5. **Usage Notes**:
   - Wait between scraping sessions to avoid rate limiting
   - Consider using LinkedIn's official API for production use
   - Respect robots.txt and website terms of service