Enhanced JobSpy ETL Pipeline

A robust and scalable ETL (Extract, Transform, Load) pipeline for scraping and analyzing job postings from multiple job boards including LinkedIn, Indeed, Glassdoor & ZipRecruiter.

Enhanced Features

Multi-source Job Scraping: Parallel scraping from major job boards
Intelligent Rate Limiting: Per-site rate limiting with automatic throttling
Dual Database Storage:
- SQLite for reliable persistent storage
- DuckDB for high-performance analytics
Parallel Processing: Multi-threaded job scraping with configurable worker count
Built-in Analytics: Automated generation of salary and company analytics
Robust Error Handling:
- Automatic retries with exponential backoff
- Per-site error isolation
- Comprehensive logging
Data Export: Automated CSV exports of analytics results

Requirements

pip install -r requirements.txt

Quick Start

from runner import JobScraperETL

# Initialize the ETL pipeline
etl = JobScraperETL()

# Run the complete pipeline with 4 worker threads
etl.run_etl_pipeline(max_workers=4)

Configuration

Key configuration parameters can be adjusted in the JobScraperETL class:

# Rate limiting settings
self.max_requests = 45  # Maximum requests per time window
self.window_size = 60   # Time window in seconds

# Worker threads for parallel processing
max_workers = 4  # Adjust based on your system's capabilities

# Job search parameters
self.search_locations = [
    "New York, NY", "San Francisco, CA", "Seattle, WA",
    "Austin, TX", "Boston, MA", "Chicago, IL"
]

self.job_titles = [
    "software engineer", "software developer", 
    "data scientist", "machine learning engineer",
    "data engineer", "full stack developer"
]

Analytics Output

The pipeline automatically generates two types of analytics:

Salary Analytics (salary_analytics.csv)

Average, minimum, and maximum salaries by location and job type
Job count per location/job type combination
Salary trends analysis

Company Analytics (company_analytics.csv)

Total job postings per company
Number of unique locations per company
Diversity of job types
Average salary offerings

Database Schema

The pipeline maintains synchronized SQLite and DuckDB databases with the following schema:

CREATE TABLE jobs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_id TEXT UNIQUE,
    title TEXT,
    company TEXT,
    company_url TEXT,
    job_url TEXT,
    location_country TEXT,
    location_city TEXT,
    location_state TEXT,
    description TEXT,
    job_type TEXT,
    salary_interval TEXT,
    salary_min_amount REAL,
    salary_max_amount REAL,
    salary_currency TEXT,
    date_posted TIMESTAMP,
    is_remote BOOLEAN,
    job_function TEXT,
    company_industry TEXT,
    source_site TEXT,
    scrape_date TIMESTAMP,
    UNIQUE(job_url, company, title)
)

Error Handling

The pipeline implements comprehensive error handling:

Rate Limiting: Automatic throttling when approaching API limits
Retries: Automatic retries with exponential backoff for transient failures
Isolation: Errors in one scraping task don't affect others
Logging: Detailed logging of all operations and errors

Logging

Logs are written to job_scraper.log and include:

Scraping progress and results
Database operations
Rate limiting events
Errors and warnings

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
__pycache__		__pycache__
.gitignore		.gitignore
config.py		config.py
data_transformer.py		data_transformer.py
database.py		database.py
readme.md		readme.md
requirements.txt		requirements.txt
runner.py		runner.py
scraper.py		scraper.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhanced JobSpy ETL Pipeline

Enhanced Features

Requirements

Quick Start

Configuration

Analytics Output

Salary Analytics (salary_analytics.csv)

Company Analytics (company_analytics.csv)

Database Schema

Error Handling

Logging

Contributing

License

About

Releases

Packages

Languages

iukea1/data_ETL_dbs

Folders and files

Latest commit

History

Repository files navigation

Enhanced JobSpy ETL Pipeline

Enhanced Features

Requirements

Quick Start

Configuration

Analytics Output

Salary Analytics (salary_analytics.csv)

Company Analytics (company_analytics.csv)

Database Schema

Error Handling

Logging

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages