Skip to content

sonisahil658/Data-Analystjob-scrapper

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Job Scraper & Analytics Dashboard

Automated job data pipeline for LinkedIn with intelligent skill extraction and real-time analytics.

Python 3.11+ Playwright Streamlit SQLite


Overview

A production-ready job scraping system that collects job listings from LinkedIn, extracts technical skills using regex-based pattern matching, and provides interactive analytics through a Streamlit dashboard.

Key Capabilities

Feature Description
Two-Phase Scraping Separate URL collection and detail extraction for resilience
3-Layer Skill Extraction 977 skills with regex patterns, minimal false positives
150 Role Categories Automatic role normalization with pattern matching
Real-Time Analytics Interactive charts, skill trends, and export capabilities
Adaptive Rate Limiting Circuit breaker with auto-tuning concurrency (2-10 workers)
Resume Capability Checkpoint-based recovery from interruptions

Project Structure

Job_Scrapper/
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ requirements.txt              # Production dependencies
β”œβ”€β”€ requirements-dev.txt          # Development dependencies
β”œβ”€β”€ .gitignore                    # Git ignore rules
β”‚
β”œβ”€β”€ code/                         # All source code
β”‚   β”œβ”€β”€ streamlit_app.py          # Main dashboard entry point
β”‚   β”œβ”€β”€ run_scraper.py            # CLI scraper runner
β”‚   β”œβ”€β”€ save_linkedin_cookies.py  # LinkedIn authentication helper
β”‚   β”œβ”€β”€ setup_playwright.sh       # Playwright browser installer (WSL/Linux)
β”‚   β”‚
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ jobs.db               # SQLite database (auto-created)
β”‚   β”‚   └── Analysis_Report/      # Generated analysis reports
β”‚   β”‚       β”œβ”€β”€ Data_Analyst/
β”‚   β”‚       β”œβ”€β”€ Data_Engineer/
β”‚   β”‚       └── GenAI_DataScience/
β”‚   β”‚
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ config/               # Configuration files
β”‚   β”‚   β”‚   β”œβ”€β”€ skills_reference_2025.json   # 977 skills with regex patterns
β”‚   β”‚   β”‚   β”œβ”€β”€ roles_reference_2025.json    # 150 role categories
β”‚   β”‚   β”‚   β”œβ”€β”€ countries.py      # Country/location mappings
β”‚   β”‚   β”‚   └── naukri_locations.py
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ db/                   # Database layer
β”‚   β”‚   β”‚   β”œβ”€β”€ connection.py     # SQLite connection manager
β”‚   β”‚   β”‚   β”œβ”€β”€ schema.py         # Table schemas
β”‚   β”‚   β”‚   └── operations.py     # CRUD operations
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   └── models.py         # Pydantic data models
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ scraper/
β”‚   β”‚   β”‚   β”œβ”€β”€ unified/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ linkedin/     # LinkedIn scraper components
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ concurrent_detail_scraper.py  # Multi-tab scraper (up to 10 tabs)
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ sequential_detail_scraper.py  # Single-tab scraper
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ playwright_url_scraper.py     # URL collection
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ selector_config.py            # CSS selectors
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ retry_helper.py               # 404/503 handling
β”‚   β”‚   β”‚   β”‚   β”‚   └── job_validator.py              # Field validation
β”‚   β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ naukri/       # Naukri scraper components
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ url_scraper.py
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ detail_scraper.py
β”‚   β”‚   β”‚   β”‚   β”‚   └── selectors.py
β”‚   β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ scalable/     # Rate limiting & resilience
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ adaptive_rate_limiter.py
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ checkpoint_manager.py
β”‚   β”‚   β”‚   β”‚   β”‚   └── progress_tracker.py
β”‚   β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ linkedin_unified.py   # LinkedIn orchestrator
β”‚   β”‚   β”‚   β”‚   └── naukri_unified.py     # Naukri orchestrator
β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   └── services/         # External service clients
β”‚   β”‚   β”‚       β”œβ”€β”€ playwright_browser.py
β”‚   β”‚   β”‚       └── session_manager.py
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ analysis/
β”‚   β”‚   β”‚   └── skill_extraction/ # 3-layer skill extraction
β”‚   β”‚   β”‚       β”œβ”€β”€ extractor.py           # Main AdvancedSkillExtractor class
β”‚   β”‚   β”‚       β”œβ”€β”€ layer3_direct.py       # Pattern matching from JSON
β”‚   β”‚   β”‚       β”œβ”€β”€ batch_reextract.py     # Re-process existing jobs
β”‚   β”‚   β”‚       └── deduplicator.py        # Skill normalization
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ ui/
β”‚   β”‚   β”‚   └── components/       # Streamlit UI components
β”‚   β”‚   β”‚       β”œβ”€β”€ kpi_dashboard.py
β”‚   β”‚   β”‚       β”œβ”€β”€ link_scraper_form.py
β”‚   β”‚   β”‚       β”œβ”€β”€ detail_scraper_form.py
β”‚   β”‚   β”‚       └── analytics/
β”‚   β”‚   β”‚           β”œβ”€β”€ skills_charts.py
β”‚   β”‚   β”‚           └── overview_metrics.py
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”‚   └── cleanup_expired_urls.py
β”‚   β”‚   β”‚
β”‚   β”‚   └── validation/
β”‚   β”‚       β”œβ”€β”€ validation_pipeline.py
β”‚   β”‚       └── single_job_validator.py
β”‚   β”‚
β”‚   β”œβ”€β”€ scripts/
β”‚   β”‚   β”œβ”€β”€ extraction/
β”‚   β”‚   β”‚   └── reextract_skills.py
β”‚   β”‚   β”‚
β”‚   β”‚   └── validation/           # Validation suite
β”‚   β”‚       β”œβ”€β”€ layer1_syntax_check.sh
β”‚   β”‚       β”œβ”€β”€ layer2_coverage.sh
β”‚   β”‚       β”œβ”€β”€ layer3_fp_detection.sh
β”‚   β”‚       β”œβ”€β”€ layer4_fn_detection.sh
β”‚   β”‚       β”œβ”€β”€ cross_verify_skills.py
β”‚   β”‚       └── run_all_validations.sh
β”‚   β”‚
β”‚   β”œβ”€β”€ tests/
β”‚   β”‚   β”œβ”€β”€ test_skill_validation_comprehensive.py
β”‚   β”‚   └── test_linkedin_selectors.py
β”‚   β”‚
β”‚   └── docs/                     # Documentation
β”‚       └── archive/              # Historical docs
β”‚
└── Analysis/                     # Downloaded CSVs and notebooks (gitignored)
    β”œβ”€β”€ Data Analysis/
    β”‚   β”œβ”€β”€ data_visualizer.ipynb    # Analysis notebook (update CSV path for charts)
    β”‚   └── csv/                     # Add exported CSVs here
    β”‚
    β”œβ”€β”€ Data Engineering/
    β”‚   β”œβ”€β”€ data_visualizer.ipynb
    β”‚   └── csv/
    β”‚
    └── GenAI & DataScience/
        β”œβ”€β”€ data_visualizer.ipynb
        └── csv/

Installation

Prerequisites

  • Python 3.11 or higher
  • Git

Step 1: Clone & Create Virtual Environment

Windows (PowerShell)

git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper

# Create virtual environment
python -m venv venv-win

# Activate
.\venv-win\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Linux / WSL

git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper

# Create virtual environment
python3 -m venv venv-linux

# Activate
source venv-linux/bin/activate

# Install dependencies
python -m pip install -r requirements.txt

Note for dual-boot users: Keep separate venvs (venv-win/ and venv-linux/) as Python virtual environments are not cross-platform compatible.

Step 2: Install Playwright Browsers

# Windows
playwright install chromium

# Linux/WSL (use python -m prefix)
python -m playwright install chromium

Step 3: Launch Dashboard

cd code

# Windows
streamlit run streamlit_app.py

# Linux/WSL (use python -m prefix)
python -m streamlit run streamlit_app.py

The dashboard opens at http://localhost:8501


Architecture

Why Two-Phase Scraping?

Phase 1: URL Collection          Phase 2: Detail Scraping
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Search Results     β”‚         β”‚  Individual Jobs    β”‚
β”‚  β”œβ”€β”€ Fast scroll    β”‚   ──▢   β”‚  β”œβ”€β”€ Full desc      β”‚
β”‚  β”œβ”€β”€ Extract URLs   β”‚         β”‚  β”œβ”€β”€ Skills parse   β”‚
β”‚  └── Store to DB    β”‚         β”‚  └── Store details  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      job_urls table                  jobs table

Benefits:

  • Resilience: If detail scraping fails, URLs are preserved
  • Efficiency: Batch process up to 10 jobs concurrently in Phase 2
  • Resumable: Pick up exactly where you left off
  • Deduplication: Skip already-scraped URLs automatically

Why Regex-Based Skill Extraction?

Approach Speed Accuracy Maintenance
Regex (chosen) 0.3s/job 85-90% Pattern file updates
spaCy NER 3-5s/job 75-80% Model retraining
GPT-based 2-10s/job 90%+ API costs

Our 3-layer approach achieves 85-90% accuracy at 10x speed of NLP:

  1. Layer 1: Multi-word phrase extraction (priority matching)
  2. Layer 2: Context-aware extraction (technical context detection)
  3. Layer 3: Direct pattern matching (977 skill patterns from JSON)

Usage

Dashboard Workflow

  1. KPI Dashboard - View overall statistics
  2. Link Scraper - Phase 1: Collect job URLs
  3. Detail Scraper - Phase 2: Extract job details & skills
  4. Analytics - Analyze skill trends and export data

Command Line

cd code

# Run validation suite
bash scripts/validation/run_all_validations.sh

# Re-extract skills for existing jobs
python -m src.analysis.skill_extraction.batch_reextract --batch-size 100

LinkedIn Authentication (Optional)

For authenticated scraping with higher limits:

cd code
python save_linkedin_cookies.py

This saves cookies to linkedin_cookies.json for subsequent sessions.


Configuration

Skills Reference (code/src/config/skills_reference_2025.json)

{
  "total_skills": 977,
  "skills": [
    {
      "name": "Python",
      "patterns": ["\\bPython\\b", "\\bpython\\b", "\\bPython3\\b"]
    }
  ]
}

Environment Variables (Optional)

Create .env file in code/ directory:

# Database path (default: data/jobs.db)
DB_PATH=data/jobs.db

# Playwright browser path (for WSL)
PLAYWRIGHT_BROWSERS_PATH=.playwright-browsers

Database Schema

-- Phase 1: URL Collection
CREATE TABLE job_urls (
    job_id TEXT PRIMARY KEY,
    platform TEXT NOT NULL,
    input_role TEXT NOT NULL,
    actual_role TEXT NOT NULL,
    url TEXT NOT NULL UNIQUE,
    scraped INTEGER DEFAULT 0
);

-- Phase 2: Full Details
CREATE TABLE jobs (
    job_id TEXT PRIMARY KEY,
    platform TEXT NOT NULL,
    actual_role TEXT NOT NULL,
    url TEXT NOT NULL UNIQUE,
    job_description TEXT,
    skills TEXT,
    company_name TEXT,
    posted_date TEXT,
    scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Performance

Metric Value
URL Collection 200-300 URLs/min
Detail Scraping 15-20 jobs/min (10 workers)
Skill Extraction 0.3s/job
Storage per Job ~2KB

Troubleshooting

Playwright Browser Not Found (WSL/Linux)

cd code
chmod +x setup_playwright.sh
./setup_playwright.sh

"python" command not found (Linux)

Use python3 or the python -m prefix:

python3 -m streamlit run streamlit_app.py
python3 -m pip install package_name

Rate Limited (429 Errors)

The adaptive rate limiter handles this automatically:

  • Concurrency reduces from 10 β†’ 2
  • Circuit breaker triggers 60s pause
  • Gradually recovers when stable

Database Locked

pkill -f streamlit
python -m streamlit run streamlit_app.py

Development

Install Dev Dependencies

pip install -r requirements-dev.txt

Run Tests

cd code
python -m pytest tests/ -v

Type Checking

cd code
python -m basedpyright src/

License

MIT License - See LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 62.5%
  • HTML 24.8%
  • Python 10.5%
  • Shell 1.4%
  • Other 0.8%