Automated job data pipeline for LinkedIn with intelligent skill extraction and real-time analytics.
A production-ready job scraping system that collects job listings from LinkedIn, extracts technical skills using regex-based pattern matching, and provides interactive analytics through a Streamlit dashboard.
| Feature | Description |
|---|---|
| Two-Phase Scraping | Separate URL collection and detail extraction for resilience |
| 3-Layer Skill Extraction | 977 skills with regex patterns, minimal false positives |
| 150 Role Categories | Automatic role normalization with pattern matching |
| Real-Time Analytics | Interactive charts, skill trends, and export capabilities |
| Adaptive Rate Limiting | Circuit breaker with auto-tuning concurrency (2-10 workers) |
| Resume Capability | Checkpoint-based recovery from interruptions |
Job_Scrapper/
βββ README.md # This file
βββ requirements.txt # Production dependencies
βββ requirements-dev.txt # Development dependencies
βββ .gitignore # Git ignore rules
β
βββ code/ # All source code
β βββ streamlit_app.py # Main dashboard entry point
β βββ run_scraper.py # CLI scraper runner
β βββ save_linkedin_cookies.py # LinkedIn authentication helper
β βββ setup_playwright.sh # Playwright browser installer (WSL/Linux)
β β
β βββ data/
β β βββ jobs.db # SQLite database (auto-created)
β β βββ Analysis_Report/ # Generated analysis reports
β β βββ Data_Analyst/
β β βββ Data_Engineer/
β β βββ GenAI_DataScience/
β β
β βββ src/
β β βββ config/ # Configuration files
β β β βββ skills_reference_2025.json # 977 skills with regex patterns
β β β βββ roles_reference_2025.json # 150 role categories
β β β βββ countries.py # Country/location mappings
β β β βββ naukri_locations.py
β β β
β β βββ db/ # Database layer
β β β βββ connection.py # SQLite connection manager
β β β βββ schema.py # Table schemas
β β β βββ operations.py # CRUD operations
β β β
β β βββ models/
β β β βββ models.py # Pydantic data models
β β β
β β βββ scraper/
β β β βββ unified/
β β β β βββ linkedin/ # LinkedIn scraper components
β β β β β βββ concurrent_detail_scraper.py # Multi-tab scraper (up to 10 tabs)
β β β β β βββ sequential_detail_scraper.py # Single-tab scraper
β β β β β βββ playwright_url_scraper.py # URL collection
β β β β β βββ selector_config.py # CSS selectors
β β β β β βββ retry_helper.py # 404/503 handling
β β β β β βββ job_validator.py # Field validation
β β β β β
β β β β βββ naukri/ # Naukri scraper components
β β β β β βββ url_scraper.py
β β β β β βββ detail_scraper.py
β β β β β βββ selectors.py
β β β β β
β β β β βββ scalable/ # Rate limiting & resilience
β β β β β βββ adaptive_rate_limiter.py
β β β β β βββ checkpoint_manager.py
β β β β β βββ progress_tracker.py
β β β β β
β β β β βββ linkedin_unified.py # LinkedIn orchestrator
β β β β βββ naukri_unified.py # Naukri orchestrator
β β β β
β β β βββ services/ # External service clients
β β β βββ playwright_browser.py
β β β βββ session_manager.py
β β β
β β βββ analysis/
β β β βββ skill_extraction/ # 3-layer skill extraction
β β β βββ extractor.py # Main AdvancedSkillExtractor class
β β β βββ layer3_direct.py # Pattern matching from JSON
β β β βββ batch_reextract.py # Re-process existing jobs
β β β βββ deduplicator.py # Skill normalization
β β β
β β βββ ui/
β β β βββ components/ # Streamlit UI components
β β β βββ kpi_dashboard.py
β β β βββ link_scraper_form.py
β β β βββ detail_scraper_form.py
β β β βββ analytics/
β β β βββ skills_charts.py
β β β βββ overview_metrics.py
β β β
β β βββ utils/
β β β βββ cleanup_expired_urls.py
β β β
β β βββ validation/
β β βββ validation_pipeline.py
β β βββ single_job_validator.py
β β
β βββ scripts/
β β βββ extraction/
β β β βββ reextract_skills.py
β β β
β β βββ validation/ # Validation suite
β β βββ layer1_syntax_check.sh
β β βββ layer2_coverage.sh
β β βββ layer3_fp_detection.sh
β β βββ layer4_fn_detection.sh
β β βββ cross_verify_skills.py
β β βββ run_all_validations.sh
β β
β βββ tests/
β β βββ test_skill_validation_comprehensive.py
β β βββ test_linkedin_selectors.py
β β
β βββ docs/ # Documentation
β βββ archive/ # Historical docs
β
βββ Analysis/ # Downloaded CSVs and notebooks (gitignored)
βββ Data Analysis/
β βββ data_visualizer.ipynb # Analysis notebook (update CSV path for charts)
β βββ csv/ # Add exported CSVs here
β
βββ Data Engineering/
β βββ data_visualizer.ipynb
β βββ csv/
β
βββ GenAI & DataScience/
βββ data_visualizer.ipynb
βββ csv/
- Python 3.11 or higher
- Git
git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper
# Create virtual environment
python -m venv venv-win
# Activate
.\venv-win\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtgit clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper
# Create virtual environment
python3 -m venv venv-linux
# Activate
source venv-linux/bin/activate
# Install dependencies
python -m pip install -r requirements.txtNote for dual-boot users: Keep separate venvs (venv-win/ and venv-linux/) as Python virtual environments are not cross-platform compatible.
# Windows
playwright install chromium
# Linux/WSL (use python -m prefix)
python -m playwright install chromiumcd code
# Windows
streamlit run streamlit_app.py
# Linux/WSL (use python -m prefix)
python -m streamlit run streamlit_app.pyThe dashboard opens at http://localhost:8501
Phase 1: URL Collection Phase 2: Detail Scraping
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Search Results β β Individual Jobs β
β βββ Fast scroll β βββΆ β βββ Full desc β
β βββ Extract URLs β β βββ Skills parse β
β βββ Store to DB β β βββ Store details β
βββββββββββββββββββββββ βββββββββββββββββββββββ
job_urls table jobs table
Benefits:
- Resilience: If detail scraping fails, URLs are preserved
- Efficiency: Batch process up to 10 jobs concurrently in Phase 2
- Resumable: Pick up exactly where you left off
- Deduplication: Skip already-scraped URLs automatically
| Approach | Speed | Accuracy | Maintenance |
|---|---|---|---|
| Regex (chosen) | 0.3s/job | 85-90% | Pattern file updates |
| spaCy NER | 3-5s/job | 75-80% | Model retraining |
| GPT-based | 2-10s/job | 90%+ | API costs |
Our 3-layer approach achieves 85-90% accuracy at 10x speed of NLP:
- Layer 1: Multi-word phrase extraction (priority matching)
- Layer 2: Context-aware extraction (technical context detection)
- Layer 3: Direct pattern matching (977 skill patterns from JSON)
- KPI Dashboard - View overall statistics
- Link Scraper - Phase 1: Collect job URLs
- Detail Scraper - Phase 2: Extract job details & skills
- Analytics - Analyze skill trends and export data
cd code
# Run validation suite
bash scripts/validation/run_all_validations.sh
# Re-extract skills for existing jobs
python -m src.analysis.skill_extraction.batch_reextract --batch-size 100For authenticated scraping with higher limits:
cd code
python save_linkedin_cookies.pyThis saves cookies to linkedin_cookies.json for subsequent sessions.
{
"total_skills": 977,
"skills": [
{
"name": "Python",
"patterns": ["\\bPython\\b", "\\bpython\\b", "\\bPython3\\b"]
}
]
}Create .env file in code/ directory:
# Database path (default: data/jobs.db)
DB_PATH=data/jobs.db
# Playwright browser path (for WSL)
PLAYWRIGHT_BROWSERS_PATH=.playwright-browsers-- Phase 1: URL Collection
CREATE TABLE job_urls (
job_id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
input_role TEXT NOT NULL,
actual_role TEXT NOT NULL,
url TEXT NOT NULL UNIQUE,
scraped INTEGER DEFAULT 0
);
-- Phase 2: Full Details
CREATE TABLE jobs (
job_id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
actual_role TEXT NOT NULL,
url TEXT NOT NULL UNIQUE,
job_description TEXT,
skills TEXT,
company_name TEXT,
posted_date TEXT,
scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
);| Metric | Value |
|---|---|
| URL Collection | 200-300 URLs/min |
| Detail Scraping | 15-20 jobs/min (10 workers) |
| Skill Extraction | 0.3s/job |
| Storage per Job | ~2KB |
cd code
chmod +x setup_playwright.sh
./setup_playwright.shUse python3 or the python -m prefix:
python3 -m streamlit run streamlit_app.py
python3 -m pip install package_nameThe adaptive rate limiter handles this automatically:
- Concurrency reduces from 10 β 2
- Circuit breaker triggers 60s pause
- Gradually recovers when stable
pkill -f streamlit
python -m streamlit run streamlit_app.pypip install -r requirements-dev.txtcd code
python -m pytest tests/ -vcd code
python -m basedpyright src/MIT License - See LICENSE file for details.