Skip to content

gavrilovdm/job-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Scraper - Node.js Implementation

This folder contains the Node.js implementation of the job scraper.

Requirements

  • Node.js 14+
  • NPM or Yarn package manager

Installation

npm install
# or with yarn
yarn install

Quick Start Guide

Follow these steps to quickly start scraping frontend job listings:

  1. Prepare your input Excel file (e.g., RemoteList_small.xlsx) with company information in the following format:

    • Each row should represent a company
    • The file must contain these columns:
  2. Run the scraper with your input file:

    node job-scraper.js RemoteList_small.xlsx
  3. The results will be saved to frontend_jobs_results.xlsx in the same directory.

Example:

# Clone the repository
git clone <repository-url>
cd job-scraper/nodejs

# Install dependencies
npm install

# If you don't have an input file, create a simple one:
# | Company | URL                     |
# |--------------|-------------------------|
# | Example Inc  | https://example.com     |
# | Acme Corp    | https://acmecorp.com    |
# | Tech Labs    | https://techlabs.io     |

# Run the scraper with your file
node job-scraper.js RemoteList_small.xlsx

# Check the results
open frontend_jobs_results.xlsx

Input File Format

The input Excel file (e.g., RemoteList_small.xlsx) should be structured as follows:

Company Name URL
Company 1 https://company1.com
Company 2 https://company2.com
Company 3 https://company3.com

Output File Format

The output file (frontend_jobs_results.xlsx) will contain:

company title url description location
Company 1 Frontend Developer https://company1.com/careers/job1
Company 2 Senior React Developer https://company2.com/jobs/123
Company 3 JavaScript Engineer https://company3.com/vacancies/45

The scraper will:

  • Process each company in the list
  • Find all matching frontend jobs for each company
  • Save all found jobs to the output Excel file
  • Cache results to speed up future runs

Performance Optimization

The scraper uses several techniques to optimize performance:

  1. Caching: Results for each company are cached for 24 hours. To reset the cache:

    # Remove all cache files
    rm -rf ./cache/*.json
  2. Worker Threads: By default, the scraper uses multiple worker threads based on your CPU cores. You can adjust this in the code:

    // In job-scraper.js, modify the constructor:
    constructor(excelPath) {
      // ...
      this.maxWorkers = 4; // Set to desired number
      // ...
    }
  3. Timeout Settings: If websites are taking too long to load, you can adjust the timeout setting:

    // In job-scraper.js, modify the runWorker method:
    runWorker(companyInfo) {
      // ...
      const workerData = { 
        company: companyInfo,
        cacheDir: this.cacheDir,
        timeout: 30000 // Increase timeout (in ms)
      };
      // ...
    }

New Features

This version of the scraper includes pagination support:

Pagination Support: The scraper now automatically navigates through all pages of job listings on company career pages. It:

  • Detects "Next" buttons and page numbers
  • Follows pagination to ensure all job listings are checked
  • Supports multiple pagination styles and formats

For example, with RemoteList_small.xlsx containing 10 companies, the scraper will:

  • Process each company
  • Navigate through pagination if available
  • Find all frontend jobs on all pages
  • Complete the full scraping process for comprehensive results

Required Packages

This application relies on the following packages:

  • puppeteer
  • cheerio
  • xlsx
  • axios

Usage

node job-scraper.js

Features

  • Searches for frontend developer jobs on company websites
  • Caches results to avoid unnecessary requests
  • Deduplicates job listings
  • Saves results to Excel file incrementally
  • Shows progress during execution
  • Uses worker threads for parallel processing
  • Supports pagination on job listing pages

Parallelization with Worker Threads

This implementation uses Node.js Worker Threads for parallel processing of companies, which significantly improves performance:

  • Each company is processed in its own worker thread
  • The number of concurrent workers is determined based on available CPU cores
  • Each worker thread launches its own isolated browser instance
  • Results are collected and saved incrementally
  • The main thread orchestrates workers and handles result aggregation

The worker implementation is in worker.js, while the main script in job-scraper.js manages the worker pool.

Advanced Scraping Features

Pagination Support

The scraper now automatically:

  • Detects pagination controls on job listing pages
  • Navigates through all available pages to find jobs
  • Supports various pagination formats (numbered pages, "Next" buttons, etc.)

About

job scraper (+parser) on node.js with puppeteer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published