Job Scraper - Node.js Implementation

This folder contains the Node.js implementation of the job scraper.

Requirements

Node.js 14+
NPM or Yarn package manager

Installation

npm install
# or with yarn
yarn install

Quick Start Guide

Follow these steps to quickly start scraping frontend job listings:

Prepare your input Excel file (e.g., RemoteList_small.xlsx) with company information in the following format:
- Each row should represent a company
- The file must contain these columns:
  - Company - Name of the company
  - URL - Company website URL (e.g., https://example.com)

Run the scraper with your input file:

node job-scraper.js RemoteList_small.xlsx

The results will be saved to frontend_jobs_results.xlsx in the same directory.

Example:

# Clone the repository
git clone <repository-url>
cd job-scraper/nodejs

# Install dependencies
npm install

# If you don't have an input file, create a simple one:
# | Company | URL                     |
# |--------------|-------------------------|
# | Example Inc  | https://example.com     |
# | Acme Corp    | https://acmecorp.com    |
# | Tech Labs    | https://techlabs.io     |

# Run the scraper with your file
node job-scraper.js RemoteList_small.xlsx

# Check the results
open frontend_jobs_results.xlsx

Input File Format

The input Excel file (e.g., RemoteList_small.xlsx) should be structured as follows:

Company Name	URL
Company 1	https://company1.com
Company 2	https://company2.com
Company 3	https://company3.com

Output File Format

The output file (frontend_jobs_results.xlsx) will contain:

company	title	url
Company 1	Frontend Developer	https://company1.com/careers/job1
Company 2	Senior React Developer	https://company2.com/jobs/123
Company 3	JavaScript Engineer	https://company3.com/vacancies/45

The scraper will:

Process each company in the list
Find all matching frontend jobs for each company
Save all found jobs to the output Excel file
Cache results to speed up future runs

Performance Optimization

The scraper uses several techniques to optimize performance:

Caching: Results for each company are cached for 24 hours. To reset the cache:
```
# Remove all cache files
rm -rf ./cache/*.json
```

Worker Threads: By default, the scraper uses multiple worker threads based on your CPU cores. You can adjust this in the code:

// In job-scraper.js, modify the constructor:
constructor(excelPath) {
  // ...
  this.maxWorkers = 4; // Set to desired number
  // ...
}

Timeout Settings: If websites are taking too long to load, you can adjust the timeout setting:

// In job-scraper.js, modify the runWorker method:
runWorker(companyInfo) {
  // ...
  const workerData = { 
    company: companyInfo,
    cacheDir: this.cacheDir,
    timeout: 30000 // Increase timeout (in ms)
  };
  // ...
}

New Features

This version of the scraper includes pagination support:

Pagination Support: The scraper now automatically navigates through all pages of job listings on company career pages. It:

Detects "Next" buttons and page numbers
Follows pagination to ensure all job listings are checked
Supports multiple pagination styles and formats

For example, with RemoteList_small.xlsx containing 10 companies, the scraper will:

Process each company
Navigate through pagination if available
Find all frontend jobs on all pages
Complete the full scraping process for comprehensive results

Required Packages

This application relies on the following packages:

puppeteer
cheerio
xlsx
axios

Usage

node job-scraper.js

Features

Searches for frontend developer jobs on company websites
Caches results to avoid unnecessary requests
Deduplicates job listings
Saves results to Excel file incrementally
Shows progress during execution
Uses worker threads for parallel processing
Supports pagination on job listing pages

Parallelization with Worker Threads

This implementation uses Node.js Worker Threads for parallel processing of companies, which significantly improves performance:

Each company is processed in its own worker thread
The number of concurrent workers is determined based on available CPU cores
Each worker thread launches its own isolated browser instance
Results are collected and saved incrementally
The main thread orchestrates workers and handles result aggregation

The worker implementation is in worker.js, while the main script in job-scraper.js manages the worker pool.

Advanced Scraping Features

Pagination Support

The scraper now automatically:

Detects pagination controls on job listing pages
Navigates through all available pages to find jobs
Supports various pagination formats (numbered pages, "Next" buttons, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
RemoteList_small.xlsx		RemoteList_small.xlsx
job-scraper.js		job-scraper.js
package.json		package.json
worker.js		worker.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Job Scraper - Node.js Implementation

Requirements

Installation

Quick Start Guide

Example:

Input File Format

Output File Format

Performance Optimization

New Features

Required Packages

Usage

Features

Parallelization with Worker Threads

Advanced Scraping Features

Pagination Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

gavrilovdm/job-scraper

Folders and files

Latest commit

History

Repository files navigation

Job Scraper - Node.js Implementation

Requirements

Installation

Quick Start Guide

Example:

Input File Format

Output File Format

Performance Optimization

New Features

Required Packages

Usage

Features

Parallelization with Worker Threads

Advanced Scraping Features

Pagination Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages