This folder contains the Node.js implementation of the job scraper.
- Node.js 14+
- NPM or Yarn package manager
npm install
# or with yarn
yarn install
Follow these steps to quickly start scraping frontend job listings:
-
Prepare your input Excel file (e.g.,
RemoteList_small.xlsx
) with company information in the following format:- Each row should represent a company
- The file must contain these columns:
Company
- Name of the companyURL
- Company website URL (e.g., https://example.com)
-
Run the scraper with your input file:
node job-scraper.js RemoteList_small.xlsx
-
The results will be saved to
frontend_jobs_results.xlsx
in the same directory.
# Clone the repository
git clone <repository-url>
cd job-scraper/nodejs
# Install dependencies
npm install
# If you don't have an input file, create a simple one:
# | Company | URL |
# |--------------|-------------------------|
# | Example Inc | https://example.com |
# | Acme Corp | https://acmecorp.com |
# | Tech Labs | https://techlabs.io |
# Run the scraper with your file
node job-scraper.js RemoteList_small.xlsx
# Check the results
open frontend_jobs_results.xlsx
The input Excel file (e.g., RemoteList_small.xlsx
) should be structured as follows:
Company Name | URL |
---|---|
Company 1 | https://company1.com |
Company 2 | https://company2.com |
Company 3 | https://company3.com |
The output file (frontend_jobs_results.xlsx
) will contain:
company | title | url | description | location |
---|---|---|---|---|
Company 1 | Frontend Developer | https://company1.com/careers/job1 | ||
Company 2 | Senior React Developer | https://company2.com/jobs/123 | ||
Company 3 | JavaScript Engineer | https://company3.com/vacancies/45 |
The scraper will:
- Process each company in the list
- Find all matching frontend jobs for each company
- Save all found jobs to the output Excel file
- Cache results to speed up future runs
The scraper uses several techniques to optimize performance:
-
Caching: Results for each company are cached for 24 hours. To reset the cache:
# Remove all cache files rm -rf ./cache/*.json
-
Worker Threads: By default, the scraper uses multiple worker threads based on your CPU cores. You can adjust this in the code:
// In job-scraper.js, modify the constructor: constructor(excelPath) { // ... this.maxWorkers = 4; // Set to desired number // ... }
-
Timeout Settings: If websites are taking too long to load, you can adjust the timeout setting:
// In job-scraper.js, modify the runWorker method: runWorker(companyInfo) { // ... const workerData = { company: companyInfo, cacheDir: this.cacheDir, timeout: 30000 // Increase timeout (in ms) }; // ... }
This version of the scraper includes pagination support:
Pagination Support: The scraper now automatically navigates through all pages of job listings on company career pages. It:
- Detects "Next" buttons and page numbers
- Follows pagination to ensure all job listings are checked
- Supports multiple pagination styles and formats
For example, with RemoteList_small.xlsx
containing 10 companies, the scraper will:
- Process each company
- Navigate through pagination if available
- Find all frontend jobs on all pages
- Complete the full scraping process for comprehensive results
This application relies on the following packages:
- puppeteer
- cheerio
- xlsx
- axios
node job-scraper.js
- Searches for frontend developer jobs on company websites
- Caches results to avoid unnecessary requests
- Deduplicates job listings
- Saves results to Excel file incrementally
- Shows progress during execution
- Uses worker threads for parallel processing
- Supports pagination on job listing pages
This implementation uses Node.js Worker Threads for parallel processing of companies, which significantly improves performance:
- Each company is processed in its own worker thread
- The number of concurrent workers is determined based on available CPU cores
- Each worker thread launches its own isolated browser instance
- Results are collected and saved incrementally
- The main thread orchestrates workers and handles result aggregation
The worker implementation is in worker.js
, while the main script in job-scraper.js
manages the worker pool.
The scraper now automatically:
- Detects pagination controls on job listing pages
- Navigates through all available pages to find jobs
- Supports various pagination formats (numbered pages, "Next" buttons, etc.)