Skip to content

simddev/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A concurrent web crawler written in TypeScript and Node.js. Point it at any website and it will recursively crawl every internal page up to a configurable limit, extracting headings, paragraphs, links, and images from each page, then write the results to a JSON report.

Features

  • Concurrent crawling with configurable parallelism (powered by p-limit)
  • Configurable page limit to avoid crawling indefinitely
  • Stays on the same domain — never follows external links
  • Extracts from each page: heading, first paragraph, outgoing links, and image URLs
  • Outputs a sorted report.json file
  • HTML parsing via jsdom
  • Fully unit tested with Vitest

Requirements

  • Node.js 22.15.0 (via nvm)

Setup

nvm use
npm install

Usage

npm run start <URL> <maxConcurrency> <maxPages>
Argument Description
URL The root URL to start crawling from
maxConcurrency Number of pages to fetch in parallel (e.g. 3)
maxPages Maximum number of unique pages to crawl (e.g. 50)

Example

npm run start https://learnwebscraping.dev/practice/ecommerce/ 3 50

This will crawl up to 50 pages starting from the given URL, with 3 concurrent requests at a time, and write results to report.json.

Output

While running, each crawled URL is printed to the console. When finished, a report.json file is written to the current directory containing a sorted array of page records:

[
  {
    "url": "https://example.com/about",
    "heading": "About Us",
    "first_paragraph": "We build things.",
    "outgoing_links": ["https://example.com/", "https://example.com/contact"],
    "image_urls": ["https://example.com/logo.png"]
  }
]

Testing

npm test

Project Structure

src/
├── index.ts      — CLI entry point, argument parsing
├── crawl.ts      — Core crawler logic and HTML extraction functions
└── report.ts     — JSON report writer

About

A Web Crawler in TypeScript using Node.js

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors