Skip to content

shuddha2021/nodejs-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A lightweight and configurable web crawler built with Node.js.

Screenshot 2024-04-17 at 7 13 38 PM

Description

This crawler recursively extracts links from websites using Node.js, Axios, and Cheerio. It respects depth limits and avoids duplicate visits for efficient crawling.

Features

  • Asynchronous operation for optimal performance
  • Recursive link extraction with configurable depth
  • Deduplication of visited URLs
  • Targeted crawling capability (e.g., specific domain)
  • Extensible codebase for easy customization
  • Error handling and reporting

Technologies

  • Node.js
  • Axios (HTTP requests)
  • Cheerio (HTML parsing)

Implementation

The crawler fetches and parses HTML using Axios and Cheerio, respectively. It maintains a set of visited URLs and recursively follows links within the configured depth limit and target domain. The process continues until all links are crawled or the maximum depth is reached.

Usage

  1. Clone the repository
  2. Install dependencies: npm install
  3. Configure MAX_DEPTH and targetDomain in crawler.js
  4. Run: node crawler.js

Contributing

Contributions are welcome! Open issues or submit pull requests.

License

MIT License