TypeScript Web Crawler

A simple, educational web crawler built with TypeScript. This project demonstrates how to build a basic web scraper that can crawl websites, extract links, and respect crawling limits.

Features

Recursive link crawling with configurable depth
Domain restrictions to keep crawls focused
Configurable rate limiting to be respectful to servers
Link extraction and page metadata collection
Built-in statistics tracking
Clean TypeScript implementation with type safety

Installation

First, install the dependencies:

npm install

Usage

Running the Examples

The project includes example usage in src/example.ts. You can run it in two ways:

Development mode (with ts-node):

npm run dev

Production mode (compiled):

npm run build
npm start

Using the Crawler in Your Code

import { WebCrawler } from './crawler';

const crawler = new WebCrawler({
  maxDepth: 2,        // How deep to crawl from starting URL
  maxPages: 50,       // Maximum number of pages to crawl
  delay: 1000,        // Delay between requests in milliseconds
  allowedDomains: ['example.com'], // Restrict to specific domains
});

const results = await crawler.crawl('https://example.com');

// Access results
results.forEach(result => {
  console.log(result.title);
  console.log(result.url);
  console.log(result.links);
});

// Get statistics
const stats = crawler.getStats();
console.log(stats);

Configuration Options

Option	Type	Default	Description
`maxDepth`	number	2	Maximum depth to crawl from starting URL
`maxPages`	number	50	Maximum total pages to crawl
`delay`	number	1000	Delay in milliseconds between requests
`allowedDomains`	string[]	[]	Whitelist of allowed domains (empty = all domains)

Project Structure

.
├── src/
│   ├── crawler.ts    # Main crawler implementation
│   └── example.ts    # Example usage
├── dist/             # Compiled JavaScript output
├── package.json
├── tsconfig.json
└── README.md

Scripts

npm run build - Compile TypeScript to JavaScript
npm start - Run the compiled code
npm run dev - Run directly with ts-node
npm run clean - Remove compiled files

How It Works

URL Queue: Starting from an initial URL, the crawler fetches the page content
HTML Parsing: Uses Cheerio to parse HTML and extract links
Link Discovery: Finds all <a> tags and converts relative URLs to absolute
Recursive Crawling: Follows discovered links up to the configured depth
Rate Limiting: Waits between requests to avoid overwhelming servers
Domain Filtering: Optionally restricts crawling to specific domains

Dependencies

axios - HTTP client for fetching web pages
cheerio - Fast, flexible HTML parsing (jQuery-like API)
typescript - Type-safe JavaScript
ts-node - Execute TypeScript directly for development

Notes for Production Use

This is an educational example. For production use, consider:

Adding robots.txt parsing and respect
Implementing better error handling and retry logic
Adding request queuing and concurrency control
Implementing caching to avoid re-crawling pages
Adding content extraction and storage
Implementing politeness policies per domain
Adding proxy support for large-scale crawling

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TypeScript Web Crawler

Features

Installation

Usage

Running the Examples

Using the Crawler in Your Code

Configuration Options

Project Structure

Scripts

How It Works

Dependencies

Notes for Production Use

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

syntax-scribe/doc-automation-demo

Folders and files

Latest commit

History

Repository files navigation

TypeScript Web Crawler

Features

Installation

Usage

Running the Examples

Using the Crawler in Your Code

Configuration Options

Project Structure

Scripts

How It Works

Dependencies

Notes for Production Use

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages