ts-web-scraper

A powerful, type-safe web scraping library for TypeScript and Bun with zero external dependencies. Built entirely on Bun's native APIs for maximum performance and minimal footprint.

Features

🚀 Zero Dependencies - Built entirely on Bun native APIs
💪 Fully Typed - Complete TypeScript support with type inference
⚡️ High Performance - Optimized for speed with native Bun performance
🔄 Rate Limiting - Built-in token bucket rate limiter with burst support
💾 Smart Caching - LRU cache with TTL support
🔁 Automatic Retries - Exponential backoff retry logic
📊 Data Extraction - Powerful pipeline-based data extraction and transformation
🎯 Validation - Built-in schema validation for extracted data
📈 Monitoring - Performance metrics and analytics
🔍 Change Detection - Track content changes over time with diff algorithms
🤖 Ethical Scraping - Robots.txt support and user-agent management
🍪 Session Management - Cookie jar and session persistence
📝 Multiple Export Formats - JSON, CSV, XML, YAML, Markdown, HTML
🌐 Pagination - Automatic pagination detection and traversal
🎨 Client-Side Rendering - Support for JavaScript-heavy sites
📚 Comprehensive Docs - Full documentation with examples

Installation

bun add ts-web-scraper

Quick Start

import { createScraper } from 'ts-web-scraper'

// Create a scraper instance
const scraper = createScraper({
  rateLimit: { requestsPerSecond: 2 },
  cache: { enabled: true, ttl: 60000 },
  retry: { maxRetries: 3 },
})

// Scrape a website
const result = await scraper.scrape('https://example.com', {
  extract: doc => ({
    title: doc.querySelector('title')?.textContent,
    headings: Array.from(doc.querySelectorAll('h1')).map(h => h.textContent),
  }),
})

console.log(result.data)

Core Concepts

Scraper

The main scraper class provides a unified API for all scraping operations:

import { createScraper } from 'ts-web-scraper'

const scraper = createScraper({
  // Rate limiting
  rateLimit: {
    requestsPerSecond: 2,
    burstSize: 5
  },

  // Caching
  cache: {
    enabled: true,
    ttl: 60000,
    maxSize: 100
  },

  // Retry logic
  retry: {
    maxRetries: 3,
    initialDelay: 1000
  },

  // Performance monitoring
  monitor: true,

  // Change tracking
  trackChanges: true,

  // Cookies & sessions
  cookies: { enabled: true },
})

Data Extraction

Extract and transform data using pipelines:

import { extractors, pipeline } from 'ts-web-scraper'

const extractProducts = pipeline()
  .step(extractors.structured('.product', {
    name: '.product-name',
    price: '.product-price',
    rating: '.rating',
  }))
  .map('parse-price', p => ({
    ...p,
    price: Number.parseFloat(p.price.replace(/[^0-9.]/g, '')),
  }))
  .filter('in-stock', products => products.every(p => p.price > 0))
  .sort('by-price', (a, b) => a.price - b.price)

const result = await extractProducts.execute(document)

Change Detection

Track content changes over time:

const scraper = createScraper({ trackChanges: true })

// First scrape
const result1 = await scraper.scrape('https://example.com', {
  extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result1.changed === undefined (no previous snapshot)

// Second scrape
const result2 = await scraper.scrape('https://example.com', {
  extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result2.changed === false (if price hasn't changed)

Export Data

Export scraped data to multiple formats:

import { exportData, saveExport } from 'ts-web-scraper'

// Export to JSON
const json = exportData(data, { format: 'json', pretty: true })

// Export to CSV
const csv = exportData(data, { format: 'csv' })

// Save to file (format auto-detected from extension)
await saveExport(data, 'output.csv')
await saveExport(data, 'output.json')
await saveExport(data, 'output.xml')

Advanced Features

Pagination

Automatically traverse paginated content:

for await (const page of scraper.scrapeAll('https://example.com/posts', {
  extract: doc => ({
    posts: extractors.structured('article', {
      title: 'h2',
      content: '.content',
    }).execute(doc),
  }),
}, { maxPages: 10 })) {
  console.log(`Page ${page.pageNumber}:`, page.data)
}

Performance Monitoring

Track and analyze scraping performance:

const scraper = createScraper({ monitor: true })

await scraper.scrape('https://example.com')
await scraper.scrape('https://example.com/page2')

const stats = scraper.getStats()
console.log(stats.totalRequests) // 2
console.log(stats.averageDuration) // Average time per request
console.log(stats.cacheHitRate) // Cache effectiveness

const report = scraper.getReport()
console.log(report) // Formatted performance report

Content Validation

Validate extracted data against schemas:

const result = await scraper.scrape('https://example.com', {
  extract: doc => ({
    title: doc.querySelector('title')?.textContent,
    price: Number.parseFloat(doc.querySelector('.price')?.textContent || '0'),
  }),
  validate: {
    title: { type: 'string', required: true },
    price: { type: 'number', min: 0, required: true },
  },
})

if (result.success) {
  // Data is valid and typed
  console.log(result.data.title, result.data.price)
}
else {
  console.error(result.error)
}

Documentation

For full documentation, visit https://ts-web-scraper.netlify.app

Testing

bun test

All 482 tests passing with comprehensive coverage of:

Core scraping functionality
Rate limiting and caching
Data extraction pipelines
Change detection and monitoring
Export formats
Error handling and edge cases

Changelog

Please see our releases page for more information on what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Community

For help, discussion about best practices, or any other conversation that would benefit from being searchable:

Discussions on GitHub

For casual chit-chat with others using this package:

Join the Stacks Discord Server

Postcardware

"Software that is free, but hopes for a postcard." We love receiving postcards from around the world showing where Stacks is being used! We showcase them on our website too.

Our address: Stacks.js, 12665 Village Ln #2306, Playa Vista, CA 90094, United States 🌎

License

The MIT License (MIT). Please see LICENSE for more information.

Made with 💙

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
.vscode		.vscode
.zed		.zed
bin		bin
docs		docs
examples		examples
src		src
test		test
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
buddy-bot.config.ts		buddy-bot.config.ts
build.ts		build.ts
bun.lock		bun.lock
bunfig.toml		bunfig.toml
deps.yaml		deps.yaml
eslint.config.ts		eslint.config.ts
package.json		package.json
scraper		scraper
scraper.config.ts		scraper.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ts-web-scraper

Features

Installation

Quick Start

Core Concepts

Scraper

Data Extraction

Change Detection

Export Data

Advanced Features

Pagination

Performance Monitoring

Content Validation

Documentation

Testing

Changelog

Contributing

Community

Postcardware

Sponsors

License

About

Uh oh!

Releases 2

Languages

License

stacksjs/ts-web-scraper

Folders and files

Latest commit

History

Repository files navigation

ts-web-scraper

Features

Installation

Quick Start

Core Concepts

Scraper

Data Extraction

Change Detection

Export Data

Advanced Features

Pagination

Performance Monitoring

Content Validation

Documentation

Testing

Changelog

Contributing

Community

Postcardware

Sponsors

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages