Skip to content

feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5

Merged
bordoni merged 5 commits intomainfrom
feature/cheerio-scraper
Jul 6, 2025
Merged

feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5
bordoni merged 5 commits intomainfrom
feature/cheerio-scraper

Conversation

@bordoni
Copy link
Member

@bordoni bordoni commented Jul 5, 2025

Summary

This PR introduces version 0.1.0-beta.3 with a Cheerio-based content scraper as a fallback option and enhanced pattern filtering capabilities.

Changes

🎯 Cheerio-based Content Scraper

  • Added fallback scraper for unknown URLs when Pure.md is unavailable
  • Automatic fallback when Pure.md API fails or returns errors
  • Extracts title, content, and links from any webpage
  • Converts HTML content to clean markdown format
  • Preserves document structure (headings, lists, quotes, code blocks)

🔍 Enhanced Link Filtering

  • Replaced single followPattern with includePatterns and excludePatterns arrays
  • Support for multiple include patterns (link must match at least one)
  • Support for multiple exclude patterns (link must not match any)
  • Patterns can be combined for sophisticated filtering logic
  • Applied during both link collection and crawling phases

🔧 Crawler Improvements

  • Crawler now attempts Pure.md first, falls back to Cheerio on failure
  • No Pure.md API key required for basic operation
  • Improved error handling with graceful degradation
  • Better support for sites not recognized by Pure.md

Test plan

  • All existing tests pass
  • Added tests for new pattern array syntax
  • Manually tested Cheerio scraper with example.com
  • Verified fallback behavior when Pure.md is unavailable
  • Tested include/exclude pattern filtering

bordoni added 5 commits July 5, 2025 00:21
- Add Cheerio-based content scraper as fallback for unknown URLs
- Replace single followPattern with includePatterns/excludePatterns arrays
- Support multiple include patterns (must match at least one)
- Support multiple exclude patterns (must not match any)
- Automatic fallback when Pure.md fails or is unavailable
- Update tests to use new pattern array syntax
- Bump version to 0.1.0-beta.3
- Add tests for ContentScraper class structure and error handling
- Add tests for pattern filtering logic (include/exclude patterns)
- Add integration tests for new pattern array syntax
- Update existing tests to use includePatterns instead of followPattern
- Maintain 94.85% test coverage with 79 total tests (up from 66)
- Rename ContentScraper to LinkDiscoverer
- Remove content extraction functionality from Cheerio implementation
- Pure.md remains the exclusive tool for content extraction
- LinkDiscoverer only extracts links for crawling and pagination
- Pages without successful Pure.md extraction show placeholder content
- Better separation of concerns between content extraction and link discovery
- Remove config-migration.ts and its tests
- Remove migration logic from CLI
- Update documentation to remove migration references
- Update CLI version to 0.1.0-beta.3
- Since we're still in beta, no need for backward compatibility
- Make pure section optional in config schema
- Update default config to not include empty pure object
- Handle optional pure config in CLI and crawler
- Update all example configs to not require pure section
- Add helpful message in init command about Pure.md API key
- Avoids awkward requirement of empty apiKey string
@bordoni bordoni merged commit f6ec55e into main Jul 6, 2025
3 checks passed
@bordoni bordoni deleted the feature/cheerio-scraper branch July 6, 2025 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant