feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3) by bordoni · Pull Request #5 · stellarwp/archivist

bordoni · 2025-07-05T04:33:03Z

Summary

This PR introduces version 0.1.0-beta.3 with a Cheerio-based content scraper as a fallback option and enhanced pattern filtering capabilities.

Changes

🎯 Cheerio-based Content Scraper

Added fallback scraper for unknown URLs when Pure.md is unavailable
Automatic fallback when Pure.md API fails or returns errors
Extracts title, content, and links from any webpage
Converts HTML content to clean markdown format
Preserves document structure (headings, lists, quotes, code blocks)

🔍 Enhanced Link Filtering

Replaced single followPattern with includePatterns and excludePatterns arrays
Support for multiple include patterns (link must match at least one)
Support for multiple exclude patterns (link must not match any)
Patterns can be combined for sophisticated filtering logic
Applied during both link collection and crawling phases

🔧 Crawler Improvements

Crawler now attempts Pure.md first, falls back to Cheerio on failure
No Pure.md API key required for basic operation
Improved error handling with graceful degradation
Better support for sites not recognized by Pure.md

Test plan

All existing tests pass
Added tests for new pattern array syntax
Manually tested Cheerio scraper with example.com
Verified fallback behavior when Pure.md is unavailable
Tested include/exclude pattern filtering

- Add Cheerio-based content scraper as fallback for unknown URLs - Replace single followPattern with includePatterns/excludePatterns arrays - Support multiple include patterns (must match at least one) - Support multiple exclude patterns (must not match any) - Automatic fallback when Pure.md fails or is unavailable - Update tests to use new pattern array syntax - Bump version to 0.1.0-beta.3

- Add tests for ContentScraper class structure and error handling - Add tests for pattern filtering logic (include/exclude patterns) - Add integration tests for new pattern array syntax - Update existing tests to use includePatterns instead of followPattern - Maintain 94.85% test coverage with 79 total tests (up from 66)

- Rename ContentScraper to LinkDiscoverer - Remove content extraction functionality from Cheerio implementation - Pure.md remains the exclusive tool for content extraction - LinkDiscoverer only extracts links for crawling and pagination - Pages without successful Pure.md extraction show placeholder content - Better separation of concerns between content extraction and link discovery

- Remove config-migration.ts and its tests - Remove migration logic from CLI - Update documentation to remove migration references - Update CLI version to 0.1.0-beta.3 - Since we're still in beta, no need for backward compatibility

- Make pure section optional in config schema - Update default config to not include empty pure object - Handle optional pure config in CLI and crawler - Update all example configs to not require pure section - Add helpful message in init command about Pure.md API key - Avoids awkward requirement of empty apiKey string

bordoni added 5 commits July 5, 2025 00:21

bordoni merged commit f6ec55e into main Jul 6, 2025
3 checks passed

bordoni deleted the feature/cheerio-scraper branch July 6, 2025 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5

feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5
bordoni merged 5 commits intomainfrom
feature/cheerio-scraper

bordoni commented Jul 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bordoni commented Jul 5, 2025

Summary

Changes

🎯 Cheerio-based Content Scraper

🔍 Enhanced Link Filtering

🔧 Crawler Improvements

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant