feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5
Merged
feat: add Cheerio scraper and enhance pattern filtering (v0.1.0-beta.3)#5
Conversation
- Add Cheerio-based content scraper as fallback for unknown URLs - Replace single followPattern with includePatterns/excludePatterns arrays - Support multiple include patterns (must match at least one) - Support multiple exclude patterns (must not match any) - Automatic fallback when Pure.md fails or is unavailable - Update tests to use new pattern array syntax - Bump version to 0.1.0-beta.3
- Add tests for ContentScraper class structure and error handling - Add tests for pattern filtering logic (include/exclude patterns) - Add integration tests for new pattern array syntax - Update existing tests to use includePatterns instead of followPattern - Maintain 94.85% test coverage with 79 total tests (up from 66)
- Rename ContentScraper to LinkDiscoverer - Remove content extraction functionality from Cheerio implementation - Pure.md remains the exclusive tool for content extraction - LinkDiscoverer only extracts links for crawling and pagination - Pages without successful Pure.md extraction show placeholder content - Better separation of concerns between content extraction and link discovery
- Remove config-migration.ts and its tests - Remove migration logic from CLI - Update documentation to remove migration references - Update CLI version to 0.1.0-beta.3 - Since we're still in beta, no need for backward compatibility
- Make pure section optional in config schema - Update default config to not include empty pure object - Handle optional pure config in CLI and crawler - Update all example configs to not require pure section - Add helpful message in init command about Pure.md API key - Avoids awkward requirement of empty apiKey string
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces version 0.1.0-beta.3 with a Cheerio-based content scraper as a fallback option and enhanced pattern filtering capabilities.
Changes
🎯 Cheerio-based Content Scraper
🔍 Enhanced Link Filtering
followPatternwithincludePatternsandexcludePatternsarrays🔧 Crawler Improvements
Test plan