-
Notifications
You must be signed in to change notification settings - Fork 394
feat: add utility for load and parse Sitemap and SitemapRequestLoader
#1169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.
- Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
- Refactored endpoint functions to include consistent parameters (scope, _receive, send).
- Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/unit/server.py | Refactored endpoint function signatures and routing logic; added new sitemap endpoint. |
tests/unit/_utils/test_sitemap.py | Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios. |
SitemapRequestLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:
- Refactoring the server routing to support dynamic endpoint functions with a unified signature.
- Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
- Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/server.py | Refactored endpoint routing to use a path-to-handler mapping. |
tests/unit/request_loaders/test_sitemap_request_loader.py | New tests ensuring proper sitemap request loader functionality. |
tests/unit/_utils/test_sitemap.py | Extensive tests for sitemap parsing and various sitemap formats. |
src/crawlee/request_loaders/_sitemap_request_loader.py | New implementation of SitemapRequestLoader with background sitemap loading. |
src/crawlee/request_loaders/init.py | Updated all to export SitemapRequestLoader. |
src/crawlee/_utils/robots.py | Extended RobotsTxtFile to support sitemap parsing and URL extraction. |
Comments suppressed due to low confidence (2)
tests/unit/server.py:120
- Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).
path_parts = URL(scope['path']).parts
src/crawlee/_utils/robots.py:89
- The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.
async def parse_sitemaps(self) -> Sitemap:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems promising, thanks 🙂
Co-authored-by: Jan Buchar <Teyras@gmail.com>
### Description - Add `stream` method for `HttpClient` - Add an async context manager for cleaning up resources when closing a `HttpClient` Relates: #1169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate if @vdusek or @Pijukatel could also look into this as it's pretty big. I don't see any issues now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just maybe could you update the Request loaders guide to cover SitemapRequestLoader
as well? 🙂
Description
SitemapRequestLoader
for comfortable working withSitemap
and easy integration into the frameworkSitemap
, loads, and stream parsingIssues
Sitemap
parser utility #1161Testing
SitemapRequestLoader