Skip to content

feat: add utility for load and parse Sitemap and SitemapRequestLoader #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Apr 22, 2025

Description

  • Add SitemapRequestLoader for comfortable working with Sitemap and easy integration into the framework
  • Add utility for working with Sitemap, loads, and stream parsing

Issues

Testing

  • Add tests for SitemapRequestLoader
  • Add new endpoints for the unicorn server for sitemaps tests

@Mantisus Mantisus requested a review from Copilot April 22, 2025 23:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

@Mantisus Mantisus self-assigned this Apr 22, 2025
@Mantisus Mantisus requested a review from Copilot May 30, 2025 12:16
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.

  • Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
  • Refactored endpoint functions to include consistent parameters (scope, _receive, send).
  • Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/unit/server.py Refactored endpoint function signatures and routing logic; added new sitemap endpoint.
tests/unit/_utils/test_sitemap.py Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios.

@Mantisus Mantisus changed the title feat: add Sitemap Utility feat: add utility for load and parse Sitemap and SitemapRequestLoader Jun 3, 2025
@Mantisus Mantisus requested a review from Copilot June 3, 2025 18:22
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:

  • Refactoring the server routing to support dynamic endpoint functions with a unified signature.
  • Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
  • Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/server.py Refactored endpoint routing to use a path-to-handler mapping.
tests/unit/request_loaders/test_sitemap_request_loader.py New tests ensuring proper sitemap request loader functionality.
tests/unit/_utils/test_sitemap.py Extensive tests for sitemap parsing and various sitemap formats.
src/crawlee/request_loaders/_sitemap_request_loader.py New implementation of SitemapRequestLoader with background sitemap loading.
src/crawlee/request_loaders/init.py Updated all to export SitemapRequestLoader.
src/crawlee/_utils/robots.py Extended RobotsTxtFile to support sitemap parsing and URL extraction.
Comments suppressed due to low confidence (2)

tests/unit/server.py:120

  • Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).
path_parts = URL(scope['path']).parts

src/crawlee/_utils/robots.py:89

  • The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.
async def parse_sitemaps(self) -> Sitemap:

@Mantisus Mantisus marked this pull request as ready for review June 3, 2025 18:44
@Mantisus Mantisus requested review from janbuchar and vdusek June 3, 2025 18:44
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems promising, thanks 🙂

Mantisus added 5 commits June 9, 2025 10:00
Mantisus added 3 commits June 10, 2025 21:48
Mantisus and others added 17 commits June 11, 2025 13:42

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Jan Buchar <Teyras@gmail.com>
vdusek pushed a commit that referenced this pull request Jun 19, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
### Description

- Add `stream` method for `HttpClient`
- Add an async context manager for cleaning up resources when closing a
`HttpClient`

Relates: #1169
Mantisus added 2 commits June 19, 2025 12:31
@Mantisus Mantisus requested a review from janbuchar June 19, 2025 12:35
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate if @vdusek or @Pijukatel could also look into this as it's pretty big. I don't see any issues now.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just maybe could you update the Request loaders guide to cover SitemapRequestLoader as well? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add Sitemap parser utility
3 participants