Skip to content

fix: fix match check for specified enqueue strategy for requests with redirect #1199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2025

Conversation

Mantisus
Copy link
Collaborator

Description

  • Fixes match check for specified enqueue strategy for requests with redirect. Before this PR, the check used the final url after the redirect, after that the original url will be used.

Issues

Testing

  • Added tests for enqueue strategy with redirect simulation.

@Mantisus Mantisus self-assigned this May 16, 2025
@Mantisus Mantisus requested a review from Copilot May 16, 2025 19:27
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the match check for the enqueue strategy by ensuring that the original URL is used for comparison instead of the final URL after a redirect. Key changes include adding a new field (loaded_url) to test inputs, updating tests to assign the loaded URL, and modifying the crawler’s commit handler to use context.request.url for the URL check.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/unit/crawlers/_basic/test_basic_crawler.py Added a new "loaded_url" field in the test input dataclass and updated test cases and request context assignment to simulate original URLs.
src/crawlee/crawlers/_basic/_basic_crawler.py Changed the URL used for the match check by replacing the earlier "origin" variable with context.request.url.

@@ -1129,7 +1129,7 @@ async def _commit_request_handler_result(self, context: BasicCrawlingContext) ->
and self._check_enqueue_strategy(
add_requests_call.get('strategy', 'all'),
target_url=urlparse(dst_request.url),
origin_url=urlparse(origin),
origin_url=urlparse(context.request.url),
Copy link
Preview

Copilot AI May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests assign the original URL to context.request.loaded_url but the production code uses context.request.url for the enqueue strategy check. Ensure that the correct property (either 'url' or 'loaded_url') is used consistently across the system to fix the redirect issue.

Suggested change
origin_url=urlparse(context.request.url),
origin_url=urlparse(context.request.loaded_url or context.request.url),

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests assign context.request.loaded_url to simulate a redirect. The start url should be used when we check the strategy.

@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None:

@crawler.router.handler('start')
async def start_handler(context: BasicCrawlingContext) -> None:
context.request.loaded_url = test_input.loaded_url
Copy link
Preview

Copilot AI May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test assigns the original URL to context.request.loaded_url but if the commit handler is intended to use the original URL for enqueue strategy testing, consider aligning the property used (for example, updating context.request.url) to match the production logic.

Suggested change
context.request.loaded_url = test_input.loaded_url
context.request.url = test_input.loaded_url

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

@Mantisus Mantisus May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, we do this to simulate a redirect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

@Mantisus Mantisus requested review from janbuchar and Pijukatel May 16, 2025 19:35
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please add that one comment.

@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None:

@crawler.router.handler('start')
async def start_handler(context: BasicCrawlingContext) -> None:
context.request.loaded_url = test_input.loaded_url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

@janbuchar janbuchar merged commit d84c30c into apify:master May 20, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When the page is redirected, the strategy parameter of context.enqueue_links is no longer effective.
4 participants