fix: fix match check for specified enqueue strategy for requests with redirect #1199

Mantisus · 2025-05-16T19:24:43Z

Description

Fixes match check for specified enqueue strategy for requests with redirect. Before this PR, the check used the final url after the redirect, after that the original url will be used.

Issues

Closes: When the page is redirected, the strategy parameter of context.enqueue_links is no longer effective. #1198

Testing

Added tests for enqueue strategy with redirect simulation.

Copilot

Pull Request Overview

This PR fixes the match check for the enqueue strategy by ensuring that the original URL is used for comparison instead of the final URL after a redirect. Key changes include adding a new field (loaded_url) to test inputs, updating tests to assign the loaded URL, and modifying the crawler’s commit handler to use context.request.url for the URL check.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/unit/crawlers/_basic/test_basic_crawler.py	Added a new "loaded_url" field in the test input dataclass and updated test cases and request context assignment to simulate original URLs.
src/crawlee/crawlers/_basic/_basic_crawler.py	Changed the URL used for the match check by replacing the earlier "origin" variable with context.request.url.

Copilot · 2025-05-16T19:28:09Z

src/crawlee/crawlers/_basic/_basic_crawler.py

@@ -1129,7 +1129,7 @@ async def _commit_request_handler_result(self, context: BasicCrawlingContext) ->
                    and self._check_enqueue_strategy(
                        add_requests_call.get('strategy', 'all'),
                        target_url=urlparse(dst_request.url),
-                        origin_url=urlparse(origin),
+                        origin_url=urlparse(context.request.url),


The tests assign the original URL to context.request.loaded_url but the production code uses context.request.url for the enqueue strategy check. Ensure that the correct property (either 'url' or 'loaded_url') is used consistently across the system to fix the redirect issue.

Suggested change

origin_url=urlparse(context.request.url),

origin_url=urlparse(context.request.loaded_url or context.request.url),

Tests assign context.request.loaded_url to simulate a redirect. The start url should be used when we check the strategy.

Copilot · 2025-05-16T19:28:09Z

tests/unit/crawlers/_basic/test_basic_crawler.py

@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None:

    @crawler.router.handler('start')
    async def start_handler(context: BasicCrawlingContext) -> None:
+        context.request.loaded_url = test_input.loaded_url


The test assigns the original URL to context.request.loaded_url but if the commit handler is intended to use the original URL for enqueue strategy testing, consider aligning the property used (for example, updating context.request.url) to match the production logic.

Suggested change

context.request.loaded_url = test_input.loaded_url

context.request.url = test_input.loaded_url

This is correct, we do this to simulate a redirect.

Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

vdusek

LGTM

janbuchar

LGTM, but please add that one comment.

janbuchar · 2025-05-20T11:20:17Z

tests/unit/crawlers/_basic/test_basic_crawler.py

@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None:

    @crawler.router.handler('start')
    async def start_handler(context: BasicCrawlingContext) -> None:
+        context.request.loaded_url = test_input.loaded_url


Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

fix enqueue strategy for redirects

fc35507

Mantisus self-assigned this May 16, 2025

Mantisus requested a review from Copilot May 16, 2025 19:27

Copilot AI reviewed May 16, 2025

View reviewed changes

Mantisus requested review from janbuchar and Pijukatel May 16, 2025 19:35

vdusek approved these changes May 19, 2025

View reviewed changes

Pijukatel approved these changes May 20, 2025

View reviewed changes

janbuchar approved these changes May 20, 2025

View reviewed changes

add comment

a4385b3

janbuchar merged commit d84c30c into apify:master May 20, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix match check for specified enqueue strategy for requests with redirect #1199

fix: fix match check for specified enqueue strategy for requests with redirect #1199

Uh oh!

Mantisus commented May 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 16, 2025

Uh oh!

Mantisus May 16, 2025

Uh oh!

Copilot AI May 16, 2025

Uh oh!

Mantisus May 16, 2025 •

edited

Loading

Uh oh!

janbuchar May 20, 2025

Uh oh!

vdusek left a comment

Uh oh!

janbuchar left a comment

Uh oh!

janbuchar May 20, 2025

Uh oh!

Uh oh!

Uh oh!

	origin_url=urlparse(context.request.url),
	origin_url=urlparse(context.request.loaded_url or context.request.url),

	context.request.loaded_url = test_input.loaded_url
	context.request.url = test_input.loaded_url

fix: fix match check for specified enqueue strategy for requests with redirect #1199

fix: fix match check for specified enqueue strategy for requests with redirect #1199

Uh oh!

Conversation

Mantisus commented May 16, 2025

Description

Issues

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar May 20, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mantisus May 16, 2025 •

edited

Loading