-
Notifications
You must be signed in to change notification settings - Fork 394
fix: fix match check for specified enqueue strategy for requests with redirect #1199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the match check for the enqueue strategy by ensuring that the original URL is used for comparison instead of the final URL after a redirect. Key changes include adding a new field (loaded_url) to test inputs, updating tests to assign the loaded URL, and modifying the crawler’s commit handler to use context.request.url for the URL check.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tests/unit/crawlers/_basic/test_basic_crawler.py | Added a new "loaded_url" field in the test input dataclass and updated test cases and request context assignment to simulate original URLs. |
src/crawlee/crawlers/_basic/_basic_crawler.py | Changed the URL used for the match check by replacing the earlier "origin" variable with context.request.url. |
@@ -1129,7 +1129,7 @@ async def _commit_request_handler_result(self, context: BasicCrawlingContext) -> | |||
and self._check_enqueue_strategy( | |||
add_requests_call.get('strategy', 'all'), | |||
target_url=urlparse(dst_request.url), | |||
origin_url=urlparse(origin), | |||
origin_url=urlparse(context.request.url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests assign the original URL to context.request.loaded_url but the production code uses context.request.url for the enqueue strategy check. Ensure that the correct property (either 'url' or 'loaded_url') is used consistently across the system to fix the redirect issue.
origin_url=urlparse(context.request.url), | |
origin_url=urlparse(context.request.loaded_url or context.request.url), |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests assign context.request.loaded_url
to simulate a redirect. The start url should be used when we check the strategy.
@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None: | |||
|
|||
@crawler.router.handler('start') | |||
async def start_handler(context: BasicCrawlingContext) -> None: | |||
context.request.loaded_url = test_input.loaded_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test assigns the original URL to context.request.loaded_url but if the commit handler is intended to use the original URL for enqueue strategy testing, consider aligning the property used (for example, updating context.request.url) to match the production logic.
context.request.loaded_url = test_input.loaded_url | |
context.request.url = test_input.loaded_url |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, we do this to simulate a redirect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂
Something along the lines of "Assign test value to loaded_url
- BasicCrawler does not do any navigation by itself".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please add that one comment.
@@ -451,6 +513,7 @@ async def test_enqueue_strategy(test_input: AddRequestsTestInput) -> None: | |||
|
|||
@crawler.router.handler('start') | |||
async def start_handler(context: BasicCrawlingContext) -> None: | |||
context.request.loaded_url = test_input.loaded_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂
Something along the lines of "Assign test value to loaded_url
- BasicCrawler does not do any navigation by itself".
Description
Issues
Testing