Skip to content

fix: Fix link filtering in enqueueLinks in AdaptivePlaywrightCrawler #3021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 25, 2025

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Jun 18, 2025
@janbuchar janbuchar requested review from B4nan and barjin June 18, 2025 16:05
@github-actions github-actions bot added this to the 117th sprint - Tooling team milestone Jun 18, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jun 18, 2025
@janbuchar janbuchar marked this pull request as ready for review June 23, 2025 15:28
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes seem sound to me, thank you 👍

I'm starting to get lost in the AdaptiveCrawler design (partially since it's a subclass of PlaywrightCrawler and the HttpCrawler features are glued on top of it). Not sure what the way out is, refactoring this would be tough for sure.

@@ -590,15 +627,14 @@ export class AdaptivePlaywrightCrawler extends PlaywrightCrawler {

return $;
},
async enqueueLinks(
enqueueLinks: async (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this function to arrow function because of the this binding is sneaky and got me confused for a bit (but I don't really see any better way).

fwiw, the new enqueueLinks impl doesn't interact with this at all (could be static / completely separate).

@janbuchar
Copy link
Contributor Author

I'm starting to get lost in the AdaptiveCrawler design (partially since it's a subclass of PlaywrightCrawler and the HttpCrawler features are glued on top of it). Not sure what the way out is, refactoring this would be tough for sure.

The Python counterpart is a bit DRY-er, I hope we can make this implementation more like it in v4.

@janbuchar janbuchar merged commit 8a3b6f8 into master Jun 25, 2025
10 of 11 checks passed
@janbuchar janbuchar deleted the fix-adaptive-crawler-enqueue-strategy-check branch June 25, 2025 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enqueue strategy check after redirects is not working with adaptive crawler
2 participants