-
Notifications
You must be signed in to change notification settings - Fork 879
feat: add TandemRequestProvider
for combined RequestList
and RequestQueue
usage
#2914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Please merge the latest |
fbd4a80
to
7b7260e
Compare
@barjin Everything is done know you can review the PR thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your initiative!
There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee
master into your branch, so we can test the changes later? Cheers!
TandemRequestProvider
for combined RequestList
and RequestQueue
usage
Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please |
See the GitHub documentation at https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork?platform=linux . I see you might have already merged our Try going here and find the "Sync fork" button. |
It's done now let me resolve the requested changes Thanks for guide |
Hey @barjin Now the PR is ready to review/merge can you check and let me know if any changes required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this already looks much better.
I still have some questions regarding the background transfer of the RequestList
requests to the RequestQueue
. I'm still not convinced this is the way to go here, especially since we've always done this lazily until now.
I'll request reviews from our other maintainers to make sure I'm not missing anything. Cheers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Thanks @CodeMan62 for the contribution! I assume you're trying to fix #2499? If so, we should link it to this PR. I haven't reviewed the code yet, but here are some points I want to make:
While issue #2499 mentions that we might want to transfer the request list independently of |
…ed transfers instead of immediate background transfer
Hey @barjin Thanks for the thorough review! I've addressed both concerns:
The new implementation should handle large RequestLists efficiently while being safer to use. Let me know if you'd like any adjustments! |
Hey @janbuchar Thanks for the feedback! I've made the following changes:
And yes, this PR is addressing #2499 . I've looked at your initial draft and implemented a similar approach with batched transfers that should be more efficient while maintaining the same functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have some comments, but they are not as pressing as the previous ones (and reflect rather my ideas than actual issues with the code).
Once we discuss the comments, I don't see any big issues anymore, so this PR is actually a soft go from me. Thanks!
…ain original request order
@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:
Let me know if you agree with this approach or if you'd prefer to handle the rename differently. |
@barjin I have done the changes you have asked for take a look at the changes and let me know if any changes required |
Hi @CodeMan62, i disagree. There is no good reason to introduce another set of names just to change them later. We should keep the current names backwards compatible for now, but where it's applicable, we should use the naming from the python port - that is the target state:
|
@janbuchar No worries, let me change the names as you want |
cf41668
to
2f6a90f
Compare
Hey @janbuchar I have changed the names as you want can you please review it and tell me know if any code changes required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up nicely!
- please rename the files to match the classes in them
- could you add some tests to verify the newly added functionality? Also, the existing tests must pass
- please run the code formatter (
yarn format
) to get rid of whitespace-only changes
Working on it today, almost done with the commit |
Hey @janbuchar, I have renamed the files to match the classes in them and added tests, and all the tests are passing too, and also |
7e2c210
to
3cada6a
Compare
Hi @CodeMan62, I still see some test failures - could you look into that please? |
Yes, I am trying to fix them :) |
@janbuchar Can you help me fix this failing tests please? |
@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests? |
almost 26 tests are failing let me try more to fix as much as I can and then let you know my mind is completely messed with these test but no worries let me put some more efforts |
@janbuchar i think i really need your help in it i am not able to fix the tests help me in it please 🙏 |
Overview
This PR introduces two new components to improve request management in Crawlee:
TandemRequestProvider
: A provider that seamlessly combinesRequestList
andRequestQueue
, allowing crawlers to use both sources efficiently.RequestListAdapter
: An adapter that makesRequestList
compatible with theIRequestProvider
interface.Problem Solved
When using both
RequestList
andRequestQueue
in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:RequestList
RequestQueue
in the backgroundRequestQueue
after the list is exhaustedImplementation Details
RequestListAdapter
wraps aRequestList
instance and adapts its interface to matchIRequestProvider
.TandemRequestProvider
orchestrates the flow between the list and queue:IRequestProvider
interfaceIntegration
The implementation is already integrated into
BasicCrawler._initializeRequestProviders()
, making it immediately available to all crawler types without requiring additional configuration changes.Backward Compatibility
This change is fully backward compatible:
RequestList
orRequestQueue
alone will continue to work without changesTesting
The implementation was tested to ensure it correctly:
Closes #2499