Skip to content

feat: add TandemRequestProvider for combined RequestList and RequestQueue usage #2914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

CodeMan62
Copy link

@CodeMan62 CodeMan62 commented Apr 5, 2025

Overview

This PR introduces two new components to improve request management in Crawlee:

  • TandemRequestProvider: A provider that seamlessly combines RequestList and RequestQueue, allowing crawlers to use both sources efficiently.
  • RequestListAdapter: An adapter that makes RequestList compatible with the IRequestProvider interface.

Problem Solved

When using both RequestList and RequestQueue in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:

  1. First processing requests from the RequestList
  2. Automatically transferring these requests to the RequestQueue in the background
  3. Ensuring each URL is processed exactly once
  4. Gracefully transitioning to RequestQueue after the list is exhausted

Implementation Details

  • The RequestListAdapter wraps a RequestList instance and adapts its interface to match IRequestProvider.
  • The TandemRequestProvider orchestrates the flow between the list and queue:
    • It implements the IRequestProvider interface
    • Handles request processing through the queue while ensuring list requests are enqueued first
    • Maintains proper request state tracking across both sources
    • Implements a background transfer mechanism to move requests efficiently

Integration

The implementation is already integrated into BasicCrawler._initializeRequestProviders(), making it immediately available to all crawler types without requiring additional configuration changes.

Backward Compatibility

This change is fully backward compatible:

  • Existing code using either RequestList or RequestQueue alone will continue to work without changes
  • This introduces an enhancement without modifying existing behavior

Testing

The implementation was tested to ensure it correctly:

  • Transfers requests from list to queue
  • Processes requests in the correct order
  • Handles errors appropriately
  • Maintains proper request state

Closes #2499

@barjin
Copy link
Contributor

barjin commented Apr 5, 2025

Please merge the latest master into your branch so we only get your changes in the diff. Cheers!

@CodeMan62 CodeMan62 force-pushed the feat/tandem-request-provider branch from fbd4a80 to 7b7260e Compare April 5, 2025 19:42
@CodeMan62
Copy link
Author

@barjin Everything is done know you can review the PR thanks

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your initiative!

There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee master into your branch, so we can test the changes later? Cheers!

@barjin barjin changed the title Add TandemRequestProvider for combined RequestList and RequestQueue usage feat: add TandemRequestProvider for combined RequestList and RequestQueue usage Apr 7, 2025
@CodeMan62
Copy link
Author

Thank you for your initiative!

There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee master into your branch, so we can test the changes later? Cheers!

Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please

@barjin
Copy link
Contributor

barjin commented Apr 7, 2025

Thank you for your initiative!
There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee master into your branch, so we can test the changes later? Cheers!

Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please

See the GitHub documentation at https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork?platform=linux . I see you might have already merged our master branch into your master branch - you want to sync our master into your feat/tandem-request-provider.

Try going here and find the "Sync fork" button.

@CodeMan62
Copy link
Author

It's done now let me resolve the requested changes Thanks for guide

@CodeMan62
Copy link
Author

Hey @barjin Now the PR is ready to review/merge can you check and let me know if any changes required

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this already looks much better.

I still have some questions regarding the background transfer of the RequestList requests to the RequestQueue. I'm still not convinced this is the way to go here, especially since we've always done this lazily until now.

I'll request reviews from our other maintainers to make sure I'm not missing anything. Cheers!

@barjin barjin requested review from janbuchar, B4nan and Copilot April 9, 2025 08:24
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

@janbuchar
Copy link
Contributor

Thanks @CodeMan62 for the contribution! I assume you're trying to fix #2499? If so, we should link it to this PR.

I haven't reviewed the code yet, but here are some points I want to make:

  1. Eventually, we want the tandem class to be called RequestManagerTandem so that it is consistent with the Python version of Crawlee. Can you rename IRequestProvider to IRequestManager as well? The RequestProvider and so on has to stay like that until we start working on a breaking release, so let's not rename that yet.

I still have some questions regarding the background transfer of the RequestList requests to the RequestQueue. I'm still not convinced this is the way to go here, especially since we've always done this lazily until now.

While issue #2499 mentions that we might want to transfer the request list independently of fetchNextRequest calls, I agree that we should tread carefully. Just extracting the tandem behavior as it is now should be sufficient. I made an initial draft a while back here https://github.com/apify/crawlee/pull/2898/files#diff-d631fbcd81c43142e84b9d2d01ead93451a95344001815f83c88f956ba14557a

…ed transfers instead of immediate background transfer
@CodeMan62
Copy link
Author

Hey @barjin Thanks for the thorough review! I've addressed both concerns:

  1. Changed RequestList -> RequestQueue transfer to be lazy and batched (25 requests at a time):

    • Only transfers when fetchNextRequest() needs more requests
    • Prevents memory spikes and unnecessary API calls
    • Allows graceful stopping without wasted operations
  2. Fixed the process hanging issue by removing background Promise:

    • No more silent operations after crawler.stop()
    • Process exits cleanly since transfers are now on-demand

The new implementation should handle large RequestLists efficiently while being safer to use. Let me know if you'd like any adjustments!

@CodeMan62
Copy link
Author

Hey @janbuchar Thanks for the feedback! I've made the following changes:

  1. Renamed TandemRequestProvider to RequestManagerTandem to align with Python Crawlee's naming
  2. Updated the implementation to use lazy loading with batched transfers (25 requests at a time) instead of background transfer
  3. Left IRequestProvider as is for now, waiting for the breaking release to rename it to IRequestManager

And yes, this PR is addressing #2499 . I've looked at your initial draft and implemented a similar approach with batched transfers that should be more efficient while maintaining the same functionality.

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some comments, but they are not as pressing as the previous ones (and reflect rather my ideas than actual issues with the code).

Once we discuss the comments, I don't see any big issues anymore, so this PR is actually a soft go from me. Thanks!

@CodeMan62
Copy link
Author

@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:

  1. Discuss the naming convention changes more broadly
  2. Plan this as part of a coordinated update with other renames
  3. Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change

Let me know if you agree with this approach or if you'd prefer to handle the rename differently.

@CodeMan62
Copy link
Author

@barjin I have done the changes you have asked for take a look at the changes and let me know if any changes required

@janbuchar
Copy link
Contributor

@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:

  1. Discuss the naming convention changes more broadly
  2. Plan this as part of a coordinated update with other renames
  3. Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change

Let me know if you agree with this approach or if you'd prefer to handle the rename differently.

Hi @CodeMan62, i disagree. There is no good reason to introduce another set of names just to change them later. We should keep the current names backwards compatible for now, but where it's applicable, we should use the naming from the python port - that is the target state:

  • instead of TandemRequestProvider, there should be RequestManagerTandem
  • instead of IRequestProvider, we should have IRequestManager
  • RequestProvider should implement IRequestManager

@CodeMan62
Copy link
Author

@janbuchar No worries, let me change the names as you want

@CodeMan62 CodeMan62 force-pushed the feat/tandem-request-provider branch from cf41668 to 2f6a90f Compare April 15, 2025 18:56
@CodeMan62
Copy link
Author

Hey @janbuchar I have changed the names as you want can you please review it and tell me know if any code changes required

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely!

  • please rename the files to match the classes in them
  • could you add some tests to verify the newly added functionality? Also, the existing tests must pass
  • please run the code formatter (yarn format) to get rid of whitespace-only changes

@CodeMan62
Copy link
Author

This is shaping up nicely!

  • please rename the files to match the classes in them
  • could you add some tests to verify the newly added functionality? Also, the existing tests must pass
  • please run the code formatter (yarn format) to get rid of whitespace-only changes

Working on it today, almost done with the commit

@CodeMan62
Copy link
Author

Hey @janbuchar, I have renamed the files to match the classes in them and added tests, and all the tests are passing too, and also yarn format is done. Let me know if any changes are required! :)

@CodeMan62 CodeMan62 force-pushed the feat/tandem-request-provider branch from 7e2c210 to 3cada6a Compare April 17, 2025 16:20
@janbuchar
Copy link
Contributor

Hi @CodeMan62, I still see some test failures - could you look into that please?

@CodeMan62
Copy link
Author

CodeMan62 commented Apr 25, 2025

Hi @CodeMan62, I still see some test failures - could you look into that please?

Yes, I am trying to fix them :)
EDIT:- Working on it

@CodeMan62
Copy link
Author

@janbuchar Can you help me fix this failing tests please?

@janbuchar
Copy link
Contributor

@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?

@CodeMan62
Copy link
Author

@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?

almost 26 tests are failing let me try more to fix as much as I can and then let you know my mind is completely messed with these test but no worries let me put some more efforts

@CodeMan62
Copy link
Author

@janbuchar i think i really need your help in it i am not able to fix the tests help me in it please 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify RequestList and RequestProvider interfaces and extract their "tandem" behavior from BasicCrawler
3 participants