Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible management of connections to a domain (HTTPConnectionPool) #5618

Open
PsykotropyK opened this issue Sep 6, 2022 · 1 comment

Comments

@PsykotropyK
Copy link

PsykotropyK commented Sep 6, 2022

Summary

I would like more options on how to manage connections to a domain.

My understanding is that the connection are managed through twisted.web.client.HTTPConnectionPool, which as the name suggest, creates a pool of connection (limitated by CONCURRENT_REQUESTS_PER_DOMAIN, which translates into maxPersistentPerHost in twisted.web.client.HTTPConnectionPool) on which the request will be balanced according, I guess, to the load.

A few things could be added :

  • At the pool creation we have : self._pool = HTTPConnectionPool(reactor, persistent=True) --> persistent could be dependent upon a particular setting (default = True)
  • self._pool.cachedConnectionTimeout = settings.getint('DOMAIN_IDLE_CONNECTION_TIMEOUT') could be added to lower the time an idle connection would remain (the name I considered is based on my understanding of what it does, anything more proper is welcome)
  • an option to call closeCachedConnection so all connexion in the HttpConnectionPool be forcibly closed (which is what mostly interests me)

Motivation

Some rotating proxy services are connection based. As long as the connection stays open, the ip is not rotated. In my case, it means that with the default CONCURRENT_REQUESTS_PER_DOMAIN only 8 exit ips are used, and will be recycled indefinitely.

Describe alternatives you've considered

I tried tweaking the CONCURRENT_REQUESTS_PER_DOMAIN hoping a sufficiently big number will ensure a wide enough pool to deal with my request, though it seems the same connection are reused (i.e. setting it to 9999999 ending in opening 50 concurrent connection and passing all my requests through those 50 connections (tested on >2000 requests)

The only alternative I found is through using scrapy-playwright were I can specify a dedicated context to use for a request, though the crawling speed when not needing any JS interpreter or all the other fancy things playwright can do drops drastically (I can scrape https://api.ipify.org/ at over 1500 page/sec with the standard scrapy request, and at 90 pages/sec with playwright)

PsykotropyK added a commit to PsykotropyK/scrapy that referenced this issue Sep 6, 2022
PsykotropyK added a commit to PsykotropyK/scrapy that referenced this issue Sep 6, 2022
@PsykotropyK
Copy link
Author

So I run some tests.

  1. Havingpersistent = False, closes each connection after a request is sent. It answers my need, but that's an all or nothing situation.
  2. Using closeCachedConnections() (with persistent = True) does not do much (at least nothing that I see). I tried to call the close() method within download_request() but again it does not ends in any visible differences

Do you have any guess on how to forcibly close the pool of connections ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants