More flexible management of connections to a domain (HTTPConnectionPool) #5618

PsykotropyK · 2022-09-06T10:44:47Z

Summary

I would like more options on how to manage connections to a domain.

My understanding is that the connection are managed through twisted.web.client.HTTPConnectionPool, which as the name suggest, creates a pool of connection (limitated by CONCURRENT_REQUESTS_PER_DOMAIN, which translates into maxPersistentPerHost in twisted.web.client.HTTPConnectionPool) on which the request will be balanced according, I guess, to the load.

A few things could be added :

At the pool creation we have : self._pool = HTTPConnectionPool(reactor, persistent=True) --> persistent could be dependent upon a particular setting (default = True)
self._pool.cachedConnectionTimeout = settings.getint('DOMAIN_IDLE_CONNECTION_TIMEOUT') could be added to lower the time an idle connection would remain (the name I considered is based on my understanding of what it does, anything more proper is welcome)
an option to call closeCachedConnection so all connexion in the HttpConnectionPool be forcibly closed (which is what mostly interests me)

Motivation

Some rotating proxy services are connection based. As long as the connection stays open, the ip is not rotated. In my case, it means that with the default CONCURRENT_REQUESTS_PER_DOMAIN only 8 exit ips are used, and will be recycled indefinitely.

Describe alternatives you've considered

I tried tweaking the CONCURRENT_REQUESTS_PER_DOMAIN hoping a sufficiently big number will ensure a wide enough pool to deal with my request, though it seems the same connection are reused (i.e. setting it to 9999999 ending in opening 50 concurrent connection and passing all my requests through those 50 connections (tested on >2000 requests)

The only alternative I found is through using scrapy-playwright were I can specify a dedicated context to use for a request, though the crawling speed when not needing any JS interpreter or all the other fancy things playwright can do drops drastically (I can scrape https://api.ipify.org/ at over 1500 page/sec with the standard scrapy request, and at 90 pages/sec with playwright)

The text was updated successfully, but these errors were encountered:

scrapy#5618

PsykotropyK · 2022-09-07T08:10:50Z

So I run some tests.

Havingpersistent = False, closes each connection after a request is sent. It answers my need, but that's an all or nothing situation.
Using closeCachedConnections() (with persistent = True) does not do much (at least nothing that I see). I tried to call the close() method within download_request() but again it does not ends in any visible differences

Do you have any guess on how to forcibly close the pool of connections ?

Gallaecio added the enhancement label Sep 6, 2022

PsykotropyK mentioned this issue Sep 6, 2022

Support per-request download handler override #5510

Open

PsykotropyK added a commit to PsykotropyK/scrapy that referenced this issue Sep 6, 2022

Update default_settings.py

0da7a79

scrapy#5618

PsykotropyK added a commit to PsykotropyK/scrapy that referenced this issue Sep 6, 2022

Update http11.py

12af263

scrapy#5618

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible management of connections to a domain (HTTPConnectionPool) #5618

More flexible management of connections to a domain (HTTPConnectionPool) #5618

PsykotropyK commented Sep 6, 2022 •

edited

PsykotropyK commented Sep 7, 2022

More flexible management of connections to a domain (HTTPConnectionPool) #5618

More flexible management of connections to a domain (HTTPConnectionPool) #5618

Comments

PsykotropyK commented Sep 6, 2022 • edited

Summary

Motivation

Describe alternatives you've considered

PsykotropyK commented Sep 7, 2022

PsykotropyK commented Sep 6, 2022 •

edited