You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like more options on how to manage connections to a domain.
My understanding is that the connection are managed through twisted.web.client.HTTPConnectionPool, which as the name suggest, creates a pool of connection (limitated by CONCURRENT_REQUESTS_PER_DOMAIN, which translates into maxPersistentPerHost in twisted.web.client.HTTPConnectionPool) on which the request will be balanced according, I guess, to the load.
A few things could be added :
At the pool creation we have : self._pool = HTTPConnectionPool(reactor, persistent=True) --> persistent could be dependent upon a particular setting (default = True)
self._pool.cachedConnectionTimeout = settings.getint('DOMAIN_IDLE_CONNECTION_TIMEOUT') could be added to lower the time an idle connection would remain (the name I considered is based on my understanding of what it does, anything more proper is welcome)
an option to call closeCachedConnection so all connexion in the HttpConnectionPool be forcibly closed (which is what mostly interests me)
Motivation
Some rotating proxy services are connection based. As long as the connection stays open, the ip is not rotated. In my case, it means that with the default CONCURRENT_REQUESTS_PER_DOMAIN only 8 exit ips are used, and will be recycled indefinitely.
Describe alternatives you've considered
I tried tweaking the CONCURRENT_REQUESTS_PER_DOMAIN hoping a sufficiently big number will ensure a wide enough pool to deal with my request, though it seems the same connection are reused (i.e. setting it to 9999999 ending in opening 50 concurrent connection and passing all my requests through those 50 connections (tested on >2000 requests)
The only alternative I found is through using scrapy-playwright were I can specify a dedicated context to use for a request, though the crawling speed when not needing any JS interpreter or all the other fancy things playwright can do drops drastically (I can scrape https://api.ipify.org/ at over 1500 page/sec with the standard scrapy request, and at 90 pages/sec with playwright)
The text was updated successfully, but these errors were encountered:
Havingpersistent = False, closes each connection after a request is sent. It answers my need, but that's an all or nothing situation.
Using closeCachedConnections() (with persistent = True) does not do much (at least nothing that I see). I tried to call the close() method within download_request() but again it does not ends in any visible differences
Do you have any guess on how to forcibly close the pool of connections ?
Summary
I would like more options on how to manage connections to a domain.
My understanding is that the connection are managed through
twisted.web.client.HTTPConnectionPool
, which as the name suggest, creates a pool of connection (limitated byCONCURRENT_REQUESTS_PER_DOMAIN
, which translates into maxPersistentPerHost in twisted.web.client.HTTPConnectionPool) on which the request will be balanced according, I guess, to the load.A few things could be added :
self._pool = HTTPConnectionPool(reactor, persistent=True)
--> persistent could be dependent upon a particular setting (default = True)self._pool.cachedConnectionTimeout = settings.getint('DOMAIN_IDLE_CONNECTION_TIMEOUT')
could be added to lower the time an idle connection would remain (the name I considered is based on my understanding of what it does, anything more proper is welcome)Motivation
Some rotating proxy services are connection based. As long as the connection stays open, the ip is not rotated. In my case, it means that with the default
CONCURRENT_REQUESTS_PER_DOMAIN
only 8 exit ips are used, and will be recycled indefinitely.Describe alternatives you've considered
I tried tweaking the
CONCURRENT_REQUESTS_PER_DOMAIN
hoping a sufficiently big number will ensure a wide enough pool to deal with my request, though it seems the same connection are reused (i.e. setting it to 9999999 ending in opening 50 concurrent connection and passing all my requests through those 50 connections (tested on >2000 requests)The only alternative I found is through using scrapy-playwright were I can specify a dedicated context to use for a request, though the crawling speed when not needing any JS interpreter or all the other fancy things playwright can do drops drastically (I can scrape https://api.ipify.org/ at over 1500 page/sec with the standard scrapy request, and at 90 pages/sec with playwright)
The text was updated successfully, but these errors were encountered: