connection pooling do not work when using proxy #2743

jdxin0 · 2017-05-17T04:03:26Z

Scrapy create a new TCP4ClientEndpoint for each request when using proxy in ScrapyAgent while ProxyAgent(twisted) use key = ("http-proxy", self._proxyEndpoint) as connection pool key.
It causes creating new connection for each request when using proxy，
will get errno99: cannot assign requested address when all ports has been used (socket TIME_WAIT).

scrapy/core/downloader/handlers/http11.py

class ScrapyAgent(object):
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            proxyHost = to_unicode(proxyHost)
            omitConnectTunnel = b'noconnect' in proxyParams
            if  scheme == b'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get(b'Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                    contextFactory=self._contextFactory, connectTimeout=timeout,
                    bindAddress=bindaddress, pool=self._pool)
            else:
                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                    timeout=timeout, bindAddress=bindaddress)
                return self._ProxyAgent(endpoint)

        return self._Agent(reactor, contextFactory=self._contextFactory,
            connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

twisted/web/client.py

@implementer(IAgent)
class ProxyAgent(_AgentBase):
    """
    An HTTP agent able to cross HTTP proxies.

    @ivar _proxyEndpoint: The endpoint used to connect to the proxy.

    @since: 11.1
    """

    def __init__(self, endpoint, reactor=None, pool=None):
        if reactor is None:
            from twisted.internet import reactor
        _AgentBase.__init__(self, reactor, pool)
        self._proxyEndpoint = endpoint


    def request(self, method, uri, headers=None, bodyProducer=None):
        """
        Issue a new request via the configured proxy.
        """
        # Cache *all* connections under the same key, since we are only
        # connecting to a single destination, the proxy:
        key = ("http-proxy", self._proxyEndpoint)

        # To support proxying HTTPS via CONNECT, we will use key
        # ("http-proxy-CONNECT", scheme, host, port), and an endpoint that
        # wraps _proxyEndpoint with an additional callback to do the CONNECT.
        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
                                         URI.fromBytes(uri), headers,
                                         bodyProducer, uri)

The text was updated successfully, but these errors were encountered:

redapple · 2017-05-26T14:45:55Z

Interesting @jdxin0 . Creating an endpoint for ProxyAgent is actually from the Twisted docs.
I looked at endpoint keys for CONNECT (#1912).
I'll have a look if I can reproduce what you see with plain HTTP proxies.

jdxin0 · 2017-05-27T07:45:45Z

@redapple
Here are my test code. It should open only two connection to proxy server.

import scrapy
from scrapy.crawler import CrawlerProcess


proxy_url = 'http://127.0.0.1:1235'


class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        for i in range(10000):
            yield scrapy.Request(
                'http://httpbin.org',
                dont_filter=True,
                meta={'proxy': proxy_url}
            )

    def parse(self, response):
        pass


if __name__ == '__main__':
    settings = {'CONCURRENT_REQUESTS_PER_IP': 2, 'DOWNLOAD_DELAY': 0.1}
    process = CrawlerProcess(settings)
    process.crawl(TestSpider)
    process.start()

I use 'Activity Monitor -> Network -> Open Files and Ports' in Mac's system utils to monitor opened port by scrapy process.

It used two ports to proxy server in the image, but it changed the ports quikly (didn't reuse old connection).

redapple · 2017-05-29T14:04:37Z

@jdxin0 , I had to change this to make it work:

$ git diff
diff --git a/scrapy/core/downloader/handlers/http11.py b/scrapy/core/downloader/handlers/http11.py
index 9bfdd80..3763537 100644
--- a/scrapy/core/downloader/handlers/http11.py
+++ b/scrapy/core/downloader/handlers/http11.py
@@ -14,7 +14,7 @@ from twisted.web.iweb import IBodyProducer, UNKNOWN_LENGTH
 from twisted.internet.error import TimeoutError
 from twisted.web.http import _DataLoss, PotentialDataLoss
 from twisted.web.client import Agent, ProxyAgent, ResponseDone, \
-    HTTPConnectionPool, ResponseFailed
+    HTTPConnectionPool, ResponseFailed, URI
 from twisted.internet.endpoints import TCP4ClientEndpoint
 
 from scrapy.http import Headers
@@ -228,10 +228,24 @@ class TunnelingAgent(Agent):
             headers, bodyProducer, requestPath)
 
 
+class ScrapyProxyAgent(ProxyAgent):
+
+    def request(self, method, uri, headers=None, bodyProducer=None):
+        """
+        Issue a new request via the configured proxy.
+        """
+        # Cache *all* connections under the same key, since we are only
+        # connecting to a single destination, the proxy:
+        key = ("http-proxy", self._proxyEndpoint._host, self._proxyEndpoint._port)
+        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
+                                         URI.fromBytes(uri), headers,
+                                         bodyProducer, uri)
+
+
 class ScrapyAgent(object):
 
     _Agent = Agent
-    _ProxyAgent = ProxyAgent
+    _ProxyAgent = ScrapyProxyAgent
     _TunnelingAgent = TunnelingAgent
 
     def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None,
@@ -262,7 +276,7 @@ class ScrapyAgent(object):
             else:
                 endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                     timeout=timeout, bindAddress=bindaddress)
-                return self._ProxyAgent(endpoint)
+                return self._ProxyAgent(endpoint, pool=self._pool)
 
         return self._Agent(reactor, contextFactory=self._contextFactory,
             connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

I'll continue looking since using Agent.usingEndpointFactory may be what we want. I'm not sure yet.

redapple · 2017-05-29T15:16:41Z

Another implementation that seems to work:

$ git diff
diff --git a/scrapy/core/downloader/handlers/http11.py b/scrapy/core/downloader/handlers/http11.py
index 9bfdd80..a4b077b 100644
--- a/scrapy/core/downloader/handlers/http11.py
+++ b/scrapy/core/downloader/handlers/http11.py
@@ -14,7 +14,7 @@ from twisted.web.iweb import IBodyProducer, UNKNOWN_LENGTH
 from twisted.internet.error import TimeoutError
 from twisted.web.http import _DataLoss, PotentialDataLoss
 from twisted.web.client import Agent, ProxyAgent, ResponseDone, \
-    HTTPConnectionPool, ResponseFailed
+    HTTPConnectionPool, ResponseFailed, URI
 from twisted.internet.endpoints import TCP4ClientEndpoint
 
 from scrapy.http import Headers
@@ -228,10 +228,33 @@ class TunnelingAgent(Agent):
             headers, bodyProducer, requestPath)
 
 
+class ScrapyProxyAgent(Agent):
+
+    def __init__(self, reactor, proxyURI,
+                 connectTimeout=None, bindAddress=None, pool=None):
+        super(ScrapyProxyAgent, self).__init__(reactor,
+                                               connectTimeout=connectTimeout,
+                                               bindAddress=bindAddress,
+                                               pool=pool)
+        self._proxyURI = URI.fromBytes(proxyURI)
+
+    def request(self, method, uri, headers=None, bodyProducer=None):
+        """
+        Issue a new request via the configured proxy.
+        """
+        # Cache *all* connections under the same key, since we are only
+        # connecting to a single destination, the proxy:
+        proxyEndpoint = self._getEndpoint(self._proxyURI)
+        key = ("http-proxy", self._proxyURI.host, self._proxyURI.port)
+        return self._requestWithEndpoint(key, proxyEndpoint, method,
+                                         URI.fromBytes(uri), headers,
+                                         bodyProducer, uri)
+
+
 class ScrapyAgent(object):
 
     _Agent = Agent
-    _ProxyAgent = ProxyAgent
+    _ProxyAgent = ScrapyProxyAgent
     _TunnelingAgent = TunnelingAgent
 
     def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None,
@@ -260,9 +283,8 @@ class ScrapyAgent(object):
                     contextFactory=self._contextFactory, connectTimeout=timeout,
                     bindAddress=bindaddress, pool=self._pool)
             else:
-                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
-                    timeout=timeout, bindAddress=bindaddress)
-                return self._ProxyAgent(endpoint)
+                return self._ProxyAgent(reactor, proxyURI=proxy,
+                    connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
 
         return self._Agent(reactor, contextFactory=self._contextFactory,
             connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

jdxin0 · 2017-05-30T04:05:46Z

@redapple Thanks for fixing it. It's really helpful in my using case.

jdxin0 · 2017-06-06T03:34:01Z

@redapple The implementation above for this problem somehow affect the download timeout calculation for request.

redapple · 2017-06-06T08:39:33Z

@jdxin0 ,
what do you mean by "somehow affect the download timeout calculation"?
have you tried #2767 ?

jdxin0 · 2017-06-07T04:13:46Z

I used monkey patch to fix the proxy connection pooling problem.
But it causes a lot of timeout(30s) errors when I am sure the request didn't reach 30s.
When I removed monkey patch, it changed to normal.

Here is the patch code.

from twisted.web.client import URI
from scrapy.core.downloader.handlers import http11
from scrapy.core.downloader.handlers.http11 import ProxyAgent, _parse, \
    to_unicode, reactor, TCP4ClientEndpoint, ScrapyAgent as _ScrapyAgent


class ScrapyProxyAgent(ProxyAgent):
    def request(self, method, uri, headers=None, bodyProducer=None):
        """
        Issue a new request via the configured proxy.
        """
        # Cache *all* connections under the same key, since we are only
        # connecting to a single destination, the proxy:
        key = (
        "http-proxy", self._proxyEndpoint._host, self._proxyEndpoint._port)
        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
                                         URI.fromBytes(uri), headers,
                                         bodyProducer, uri)


class ScrapyAgent(_ScrapyAgent):
    _ProxyAgent = ScrapyProxyAgent
    
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            proxyHost = to_unicode(proxyHost)
            omitConnectTunnel = b'noconnect' in proxyParams
            if scheme == b'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get(b'Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                                            contextFactory=self._contextFactory,
                                            connectTimeout=timeout,
                                            bindAddress=bindaddress,
                                            pool=self._pool)
            else:
                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                                              timeout=timeout,
                                              bindAddress=bindaddress)
                return self._ProxyAgent(endpoint, pool=self._pool)
        
        return self._Agent(reactor, contextFactory=self._contextFactory,
                           connectTimeout=timeout, bindAddress=bindaddress,
                           pool=self._pool)
    
    
def patch_proxy():
    http11.ScrapyAgent = ScrapyAgent

redapple · 2017-06-07T07:35:57Z

Can I ask you to try the branch with my tentative fix instead?

git checkout -b redapple-http-proxy-endpoint-key master
git pull https://github.com/redapple/scrapy.git http-proxy-endpoint-key

jdxin0 · 2017-06-07T09:38:44Z

I tried your fix(redapple-http-proxy-endpoint-key http-proxy-endpoint-key).
A few minute(1 or 2) after the spider start, everything was fine.
After a while, all the requests started to fail with http code 502.

The spider logged two hundred of 502 fail per second, there is no way that my spider can get two hundred of 502 fail per second from the target host server.

And the proxy log suggest that my spider connected to proxy port, then abort the connection immediately.

redapple · 2017-06-07T10:03:33Z

The spider logged two hundred of 502 fail per second, there is no way that my spider can get two hundred of 502 fail per second from the target host server.

and yet it happens. HTTP 502s is something different from the original connection pooling when using an HTTP proxy.

If you have proxy logs, you are probably in a better position to debug this. I have no setup to reproduce your use-case, so I don't think I can investigate further.

Being HTTP, it's easier to debug with something like Wireshark. If you can provide network capture of what's happening, I can maybe have a look.
From what I tested locally, HTTP connections to the HTTP proxy were correctly being reused.
I have not looked at timeouts though.

jdxin0 · 2017-06-08T07:20:39Z

Thanks for you advise, it turned out to be some corner case of our self-implemented proxy server.

jdxin0 · 2017-09-04T03:23:35Z

@redapple My problem is solved with your patch after solving self-implemented proxy server problem

redapple mentioned this issue May 30, 2017

[MRG+1] Use HTTP pool and proper endpoint key for ProxyAgent #2767

Merged

redapple added bug patch available labels May 31, 2017

redapple added this to the v1.4.1 milestone Jul 27, 2017

kmike modified the milestones: v1.4.1, v1.5 Dec 22, 2017

dangra closed this as completed in #2767 Dec 26, 2017

Gallaecio mentioned this issue Sep 19, 2019

Upgrading from 1.4 to 1.5 results in many 400 responses #4027

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connection pooling do not work when using proxy #2743

connection pooling do not work when using proxy #2743

jdxin0 commented May 17, 2017 •

edited

Loading

redapple commented May 26, 2017

jdxin0 commented May 27, 2017 •

edited

Loading

redapple commented May 29, 2017

redapple commented May 29, 2017

jdxin0 commented May 30, 2017 •

edited

Loading

jdxin0 commented Jun 6, 2017

redapple commented Jun 6, 2017

jdxin0 commented Jun 7, 2017

redapple commented Jun 7, 2017

jdxin0 commented Jun 7, 2017

redapple commented Jun 7, 2017

jdxin0 commented Jun 8, 2017

jdxin0 commented Sep 4, 2017

connection pooling do not work when using proxy #2743

connection pooling do not work when using proxy #2743

Comments

jdxin0 commented May 17, 2017 • edited Loading

redapple commented May 26, 2017

jdxin0 commented May 27, 2017 • edited Loading

redapple commented May 29, 2017

redapple commented May 29, 2017

jdxin0 commented May 30, 2017 • edited Loading

jdxin0 commented Jun 6, 2017

redapple commented Jun 6, 2017

jdxin0 commented Jun 7, 2017

redapple commented Jun 7, 2017

jdxin0 commented Jun 7, 2017

redapple commented Jun 7, 2017

jdxin0 commented Jun 8, 2017

jdxin0 commented Sep 4, 2017

jdxin0 commented May 17, 2017 •

edited

Loading

jdxin0 commented May 27, 2017 •

edited

Loading

jdxin0 commented May 30, 2017 •

edited

Loading