Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection pooling do not work when using proxy #2743

Closed
jdxin0 opened this issue May 17, 2017 · 13 comments
Closed

connection pooling do not work when using proxy #2743

jdxin0 opened this issue May 17, 2017 · 13 comments

Comments

@jdxin0
Copy link

jdxin0 commented May 17, 2017

Scrapy create a new TCP4ClientEndpoint for each request when using proxy in ScrapyAgent while ProxyAgent(twisted) use key = ("http-proxy", self._proxyEndpoint) as connection pool key.
It causes creating new connection for each request when using proxy,
will get errno99: cannot assign requested address when all ports has been used (socket TIME_WAIT).

scrapy/core/downloader/handlers/http11.py

class ScrapyAgent(object):
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            proxyHost = to_unicode(proxyHost)
            omitConnectTunnel = b'noconnect' in proxyParams
            if  scheme == b'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get(b'Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                    contextFactory=self._contextFactory, connectTimeout=timeout,
                    bindAddress=bindaddress, pool=self._pool)
            else:
                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                    timeout=timeout, bindAddress=bindaddress)
                return self._ProxyAgent(endpoint)

        return self._Agent(reactor, contextFactory=self._contextFactory,
            connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

twisted/web/client.py

@implementer(IAgent)
class ProxyAgent(_AgentBase):
    """
    An HTTP agent able to cross HTTP proxies.

    @ivar _proxyEndpoint: The endpoint used to connect to the proxy.

    @since: 11.1
    """

    def __init__(self, endpoint, reactor=None, pool=None):
        if reactor is None:
            from twisted.internet import reactor
        _AgentBase.__init__(self, reactor, pool)
        self._proxyEndpoint = endpoint


    def request(self, method, uri, headers=None, bodyProducer=None):
        """
        Issue a new request via the configured proxy.
        """
        # Cache *all* connections under the same key, since we are only
        # connecting to a single destination, the proxy:
        key = ("http-proxy", self._proxyEndpoint)

        # To support proxying HTTPS via CONNECT, we will use key
        # ("http-proxy-CONNECT", scheme, host, port), and an endpoint that
        # wraps _proxyEndpoint with an additional callback to do the CONNECT.
        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
                                         URI.fromBytes(uri), headers,
                                         bodyProducer, uri)
@redapple
Copy link
Contributor

Interesting @jdxin0 . Creating an endpoint for ProxyAgent is actually from the Twisted docs.
I looked at endpoint keys for CONNECT (#1912).
I'll have a look if I can reproduce what you see with plain HTTP proxies.

@jdxin0
Copy link
Author

jdxin0 commented May 27, 2017

@redapple
Here are my test code. It should open only two connection to proxy server.

import scrapy
from scrapy.crawler import CrawlerProcess


proxy_url = 'http://127.0.0.1:1235'


class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        for i in range(10000):
            yield scrapy.Request(
                'http://httpbin.org',
                dont_filter=True,
                meta={'proxy': proxy_url}
            )

    def parse(self, response):
        pass


if __name__ == '__main__':
    settings = {'CONCURRENT_REQUESTS_PER_IP': 2, 'DOWNLOAD_DELAY': 0.1}
    process = CrawlerProcess(settings)
    process.crawl(TestSpider)
    process.start()

I use 'Activity Monitor -> Network -> Open Files and Ports' in Mac's system utils to monitor opened port by scrapy process.
image
image
image

It used two ports to proxy server in the image, but it changed the ports quikly (didn't reuse old connection).

@redapple
Copy link
Contributor

@jdxin0 , I had to change this to make it work:

$ git diff
diff --git a/scrapy/core/downloader/handlers/http11.py b/scrapy/core/downloader/handlers/http11.py
index 9bfdd80..3763537 100644
--- a/scrapy/core/downloader/handlers/http11.py
+++ b/scrapy/core/downloader/handlers/http11.py
@@ -14,7 +14,7 @@ from twisted.web.iweb import IBodyProducer, UNKNOWN_LENGTH
 from twisted.internet.error import TimeoutError
 from twisted.web.http import _DataLoss, PotentialDataLoss
 from twisted.web.client import Agent, ProxyAgent, ResponseDone, \
-    HTTPConnectionPool, ResponseFailed
+    HTTPConnectionPool, ResponseFailed, URI
 from twisted.internet.endpoints import TCP4ClientEndpoint
 
 from scrapy.http import Headers
@@ -228,10 +228,24 @@ class TunnelingAgent(Agent):
             headers, bodyProducer, requestPath)
 
 
+class ScrapyProxyAgent(ProxyAgent):
+
+    def request(self, method, uri, headers=None, bodyProducer=None):
+        """
+        Issue a new request via the configured proxy.
+        """
+        # Cache *all* connections under the same key, since we are only
+        # connecting to a single destination, the proxy:
+        key = ("http-proxy", self._proxyEndpoint._host, self._proxyEndpoint._port)
+        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
+                                         URI.fromBytes(uri), headers,
+                                         bodyProducer, uri)
+
+
 class ScrapyAgent(object):
 
     _Agent = Agent
-    _ProxyAgent = ProxyAgent
+    _ProxyAgent = ScrapyProxyAgent
     _TunnelingAgent = TunnelingAgent
 
     def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None,
@@ -262,7 +276,7 @@ class ScrapyAgent(object):
             else:
                 endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                     timeout=timeout, bindAddress=bindaddress)
-                return self._ProxyAgent(endpoint)
+                return self._ProxyAgent(endpoint, pool=self._pool)
 
         return self._Agent(reactor, contextFactory=self._contextFactory,
             connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

I'll continue looking since using Agent.usingEndpointFactory may be what we want. I'm not sure yet.

@redapple
Copy link
Contributor

Another implementation that seems to work:

$ git diff
diff --git a/scrapy/core/downloader/handlers/http11.py b/scrapy/core/downloader/handlers/http11.py
index 9bfdd80..a4b077b 100644
--- a/scrapy/core/downloader/handlers/http11.py
+++ b/scrapy/core/downloader/handlers/http11.py
@@ -14,7 +14,7 @@ from twisted.web.iweb import IBodyProducer, UNKNOWN_LENGTH
 from twisted.internet.error import TimeoutError
 from twisted.web.http import _DataLoss, PotentialDataLoss
 from twisted.web.client import Agent, ProxyAgent, ResponseDone, \
-    HTTPConnectionPool, ResponseFailed
+    HTTPConnectionPool, ResponseFailed, URI
 from twisted.internet.endpoints import TCP4ClientEndpoint
 
 from scrapy.http import Headers
@@ -228,10 +228,33 @@ class TunnelingAgent(Agent):
             headers, bodyProducer, requestPath)
 
 
+class ScrapyProxyAgent(Agent):
+
+    def __init__(self, reactor, proxyURI,
+                 connectTimeout=None, bindAddress=None, pool=None):
+        super(ScrapyProxyAgent, self).__init__(reactor,
+                                               connectTimeout=connectTimeout,
+                                               bindAddress=bindAddress,
+                                               pool=pool)
+        self._proxyURI = URI.fromBytes(proxyURI)
+
+    def request(self, method, uri, headers=None, bodyProducer=None):
+        """
+        Issue a new request via the configured proxy.
+        """
+        # Cache *all* connections under the same key, since we are only
+        # connecting to a single destination, the proxy:
+        proxyEndpoint = self._getEndpoint(self._proxyURI)
+        key = ("http-proxy", self._proxyURI.host, self._proxyURI.port)
+        return self._requestWithEndpoint(key, proxyEndpoint, method,
+                                         URI.fromBytes(uri), headers,
+                                         bodyProducer, uri)
+
+
 class ScrapyAgent(object):
 
     _Agent = Agent
-    _ProxyAgent = ProxyAgent
+    _ProxyAgent = ScrapyProxyAgent
     _TunnelingAgent = TunnelingAgent
 
     def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None,
@@ -260,9 +283,8 @@ class ScrapyAgent(object):
                     contextFactory=self._contextFactory, connectTimeout=timeout,
                     bindAddress=bindaddress, pool=self._pool)
             else:
-                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
-                    timeout=timeout, bindAddress=bindaddress)
-                return self._ProxyAgent(endpoint)
+                return self._ProxyAgent(reactor, proxyURI=proxy,
+                    connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
 
         return self._Agent(reactor, contextFactory=self._contextFactory,
             connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

@jdxin0
Copy link
Author

jdxin0 commented May 30, 2017

@redapple Thanks for fixing it. It's really helpful in my using case.

@jdxin0
Copy link
Author

jdxin0 commented Jun 6, 2017

@redapple The implementation above for this problem somehow affect the download timeout calculation for request.

@redapple
Copy link
Contributor

redapple commented Jun 6, 2017

@jdxin0 ,
what do you mean by "somehow affect the download timeout calculation"?
have you tried #2767 ?

@jdxin0
Copy link
Author

jdxin0 commented Jun 7, 2017

I used monkey patch to fix the proxy connection pooling problem.
But it causes a lot of timeout(30s) errors when I am sure the request didn't reach 30s.
When I removed monkey patch, it changed to normal.

Here is the patch code.

from twisted.web.client import URI
from scrapy.core.downloader.handlers import http11
from scrapy.core.downloader.handlers.http11 import ProxyAgent, _parse, \
    to_unicode, reactor, TCP4ClientEndpoint, ScrapyAgent as _ScrapyAgent


class ScrapyProxyAgent(ProxyAgent):
    def request(self, method, uri, headers=None, bodyProducer=None):
        """
        Issue a new request via the configured proxy.
        """
        # Cache *all* connections under the same key, since we are only
        # connecting to a single destination, the proxy:
        key = (
        "http-proxy", self._proxyEndpoint._host, self._proxyEndpoint._port)
        return self._requestWithEndpoint(key, self._proxyEndpoint, method,
                                         URI.fromBytes(uri), headers,
                                         bodyProducer, uri)


class ScrapyAgent(_ScrapyAgent):
    _ProxyAgent = ScrapyProxyAgent
    
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            proxyHost = to_unicode(proxyHost)
            omitConnectTunnel = b'noconnect' in proxyParams
            if scheme == b'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get(b'Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                                            contextFactory=self._contextFactory,
                                            connectTimeout=timeout,
                                            bindAddress=bindaddress,
                                            pool=self._pool)
            else:
                endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                                              timeout=timeout,
                                              bindAddress=bindaddress)
                return self._ProxyAgent(endpoint, pool=self._pool)
        
        return self._Agent(reactor, contextFactory=self._contextFactory,
                           connectTimeout=timeout, bindAddress=bindaddress,
                           pool=self._pool)
    
    
def patch_proxy():
    http11.ScrapyAgent = ScrapyAgent

@redapple
Copy link
Contributor

redapple commented Jun 7, 2017

Can I ask you to try the branch with my tentative fix instead?

git checkout -b redapple-http-proxy-endpoint-key master
git pull https://github.com/redapple/scrapy.git http-proxy-endpoint-key

@jdxin0
Copy link
Author

jdxin0 commented Jun 7, 2017

I tried your fix(redapple-http-proxy-endpoint-key http-proxy-endpoint-key).
A few minute(1 or 2) after the spider start, everything was fine.
After a while, all the requests started to fail with http code 502.

The spider logged two hundred of 502 fail per second, there is no way that my spider can get two hundred of 502 fail per second from the target host server.

And the proxy log suggest that my spider connected to proxy port, then abort the connection immediately.

@redapple
Copy link
Contributor

redapple commented Jun 7, 2017

The spider logged two hundred of 502 fail per second, there is no way that my spider can get two hundred of 502 fail per second from the target host server.

and yet it happens. HTTP 502s is something different from the original connection pooling when using an HTTP proxy.

If you have proxy logs, you are probably in a better position to debug this. I have no setup to reproduce your use-case, so I don't think I can investigate further.

Being HTTP, it's easier to debug with something like Wireshark. If you can provide network capture of what's happening, I can maybe have a look.
From what I tested locally, HTTP connections to the HTTP proxy were correctly being reused.
I have not looked at timeouts though.

@jdxin0
Copy link
Author

jdxin0 commented Jun 8, 2017

Thanks for you advise, it turned out to be some corner case of our self-implemented proxy server.

@redapple redapple added this to the v1.4.1 milestone Jul 27, 2017
@jdxin0
Copy link
Author

jdxin0 commented Sep 4, 2017

@redapple My problem is solved with your patch after solving self-implemented proxy server problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants