-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection pooling do not work when using proxy #2743
Comments
Interesting @jdxin0 . Creating an endpoint for ProxyAgent is actually from the Twisted docs. |
@redapple
I use 'Activity Monitor -> Network -> Open Files and Ports' in Mac's system utils to monitor opened port by scrapy process. It used two ports to proxy server in the image, but it changed the ports quikly (didn't reuse old connection). |
@jdxin0 , I had to change this to make it work:
I'll continue looking since using |
Another implementation that seems to work:
|
@redapple Thanks for fixing it. It's really helpful in my using case. |
@redapple The implementation above for this problem somehow affect the download timeout calculation for request. |
I used monkey patch to fix the proxy connection pooling problem. Here is the patch code. from twisted.web.client import URI
from scrapy.core.downloader.handlers import http11
from scrapy.core.downloader.handlers.http11 import ProxyAgent, _parse, \
to_unicode, reactor, TCP4ClientEndpoint, ScrapyAgent as _ScrapyAgent
class ScrapyProxyAgent(ProxyAgent):
def request(self, method, uri, headers=None, bodyProducer=None):
"""
Issue a new request via the configured proxy.
"""
# Cache *all* connections under the same key, since we are only
# connecting to a single destination, the proxy:
key = (
"http-proxy", self._proxyEndpoint._host, self._proxyEndpoint._port)
return self._requestWithEndpoint(key, self._proxyEndpoint, method,
URI.fromBytes(uri), headers,
bodyProducer, uri)
class ScrapyAgent(_ScrapyAgent):
_ProxyAgent = ScrapyProxyAgent
def _get_agent(self, request, timeout):
bindaddress = request.meta.get('bindaddress') or self._bindAddress
proxy = request.meta.get('proxy')
if proxy:
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
scheme = _parse(request.url)[0]
proxyHost = to_unicode(proxyHost)
omitConnectTunnel = b'noconnect' in proxyParams
if scheme == b'https' and not omitConnectTunnel:
proxyConf = (proxyHost, proxyPort,
request.headers.get(b'Proxy-Authorization', None))
return self._TunnelingAgent(reactor, proxyConf,
contextFactory=self._contextFactory,
connectTimeout=timeout,
bindAddress=bindaddress,
pool=self._pool)
else:
endpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
timeout=timeout,
bindAddress=bindaddress)
return self._ProxyAgent(endpoint, pool=self._pool)
return self._Agent(reactor, contextFactory=self._contextFactory,
connectTimeout=timeout, bindAddress=bindaddress,
pool=self._pool)
def patch_proxy():
http11.ScrapyAgent = ScrapyAgent |
Can I ask you to try the branch with my tentative fix instead?
|
I tried your fix(redapple-http-proxy-endpoint-key http-proxy-endpoint-key). The spider logged two hundred of 502 fail per second, there is no way that my spider can get two hundred of 502 fail per second from the target host server. And the proxy log suggest that my spider connected to proxy port, then abort the connection immediately. |
and yet it happens. HTTP 502s is something different from the original connection pooling when using an HTTP proxy. If you have proxy logs, you are probably in a better position to debug this. I have no setup to reproduce your use-case, so I don't think I can investigate further. Being HTTP, it's easier to debug with something like Wireshark. If you can provide network capture of what's happening, I can maybe have a look. |
Thanks for you advise, it turned out to be some corner case of our self-implemented proxy server. |
@redapple My problem is solved with your patch after solving self-implemented proxy server problem |
Scrapy create a new
TCP4ClientEndpoint
for each request when using proxy inScrapyAgent
whileProxyAgent
(twisted) usekey = ("http-proxy", self._proxyEndpoint)
as connection pool key.It causes creating new connection for each request when using proxy,
will get
errno99: cannot assign requested address
when all ports has been used (socket TIME_WAIT).scrapy/core/downloader/handlers/http11.py
twisted/web/client.py
The text was updated successfully, but these errors were encountered: