Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindaddress of meta never work right. #3565

Open
NewUserHa opened this issue Jan 4, 2019 · 10 comments
Open

bindaddress of meta never work right. #3565

NewUserHa opened this issue Jan 4, 2019 · 10 comments

Comments

@NewUserHa
Copy link
Contributor

I have multi local IPs with different outgoing port to internet.
simply:

        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

responsed same result.

ps: already verified the two bind-address works using curl.

does anyone know how to fix and be willing to fix it?

@NewUserHa
Copy link
Contributor Author

#1967 same issue.

@elacuesta
Copy link
Member

I was able to reproduce the issue, but only from the Scrapy shell, not within a spider, and only when the requests are directed to hosts outside of the local network.

Within the local network

I started a simple Flask app in one server (192.168.1.156) that returns the IP address of the host that makes a request:

from flask import Flask, request

app = Flask(__name__)

@app.route('/')
def hello():
    return request.remote_addr

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')

then, from a different machine with two interfaces (192.168.1.4 and 192.168.1.6) in the same private network:

$ scrapy shell http://192.168.1.156:5000
(...)
In [1]: response.text
Out[1]: '192.168.1.4'

In [2]: fetch(scrapy.Request('http://192.168.1.156:5000', meta={'bindaddress': ('192.168.1.4', 0)}))
2019-01-04 10:17:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.1.156:5000> (referer: None)

In [3]: response.text
Out[3]: '192.168.1.4'

In [4]: fetch(scrapy.Request('http://192.168.1.156:5000', meta={'bindaddress': ('192.168.1.6', 0)}))
2019-01-04 10:17:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.1.156:5000> (referer: None)

In [5]: response.text
Out[5]: '192.168.1.6'

Now, consider the following spider:

from scrapy import Spider, Request

class AddressSpider(Spider):
    name = 'address'

    def start_requests(self):
        yield Request('http://192.168.1.156:5000', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)})
        yield Request('http://192.168.1.156:5000', dont_filter=True, meta={'bindaddress': ('192.168.1.6', 0)})

    def parse(self, response):
        self.logger.info('%s - %s', response.meta['bindaddress'], response.text)

and the resulting logs:

(...)
2019-01-04 11:15:50 [address] INFO: ('192.168.1.4', 0) - 192.168.1.4
2019-01-04 11:15:50 [address] INFO: ('192.168.1.6', 0) - 192.168.1.6
(...)

Requesting external servers

The same spider, now requesting public servers, running from a host with two interfaces which are connected to the internet through different public IPs:

from scrapy import Spider, Request

class AddressSpider(Spider):
    name = 'address'

    def start_requests(self):
        yield Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.43.152', 0)})
        yield Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)})

    def parse(self, response):
        self.logger.info('%s - %s', response.meta['bindaddress'], response.text)

produces:

(...)
2019-01-04 11:20:40 [address] INFO: ('192.168.1.4', 0) - {"ip": "186.49.X.X"}
2019-01-04 11:20:40 [address] INFO: ('192.168.43.152', 0) - {"ip": "179.28.X.X"}
(...)

However, there are issues when fetching the same requests from the shell:

$ scrapy shell http://ip.jsontest.com/
(...)
In [1]: response.text
Out[1]: '{"ip": "186.49.X.X"}\n'

In [2]: fetch(scrapy.Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.43.152', 0)}))
2019-01-04 11:24:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ip.jsontest.com/> (referer: None)

In [3]: response.text
Out[3]: '{"ip": "186.49.X.X"}\n'

In [4]: fetch(scrapy.Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)}))
2019-01-04 11:25:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ip.jsontest.com/> (referer: None)

In [5]: response.text
Out[5]: '{"ip": "186.49.X.X"}\n'

Conclusion

I suspect some connection is reused in the context of the Scrapy shell, which makes all requests go through the same interface.

@NewUserHa
Copy link
Contributor Author

NewUserHa commented Jan 4, 2019

I tried both in shell and in spider and the results are the same. so I'm head to using a local proxy server with proxy meta.

my env was:

the machine running these is a windows machine has an ethernet interface with two ips(192.168.0.2 + 192.168.0.3)
and a physical route for routing different outgoing for the two ips.

        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

the target server says the two requests are from the same source internet ip(the 192.168.0.2 one).

C:\Users\USER>scrapy version -v
Scrapy       : 1.5.1
lxml         : 4.2.5.0
libxml2      : 2.9.5
cssselect    : 1.0.3
parsel       : 1.5.0
w3lib        : 1.19.0
Twisted      : 18.9.0
Python       : 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018)
cryptography : 2.3.1
Platform     : Windows-10-10.0.17134-SP0

@elacuesta
Copy link
Member

Could you provide the complete spider code with logs, along with the curl commands you mentioned at the beggining of the issue?

@alwaysused
Copy link

hey,man.I guess i find the reason. Scrapy use a network library named twisted, it reserved a connection pool which is stored as k-v, whose key is only relative to the target server.So if you request the target website twice in a short duration it may reuse the connection before. So the result is clear.But i can‘t reproduce the issue because of my network :(

@NewUserHa
Copy link
Contributor Author

NewUserHa commented Jan 4, 2019

hey, I just had new found when I'm running it for detailed logs.

import scrapy

class try_(scrapy.Spider):
    name = "try_"

    custom_settings = {
        'DOWNLOAD_DELAY': 1.5,
    }

    def start_requests(self):
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

    def parse3(self, response):
        print(response.text, response.meta)

this won't work, and output:

2019-01-04 23:50:58 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-04 23:50:58 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-04 23:50:58 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 1.5}
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-04 23:50:58 [scrapy.core.engine] INFO: Spider opened
2019-01-04 23:50:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-04 23:50:58 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-01-04 23:50:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.2', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.6750264167785645}
2019-01-04 23:51:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.3', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.30902647972106934}
2019-01-04 23:51:01 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-04 23:51:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 466,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 1, 4, 15, 51, 1, 184202),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 1, 4, 15, 50, 58, 756330)}
2019-01-04 23:51:01 [scrapy.core.engine] INFO: Spider closed (finished)

but this do work:

import scrapy

class try_(scrapy.Spider):
    name = "try_"

    custom_settings = {

    }

    def start_requests(self):
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

    def parse3(self, response):
        print(response.text, response.meta)

output:

2019-01-04 23:51:39 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-04 23:51:39 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-04 23:51:39 [scrapy.crawler] INFO: Overridden settings: {}
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-04 23:51:39 [scrapy.core.engine] INFO: Spider opened
2019-01-04 23:51:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-04 23:51:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-01-04 23:51:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
2019-01-04 23:51:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.B"
}
 {'bindaddress': ('192.168.0.3', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.7331118583679199}
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.2', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.744828462600708}
2019-01-04 23:51:40 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-04 23:51:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 465,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 1, 4, 15, 51, 40, 431834),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 1, 4, 15, 51, 39, 576838)}
2019-01-04 23:51:40 [scrapy.core.engine] INFO: Spider closed (finished)

the only difference is in custom_settings!

but who don't override DOWNLOAD_DELAY?

@alwaysused
Copy link

Maybe the delay just make the connection made before break in my opinion.So can you add a break point in your file? It's in the sit-packages ,twisted.web.client.HttpConnection.getConnection .you can add at 1st line in the function.And then debug it to see if it makes two newConnections or not.

@NewUserHa
Copy link
Contributor Author

NewUserHa commented Jan 4, 2019

how to add a breakpoint in an imported library?
and the default delay is 0.5(<overrided1.5) I think

@alwaysused
Copy link

sorry for replying so late, that was midnight in my country and i got to sleep. About the debug are you using an ide for python? I'm using pycharm and the library can be found in the project sidebar, and if you are not using this you can google or just search it in the ide's website.The 'twisted' network library is a pure python project instead a c extesion so debugging the library is possible.

@tarnenok
Copy link

tarnenok commented Jan 8, 2019

@NewUserHa, as @alwaysused mentioned above, twisted caches connections by the key. That key is composed in line and don't take into account bind address property. So, for the second link in your example it uses cached connection with this meta={'bindaddress': ('192.168.0.2', 0)}

Your example with empty custom_settings works properly, because twisted caches connection in line after http response is received. So If there is no defined DOWNLOAD_DELAY or it's smaller than a server time response, than the second request starts earlier then connection from the first request is cached. Consequntly it creates a new connection.

To fix this problem you can patch connection key composed in twisted in line to the following

key = (parsedURI.scheme, parsedURI.host, parsedURI.port, endpoint._bindAddress)

But it access the private member and violates incapsulation properties. If this functionality is relly needed by community it worths to think better and maybe update ScrapyAgent to take into account bindAddress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants