bindaddress of meta never work right. #3565

NewUserHa · 2019-01-04T03:11:35Z

I have multi local IPs with different outgoing port to internet.
simply:

        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

responsed same result.

ps: already verified the two bind-address works using curl.

does anyone know how to fix and be willing to fix it?

The text was updated successfully, but these errors were encountered:

NewUserHa · 2019-01-04T03:11:59Z

#1967 same issue.

elacuesta · 2019-01-04T14:31:59Z

I was able to reproduce the issue, but only from the Scrapy shell, not within a spider, and only when the requests are directed to hosts outside of the local network.

Within the local network

I started a simple Flask app in one server (192.168.1.156) that returns the IP address of the host that makes a request:

from flask import Flask, request

app = Flask(__name__)

@app.route('/')
def hello():
    return request.remote_addr

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')

then, from a different machine with two interfaces (192.168.1.4 and 192.168.1.6) in the same private network:

$ scrapy shell http://192.168.1.156:5000
(...)
In [1]: response.text
Out[1]: '192.168.1.4'

In [2]: fetch(scrapy.Request('http://192.168.1.156:5000', meta={'bindaddress': ('192.168.1.4', 0)}))
2019-01-04 10:17:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.1.156:5000> (referer: None)

In [3]: response.text
Out[3]: '192.168.1.4'

In [4]: fetch(scrapy.Request('http://192.168.1.156:5000', meta={'bindaddress': ('192.168.1.6', 0)}))
2019-01-04 10:17:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.1.156:5000> (referer: None)

In [5]: response.text
Out[5]: '192.168.1.6'

Now, consider the following spider:

from scrapy import Spider, Request

class AddressSpider(Spider):
    name = 'address'

    def start_requests(self):
        yield Request('http://192.168.1.156:5000', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)})
        yield Request('http://192.168.1.156:5000', dont_filter=True, meta={'bindaddress': ('192.168.1.6', 0)})

    def parse(self, response):
        self.logger.info('%s - %s', response.meta['bindaddress'], response.text)

and the resulting logs:

(...)
2019-01-04 11:15:50 [address] INFO: ('192.168.1.4', 0) - 192.168.1.4
2019-01-04 11:15:50 [address] INFO: ('192.168.1.6', 0) - 192.168.1.6
(...)

Requesting external servers

The same spider, now requesting public servers, running from a host with two interfaces which are connected to the internet through different public IPs:

from scrapy import Spider, Request

class AddressSpider(Spider):
    name = 'address'

    def start_requests(self):
        yield Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.43.152', 0)})
        yield Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)})

    def parse(self, response):
        self.logger.info('%s - %s', response.meta['bindaddress'], response.text)

produces:

(...)
2019-01-04 11:20:40 [address] INFO: ('192.168.1.4', 0) - {"ip": "186.49.X.X"}
2019-01-04 11:20:40 [address] INFO: ('192.168.43.152', 0) - {"ip": "179.28.X.X"}
(...)

However, there are issues when fetching the same requests from the shell:

$ scrapy shell http://ip.jsontest.com/
(...)
In [1]: response.text
Out[1]: '{"ip": "186.49.X.X"}\n'

In [2]: fetch(scrapy.Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.43.152', 0)}))
2019-01-04 11:24:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ip.jsontest.com/> (referer: None)

In [3]: response.text
Out[3]: '{"ip": "186.49.X.X"}\n'

In [4]: fetch(scrapy.Request('http://ip.jsontest.com/', dont_filter=True, meta={'bindaddress': ('192.168.1.4', 0)}))
2019-01-04 11:25:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ip.jsontest.com/> (referer: None)

In [5]: response.text
Out[5]: '{"ip": "186.49.X.X"}\n'

Conclusion

I suspect some connection is reused in the context of the Scrapy shell, which makes all requests go through the same interface.

NewUserHa · 2019-01-04T15:34:43Z

I tried both in shell and in spider and the results are the same. so I'm head to using a local proxy server with proxy meta.

my env was:

the machine running these is a windows machine has an ethernet interface with two ips(192.168.0.2 + 192.168.0.3)
and a physical route for routing different outgoing for the two ips.

        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

the target server says the two requests are from the same source internet ip(the 192.168.0.2 one).

C:\Users\USER>scrapy version -v
Scrapy       : 1.5.1
lxml         : 4.2.5.0
libxml2      : 2.9.5
cssselect    : 1.0.3
parsel       : 1.5.0
w3lib        : 1.19.0
Twisted      : 18.9.0
Python       : 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018)
cryptography : 2.3.1
Platform     : Windows-10-10.0.17134-SP0

elacuesta · 2019-01-04T15:43:09Z

Could you provide the complete spider code with logs, along with the curl commands you mentioned at the beggining of the issue?

alwaysused · 2019-01-04T15:47:03Z

hey,man.I guess i find the reason. Scrapy use a network library named twisted, it reserved a connection pool which is stored as k-v, whose key is only relative to the target server.So if you request the target website twice in a short duration it may reuse the connection before. So the result is clear.But i can‘t reproduce the issue because of my network :(

NewUserHa · 2019-01-04T15:54:12Z

hey, I just had new found when I'm running it for detailed logs.

import scrapy

class try_(scrapy.Spider):
    name = "try_"

    custom_settings = {
        'DOWNLOAD_DELAY': 1.5,
    }

    def start_requests(self):
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

    def parse3(self, response):
        print(response.text, response.meta)

this won't work, and output:

2019-01-04 23:50:58 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-04 23:50:58 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-04 23:50:58 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 1.5}
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-04 23:50:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-04 23:50:58 [scrapy.core.engine] INFO: Spider opened
2019-01-04 23:50:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-04 23:50:58 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-01-04 23:50:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.2', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.6750264167785645}
2019-01-04 23:51:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.3', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.30902647972106934}
2019-01-04 23:51:01 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-04 23:51:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 466,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 1, 4, 15, 51, 1, 184202),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 1, 4, 15, 50, 58, 756330)}
2019-01-04 23:51:01 [scrapy.core.engine] INFO: Spider closed (finished)

but this do work:

import scrapy

class try_(scrapy.Spider):
    name = "try_"

    custom_settings = {

    }

    def start_requests(self):
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.2', 0)}, dont_filter=True)
        yield scrapy.Request('http://httpbin.org/ip', self.parse3, meta={'bindaddress': ('192.168.0.3', 0)}, dont_filter=True)

    def parse3(self, response):
        print(response.text, response.meta)

output:

2019-01-04 23:51:39 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-04 23:51:39 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-04 23:51:39 [scrapy.crawler] INFO: Overridden settings: {}
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-04 23:51:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-04 23:51:39 [scrapy.core.engine] INFO: Spider opened
2019-01-04 23:51:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-04 23:51:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-01-04 23:51:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
2019-01-04 23:51:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: None)
{
  "origin": "x.x.x.B"
}
 {'bindaddress': ('192.168.0.3', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.7331118583679199}
{
  "origin": "x.x.x.A"
}
 {'bindaddress': ('192.168.0.2', 0), 'download_timeout': 180.0, 'download_slot': 'httpbin.org', 'download_latency': 0.744828462600708}
2019-01-04 23:51:40 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-04 23:51:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 465,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 1, 4, 15, 51, 40, 431834),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 1, 4, 15, 51, 39, 576838)}
2019-01-04 23:51:40 [scrapy.core.engine] INFO: Spider closed (finished)

the only difference is in custom_settings!

but who don't override DOWNLOAD_DELAY?

alwaysused · 2019-01-04T16:00:07Z

Maybe the delay just make the connection made before break in my opinion.So can you add a break point in your file? It's in the sit-packages ,twisted.web.client.HttpConnection.getConnection .you can add at 1st line in the function.And then debug it to see if it makes two newConnections or not.

NewUserHa · 2019-01-04T16:03:54Z

how to add a breakpoint in an imported library?
and the default delay is 0.5(<overrided1.5) I think

alwaysused · 2019-01-05T01:09:06Z

sorry for replying so late, that was midnight in my country and i got to sleep. About the debug are you using an ide for python? I'm using pycharm and the library can be found in the project sidebar, and if you are not using this you can google or just search it in the ide's website.The 'twisted' network library is a pure python project instead a c extesion so debugging the library is possible.

tarnenok · 2019-01-08T19:17:29Z

@NewUserHa, as @alwaysused mentioned above, twisted caches connections by the key. That key is composed in line and don't take into account bind address property. So, for the second link in your example it uses cached connection with this meta={'bindaddress': ('192.168.0.2', 0)}

Your example with empty custom_settings works properly, because twisted caches connection in line after http response is received. So If there is no defined DOWNLOAD_DELAY or it's smaller than a server time response, than the second request starts earlier then connection from the first request is cached. Consequntly it creates a new connection.

To fix this problem you can patch connection key composed in twisted in line to the following

key = (parsedURI.scheme, parsedURI.host, parsedURI.port, endpoint._bindAddress)

But it access the private member and violates incapsulation properties. If this functionality is relly needed by community it worths to think better and maybe update ScrapyAgent to take into account bindAddress.

elacuesta mentioned this issue Jan 4, 2019

Issue with using bindaddress #1967

Closed

Gallaecio added the enhancement label Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bindaddress of meta never work right. #3565

bindaddress of meta never work right. #3565

NewUserHa commented Jan 4, 2019

NewUserHa commented Jan 4, 2019

elacuesta commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 •

edited

elacuesta commented Jan 4, 2019

alwaysused commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 •

edited

alwaysused commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 •

edited

alwaysused commented Jan 5, 2019

tarnenok commented Jan 8, 2019 •

edited

bindaddress of meta never work right. #3565

bindaddress of meta never work right. #3565

Comments

NewUserHa commented Jan 4, 2019

NewUserHa commented Jan 4, 2019

elacuesta commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 • edited

elacuesta commented Jan 4, 2019

alwaysused commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 • edited

alwaysused commented Jan 4, 2019

NewUserHa commented Jan 4, 2019 • edited

alwaysused commented Jan 5, 2019

tarnenok commented Jan 8, 2019 • edited

NewUserHa commented Jan 4, 2019 •

edited

NewUserHa commented Jan 4, 2019 •

edited

NewUserHa commented Jan 4, 2019 •

edited

tarnenok commented Jan 8, 2019 •

edited