New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get server IP address for HTTP/1.1 Responses #3940
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3940 +/- ##
==========================================
- Coverage 84.79% 84.60% -0.20%
==========================================
Files 164 164
Lines 9887 9890 +3
Branches 1468 1469 +1
==========================================
- Hits 8384 8367 -17
- Misses 1248 1266 +18
- Partials 255 257 +2
|
I am having in my mind a few scenarios where |
I believe |
The cache will not always work, as @OmarFarrag points out. And resolving after the response has been received is not the answer either, the IP address may not match the server that actually sent the response. It seems to me like the right way would be to modify the low-level code that builds the response using the Twisted API, and use something like transport.getHost() to get the actual IP address that sent the response. The code would probably be much more complex, though. Also, if we go this route, we might want to consider making the server IP address part of the |
That's great feedback, thanks both! |
1aebd88
to
1ed2c15
Compare
Updated to use |
tests/test_crawl.py
Outdated
|
||
crawler = self.runner.create_crawler(SingleRequestSpider) | ||
url = 'https://example.org' | ||
yield crawler.crawl(seed=url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure about making a request to https://example.org as part of the test suite. @kmike Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it either, and I've been told about this before, precisely by Mikhail
But I don't know how else to check for an actual DNS resolution, the MockServer URLs uses only IP addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t like this too, I’ll think what to do about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 not to use example.com, even if it means we need to figure out how to start a dummy DNS server in tests. It looks doable, as e.g. twisted has DNS server implementation (see https://twistedmatrix.com/documents/current/names/howto/custom-server.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I have a DNS forwarder running on a Raspberry Pi which is based on (should I say "copied from"?) that page. I'll try to come up with a mock server. Thanks for the pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream request removed
Looks great. Let’s now get some reviewers more familiar with Scrapy’s internals. |
I do wonder if we should consider a shorter name, though. (I know, I know, I proposed the current one). Would |
I think it does, let's do it. Edit: added docs as well. |
hahaha fair enough |
@kmike After the latest developments in the above SO question, I was able to create a mock DNS server, which resolves everything to |
@@ -679,6 +682,10 @@ Response objects | |||
they're shown on the string representation of the Response (`__str__` | |||
method) which is used by the engine for logging. | |||
|
|||
.. attribute:: Response.ip_address | |||
|
|||
The IP address of the server from which the Response originated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we follow it well for other attributes, but it'd be good to say that it can be None as well. Maybe also mention when this may happen ("not all download handlers may support this attribute" or something like that, maybe in a more user-friendly way, as nobody knows what's a download handler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this. I intentionally didn't mention the S3 handler, which uses the HTTP handler internally, because those responses are technically also http. Also didn't mention responses with no body (https://github.com/scrapy/scrapy/pull/3940/files#diff-18150b1d259c93bf10bf1d4e5028d753R384-R386), I think that's probably a very specific edge case.
if __name__ == "__main__": | ||
with MockServer() as mock_http_server, MockDNSServer() as mock_dns_server: | ||
port = urlparse(mock_http_server.http_address).port | ||
url = "http://not.a.real.domain:{port}/echo".format(port=port) | ||
|
||
servers = [(mock_dns_server.host, mock_dns_server.port)] | ||
reactor.installResolver(createResolver(servers=servers)) | ||
|
||
configure_logging() | ||
runner = CrawlerRunner() | ||
d = runner.crawl(LocalhostSpider, url=url) | ||
d.addBoth(lambda _: reactor.stop()) | ||
reactor.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this block @kmike, thanks for pointing it out
Great job, thanks @elacuesta and everyone involved! |
Fixes #3903
Extract from a
scrapy shell
session:For reference, it seems like there is no public API for getting this directly from Twisted: https://twistedmatrix.com/trac/ticket/9030
Tasks:
Update:
Py3-only: Implemented using the
ipaddress
module, only available in Python 3.3+.