Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get server IP address for HTTP/1.1 Responses #3940

Merged
merged 23 commits into from Apr 16, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Aug 5, 2019

Fixes #3903

Extract from a scrapy shell session:

$ scrapy shell https://example.org
(...)
2019-08-05 16:41:17 [scrapy.core.engine] INFO: Spider opened
2019-08-05 16:41:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc3e2cb2f98>
[s]   item       {}
[s]   request    <GET https://example.org>
[s]   response   <200 https://example.org>
[s]   settings   <scrapy.settings.Settings object at 0x7fc3deadd748>
[s]   spider     <DefaultSpider 'default' at 0x7fc3de0d6860>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response.ip_address
Out[1]: IPv4Address('93.184.216.34')

In [2]: fetch('https://docs.scrapy.org')
2019-08-05 16:41:48 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://docs.scrapy.org/en/latest/> from <GET https://docs.scrapy.org>
2019-08-05 16:41:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: None)

In [3]: response.ip_address
Out[3]: IPv4Address('104.17.32.82')

For reference, it seems like there is no public API for getting this directly from Twisted: https://twistedmatrix.com/trac/ticket/9030

Tasks:

  • Implementation
  • Tests
  • Docs

Update:
Py3-only: Implemented using the ipaddress module, only available in Python 3.3+.

@codecov
Copy link

@codecov codecov bot commented Aug 5, 2019

Codecov Report

Merging #3940 into master will decrease coverage by 0.19%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master    #3940      +/-   ##
==========================================
- Coverage   84.79%   84.60%   -0.20%     
==========================================
  Files         164      164              
  Lines        9887     9890       +3     
  Branches     1468     1469       +1     
==========================================
- Hits         8384     8367      -17     
- Misses       1248     1266      +18     
- Partials      255      257       +2     
Impacted Files Coverage Δ
scrapy/core/downloader/__init__.py 90.97% <ø> (ø)
scrapy/core/downloader/handlers/http11.py 92.59% <90.90%> (-0.32%) ⬇️
scrapy/http/response/__init__.py 97.36% <100.00%> (+0.03%) ⬆️
scrapy/robotstxt.py 75.30% <0.00%> (-22.23%) ⬇️
scrapy/utils/trackref.py 82.85% <0.00%> (-2.86%) ⬇️

@OmarFarrag
Copy link
Contributor

@OmarFarrag OmarFarrag commented Aug 8, 2019

I am having in my mind a few scenarios where dnscache.get(hostname) might not work:
1- If the cached ip was popped out of the cache due to limit constraints ( scrapy.utils.datatypes.LocalCache.__setitem__)
2- If the DNSCACHE_ENABLED is false, the ip is not cached (scrapy.resolver.getHostByName)

@OmarFarrag
Copy link
Contributor

@OmarFarrag OmarFarrag commented Aug 8, 2019

I believe reactor.resolve would be the right way, provided that the reactor uses the pre installed resolver which is the CachingThreadedResolver, but I can't seem to figure from twisted documentation if that is the behavior or not. Hope this helps you :)

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 8, 2019

The cache will not always work, as @OmarFarrag points out. And resolving after the response has been received is not the answer either, the IP address may not match the server that actually sent the response.

It seems to me like the right way would be to modify the low-level code that builds the response using the Twisted API, and use something like transport.getHost() to get the actual IP address that sent the response. The code would probably be much more complex, though.

Also, if we go this route, we might want to consider making the server IP address part of the Response class, instead of a Response.meta key.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 8, 2019

That's great feedback, thanks both!
I'm on mobile now, so without rechecking the code I suspect the DNSCACHE_ENABLED setting might indeed be the cause of the failed test.
I will try to build something based upon your comments, thanks again @OmarFarrag and @Gallaecio 🤘

@elacuesta elacuesta changed the title Get server IP address in Response.meta Get server IP address in Response Aug 8, 2019
@elacuesta elacuesta force-pushed the response_ip_address branch 3 times, most recently from 1aebd88 to 1ed2c15 Compare Aug 8, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 8, 2019

Updated to use getPeer() as per this StackOverflow answer. It gets the data by accessing a private attribute, but it's actually not the first private Twisted attribute in use in that file 😅
Also, as Adrián suggested I added a server_ip_address attribute (other naming suggestions are welcome of course), because indeed the server IP belongs to the responses, not to the requests. I just need to update the docs to reflect this.
I hope this makes more sense now, looking forward to reading your feedback!

@elacuesta elacuesta changed the title Get server IP address in Response Get server IP address for HTTP/1.1 Responses Aug 8, 2019

crawler = self.runner.create_crawler(SingleRequestSpider)
url = 'https://example.org'
yield crawler.crawl(seed=url)
Copy link
Member

@Gallaecio Gallaecio Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure about making a request to https://example.org as part of the test suite. @kmike Thoughts?

Copy link
Member Author

@elacuesta elacuesta Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either, and I've been told about this before, precisely by Mikhail 😄 (#2061 (review))
But I don't know how else to check for an actual DNS resolution, the MockServer URLs uses only IP addresses.

Copy link
Contributor

@wRAR wRAR Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t like this too, I’ll think what to do about it

Copy link
Member

@kmike kmike Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 not to use example.com, even if it means we need to figure out how to start a dummy DNS server in tests. It looks doable, as e.g. twisted has DNS server implementation (see https://twistedmatrix.com/documents/current/names/howto/custom-server.html)

Copy link
Member Author

@elacuesta elacuesta Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I have a DNS forwarder running on a Raspberry Pi which is based on (should I say "copied from"?) that page. I'll try to come up with a mock server. Thanks for the pointer.

Copy link
Member Author

@elacuesta elacuesta Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream request removed 🙌

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 12, 2019

Looks great. Let’s now get some reviewers more familiar with Scrapy’s internals.

@Gallaecio Gallaecio requested review from kmike and dangra Aug 12, 2019
@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 14, 2019

I do wonder if we should consider a shorter name, though. (I know, I know, I proposed the current one). Would Response.address work?

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 15, 2019

Would Response.address work?

I think it does, let's do it.

Edit: added docs as well.

docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

The changes are fine by me.

Although that test… I can’t help but wonder if having no test at all would be better 😄. Let’s see what @kmike thinks.

docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Aug 19, 2019

Although that test… I can’t help but wonder if having no test at all would be better smile. Let’s see what @kmike thinks.

hahaha fair enough 😄

docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
scrapy/utils/datatypes.py Outdated Show resolved Hide resolved
@elacuesta elacuesta changed the title Get server IP address for HTTP/1.1 Responses [Py3] Get server IP address for HTTP/1.1 Responses Aug 21, 2019
@elacuesta elacuesta added this to the v2.0 milestone Aug 21, 2019
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Feb 4, 2020

@kmike After the latest developments in the above SO question, I was able to create a mock DNS server, which resolves everything to 127.0.0.1.
tests.mockserver.MockDNSServer is similar to tests.mockserver.MockServer, it can be used as a context manager as well.

@@ -679,6 +682,10 @@ Response objects
they're shown on the string representation of the Response (`__str__`
method) which is used by the engine for logging.

.. attribute:: Response.ip_address

The IP address of the server from which the Response originated.
Copy link
Member

@kmike kmike Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we follow it well for other attributes, but it'd be good to say that it can be None as well. Maybe also mention when this may happen ("not all download handlers may support this attribute" or something like that, maybe in a more user-friendly way, as nobody knows what's a download handler).

Copy link
Member Author

@elacuesta elacuesta Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this. I intentionally didn't mention the S3 handler, which uses the HTTP handler internally, because those responses are technically also http. Also didn't mention responses with no body (https://github.com/scrapy/scrapy/pull/3940/files#diff-18150b1d259c93bf10bf1d4e5028d753R384-R386), I think that's probably a very specific edge case.

@elacuesta elacuesta added this to the v2.1 milestone Mar 23, 2020
@kmike kmike requested a review from wRAR Mar 26, 2020
@Gallaecio Gallaecio removed this from the v2.1 milestone Apr 16, 2020
@elacuesta elacuesta changed the title [Py3] Get server IP address for HTTP/1.1 Responses Get server IP address for HTTP/1.1 Responses Apr 16, 2020
if __name__ == "__main__":
with MockServer() as mock_http_server, MockDNSServer() as mock_dns_server:
port = urlparse(mock_http_server.http_address).port
url = "http://not.a.real.domain:{port}/echo".format(port=port)

servers = [(mock_dns_server.host, mock_dns_server.port)]
reactor.installResolver(createResolver(servers=servers))

configure_logging()
runner = CrawlerRunner()
d = runner.crawl(LocalhostSpider, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Copy link
Member Author

@elacuesta elacuesta Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this block @kmike, thanks for pointing it out 👍

@kmike
Copy link
Member

@kmike kmike commented Apr 16, 2020

Great job, thanks @elacuesta and everyone involved!

@kmike kmike merged commit c33fd47 into scrapy:master Apr 16, 2020
2 checks passed
@elacuesta elacuesta deleted the response_ip_address branch Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants