Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get server IP address for HTTP/1.1 Responses #3940

Merged
merged 23 commits into from Apr 16, 2020
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
80925ab
Get server IP address for HTTP/1.1 responses
elacuesta Aug 5, 2019
e8da7e2
Test DNS resolution using CrawlerProcess
elacuesta Jan 26, 2020
8529dff
Update docs regarding Response.ip_address and IPv6
elacuesta Jan 26, 2020
b9e3a62
Merge branch 'master' into response_ip_address
elacuesta Feb 3, 2020
a2ae380
Remove unnecessary commas
elacuesta Feb 3, 2020
bb8f7dc
Mock DNS server
elacuesta Feb 3, 2020
4851efd
Flake8 adjustments
elacuesta Feb 3, 2020
e0ef8ad
CrawlerRunner test for Response.ip_address
elacuesta Feb 3, 2020
13670f0
Ignore tests/CrawlerRunner directory
elacuesta Feb 3, 2020
ad70497
Remove unnecessary parentheses in class definition
elacuesta Feb 4, 2020
6c33349
Merge branch 'master' into response_ip_address
elacuesta Feb 6, 2020
c2f484d
Merge branch 'master' into response_ip_address
Gallaecio Feb 7, 2020
13ba9bc
Note about Response.ip_address
elacuesta Feb 10, 2020
037ae5b
Explicitly indicate None as ip_address’s default value
Gallaecio Feb 10, 2020
a44942d
Merge branch 'master' into response_ip_address
elacuesta Feb 23, 2020
f85bf77
Restore unrelated change
elacuesta Feb 23, 2020
889b471
Import changes
elacuesta Feb 23, 2020
3aa5eab
Merge branch 'master' into response_ip_address
elacuesta Mar 3, 2020
ac73bcc
Merge branch 'master' into response_ip_address
elacuesta Mar 9, 2020
91a78ee
Pass callback results as dicts instead of tuples
elacuesta Mar 9, 2020
1785095
Remove single-use variable
elacuesta Mar 11, 2020
1f2e2a6
Merge branch 'master' into response_ip_address
elacuesta Apr 16, 2020
c922992
Tests: Move code inside __main__ block
elacuesta Apr 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion conftest.py
Expand Up @@ -11,7 +11,8 @@ def _py_files(folder):
# not a test, but looks like a test
"scrapy/utils/testsite.py",
# contains scripts to be run by tests/test_crawler.py::CrawlerProcessSubprocess
*_py_files("tests/CrawlerProcess")
*_py_files("tests/CrawlerProcess"),
*_py_files("tests/CrawlerRunner"),
]

for line in open('tests/ignores.txt'):
Expand Down
9 changes: 8 additions & 1 deletion docs/topics/request-response.rst
Expand Up @@ -34,7 +34,7 @@ Request objects
:type url: string

:param callback: the function that will be called with the response of this
request (once its downloaded) as its first parameter. For more information
request (once it's downloaded) as its first parameter. For more information
see :ref:`topics-request-response-ref-request-callback-arguments` below.
If a Request doesn't specify a callback, the spider's
:meth:`~scrapy.spiders.Spider.parse` method will be used.
Expand Down Expand Up @@ -611,6 +611,9 @@ Response objects
This represents the :class:`Request` that generated this response.
:type request: :class:`Request` object

:param ip_address: The IP address of the server from which the Response originated.
:type ip_address: :class:`ipaddress.IPv4Address` or :class:`ipaddress.IPv6Address`

.. attribute:: Response.url

A string containing the URL of the response.
Expand Down Expand Up @@ -679,6 +682,10 @@ Response objects
they're shown on the string representation of the Response (`__str__`
method) which is used by the engine for logging.

.. attribute:: Response.ip_address

The IP address of the server from which the Response originated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we follow it well for other attributes, but it'd be good to say that it can be None as well. Maybe also mention when this may happen ("not all download handlers may support this attribute" or something like that, maybe in a more user-friendly way, as nobody knows what's a download handler).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this. I intentionally didn't mention the S3 handler, which uses the HTTP handler internally, because those responses are technically also http. Also didn't mention responses with no body (https://github.com/scrapy/scrapy/pull/3940/files#diff-18150b1d259c93bf10bf1d4e5028d753R384-R386), I think that's probably a very specific edge case.


.. method:: Response.copy()

Returns a new Response which is a copy of this Response.
Expand Down
2 changes: 1 addition & 1 deletion scrapy/core/downloader/__init__.py
Expand Up @@ -172,7 +172,7 @@ def _downloaded(response):
return response
dfd.addCallback(_downloaded)

# 3. After response arrives, remove the request from transferring
# 3. After response arrives, remove the request from transferring
# state to free up the transferring slot so it can be used by the
# following requests (perhaps those which came from the downloader
# middleware itself)
Expand Down
18 changes: 12 additions & 6 deletions scrapy/core/downloader/handlers/http11.py
Expand Up @@ -4,6 +4,7 @@
import re
import warnings
from io import BytesIO
from ipaddress import ip_address
from time import time
from urllib.parse import urldefrag

Expand Down Expand Up @@ -382,7 +383,7 @@ def _cb_latency(self, result, request, start_time):
def _cb_bodyready(self, txresponse, request):
# deliverBody hangs for responses without body
if txresponse.length == 0:
return txresponse, b'', None
return txresponse, b'', None, None

maxsize = request.meta.get('download_maxsize', self._maxsize)
warnsize = request.meta.get('download_warnsize', self._warnsize)
Expand Down Expand Up @@ -418,11 +419,11 @@ def _cancel(_):
return d

def _cb_bodydone(self, result, request, url):
txresponse, body, flags = result
txresponse, body, flags, ip_address = result
status = int(txresponse.code)
headers = Headers(txresponse.headers.getAllRawHeaders())
respcls = responsetypes.from_args(headers=headers, url=url, body=body)
return respcls(url=url, status=status, headers=headers, body=body, flags=flags)
return respcls(url=url, status=status, headers=headers, body=body, flags=flags, ip_address=ip_address)


@implementer(IBodyProducer)
Expand Down Expand Up @@ -456,6 +457,11 @@ def __init__(self, finished, txresponse, request, maxsize, warnsize, fail_on_dat
self._fail_on_dataloss_warned = False
self._reached_warnsize = False
self._bytes_received = 0
self._ip_address = None

def connectionMade(self):
if self._ip_address is None:
self._ip_address = ip_address(self.transport._producer.getPeer().host)

def dataReceived(self, bodyBytes):
# This maybe called several times after cancel was called with buffered data.
Expand Down Expand Up @@ -488,16 +494,16 @@ def connectionLost(self, reason):

body = self._bodybuf.getvalue()
if reason.check(ResponseDone):
self._finished.callback((self._txresponse, body, None))
self._finished.callback((self._txresponse, body, None, self._ip_address))
return

if reason.check(PotentialDataLoss):
self._finished.callback((self._txresponse, body, ['partial']))
self._finished.callback((self._txresponse, body, ['partial'], self._ip_address))
return

if reason.check(ResponseFailed) and any(r.check(_DataLoss) for r in reason.value.reasons):
if not self._fail_on_dataloss:
self._finished.callback((self._txresponse, body, ['dataloss']))
self._finished.callback((self._txresponse, body, ['dataloss'], self._ip_address))
return

elif not self._fail_on_dataloss_warned:
Expand Down
5 changes: 3 additions & 2 deletions scrapy/http/response/__init__.py
Expand Up @@ -17,13 +17,14 @@

class Response(object_ref):

def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None):
def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None, ip_address=None):
self.headers = Headers(headers or {})
self.status = int(status)
self._set_body(body)
self._set_url(url)
self.request = request
self.flags = [] if flags is None else list(flags)
self.ip_address = ip_address

@property
def meta(self):
Expand Down Expand Up @@ -76,7 +77,7 @@ def replace(self, *args, **kwargs):
"""Create a new Response with the same attributes except for those
given new values.
"""
for x in ['url', 'status', 'headers', 'body', 'request', 'flags']:
for x in ['url', 'status', 'headers', 'body', 'request', 'flags', 'ip_address']:
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)
Expand Down
2 changes: 1 addition & 1 deletion scrapy/resolver.py
Expand Up @@ -29,7 +29,7 @@ def from_crawler(cls, crawler, reactor):
cache_size = 0
return cls(reactor, cache_size, crawler.settings.getfloat('DNS_TIMEOUT'))

def install_on_reactor(self,):
def install_on_reactor(self):
self.reactor.installResolver(self)

def getHostByName(self, name, timeout=None):
Expand Down
37 changes: 37 additions & 0 deletions tests/CrawlerRunner/ip_address.py
@@ -0,0 +1,37 @@
from urllib.parse import urlparse

from twisted.internet import reactor
from twisted.names.client import createResolver

from scrapy import Spider, Request
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

from tests.mockserver import MockServer, MockDNSServer


class LocalhostSpider(Spider):
name = "localhost_spider"

def start_requests(self):
yield Request(self.url)

def parse(self, response):
netloc = urlparse(response.url).netloc
self.logger.info("Host: %s" % netloc.split(":")[0])
self.logger.info("Type: %s" % type(response.ip_address))
self.logger.info("IP address: %s" % response.ip_address)


with MockServer() as mock_http_server, MockDNSServer() as mock_dns_server:
port = urlparse(mock_http_server.http_address).port
url = "http://not.a.real.domain:{port}/echo".format(port=port)

servers = [(mock_dns_server.host, mock_dns_server.port)]
reactor.installResolver(createResolver(servers=servers))

configure_logging()
runner = CrawlerRunner()
d = runner.crawl(LocalhostSpider, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run()
101 changes: 75 additions & 26 deletions tests/mockserver.py
@@ -1,3 +1,4 @@
import argparse
import json
import os
import random
Expand All @@ -6,18 +7,19 @@
from urllib.parse import urlencode

from OpenSSL import SSL
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
from twisted.internet import defer, reactor, ssl
from twisted.internet.task import deferLater
from twisted.names import dns, error
from twisted.names.server import DNSServerFactory
from twisted.web.resource import EncodingResourceWrapper, Resource
from twisted.web.server import GzipEncoderFactory, NOT_DONE_YET, Site
from twisted.web.static import File
from twisted.web.test.test_webclient import PayloadResource
from twisted.web.server import GzipEncoderFactory
from twisted.web.resource import EncodingResourceWrapper
from twisted.web.util import redirectTo
from twisted.internet import reactor, ssl
from twisted.internet.task import deferLater

from scrapy.utils.python import to_bytes, to_unicode
from scrapy.utils.ssl import SSL_OP_NO_TLSv1_3
from scrapy.utils.test import get_testenv


def getarg(request, name, default=None, type=None):
Expand Down Expand Up @@ -198,12 +200,10 @@ def render(self, request):
return b'Scrapy mock HTTP server\n'


class MockServer():
class MockServer:

def __enter__(self):
from scrapy.utils.test import get_testenv

self.proc = Popen([sys.executable, '-u', '-m', 'tests.mockserver'],
self.proc = Popen([sys.executable, '-u', '-m', 'tests.mockserver', '-t', 'http'],
stdout=PIPE, env=get_testenv())
http_address = self.proc.stdout.readline().strip().decode('ascii')
https_address = self.proc.stdout.readline().strip().decode('ascii')
Expand All @@ -224,11 +224,45 @@ def url(self, path, is_secure=False):
return host + path


class MockDNSResolver:
"""
Implements twisted.internet.interfaces.IResolver partially
"""

def _resolve(self, name):
record = dns.Record_A(address=b"127.0.0.1")
answer = dns.RRHeader(name=name, payload=record)
return [answer], [], []

def query(self, query, timeout=None):
if query.type == dns.A:
return defer.succeed(self._resolve(query.name.name))
return defer.fail(error.DomainError())

def lookupAllRecords(self, name, timeout=None):
return defer.succeed(self._resolve(name))


class MockDNSServer:

def __enter__(self):
self.proc = Popen([sys.executable, '-u', '-m', 'tests.mockserver', '-t', 'dns'],
stdout=PIPE, env=get_testenv())
host, port = self.proc.stdout.readline().strip().decode('ascii').split(":")
self.host = host
self.port = int(port)
return self

def __exit__(self, exc_type, exc_value, traceback):
self.proc.kill()
self.proc.communicate()


def ssl_context_factory(keyfile='keys/localhost.key', certfile='keys/localhost.crt', cipher_string=None):
factory = ssl.DefaultOpenSSLContextFactory(
os.path.join(os.path.dirname(__file__), keyfile),
os.path.join(os.path.dirname(__file__), certfile),
)
os.path.join(os.path.dirname(__file__), keyfile),
os.path.join(os.path.dirname(__file__), certfile),
)
if cipher_string:
ctx = factory.getContext()
# disabling TLS1.2+ because it unconditionally enables some strong ciphers
Expand All @@ -238,19 +272,34 @@ def ssl_context_factory(keyfile='keys/localhost.key', certfile='keys/localhost.c


if __name__ == "__main__":
root = Root()
factory = Site(root)
httpPort = reactor.listenTCP(0, factory)
contextFactory = ssl_context_factory()
httpsPort = reactor.listenSSL(0, factory, contextFactory)

def print_listening():
httpHost = httpPort.getHost()
httpsHost = httpsPort.getHost()
httpAddress = 'http://%s:%d' % (httpHost.host, httpHost.port)
httpsAddress = 'https://%s:%d' % (httpsHost.host, httpsHost.port)
print(httpAddress)
print(httpsAddress)
parser = argparse.ArgumentParser()
parser.add_argument("-t", "--type", type=str, choices=("http", "dns"), default="http")
args = parser.parse_args()

if args.type == "http":
root = Root()
factory = Site(root)
httpPort = reactor.listenTCP(0, factory)
contextFactory = ssl_context_factory()
httpsPort = reactor.listenSSL(0, factory, contextFactory)

def print_listening():
httpHost = httpPort.getHost()
httpsHost = httpsPort.getHost()
httpAddress = "http://%s:%d" % (httpHost.host, httpHost.port)
httpsAddress = "https://%s:%d" % (httpsHost.host, httpsHost.port)
print(httpAddress)
print(httpsAddress)

elif args.type == "dns":
clients = [MockDNSResolver()]
factory = DNSServerFactory(clients=clients)
protocol = dns.DNSDatagramProtocol(controller=factory)
listener = reactor.listenUDP(0, protocol)

def print_listening():
host = listener.getHost()
print("%s:%s" % (host.host, host.port))

reactor.callWhenRunning(print_listening)
reactor.run()
19 changes: 19 additions & 0 deletions tests/test_crawl.py
@@ -1,5 +1,8 @@
import json
import logging
from ipaddress import IPv4Address
from socket import gethostbyname
from urllib.parse import urlparse

from pytest import mark
from testfixtures import LogCapture
Expand Down Expand Up @@ -343,3 +346,19 @@ def _on_item_scraped(item):
self.assertIn("Got response 200", str(log))
self.assertIn({'id': 1}, items)
self.assertIn({'id': 2}, items)

@defer.inlineCallbacks
def test_dns_server_ip_address(self):
crawler = self.runner.create_crawler(SingleRequestSpider)
url = self.mockserver.url('/status?n=200')
yield crawler.crawl(seed=url, mockserver=self.mockserver)
ip_address = crawler.spider.meta['responses'][0].ip_address
self.assertIsNone(ip_address)

crawler = self.runner.create_crawler(SingleRequestSpider)
url = self.mockserver.url('/echo?body=test')
expected_netloc, _ = urlparse(url).netloc.split(':')
yield crawler.crawl(seed=url, mockserver=self.mockserver)
ip_address = crawler.spider.meta['responses'][0].ip_address
self.assertIsInstance(ip_address, IPv4Address)
self.assertEqual(str(ip_address), gethostbyname(expected_netloc))