Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement experimental HTTP/2 support #4769

Merged
merged 101 commits into from
Mar 18, 2021
Merged
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
9408c77
feat(http2): IH2EventsHandler, http2 module
adityaa30 May 31, 2020
7912923
chore(http2): Stream class
adityaa30 Jun 2, 2020
9ff9cae
feat(http2): support for GET requests
adityaa30 Jun 7, 2020
d09ccf8
feat(http2): support for POST requests
adityaa30 Jun 13, 2020
d06bb12
refactor: move H2Connection instance to stream
adityaa30 Jun 13, 2020
de4a343
fix: large data chunk not received
adityaa30 Jun 14, 2020
01ad8b3
refactor(http2): clean up
adityaa30 Jun 14, 2020
089dbc7
chore: use deque for pending request pool
adityaa30 Jun 17, 2020
700df3e
test: mockserver with h2 protocol for tests
adityaa30 Jun 17, 2020
303485a
fix(http2): POST request not sending large body
adityaa30 Jun 20, 2020
c74ef66
feat: handle response for different reasons
adityaa30 Jun 21, 2020
a97ac0a
test: GET request for HTTP2Client using mockserver
adityaa30 Jun 24, 2020
69f6d03
feat: TypedDict for Stream._response
adityaa30 Jun 24, 2020
065b315
Merge branch 'master' of https://github.com/scrapy/scrapy into h2-cli…
adityaa30 Jun 24, 2020
690dd7f
test: GET & POST request test for h2 client
adityaa30 Jun 28, 2020
6387445
test(tox.ini): change Twisted -> Twisted[http2]
adityaa30 Jun 28, 2020
23906b6
refactor: move TypedDict types to types.py
adityaa30 Jun 29, 2020
90a7007
test: warnsize logs, no content header, dataloss
adityaa30 Jun 29, 2020
d17417b
Merge branch 'master' of https://github.com/scrapy/scrapy into h2-cli…
adityaa30 Jun 29, 2020
26ab3e4
feat: FIFO policy to handle large no. of requests
adityaa30 Jun 30, 2020
50dd927
fix: disable redundant logs
adityaa30 Jun 30, 2020
7b1ad99
test: query params, certificate & ip_address
adityaa30 Jul 1, 2020
c361fe0
feat: check for invalid hostname
adityaa30 Jul 1, 2020
4acdc2e
refactor: use __qualname__, () for large strings
adityaa30 Jul 1, 2020
a94b303
test: reduce test data size to 1MB
adityaa30 Jul 6, 2020
7f5bb6b
chore: add h2 to setup.py, tox.ini
adityaa30 Jul 6, 2020
54e4228
refactor: use protocol
elacuesta Jul 6, 2020
1c40dfa
fix: handle CONNECTION_LOST & RESET separately
adityaa30 Jul 6, 2020
2ea7d82
feat: H2ClientFactory
adityaa30 Jul 8, 2020
64c6af1
refactor: use str instead of to_unicode
adityaa30 Jul 12, 2020
aeaeb73
feat: assert negotiated protocol as h2
adityaa30 Jul 13, 2020
1dd27a9
feat: Idle Timeout for H2Connection (240s)
adityaa30 Jul 14, 2020
e662762
chore: Handle ConnectionTerminated event
adityaa30 Jul 14, 2020
316620b
chore: pass spider as argument for request method
adityaa30 Jul 22, 2020
3685e99
test: http2 connection timeout
adityaa30 Jul 22, 2020
9fffb80
feat: H2Agent, H2ConnectionPool base implementation
adityaa30 Jul 8, 2020
8252a6f
fix: H2Agent not able to connect via SSL
adityaa30 Jul 14, 2020
62ce842
fix: multiple h2 connections to same uri
adityaa30 Jul 15, 2020
031bfc9
feat(wip): ScrapyH2Agent, ScrapyProxyH2Agent
adityaa30 Jul 22, 2020
92bec38
feat: MethodNotAllowed405, Content-Length header
adityaa30 Jul 29, 2020
e834299
test: H2DownloadHandler
adityaa30 Jul 29, 2020
19f2b4b
refactor: AcceptableProtocolsContextFactory
adityaa30 Jul 29, 2020
a3fecaf
test: fix host-name H2DownloadHandler tests
adityaa30 Jul 30, 2020
d707f8b
docs: mention H2DownloadHandler in settings.rst
adityaa30 Jul 30, 2020
e0c3019
fix: ScrapyProxyH2Agent
adityaa30 Aug 9, 2020
c67d6de
fix: H2 docs, NotImplementedError for H2 Tunnel
adityaa30 Aug 10, 2020
90f85a2
Enable Travis CI
Gallaecio Aug 11, 2020
af73f14
refactor: move all http2 tests in separate files
adityaa30 Aug 16, 2020
d97cf97
Merge branch 'master' of https://github.com/scrapy/scrapy into h2-cli…
adityaa30 Aug 16, 2020
f9f008e
test: add typing-extensions
adityaa30 Aug 16, 2020
38d3617
fix: typing & pylint errors
adityaa30 Aug 16, 2020
75fe3d1
fix: increase timeout to 0.5 seconds
adityaa30 Aug 16, 2020
a87ab71
refactor(http2): metadata for Stream
adityaa30 Aug 17, 2020
a206ac5
tests: disable python 3.5 for travis and azure
adityaa30 Aug 18, 2020
e3233b7
refactor(h2-stream): alphabetical order of imports
adityaa30 Aug 18, 2020
30eb005
fix: InvalidNegotiatedProtocol __str__ method
adityaa30 Aug 19, 2020
2f00666
refactor: move agents & context-factory
adityaa30 Aug 19, 2020
26d344b
Merge branch 'http2' of https://github.com/scrapy/scrapy into h2-clie…
adityaa30 Aug 24, 2020
1432161
fix: bump min typing-extensions version to 3.7.4
adityaa30 Aug 24, 2020
450ba6b
fix(typo): stream -> streams, use isinstance
adityaa30 Aug 26, 2020
5e36f53
chore: remove typing-extensions dependency
adityaa30 Aug 27, 2020
a8aedbe
chore: rearrange imports
adityaa30 Aug 29, 2020
eff33a2
fix(h2): Mockserver test uses H2DownloadHandler
adityaa30 Aug 30, 2020
e6dcfd3
Merge pull request #4610 from adityaa30/h2-client-protocol
Gallaecio Aug 31, 2020
8a3ba34
Merge remote-tracking branch 'upstream/master' into http2
Gallaecio Aug 31, 2020
ddc26f3
Revert Travis CI changes
Gallaecio Sep 1, 2020
4d6359d
Mark HTTP/2 as experimental
Gallaecio Sep 11, 2020
6e8d20a
HTTP/2: add some type hints (#4785)
elacuesta Sep 16, 2020
269fe35
Merge branch 'master' into http2
Gallaecio Oct 6, 2020
bde96a5
Ignore server-initiated events
Gallaecio Nov 18, 2020
08f5ed7
Fix memory issue due to unexpectedly large server frames
Gallaecio Nov 18, 2020
d698b51
Merge branch 'master' into http2
elacuesta Dec 31, 2020
e494a3f
protocol attribute for h2 responses
elacuesta Dec 31, 2020
2ce8e0c
Document the (hard-coded) maximum HTTP/2 frame size accepted from ser…
Gallaecio Feb 3, 2021
d102456
setup.py: Twisted → Twisted[http2]
Gallaecio Feb 3, 2021
536e749
HTTP/2: remove verbose protocol-handling logging
Gallaecio Feb 3, 2021
c8d8b18
Merge remote-tracking branch 'upstream/master' into http2
Gallaecio Feb 3, 2021
1a7bde0
Document that HTTP/2 server pushes are ignored
Gallaecio Feb 3, 2021
4c80155
Document that the bytes_received signal is not yet implemented for HT…
Gallaecio Feb 3, 2021
2488003
Fix test_pinned_twisted_version
Gallaecio Feb 3, 2021
0e4b291
HTTP/2: fix canceling a request before a connection has been established
Gallaecio Feb 3, 2021
7b11b74
Use --use-deprecated=legacy-resolver
Gallaecio Feb 4, 2021
7afcd63
Remove unused import
Gallaecio Feb 5, 2021
8527b53
Revert "Use --use-deprecated=legacy-resolver"
Gallaecio Feb 5, 2021
45345ba
Use constraints.txt to limit pip resolver backtracking
Gallaecio Feb 8, 2021
bb72bba
tox: apply upper constraints to all non-pinned package installations
Gallaecio Feb 8, 2021
9ac5b1d
Adjust test constraints
Gallaecio Feb 8, 2021
15b501c
Do not force string interpolation while logging
Gallaecio Feb 10, 2021
ac82a4a
Merge remote-tracking branch 'upstream/master' into http2
elacuesta Feb 17, 2021
e80f37b
Test http2 agent for unsupported scheme
elacuesta Feb 17, 2021
4418f78
Simplify check for negotiated protocol
elacuesta Feb 17, 2021
6326178
http2: acceptable protocol update, tests (#4994)
elacuesta Feb 22, 2021
7605f19
HTTP/2: test 2 concurrent requests to the same domain
Gallaecio Feb 23, 2021
bd29f32
HTTP/2: do not make conn_lost_deferred optional
Gallaecio Feb 23, 2021
5ba31cd
HTTP/2 stream close reason handling: Use else + assert instead of elif
Gallaecio Feb 23, 2021
5101094
HTTP/2: test a CONNECT request
Gallaecio Feb 24, 2021
12064d7
HTTP/2: improve header handling
Gallaecio Feb 24, 2021
386e2a5
tests/test_downloader_handlers_http2.py: fix style issue
Gallaecio Feb 24, 2021
5b2d3e1
Merge branch 'master' into http2
Gallaecio Mar 9, 2021
3bea5e1
Remove unused _is_data_lost method
Gallaecio Mar 9, 2021
2f61d7c
Remove unnecesary del statement
elacuesta Mar 15, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -666,6 +666,20 @@ handler (without replacement), place this in your ``settings.py``::
'ftp': None,
}

The default HTTPS handler uses HTTP/1.1. To use HTTP/2 update
:setting:`DOWNLOAD_HANDLERS` as follows::

DOWNLOAD_HANDLERS = {
'https': 'scrapy.core.downloader.handlers.http2.H2DownloadHandler',
}

.. note::

Scrapy currently does not support HTTP/2 Cleartext (h2c) since none
of the major browsers support HTTP/2 unencrypted (refer `http2 faq`_).

.. _http2 faq: https://http2.github.io/faq/#does-http2-require-encryption

.. setting:: DOWNLOAD_TIMEOUT

DOWNLOAD_TIMEOUT
Expand Down Expand Up @@ -743,6 +757,15 @@ Optionally, this can be set per-request basis by using the
If :setting:`RETRY_ENABLED` is ``True`` and this setting is set to ``True``,
the ``ResponseFailed([_DataLoss])`` failure will be retried as usual.

.. warning::

This setting is ignored by the
:class:`~scrapy.core.downloader.handlers.http2.H2DownloadHandler`
download handler (see :setting:`DOWNLOAD_HANDLERS`). In case of a data loss
error, the corresponding HTTP/2 connection may be corrupted, affecting other
requests that use the same connection; hence, a ``ResponseFailed([InvalidBodyLengthError])``
failure is always raised for every request that was using that connection.

.. setting:: DUPEFILTER_CLASS

DUPEFILTER_CLASS
Expand Down
55 changes: 53 additions & 2 deletions scrapy/core/downloader/contextfactory.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
import warnings

from OpenSSL import SSL
from twisted.internet._sslverify import _setAcceptableProtocols
from twisted.internet.ssl import optionsForClientTLS, CertificateOptions, platformTrust, AcceptableCiphers
from twisted.web.client import BrowserLikePolicyForHTTPS
from twisted.web.iweb import IPolicyForHTTPS
from zope.interface.declarations import implementer
from zope.interface.verify import verifyObject

from scrapy.core.downloader.tls import ScrapyClientTLSOptions, DEFAULT_CIPHERS
from scrapy.core.downloader.tls import DEFAULT_CIPHERS, openssl_methods, ScrapyClientTLSOptions
from scrapy.utils.misc import create_instance, load_object


@implementer(IPolicyForHTTPS)
Expand Down Expand Up @@ -81,8 +86,8 @@ class BrowserLikeContextFactory(ScrapyClientContextFactory):
The default OpenSSL method is ``TLS_METHOD`` (also called
``SSLv23_METHOD``) which allows TLS protocol negotiation.
"""
def creatorForNetloc(self, hostname, port):

def creatorForNetloc(self, hostname, port):
# trustRoot set to platformTrust() will use the platform's root CAs.
#
# This means that a website like https://www.cacert.org will be rejected
Expand All @@ -92,3 +97,49 @@ def creatorForNetloc(self, hostname, port):
trustRoot=platformTrust(),
extraCertificateOptions={'method': self._ssl_method},
)


@implementer(IPolicyForHTTPS)
class AcceptableProtocolsContextFactory:
"""Context factory to used to override the acceptable protocols
to set up the [OpenSSL.SSL.Context] for doing NPN and/or ALPN
negotiation.
"""

def __init__(self, context_factory, acceptable_protocols):
verifyObject(IPolicyForHTTPS, context_factory)
self._wrapped_context_factory = context_factory
self._acceptable_protocols = acceptable_protocols

def creatorForNetloc(self, hostname, port):
options = self._wrapped_context_factory.creatorForNetloc(hostname, port)
_setAcceptableProtocols(options._ctx, self._acceptable_protocols)
return options


def load_context_factory_from_settings(settings, crawler):
ssl_method = openssl_methods[settings.get('DOWNLOADER_CLIENT_TLS_METHOD')]
context_factory_cls = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
# try method-aware context factory
try:
context_factory = create_instance(
objcls=context_factory_cls,
settings=settings,
crawler=crawler,
method=ssl_method,
)
except TypeError:
# use context factory defaults
context_factory = create_instance(
objcls=context_factory_cls,
settings=settings,
crawler=crawler,
)
msg = """
'%s' does not accept `method` argument (type OpenSSL.SSL method,\
e.g. OpenSSL.SSL.SSLv23_METHOD) and/or `tls_verbose_logging` argument and/or `tls_ciphers` argument.\
Please upgrade your context factory class to handle them or ignore them.""" % (
settings['DOWNLOADER_CLIENTCONTEXTFACTORY'],)
warnings.warn(msg)

return context_factory
27 changes: 2 additions & 25 deletions scrapy/core/downloader/handlers/http11.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,11 @@
from zope.interface import implementer

from scrapy import signals
from scrapy.core.downloader.tls import openssl_methods
from scrapy.core.downloader.contextfactory import load_context_factory_from_settings
from scrapy.core.downloader.webclient import _parse
from scrapy.exceptions import ScrapyDeprecationWarning, StopDownload
from scrapy.http import Headers
from scrapy.responsetypes import responsetypes
from scrapy.utils.misc import create_instance, load_object
from scrapy.utils.python import to_bytes, to_unicode


Expand All @@ -43,29 +42,7 @@ def __init__(self, settings, crawler=None):
self._pool.maxPersistentPerHost = settings.getint('CONCURRENT_REQUESTS_PER_DOMAIN')
self._pool._factory.noisy = False

self._sslMethod = openssl_methods[settings.get('DOWNLOADER_CLIENT_TLS_METHOD')]
self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
# try method-aware context factory
try:
self._contextFactory = create_instance(
objcls=self._contextFactoryClass,
settings=settings,
crawler=crawler,
method=self._sslMethod,
)
except TypeError:
# use context factory defaults
self._contextFactory = create_instance(
objcls=self._contextFactoryClass,
settings=settings,
crawler=crawler,
)
msg = f"""
'{settings["DOWNLOADER_CLIENTCONTEXTFACTORY"]}' does not accept `method` \
argument (type OpenSSL.SSL method, e.g. OpenSSL.SSL.SSLv23_METHOD) and/or \
`tls_verbose_logging` argument and/or `tls_ciphers` argument.\
Please upgrade your context factory class to handle them or ignore them."""
warnings.warn(msg)
self._contextFactory = load_context_factory_from_settings(settings, crawler)
self._default_maxsize = settings.getint('DOWNLOAD_MAXSIZE')
self._default_warnsize = settings.getint('DOWNLOAD_WARNSIZE')
self._fail_on_dataloss = settings.getbool('DOWNLOAD_FAIL_ON_DATALOSS')
Expand Down
119 changes: 119 additions & 0 deletions scrapy/core/downloader/handlers/http2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
import warnings
from time import time
from typing import Optional
from urllib.parse import urldefrag

from twisted.internet.defer import Deferred
from twisted.internet.error import TimeoutError
from twisted.web.client import URI

from scrapy.core.downloader.contextfactory import load_context_factory_from_settings
from scrapy.core.downloader.webclient import _parse
from scrapy.core.http2.agent import H2Agent, H2ConnectionPool, ScrapyProxyH2Agent
from scrapy.http import Request, Response
from scrapy.settings import Settings
from scrapy.spiders import Spider
from scrapy.utils.python import to_bytes


class H2DownloadHandler:
def __init__(self, settings: Settings, crawler=None):
self._crawler = crawler

from twisted.internet import reactor
self._pool = H2ConnectionPool(reactor, settings)
self._context_factory = load_context_factory_from_settings(settings, crawler)

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings, crawler)

def download_request(self, request: Request, spider: Spider) -> Deferred:
agent = ScrapyH2Agent(
context_factory=self._context_factory,
pool=self._pool,
crawler=self._crawler
)
return agent.download_request(request, spider)

def close(self) -> None:
self._pool.close_connections()


class ScrapyH2Agent:
_Agent = H2Agent
_ProxyAgent = ScrapyProxyH2Agent

def __init__(
self, context_factory,
pool: H2ConnectionPool,
connect_timeout=10, bind_address: Optional[bytes] = None,
crawler=None
) -> None:
self._context_factory = context_factory
self._connect_timeout = connect_timeout
self._bind_address = bind_address
self._pool = pool
self._crawler = crawler

def _get_agent(self, request: Request, timeout: Optional[float]) -> H2Agent:
from twisted.internet import reactor
bind_address = request.meta.get('bindaddress') or self._bind_address
proxy = request.meta.get('proxy')
if proxy:
_, _, proxy_host, proxy_port, proxy_params = _parse(proxy)
scheme = _parse(request.url)[0]
proxy_host = proxy_host.decode()
omit_connect_tunnel = b'noconnect' in proxy_params
if omit_connect_tunnel:
warnings.warn("Using HTTPS proxies in the noconnect mode is not supported by the "
"downloader handler. If you use Crawlera, it doesn't require this "
"mode anymore, so you should update scrapy-crawlera to 1.3.0+ "
"and remove '?noconnect' from the Crawlera URL.")

if scheme == b'https' and not omit_connect_tunnel:
# ToDo
raise NotImplementedError('Tunneling via CONNECT method using HTTP/2.0 is not yet supported')
return self._ProxyAgent(
reactor=reactor,
context_factory=self._context_factory,
proxy_uri=URI.fromBytes(to_bytes(proxy, encoding='ascii')),
connect_timeout=timeout,
bind_address=bind_address,
pool=self._pool
)

return self._Agent(
reactor=reactor,
context_factory=self._context_factory,
connect_timeout=timeout,
bind_address=bind_address,
pool=self._pool
)

def download_request(self, request: Request, spider: Spider) -> Deferred:
from twisted.internet import reactor
timeout = request.meta.get('download_timeout') or self._connect_timeout
agent = self._get_agent(request, timeout)

start_time = time()
d = agent.request(request, spider)
d.addCallback(self._cb_latency, request, start_time)

timeout_cl = reactor.callLater(timeout, d.cancel)
d.addBoth(self._cb_timeout, request, timeout, timeout_cl)
return d

@staticmethod
def _cb_latency(response: Response, request: Request, start_time: float) -> Response:
request.meta['download_latency'] = time() - start_time
return response

@staticmethod
def _cb_timeout(response: Response, request: Request, timeout: float, timeout_cl) -> Response:
if timeout_cl.active():
timeout_cl.cancel()
return response

url = urldefrag(request.url)[0]
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
Empty file added scrapy/core/http2/__init__.py
Empty file.
Loading