Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Handle data loss gracefully. #2590

Merged
merged 1 commit into from Mar 1, 2017

Conversation

@rmax
Copy link
Contributor

@rmax rmax commented Feb 23, 2017

Websites that return a wrong Content-Length header may cause a data
loss error. Also when a chunked response is not finished properly.

This change adds a new setting DOWNLOAD_FAIL_ON_DATALOSS (default: True) and request.meta key download_fail_on_dataloss.


Related issues:

@rmax
Copy link
Contributor Author

@rmax rmax commented Feb 23, 2017

This fixes #2586

@codecov-io
Copy link

@codecov-io codecov-io commented Feb 23, 2017

Codecov Report

Merging #2590 into master will decrease coverage by -0.38%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master    #2590      +/-   ##
==========================================
- Coverage   83.93%   83.55%   -0.38%     
==========================================
  Files         161      161              
  Lines        8905     8920      +15     
  Branches     1313     1266      -47     
==========================================
- Hits         7474     7453      -21     
- Misses       1176     1206      +30     
- Partials      255      261       +6
Impacted Files Coverage Δ
scrapy/settings/default_settings.py 98.62% <100%> (ø)
scrapy/core/downloader/handlers/http11.py 90.83% <89.47%> (-0.31%)
scrapy/utils/gz.py 85.29% <0%> (-14.71%)
scrapy/link.py 86.36% <0%> (-13.64%)
scrapy/_monkeypatches.py 50% <0%> (-7.15%)
scrapy/utils/iterators.py 95.45% <0%> (-4.55%)
scrapy/mail.py 72.36% <0%> (-3.95%)
scrapy/downloadermiddlewares/httpproxy.py 96.29% <0%> (-3.71%)
scrapy/downloadermiddlewares/decompression.py 96.61% <0%> (-3.39%)
scrapy/http/response/init.py 89.83% <0%> (-3.39%)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f28cb3...f01ae6f. Read the comment docs.

@rmax rmax force-pushed the handle-data-loss-gracefully branch from c14abe4 to 887e6c5 Feb 24, 2017
return d

def test_download_broken_content_allow_data_loss(self):
request = Request(self.getURL('broken'), meta={'download_allow_dataloss': True})
Copy link
Contributor

@redapple redapple Feb 24, 2017

Can you add a similar test when using the DOWNLOAD_ALLOW_DATALOSS setting?

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

@redapple
Copy link
Contributor

@redapple redapple commented Feb 24, 2017

Looks good @rolando , thanks.
The new setting needs to be documented.

Copy link
Contributor

@redapple redapple left a comment

See my earlier comments.

@rmax rmax force-pushed the handle-data-loss-gracefully branch from 887e6c5 to a35e65e Feb 28, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Feb 28, 2017

Looks good. Thanks @rolando !

@redapple redapple changed the title Handle data loss gracefully. [MRG+1] Handle data loss gracefully. Feb 28, 2017

.. note::

This feature needs Twisted >= 11.1 and the use of the default HTTP/1.1
Copy link
Member

@kmike kmike Feb 28, 2017

Scrapy minimum supported Twisted version is 13.1

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

@kmike Do you think can we remove the note there? I copied that from one setting above.

Copy link
Member

@kmike kmike Feb 28, 2017

Yes, it looks safe to remove (here and in the setting above).

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Removed note.

@rmax
Copy link
Contributor Author

@rmax rmax commented Feb 28, 2017

@paul forgot to mention that I added a warning to hint the use of the setting, which is shown before the error:

scrapy shell "http://www.ebk-gruppe.com/"  --set RETRY_ENABLED=0  --loglevel WARN
2017-02-28 14:08:29 [scrapy.core.downloader.handlers.http11] WARNING: Found data loss in http://www.ebk-gruppe.com/. If you want to process broken responses set the setting DOWNLOAD_ALLOW_DATALOSS = True
Traceback (most recent call last):
  File "/Users/rolando/miniconda3/envs/dev/bin/scrapy", line 11, in <module>
    load_entry_point('Scrapy', 'console_scripts', 'scrapy')()
  File "/Users/rolando/Projects/sh/scrapy/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/rolando/Projects/sh/scrapy/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Users/rolando/Projects/sh/scrapy/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/Users/rolando/Projects/sh/scrapy/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/Users/rolando/Projects/sh/scrapy/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/Users/rolando/Projects/sh/scrapy/scrapy/shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "/Users/rolando/miniconda3/envs/dev/lib/python3.5/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/Users/rolando/miniconda3/envs/dev/lib/python3.5/site-packages/twisted/python/failure.py", line 372, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: >]

Whether or not to allow broken responses, where declared ``Content-Length``
does not match content sent by the server. If ``False``, these responses
trigger a ``ResponseFailed`` error. If ``True``, these responses are allowed
and the flag ``partial`` is added.
Copy link
Member

@kmike kmike Feb 28, 2017

There are now request flags and response flags; I think it'd be nice to have a concrete example on how to check it - something like

these responses are allowed and the flag partial is added to the response, i.e. 'partial' in response.flags is True.

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

The flag is not exclusive to this condition though. Legit PotentialDataLoss responses which were not dropped have the partial flag. I was wondering if this condition _DataLoss should have its own flag, like data-loss.

Copy link
Member

@kmike kmike Feb 28, 2017

Was there a 'partial' flag for such responses before this PR?

Copy link
Member

@kmike kmike Feb 28, 2017

I doubt these flags were widely used, but renaming 'partial' to 'data-loss' for this case is backwards incompatible (if the answer to previous question is "yes"); I think that's fine to do this change, but it must be mentioned in release notes.

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Was there a 'partial' flag for such responses before this PR?

Yes, when a PotentialDataLoss error was raised by twisted, scrapy added the flag partial and let the response pass through.

I have updated the doc with the 'partial' check example.

@rmax rmax force-pushed the handle-data-loss-gracefully branch from a35e65e to 9f95357 Feb 28, 2017

Whether or not to allow broken responses, where declared ``Content-Length``
does not match content sent by the server. If ``False``, these responses
trigger a ``ResponseFailed`` error. If ``True``, these responses are allowed
Copy link
Member

@kmike kmike Feb 28, 2017

Are ResponseFailed resposes retried by default? It'd be helpful to mention that - i.e. what happens if there is data loss detected?

Copy link
Member

@kmike kmike Feb 28, 2017

Would it be useful to have an option to retry these responses, but keep the last one if retry attempts failed? Based on your experience, can retries fix this data loss problem, or is it usually just a server configuration issue?

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Yes, ResponseFailed is handled by the retry middleware. And I think there are several reasons why a ResponseFailed can be raised, hence the _DataLoss explicit check. I don't know if it can be a legit scenario where a response declares Content-Length: X and suddenly the connection is closed after less than X bytes transmitted (i.e.: server crashed?). In that case retrying the request looks like the correct behavior.

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Would it be useful to have an option to retry these responses, but keep the last one if retry attempts failed?

Seems like this feature needs to be discussed on its own issue.

can retries fix this data loss problem, or is it usually just a server configuration issue?

There are several scenarios where this can happe. For example, if the server returns a static file, sets the Content-Length correctly but fails to read a part of the file due to data corruption or disk failure. The common misconfiguration case seems to be when the application uses a content compression filter which is enabled after the Content-Length is set.

We can't tell how often a retry fixed this issue unless we mine past logs.

Copy link
Member

@kmike kmike Feb 28, 2017

I'm asking because the best default value for the option is not clear for me. False makes sense because it mimics existing behavior, True makes sense if the problem is usually caused by server misconfiguration, retry+True sounds like a compromise which could work better as a default in practice because it handles both cases, but I'm not really sure.

Copy link
Member

@kmike kmike Feb 28, 2017

But I agree that it can be a separate issue :)

@@ -604,6 +604,21 @@ If you want to disable it set to 0.

This feature needs Twisted >= 11.1.

.. setting:: DOWNLOAD_ALLOW_DATALOSS

DOWNLOAD_ALLOW_DATALOSS
Copy link
Member

@kmike kmike Feb 28, 2017

The name is tricky :) Arguably we allow data loss when we don't pass partial content to the callback and drop the whole response instead. I don't have a better naming idea though.

Copy link
Member

@kmike kmike Feb 28, 2017

Data loss at network level happens regardless of Scrapy options, this isn't something we can control

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

I don't have a better naming idea though.

Neither did I :)

Copy link
Member

@kmike kmike Feb 28, 2017

Some ideas on how to avoid ambiguous "data loss" name:

  • DOWNLOAD_VALIDATE_CONTENT_LENGTH
  • DOWNLOAD_DROP_PARTIAL

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Maybe DOWNLOAD_FAIL_ON_DATALOSS, which describes the current behavior: throw an error when ResponseFailed(reasons=[_DataLoss]) happens (or any ResponseFailed error for that matter).

Copy link
Member

@kmike kmike Feb 28, 2017

This _DataLoss reason which Twisted returns doesn't necessarily means data loss in practice, right? If it can mean something else then there is no reason to follow this naming convention and call it 'data loss', we can do better. That's why I tried to avoid it in these name ideas (though 'DOWNLOAD_DROP_PARTIAL' has the same problem). Other than that DOWNLOAD_FAIL_ON_DATALOSS sounds fine, it is not ambiguous.

Copy link
Contributor Author

@rmax rmax Feb 28, 2017

Yes, a _DataLoss reason in twisted means we did not get the amount of data we expected. And that's up to the transfer decoder.

The chunked decoder raises a _DataLoss exception when there is no terminating chunk.

The identity decoder (non-chunked) raises a PotentialDataLoss error when there is no content-length declared and a _DataLoss error when the received content length does not match the declared content-length.

Copy link
Member

@kmike kmike Feb 28, 2017

Aha, so DOWNLOAD_VALIDATE_CONTENT_LENGTH is not a good name because for chunked decoder it means a different thing.

@rmax rmax force-pushed the handle-data-loss-gracefully branch from 9f95357 to a556ff0 Feb 28, 2017

elif not self._allow_dataloss_hinted:
logger.warn("Found data loss in %s. If you want to process broken "
"responses set the setting DOWNLOAD_ALLOW_DATALOSS = True",
Copy link
Member

@kmike kmike Feb 28, 2017

I think it worths adding a note that the message won't be shown for further requests, like we do in dupefilters.

@kmike
Copy link
Member

@kmike kmike commented Feb 28, 2017

The PR looks good overall 👍 Todo:

  • I still don't quite like 'data loss' term here because it is ambiguous; I think we should find a better option name.
  • It'd be nice to explain in docs how this feature interacts with retries.
  • A nitpick about warn message
  • decide on 'partial' vs 'data-loss' flag; I'm fine with both.

@rmax
Copy link
Contributor Author

@rmax rmax commented Feb 28, 2017

I still don't quite like 'data loss' term here because it is ambiguous; I think we should find a better option name.

A new candidate name: DOWNLOAD_FAIL_ON_DATALOSS. This describes the current behavior.

It'd be nice to explain in docs how this feature interacts with retries.

Please check current setting documentation.

A nitpick about warn message

Added a mention that no more messages will be shown.

decide on 'partial' vs 'data-loss' flag; I'm fine with both.

I arbitrarily changed it to dataloss (as most flags are single-word).

@rmax rmax force-pushed the handle-data-loss-gracefully branch 2 times, most recently from 34bb1fa to 96f5e25 Feb 28, 2017
@rmax
Copy link
Contributor Author

@rmax rmax commented Feb 28, 2017

Updated the setting name to DOWNLOAD_FAIL_ON_DATALOSS (default: True).

@kmike
Copy link
Member

@kmike kmike commented Feb 28, 2017

LGTM. Thanks @rolando!

@kmike kmike changed the title [MRG+1] Handle data loss gracefully. [MRG+2] Handle data loss gracefully. Feb 28, 2017
@rmax
Copy link
Contributor Author

@rmax rmax commented Feb 28, 2017

👍

@@ -118,6 +120,35 @@ def render(self, request):
return request.requestHeaders.getRawHeaders(b"content-length")[0]


class ChunkedResource(resource.Resource):
Copy link
Contributor

@redapple redapple Mar 1, 2017

I understand this acts as a valid server response using chunked encoding. Is that correct?
Would it be possible to add a case for a wrong chunked-encoded response?
This would be to verify that it triggers #2413 and that the downloader can be made robust against it with the new setting.

Copy link
Contributor Author

@rmax rmax Mar 1, 2017

That's a bit tricky but I added a broken chunked resource with accompanying tests: https://github.com/scrapy/scrapy/pull/2590/files#diff-d6a30363eaf8467303835b3adce0af7dR134

It relies on a internal request.chunked flag which, if its purpose changes in further twisted releases, it will break the tests. I verified the behavior by using tcpdump:

  • This is a normal chunked response
127.000.000.001.60134-127.000.000.001.60217: HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: text/html
Server: TwistedWeb/16.6.0
Date: Wed, 01 Mar 2017 14:25:57 GMT

8
chunked
8
content

0

  • This is a broken chunked response:
127.000.000.001.60720-127.000.000.001.60754: HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: text/html
Date: Wed, 01 Mar 2017 14:30:42 GMT
Server: TwistedWeb/16.6.0

8
chunked
8
content


Copy link
Contributor

@redapple redapple Mar 1, 2017

Looks good @rolando . Thanks!

@rmax rmax force-pushed the handle-data-loss-gracefully branch from 96f5e25 to f0ba308 Mar 1, 2017
Websites that return a wrong ``Content-Length`` header may cause a data
loss error. Also when a chunked response is not finished properly.

This change adds a new setting ``DOWNLOAD_FAIL_ON_DATALOSS`` (default:
``True``) and request.meta key ``download_fail_on_dataloss``.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants