Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response file object (returned by `InfoExtractor._request_webpage`) may be closed for failed requests (matching `expected_status`) on Python 3.4.1+ #17195

Closed
puxlit opened this issue Aug 9, 2018 · 2 comments

Comments

@puxlit
Copy link
Contributor

@puxlit puxlit commented Aug 9, 2018

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2018.08.04. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2018.08.04

Before submitting an issue make sure you have:

  • At least skimmed through the README, most notably the FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones
  • Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

Since bpo-15002 (introduced in Python 3.4.1), HTTPErrors close their fp when the error's destroyed. The current implementation of InfoExtractor._request_webpage (used by InfoExtractor._download_webpage_handle and in turn by InfoExtractor._download_{webpage, xml, and json}) accommodates for expected_status by catching HTTPErrors and returning this fp. Unfortunately, this means subsequent reads against this file object by the caller are unreliable.

  • If fp is an instance of http.client.HTTPResponse, we read out an empty response body.
  • If fp is an instance of urllib.response.addinfourl (for when youtube-dl handles gzip and deflate responses), the attempted read raises a ValueError: I/O operation on closed file exception, as demonstrated in #17447.
  • On Windows, tempfile._TemporaryFileCloser omits an implementation of __del__ that would close fp, so reads return successfully. This platform inconsistency has been reported as bpo-34958.

Fortunately, the number of extractors that make use of expected_status is small; as of 2018.08.04, it's just bbc, lynda, markiza, and twitch.

Issue encountered whilst debugging reports of problems running @bato3's fix for #17116 on Python 3.7.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 9, 2018

I don't see any problem here. By using expected_status you treat potentially failed outcomes as normal thus you should be ready for consequences like closed connection. It's the responsibility of a client code to check whether connection was closed or not in such cases.
Also if you want error then don't use expected_status and catch exception instead. The whole point of expected_status is to simplify code in cases when success and failure are both expressed in the same way. For example, when _download_json always returns JSON (e.g. for 404 and for 403) so that in 403 scenario you don't need to catch HTTPError, read output and parse it in client code.

@dstftw dstftw closed this Aug 9, 2018
@puxlit
Copy link
Contributor Author

@puxlit puxlit commented Aug 9, 2018

@dstftw, we're not talking about a closed connection here, we're talking about the response body being inaccessible, which defeats the point of expected_status. The expectation is that if you call _download_webpage with expected_status, you'll get the response body back if the request was successful or if the response code matches expected_status. But this is not the case; on expected errors, the response file object returned by _request_webpage will be closed by the time _download_webpage_handle tries to extract its contents with _webpage_read_content.

So unless you're saying that in usages like page = self._download_webpage('https://httpbin.org/status/418', None, expected_status=418), we're to expect an empty string if the request returns a 418, this is definitely a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.