Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Don't trust Content-length when Content-encoding is present #6176

Merged
merged 3 commits into from
Feb 17, 2023

Conversation

felixonmars
Copy link
Contributor

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Our YoutubeDLHandler is doing auto-decompression, so that we cannot compare Content-length to the size of extracted data. Although HttpFD has specified Youtubedl-no-compression, the web server may still ignore this and return deflated data (as the situation in #3772).

This change attempts to retain the Content-encoding header for later comparison in HttpFD so that we could discard data_len when the web server returns compressed data. I didn't find the reason why this header was previously removed but didn't yet find a counter example either.

Fixes #3772 in my experiments.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Our YoutubeDLHandler is doing auto-decompression, so that we cannot
compare Content-length to the size of extracted data. Although HttpFD has
specified Youtubedl-no-compression, the web server may still ignore this
and return deflated data (as the situation in yt-dlp#3772).

This change attempts to retain the Content-encoding header for later
comparison in HttpFD so that we could discard data_len when the web
server returns compressed data. I didn't find the reason why this header
was previously removed but didn't yet find a counter example either.

Fixes yt-dlp#3772 in my experiments.
@pukkandan
Copy link
Member

cc @coletdjnz

@pukkandan pukkandan added the bug Bug that is not site-specific label Feb 6, 2023
@coletdjnz coletdjnz self-requested a review February 6, 2023 19:22
@coletdjnz
Copy link
Member

note for self: check against urllib3/requests too

@@ -213,6 +213,11 @@ def close_stream():
def download():
data_len = ctx.data.info().get('Content-length', None)

if ctx.data.info().get('Content-encoding', None):
Copy link
Contributor

@dirkf dirkf Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is default?

Suggested change
if ctx.data.info().get('Content-encoding', None):
if ctx.data.info().get('Content-encoding'):

Or maybe

Suggested change
if ctx.data.info().get('Content-encoding', None):
if ctx.data.info().get('Content-encoding') is not None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm just trying to keep the same coding style as data_len is also adding this , None. I'll try to remove both then.

@dirkf
Copy link
Contributor

dirkf commented Feb 7, 2023

... I didn't find the reason why this header was previously removed but didn't yet find a counter example either.

When I worked on this code I inferred that it was trying to handle nested encodings, or perhaps more realistically to avoid decoding an already decoded contents. The old yt-dl code was like this:

old_resp = resp
if resp.headers.get('Content-encoding', '') == method1:
    # expand resp using method1
    # set resp.msg to old_resp.msg
    del resp.headers['Content-encoding']
if resp.headers.get('Content-encoding', '') == method2:
    # expand resp using method2
    # set resp.msg to old_resp.msg
    del resp.headers['Content-encoding']
if resp.headers.get('Content-encoding', '') == method3:
    # expand resp using method3
    # set resp.msg to old_resp.msg
    del resp.headers['Content-encoding']
...

@felixonmars
Copy link
Contributor Author

When I worked on this code I inferred that it was trying to handle nested encodings, or perhaps more realistically to avoid decoding an already decoded contents.

I see. An alternative approach would be moving the info into a custom header like "Youtubedl-decompressed" and check that in HttpFD instead.

@coletdjnz
Copy link
Member

urllib3/requests appear to retain the content-encoding header after decompression too, so this should be fine in that regard (needed to check due to #2861 and #3668).

When I worked on this code I inferred that it was trying to handle nested encodings, or perhaps more realistically to avoid decoding an already decoded contents.

Shouldn't be an issue as the data should only be tried to be decoded once (i.e. content-encoding read once for whole decoding process, including multi decoding). I don't see us needing anyway of checking if the content is decoded or not either (and in the case it is in an unsupported encoding, we can catch that elsewhere), so including the header even when decoded should be fine.

Copy link
Member

@coletdjnz coletdjnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Networking changes look fine to me.

#3772 seems to work for me too now assuming you don't use --test (@pukkandan not sure if that is a concern)

yt_dlp/downloader/http.py Outdated Show resolved Hide resolved
@pukkandan
Copy link
Member

pukkandan commented Feb 8, 2023

assuming you don't use --test (@pukkandan not sure if that is a concern)

Let's not worry about that for now

Make sure to port this to the networking PR too

@pukkandan pukkandan merged commit 65e5c02 into yt-dlp:master Feb 17, 2023
@felixonmars felixonmars deleted the content-encoding branch February 17, 2023 03:28
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug that is not site-specific
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[JWPlatform] VTT subtitles are found in HLS manifest but skipped [bilibili.com] unable to --write-subtitles
4 participants