Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix HTTP redirect handling #7094

Merged

Conversation

coletdjnz
Copy link
Member

@coletdjnz coletdjnz commented May 21, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

ADD DESCRIPTION HERE

Backport from #2861

afac4ca never properly fixed the issue (and was not tested, which was part of the problem):

This PR fixes the above issues. It aligns the redirect handling with the latest recommended RFC standard/what common browsers tend to do (see #2861 for more full description of this).

I have also back-ported some of the networking tests, including the HTTP redirect test.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at f7ce007

Summary

🧹🐛🎨

Improve redirection handling and code style in yt_dlp/utils/_utils.py and remove obsolete code and tests. This pull request enhances the functionality, readability, and maintainability of the codebase.

yt_dlp cleans up
Redirection bug is fixed
Autumn leaves unused code

Walkthrough

  • Simplify and fix redirection logic in YoutubeDLRedirectHandler class ([link](https://github.com/yt-dlp/yt-dlp/pull/7094/files?diff=unified&w=0#diff-feda8f56946dc048e754111850baaef0eec4b9f8bbc2d3f04b1a785626ea5c0eL1682-R1683), [link](https://github.com/yt-dlp/yt-dlp/pull/7094/files?diff=unified&w=0#diff-feda8f56946dc048e754111850baaef0eec4b9f8bbc2d3f04b1a785626ea5c0eL1696-R1717))

@coletdjnz coletdjnz changed the title [core] Backport fixed urllib redirect handling (and tests) [core] Fix HTTP redirect handling May 21, 2023
@coletdjnz coletdjnz marked this pull request as ready for review May 21, 2023 04:58
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
Copy link
Member

@Grub4K Grub4K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with FakeYDL(...) as ydl:
    ...

Can be rewritten as ydl = FakeYDL(...), saving one indentation across the file, since we are currently only using it to setup/teardown console title.

test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
test/test_networking.py Outdated Show resolved Hide resolved
coletdjnz and others added 8 commits May 21, 2023 22:08
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Co-authored-by: Simon Sawicki <accounts@grub4k.xyz>
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't compared the actual logic to the RFC - let me know if I should or you are confident enough to merge without.

While cleaning up legacy code certainly would be nice, I don't want to hold this PR up for that, so I'm not making any suggestions for it.

PS: It's difficult to see what's moved vs new code in test_networking. In future, try to do rename and code change in separate commits.

test/test_networking.py Outdated Show resolved Hide resolved
@pukkandan pukkandan force-pushed the master branch 2 times, most recently from ee280c7 to 7aeda6c Compare May 24, 2023 18:09
@coletdjnz coletdjnz merged commit 08916a4 into yt-dlp:master May 27, 2023
11 checks passed
@coletdjnz coletdjnz deleted the nw-backport/urllib-redirect-handling branch May 27, 2023 07:06
@dirkf
Copy link
Contributor

dirkf commented Jun 7, 2023

In Py3.9, this quashed the resource warnings on shutdown (not seen in 3.5 or 2.7):

  • ensure daemon_threads = False in ThreadingHTTPServer (default)
  • add TestHTTP.tearDown() method
    def tearDown(self):

        def closer(svr):
            def _closer():
                svr.shutdown()
                svr.server_close()
            return _closer

        shutdown_thread = threading.Thread(target=closer(self.http_httpd))
        shutdown_thread.start()
        self.http_server_thread.join(2.0)

        shutdown_thread = threading.Thread(target=closer(self.https_httpd))
        shutdown_thread.start()
        self.https_server_thread.join(2.0)

threading needs a threading.join_all() like JS Promise.all() so that the caller can wait for several threads to finish in one call.

@dirkf
Copy link
Contributor

dirkf commented Jun 7, 2023

In ytdl-org/youtube-dl@c047270c02, yt-dl started to remove each Content-Encoding header after processing the encoding. The rationale was that the handler routine could be called twice with a geo-verification proxy set. If this still needs to be done, the tests TestHTT.test_{encoding} won't work as written. But shouldn't it be done anyway?

RFC9110 s8.4

The "Content-Encoding" header field indicates what content codings have been applied to the representation, beyond those inherent in the media type, and thus what decoding mechanisms have to be applied in order to obtain data in the media type referenced by the Content-Type header field. ...

Obviously RFC9110 applies to what's sent over the network rather than the processing at either end, but it seems reasonable to apply the same semantics in those contexts, so that the header only lists the encodings that haven't yet been decoded. Work-arounds with a custom header or Request attribute seem unnecessarily painful.

@coletdjnz
Copy link
Member Author

In ytdl-org/youtube-dl@c047270c02, yt-dl started to remove each Content-Encoding header after processing the encoding. The rationale was that the handler routine could be called twice with a geo-verification proxy set. If this still needs to be done, the tests TestHTT.test_{encoding} won't work as written. But shouldn't it be done anyway?

RFC9110 s8.4

The "Content-Encoding" header field indicates what content codings have been applied to the representation, beyond those inherent in the media type, and thus what decoding mechanisms have to be applied in order to obtain data in the media type referenced by the Content-Type header field. ...

Obviously RFC9110 applies to what's sent over the network rather than the processing at either end, but it seems reasonable to apply the same semantics in those contexts, so that the header only lists the encodings that haven't yet been decoded. Work-arounds with a custom header or Request attribute seem unnecessarily painful.

This was changed in 65e5c02.

Funny enough, even urllib3/requests don't remove the Content-Encoding header, hence why I gave the go ahead.

I'll have to test the geo verification/per request proxy. I completely changed it in the network rework as the implementation is flawed/buggy.

@dirkf
Copy link
Contributor

dirkf commented Jun 7, 2023

And I actually commented on that PR too. But I hadn't yet tracked down the commit where the removal was added.

Looking at the MDN page for content-encoding (tl;dr for the RFCs), the decoding logic is indeed essentially for encoding in reversed(headers['Content-Encoding'].split()): data = decode(data, encoding). However nested encoding would appear to have little benefit with the standard set of encodings, and presumably would only happen accidentally in some proxy configuration.

@dirkf
Copy link
Contributor

dirkf commented Jun 8, 2023

Reading the HTTP specs again, it seems that a header (like Content-encoding) that can have multiple (comma-separated) values can also appear as multiple lines each containing one or more of the same set of values, so these should be equivalent:

Multi-valued-hdr: val1, val2, val3
Multi-valued-hdr: val1, val2
...
Multi-valued-hdr: val3

So my code for this is (where the decompression methods return io.BytesIO()) and txt_or_none() is lambda x: strip_or_none(x) or None:

        encodings = list(filter(txt_or_none, ', '.join(resp.headers.get_all('Content-encoding', [])).split(',')))
        for enc in encodings[::-1]:
            old_resp = resp
            if enc in ('gzip', 'x-gzip'):
                uncompressed = self.gzip_d(resp.read())
                encodings.pop()
            elif enc == 'deflate':
                uncompressed = self.reflate(resp.read())
                encodings.pop()
            elif enc == 'br' and brotli:
                uncompressed = self.brotli(resp.read())
                encodings.pop()
            else:
                break
            resp = compat_urllib_request.addinfourl(
                uncompressed, old_resp.headers, old_resp.url, old_resp.code)
            resp.msg = old_resp.msg

        if encodings:
            # remaining encodings
            resp.headers['Content-encoding'] = ' '.join(encodings)
        elif 'Content-encoding' in resp.headers:
            del resp.headers['Content-encoding']

dirkf added a commit to dirkf/youtube-dl that referenced this pull request Jul 18, 2023
* Thx coletdjnz: yt-dlp/yt-dlp#7094
* add test that redirected `POST` loses its `Content-Type`
dirkf added a commit to ytdl-org/youtube-dl that referenced this pull request Jul 18, 2023
* Thx coletdjnz: yt-dlp/yt-dlp#7094
* add test that redirected `POST` loses its `Content-Type`
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Aligns HTTP redirect handling with what browsers commonly do and RFC standards. 

Fixes issues yt-dlp@afac4ca missed.

Authored by: coletdjnz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants