Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[networking] Remove dot segments during URL normalization #7662

Merged
merged 6 commits into from Jul 28, 2023

Conversation

coletdjnz
Copy link
Member

@coletdjnz coletdjnz commented Jul 22, 2023

This implements RFC3986 5.2.4 remove_dot_segments during the URL normalization process, particularly for the urllib handler.

Closes #3355, #6526

This is adapted from the remove_dot_segments pseudo-code in the RFC and some inspiration from urllib3/rfc396 libraries (though it came out very close to them).

I have also renamed escape_url to normalize_url to better represent what it is doing, and moved these functions to utils.networking.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

馃 Generated by Copilot at 258810c

Summary

馃殮馃敡馃И

This pull request refactors and improves the URL handling functions in yt-dlp. It introduces a new function normalize_url that removes dot segments and escapes non-ASCII characters in a URL, and uses it in various modules. It also adds a new test case for the dot segment removal algorithm.

To handle URLs with more care
They moved and renamed escape_url
Now normalize_url is the tool
That removes dot segments and escapes
And makes the networking module cool

Walkthrough

  • Refactor and move the functions for escaping and normalizing URLs to yt_dlp/utils/networking.py and implement the RFC 3986 5.2.4 algorithm for removing dot segments from a path (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Add a new handler for the path /redirect_dotsegments in the HTTPTestRequestHandler class, which sends a 301 redirect response with a location header that contains dot segments (link)
  • Add a new test case for the remove_dot_segments and normalize_url functions, which uses the TestHTTPRequestHandler class and the validate_and_send function to send two requests: one to the path /a/b/./../../headers and one to the path /redirect_dotsegments. The test case asserts that both requests result in a 200 status and a final URL of /headers (link)

@dirkf
Copy link
Contributor

dirkf commented Jul 23, 2023

Does this handle the problem URLhttps://reoa92d.com/../uploaded/1649416469.mp4#t=0.1 from the original #3355 report? I thought this was against RFCs but maybe there's new logic (ISTR checking the WHAT spec for this when I made up the patch in that issue).

@coletdjnz
Copy link
Member Author

coletdjnz commented Jul 24, 2023

Does this handle the problem URLhttps://reoa92d.com/../uploaded/1649416469.mp4#t=0.1 from the original #3355 report? I thought this was against RFCs but maybe there's new logic (ISTR checking the WHAT spec for this when I made up the patch in that issue).

Yes it works (though I think the domain is dead), and pretty sure it is not against standard (at least RFC3986 which browsers follow for this I believe):

RFC 3986 5.2.4 Step A:

A. If the input buffer begins with a prefix of "../" or "./",
then remove that prefix from the input buffer; otherwise,

Step C is similar for /../

test/test_utils.py Outdated Show resolved Hide resolved
yt_dlp/utils/networking.py Show resolved Hide resolved
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
dirkf added a commit to dirkf/youtube-dl that referenced this pull request Jul 28, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
dirkf added a commit to dirkf/youtube-dl that referenced this pull request Jul 28, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
@coletdjnz coletdjnz merged commit 4bf9122 into yt-dlp:master Jul 28, 2023
13 checks passed
@coletdjnz coletdjnz deleted the feat/remove_dot_segments branch July 28, 2023 22:40
dirkf added a commit to ytdl-org/youtube-dl that referenced this pull request Jul 29, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
dirkf added a commit to ytdl-org/ytdl-nightly that referenced this pull request Sep 4, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
dirkf added a commit to ytdl-org/ytdl-nightly that referenced this pull request Sep 5, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
dirkf added a commit to ytdl-org/ytdl-nightly that referenced this pull request Sep 24, 2023
* move processing to YoutubeDLHandler
* also process `Location` header for redirect
* use tests from yt-dlp/yt-dlp#7662
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
This implements RFC3986 5.2.4 remove_dot_segments during the URL normalization process.

Closes yt-dlp#3355, yt-dlp#6526

Authored by: coletdjnz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants