-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854
Comments
It's happening because the site puts a JWT in the URL path instead of the URL's query:
this results in a path segment (the JWT payload) that is "too long" for idna from # ASCII name: fast path
labels = result.split(b'.')
for label in labels[:-1]:
if not (0 < len(label) < 64):
raise UnicodeError("label empty or too long")
if len(labels[-1]) >= 64:
raise UnicodeError("label too long")
return result, len(input) |
Any idea what we can do? especially considering this is being handled by GenericIE? |
I slightly misdiagnosed the problem in my original comment. The real cause of the too-long label was that the escaped unicode in the URL wasn't being decoded before urllib parsed the URL. Something like this should work as a fix: diff --git a/yt_dlp/extractor/generic.py b/yt_dlp/extractor/generic.py
index 2281c71f3..71859fbba 100644
--- a/yt_dlp/extractor/generic.py
+++ b/yt_dlp/extractor/generic.py
@@ -2733,6 +2733,7 @@ def filter_video(urls):
entries = []
for video_url in orderedSet(found):
+ video_url = video_url.encode().decode('unicode-escape')
video_url = unescapeHTML(video_url)
video_url = video_url.replace('\\/', '/')
video_url = urllib.parse.urljoin(url, video_url) |
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
Checklist
Region
anywhere
Provide a description that is worded well enough to be understood
Issue when downloading from medici.tv (free content):
using the URL points to free content:
https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics
ending up the error "ERROR: encoding with 'idna' codec failed (UnicodeError: label too long)"
Using the direct m3u8 link everything is fine:
[debug] Command-line config: ['-v', 'https://playout.prod.medicitv.fr/satie/live/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OCIsImlhdCI6MTY3MTc0MDA4NSwiZXhwIjoxNjcyMzQ0ODg1fQ.kGonYmET-tutepypPWErp5BTGcTZM6I_YVXI9_-S7oo/m.m3u8']
Note I am not an experienced python developer, but in my understanding the IDNA codec is for domain names. In utils.py the escape_url(url) (line 3111) perhaps the urllib.parse.urlparse() assigns incorrect value to url_parsed.netloc value (assigns the full path to it instead of the domain name).
Provide verbose output that clearly demonstrates the problem
yt-dlp -vU <your command line>
)[debug] Command-line config
) and insert it belowComplete Verbose Output
The text was updated successfully, but these errors were encountered: