Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

Closed
10 tasks done
ztutto opened this issue Dec 22, 2022 · 3 comments · Fixed by #5919
Labels
bug Bug that is not site-specific

Comments

@ztutto
Copy link

ztutto commented Dec 22, 2022

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I remove or skip any mandatory* field

Checklist

Region

anywhere

Provide a description that is worded well enough to be understood

Issue when downloading from medici.tv (free content):
using the URL points to free content:
https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics
ending up the error "ERROR: encoding with 'idna' codec failed (UnicodeError: label too long)"
Using the direct m3u8 link everything is fine:
[debug] Command-line config: ['-v', 'https://playout.prod.medicitv.fr/satie/live/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OCIsImlhdCI6MTY3MTc0MDA4NSwiZXhwIjoxNjcyMzQ0ODg1fQ.kGonYmET-tutepypPWErp5BTGcTZM6I_YVXI9_-S7oo/m.m3u8']

Note I am not an experienced python developer, but in my understanding the IDNA codec is for domain names. In utils.py the escape_url(url) (line 3111) perhaps the urllib.parse.urlparse() assigns incorrect value to url_parsed.netloc value (assigns the full path to it instead of the domain name).

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', 'https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.11.11 [8b64402] (pip)
[debug] Python 3.11.0 (CPython x86_64 64bit) - Linux-6.0.13-300.fc37.x86_64-x86_64-with-glibc2.36 (OpenSSL 3.0.5 5 Jul 2022, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.2 (setts), ffprobe 5.1.2
[debug] Optional libraries: Cryptodome-3.16.0, brotli-1.0.9, certifi-2022.12.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1723 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.11.11, Current version: 2022.11.11
yt-dlp is up to date (2022.11.11)
[debug] [generic] Extracting URL: https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics
[generic] lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics: Extracting information
[debug] Looking for embeds
[debug] Identified a JW Player JS loader
[generic] \u002F\u002Fplayout.prod.medicitv.fr\u002Fsatie\u002Flive\u002FeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OD9zdGFydD0wJmVuZD0xODAiLCJpYXQiOjE2NzE3NDI2NTksImV4cCI6MTY3MjM0NzQ1OX0.y3jfUxZ_g0_c5Io3uC2mEHTsqg9azeDv75JBjBUJ_lQ\u002Fm: Downloading m3u8 information
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
ERROR: encoding with 'idna' codec failed (UnicodeError: label too long)
Traceback (most recent call last):
  File "/usr/lib64/python3.11/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1485, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1582, in __extract_info
    return self.process_ie_result(ie_result, download, extra_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1641, in process_ie_result
    ie_result = self.process_video_result(ie_result, download=download)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2627, in process_video_result
    format['http_headers'] = self._calc_headers(full_format_info)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2335, in _calc_headers
    cookies = self._calc_cookies(info_dict['url'])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2347, in _calc_cookies
    pr = sanitized_Request(url)
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/utils.py", line 773, in sanitized_Request
    url, auth_header = extract_basic_auth(escape_url(sanitize_url(url)))
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/utils.py", line 3097, in escape_url
    netloc=url_parsed.netloc.encode('idna').decode('ascii'),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
@ztutto ztutto added site-bug Issue with a specific website triage Untriaged issue labels Dec 22, 2022
@ztutto ztutto changed the title Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) Dec 22, 2022
@bashonly
Copy link
Member

bashonly commented Dec 22, 2022

It's happening because the site puts a JWT in the URL path instead of the URL's query:

//playout.prod.medicitv.fr/satie/live/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OD9zdGFydD0wJmVuZD0xODAiLCJpYXQiOjE2NzE3NDI2NTksImV4cCI6MTY3MjM0NzQ1OX0.y3jfUxZ_g0_c5Io3uC2mEHTsqg9azeDv75JBjBUJ_lQ/m

this results in a path segment (the JWT payload) that is "too long" for idna

from /python3.10/encodings/idna.py:

            # ASCII name: fast path
            labels = result.split(b'.')
            for label in labels[:-1]:
                if not (0 < len(label) < 64):
                    raise UnicodeError("label empty or too long")
            if len(labels[-1]) >= 64:
                raise UnicodeError("label too long")
            return result, len(input)

@pukkandan
Copy link
Member

Any idea what we can do? especially considering this is being handled by GenericIE?

@pukkandan pukkandan added bug Bug that is not site-specific and removed site-bug Issue with a specific website triage Untriaged issue labels Dec 23, 2022
@bashonly
Copy link
Member

bashonly commented Dec 23, 2022

I slightly misdiagnosed the problem in my original comment. The real cause of the too-long label was that the escaped unicode in the URL wasn't being decoded before urllib parsed the URL. Something like this should work as a fix:

diff --git a/yt_dlp/extractor/generic.py b/yt_dlp/extractor/generic.py
index 2281c71f3..71859fbba 100644
--- a/yt_dlp/extractor/generic.py
+++ b/yt_dlp/extractor/generic.py
@@ -2733,6 +2733,7 @@ def filter_video(urls):
 
         entries = []
         for video_url in orderedSet(found):
+            video_url = video_url.encode().decode('unicode-escape')
             video_url = unescapeHTML(video_url)
             video_url = video_url.replace('\\/', '/')
             video_url = urllib.parse.urljoin(url, video_url)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug that is not site-specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants