Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

ztutto · 2022-12-22T21:31:19Z

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

I understand that I will be blocked if I remove or skip any mandatory* field

Checklist

I'm reporting a broken site
I've verified that I'm running yt-dlp version 2022.11.11 (update instructions) or later (specify commit)
I've checked that all provided URLs are playable in a browser with the same IP and same login details
I've checked that all URLs and arguments with special characters are properly quoted or escaped
I've searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
I've read the guidelines for opening an issue
I've read about sharing account credentials and I'm willing to share it if required

Region

anywhere

Provide a description that is worded well enough to be understood

Issue when downloading from medici.tv (free content):
using the URL points to free content:
https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics
ending up the error "ERROR: encoding with 'idna' codec failed (UnicodeError: label too long)"
Using the direct m3u8 link everything is fine:
[debug] Command-line config: ['-v', 'https://playout.prod.medicitv.fr/satie/live/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OCIsImlhdCI6MTY3MTc0MDA4NSwiZXhwIjoxNjcyMzQ0ODg1fQ.kGonYmET-tutepypPWErp5BTGcTZM6I_YVXI9_-S7oo/m.m3u8']

Note I am not an experienced python developer, but in my understanding the IDNA codec is for domain names. In utils.py the escape_url(url) (line 3111) perhaps the urllib.parse.urlparse() assigns incorrect value to url_parsed.netloc value (assigns the full path to it instead of the domain name).

Provide verbose output that clearly demonstrates the problem

Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', 'https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.11.11 [8b64402] (pip)
[debug] Python 3.11.0 (CPython x86_64 64bit) - Linux-6.0.13-300.fc37.x86_64-x86_64-with-glibc2.36 (OpenSSL 3.0.5 5 Jul 2022, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.2 (setts), ffprobe 5.1.2
[debug] Optional libraries: Cryptodome-3.16.0, brotli-1.0.9, certifi-2022.12.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1723 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.11.11, Current version: 2022.11.11
yt-dlp is up to date (2022.11.11)
[debug] [generic] Extracting URL: https://www.medici.tv/en/concerts/lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics
[generic] lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] lahav-shani-mozart-mahler-israel-philharmonic-abu-dhabi-classics: Extracting information
[debug] Looking for embeds
[debug] Identified a JW Player JS loader
[generic] \u002F\u002Fplayout.prod.medicitv.fr\u002Fsatie\u002Flive\u002FeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OD9zdGFydD0wJmVuZD0xODAiLCJpYXQiOjE2NzE3NDI2NTksImV4cCI6MTY3MjM0NzQ1OX0.y3jfUxZ_g0_c5Io3uC2mEHTsqg9azeDv75JBjBUJ_lQ\u002Fm: Downloading m3u8 information
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
ERROR: encoding with 'idna' codec failed (UnicodeError: label too long)
Traceback (most recent call last):
  File "/usr/lib64/python3.11/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1485, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1582, in __extract_info
    return self.process_ie_result(ie_result, download, extra_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1641, in process_ie_result
    ie_result = self.process_video_result(ie_result, download=download)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2627, in process_video_result
    format['http_headers'] = self._calc_headers(full_format_info)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2335, in _calc_headers
    cookies = self._calc_cookies(info_dict['url'])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 2347, in _calc_cookies
    pr = sanitized_Request(url)
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/utils.py", line 773, in sanitized_Request
    url, auth_header = extract_basic_auth(escape_url(sanitize_url(url)))
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/yt_dlp/utils.py", line 3097, in escape_url
    netloc=url_parsed.netloc.encode('idna').decode('ascii'),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

The text was updated successfully, but these errors were encountered:

bashonly · 2022-12-22T23:16:34Z

It's happening because the site puts a JWT in the URL path instead of the URL's query:

//playout.prod.medicitv.fr/satie/live/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczovL3Nkbi1nbG9iYWwtc3RyZWFtaW5nLWNhY2hlLjNxc2RuLmNvbS85Mzc4L2ZpbGVzLzIyLzEyLzIyLzY5ODQ3ODkvOTM3OC02WThHelB0OWZISmcybVItZHJtLWFlcy5pc20vbWFuaWZlc3QubTN1OD9zdGFydD0wJmVuZD0xODAiLCJpYXQiOjE2NzE3NDI2NTksImV4cCI6MTY3MjM0NzQ1OX0.y3jfUxZ_g0_c5Io3uC2mEHTsqg9azeDv75JBjBUJ_lQ/m

this results in a path segment (the JWT payload) that is "too long" for idna

from /python3.10/encodings/idna.py:

            # ASCII name: fast path
            labels = result.split(b'.')
            for label in labels[:-1]:
                if not (0 < len(label) < 64):
                    raise UnicodeError("label empty or too long")
            if len(labels[-1]) >= 64:
                raise UnicodeError("label too long")
            return result, len(input)

pukkandan · 2022-12-23T00:21:05Z

Any idea what we can do? especially considering this is being handled by GenericIE?

bashonly · 2022-12-23T01:29:47Z

I slightly misdiagnosed the problem in my original comment. The real cause of the too-long label was that the escaped unicode in the URL wasn't being decoded before urllib parsed the URL. Something like this should work as a fix:

diff --git a/yt_dlp/extractor/generic.py b/yt_dlp/extractor/generic.py
index 2281c71f3..71859fbba 100644
--- a/yt_dlp/extractor/generic.py
+++ b/yt_dlp/extractor/generic.py
@@ -2733,6 +2733,7 @@ def filter_video(urls):
 
         entries = []
         for video_url in orderedSet(found):
+            video_url = video_url.encode().decode('unicode-escape')
             video_url = unescapeHTML(video_url)
             video_url = video_url.replace('\\/', '/')
             video_url = urllib.parse.urljoin(url, video_url)

Authored by: bashonly Closes #5854

ztutto added site-bug Issue with a specific website triage Untriaged issue labels Dec 22, 2022

ztutto changed the title ~~Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link~~ Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) Dec 22, 2022

pukkandan added bug Bug that is not site-specific and removed site-bug Issue with a specific website triage Untriaged issue labels Dec 23, 2022

bashonly mentioned this issue Jan 2, 2023

[extractor/generic] Decode unicode-escaped embed URLs #5919

Merged

9 tasks

pukkandan closed this as completed in #5919 Jan 2, 2023

pukkandan pushed a commit that referenced this issue Jan 2, 2023

[extractor/generic] Decode unicode-escaped embed URLs (#5919)

05997b6

Authored by: bashonly Closes #5854

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

ztutto commented Dec 22, 2022

bashonly commented Dec 22, 2022 •

edited

pukkandan commented Dec 23, 2022

bashonly commented Dec 23, 2022 •

edited

Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

Usage of IDNA encoder in escape_url(url) prevents to get the correct m3u8 link (might incorrect input from urllib.parse.urlparse() ) #5854

Comments

ztutto commented Dec 22, 2022

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

Provide a description that is worded well enough to be understood

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

bashonly commented Dec 22, 2022 • edited

pukkandan commented Dec 23, 2022

bashonly commented Dec 23, 2022 • edited

bashonly commented Dec 22, 2022 •

edited

bashonly commented Dec 23, 2022 •

edited