-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZenYandex] Fix downloader #8454
[ZenYandex] Fix downloader #8454
Conversation
It turned out, that channel links were also broken. Example$ yt-dlp -v https://dzen.ru/id/5f377551708c8d5df525a586
[debug] Command-line config: ['-v', 'https://dzen.ru/id/5f377551708c8d5df525a586']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version nightly@2023.10.22.230540 [a40e0b37d] (zip)
[debug] Python 3.11.2 (CPython x86_64 64bit) - Linux-6.1.43-1.qubes.fc32.x86_64-x86_64-with-glibc2.36 (OpenSSL 3.0.11 19 Sep 2023, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.3-1 (setts), ffprobe 5.1.3-1
[debug] Optional libraries: Cryptodome-3.11.0, certifi-2022.09.24, mutagen-1.46.0, requests-2.28.1, sqlite3-3.40.1, urllib3-1.26.12
[debug] Proxy map: {}
[debug] Request Handlers: urllib
[debug] Loaded 1890 extractors
[ZenYandexChannel] Extracting URL: https://dzen.ru/id/5f377551708c8d5df525a586
[ZenYandexChannel] 5f377551708c8d5df525a586: Downloading webpage
[ZenYandexChannel] 5f377551708c8d5df525a586: Redirecting
ERROR: [ZenYandexChannel] 5f377551708c8d5df525a586: Unable to extract channel data; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 715, in extract
ie_result = self._real_extract(url)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/yt-dlp/yt_dlp/extractor/yandexvideo.py", line 377, in _real_extract
data = self._search_json(
^^^^^^^^^^^^^^^^^^
File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1277, in _search_json
json_string = self._search_regex(
^^^^^^^^^^^^^^^^^^^
File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1263, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
The root cause was the same. I pushed another commit, fixing ZenYandexChannel extractor. |
Even though the tests were already broken, please fix them. 'description': video_json.get('description') or self._og_search_description(webpage), |
yt_dlp/extractor/yandexvideo.py
Outdated
@@ -258,7 +258,7 @@ def _real_extract(self, url): | |||
video_id = self._match_id(redirect) | |||
webpage = self._download_webpage(redirect, video_id, note='Redirecting') | |||
data_json = self._search_json( | |||
r'data\s*=', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}') | |||
r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}') | |
r'"data"\s*[=:]', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}') |
Just in case. Same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some warnings after change
yt-dlp -F https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] Extracting URL: https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] 64d21733267b6a31f46608ef: Downloading webpage
[ZenYandex] 64d21733267b6a31f46608ef: Redirecting
[ZenYandex] 64d21733267b6a31f46608ef: Downloading MPD manifest
WARNING: [ZenYandex] Ignoring subtitle tracks found in the DASH manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[ZenYandex] 64d21733267b6a31f46608ef: Downloading m3u8 information
WARNING: [ZenYandex] Ignoring subtitle tracks found in the HLS manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[info] Available formats for 64d21733267b6a31f46608ef:
ID EXT RESOLUTION FPS │ TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
dash-f1-a1-x3 m4a audio only │ 132k dash │ audio only mp4a.40.2 132k 44k [rus] DASH audio, m4a_dash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pukkandan Nice idea! I updated regulat expressions to capture both old and new patterns:
data =
"data" :
The expression is r'("data"\s*:|data\s*=)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ansdim1 Thank you! The warning exist in master
branch as well. I guess it is the expected behavior. Anyway it is unrelated to the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning comes from the fact, that self._extract_mpd_formats
and self._extract_m3u8_formats
discard the extracted subtitles. Instead, self._extract_mpd_formats_and_subtitles
and self._extract_m3u8_formats_and_subtitles
should be used.
Also in the future, please don't force push. All commits will be squashed upon merge anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The video posted by ansdim1 does have a sub track for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed another commit, fixing some tests.
@seproDev I used your suggested 'description'.
Also updated hashes for some tests.
Status of tests:
python3.11 test/test_download.py TestDownload.test_ZenYandex # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandex_1 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandex_2 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandex_3 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_1 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_2 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_3 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_4 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_5 # fails because of captcha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me none of the tests fail due to captchas. However, these four tests still fail:
TestDownload.test_ZenYandexChannel # fails due to channel no longer existing
TestDownload.test_ZenYandexChannel_1 # fails due to channel no longer existing
TestDownload.test_ZenYandexChannel_3 # fails due to channel only having 18 videos
TestDownload.test_ZenYandexChannel_4 # fails due to more than 46 videos being uploaded
Add 'skip': 'The page does not exist',
to the tok_media entries.
Change to 'playlist_count': 18,
for jony_me.
Change to 'playlist_mincount': 46,
for tatyanareva.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now all ZenYandexChannel
test cases pass for me.
Fix data JSON extraction. The JSON begins like this in recent pages: a.data.hasOwnProperty(t)&&(n[t]=a.data[t])}({"data":{"__serverState__... New regular expression captures both old and new versions just in case. Fix yt-dlp#8275
8c32c2b
to
8ad7e94
Compare
Closes yt-dlp#8275 Authored by: starius
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
The version of ZenYandex downloader in master is broken.
I got the same error as in #8275
I researched it running the command with
--write-pages
and inspecting the HTML file.I found that "data" JSON is now passed using
{"data":{...
syntax. I guess previously it used to be passed using{data={...
syntax.So I changed the regexp and now it works!
I checked using the command from #8275
This is now working for me!
Fixes #8275
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?
Note: tests were already broken:
I didn't fixed them.
Copilot Summary
🤖 Generated by Copilot at 3f5b614
Summary
🕵️♂️🛠️🎞️
Fix Zen Yandex video extraction by updating
data_json
regex inyt_dlp/extractor/yandexvideo.py
Walkthrough
data_json
from Zen Yandex videos (link)