[ZenYandex] Fix downloader #8454

starius · 2023-10-27T19:45:21Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

The version of ZenYandex downloader in master is broken.

I got the same error as in #8275
I researched it running the command with --write-pages and inspecting the HTML file.
I found that "data" JSON is now passed using {"data":{... syntax. I guess previously it used to be passed using {data={... syntax.
So I changed the regexp and now it works!

I checked using the command from #8275

./yt-dlp.sh -vU "https://dzen.ru/video/watch/651c35fa51b4a948e51df6de"

This is now working for me!

Fixes #8275

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Note: tests were already broken:

$ python3.11 test/test_download.py TestDownload.test_ZenYandex_1     
[debug] Loaded 1890 extractors                                                                 
[ZenYandex] Extracting URL: https://dzen.ru/media/id/606fd806cc13cb3c58c05cf5/vot-eto-focus-dedy-morozy-na-gidrociklah-60c7c443da18892ebfe85ed7                                               
[ZenYandex] 60c7c443da18892ebfe85ed7: Downloading webpage                                                                                                                                     
[ZenYandex] 60c7c443da18892ebfe85ed7: Redirecting                                                                                                                                             
ERROR: [ZenYandex] 60c7c443da18892ebfe85ed7: Unable to extract metadata; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template.
 Confirm you are on the latest version using  yt-dlp -U

I didn't fixed them.

Copilot Summary

`🤖 Generated by Copilot at 3f5b614`

Summary

🕵️‍♂️🛠️🎞️

Fix Zen Yandex video extraction by updating data_json regex in yt_dlp/extractor/yandexvideo.py

data_json changed
Webpage uses double quotes
Autumn leaves fall fast

Walkthrough

Update the regex pattern for extracting data_json from Zen Yandex videos (link)

starius · 2023-10-29T14:24:41Z

It turned out, that channel links were also broken.

Example

$ yt-dlp -v https://dzen.ru/id/5f377551708c8d5df525a586
[debug] Command-line config: ['-v', 'https://dzen.ru/id/5f377551708c8d5df525a586']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version nightly@2023.10.22.230540 [a40e0b37d] (zip)
[debug] Python 3.11.2 (CPython x86_64 64bit) - Linux-6.1.43-1.qubes.fc32.x86_64-x86_64-with-glibc2.36 (OpenSSL 3.0.11 19 Sep 2023, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.3-1 (setts), ffprobe 5.1.3-1
[debug] Optional libraries: Cryptodome-3.11.0, certifi-2022.09.24, mutagen-1.46.0, requests-2.28.1, sqlite3-3.40.1, urllib3-1.26.12
[debug] Proxy map: {}
[debug] Request Handlers: urllib
[debug] Loaded 1890 extractors
[ZenYandexChannel] Extracting URL: https://dzen.ru/id/5f377551708c8d5df525a586
[ZenYandexChannel] 5f377551708c8d5df525a586: Downloading webpage
[ZenYandexChannel] 5f377551708c8d5df525a586: Redirecting
ERROR: [ZenYandexChannel] 5f377551708c8d5df525a586: Unable to extract channel data; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 715, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/yt-dlp/yt_dlp/extractor/yandexvideo.py", line 377, in _real_extract
    data = self._search_json(
           ^^^^^^^^^^^^^^^^^^
  File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1277, in _search_json
    json_string = self._search_regex(
                  ^^^^^^^^^^^^^^^^^^^
  File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1263, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

The root cause was the same. I pushed another commit, fixing ZenYandexChannel extractor.

seproDev · 2023-11-08T13:54:52Z

Even though the tests were already broken, please fix them.
Most of the breakage seems to have happened due to them truncating and prepending some stuff to the descriptions in the og tags. I would suggest changing the description extraction for ZenYandexIE to:

'description': video_json.get('description') or self._og_search_description(webpage),

pukkandan · 2023-11-08T16:33:50Z

yt_dlp/extractor/yandexvideo.py

@@ -258,7 +258,7 @@ def _real_extract(self, url):
            video_id = self._match_id(redirect)
            webpage = self._download_webpage(redirect, video_id, note='Redirecting')
        data_json = self._search_json(
-            r'data\s*=', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')
+            r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')


Suggested change

r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')

r'"data"\s*[=:]', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')

Just in case. Same below

Some warnings after change

yt-dlp -F https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] Extracting URL: https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] 64d21733267b6a31f46608ef: Downloading webpage
[ZenYandex] 64d21733267b6a31f46608ef: Redirecting
[ZenYandex] 64d21733267b6a31f46608ef: Downloading MPD manifest
WARNING: [ZenYandex] Ignoring subtitle tracks found in the DASH manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[ZenYandex] 64d21733267b6a31f46608ef: Downloading m3u8 information
WARNING: [ZenYandex] Ignoring subtitle tracks found in the HLS manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[info] Available formats for 64d21733267b6a31f46608ef:
ID EXT RESOLUTION FPS │ TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
dash-f1-a1-x3 m4a audio only │ 132k dash │ audio only mp4a.40.2 132k 44k [rus] DASH audio, m4a_dash

@pukkandan Nice idea! I updated regulat expressions to capture both old and new patterns:

data = "data" :

The expression is r'("data"\s*:|data\s*=)'

@ansdim1 Thank you! The warning exist in master branch as well. I guess it is the expected behavior. Anyway it is unrelated to the patch.

The warning comes from the fact, that self._extract_mpd_formats and self._extract_m3u8_formats discard the extracted subtitles. Instead, self._extract_mpd_formats_and_subtitles and self._extract_m3u8_formats_and_subtitles should be used.
Also in the future, please don't force push. All commits will be squashed upon merge anyways.

@seproDev I pushed another commit fixing the warning. However I can't find a video on dzen with subtitles to check it. The video posted by @ansdim1 doesn't have them. So the warning is removed, but the output is the same.

The video posted by ansdim1 does have a sub track for me.

I pushed another commit, fixing some tests.
@seproDev I used your suggested 'description'.
Also updated hashes for some tests.

Status of tests:

python3.11 test/test_download.py TestDownload.test_ZenYandex # PASS python3.11 test/test_download.py TestDownload.test_ZenYandex_1 # fails because of captcha python3.11 test/test_download.py TestDownload.test_ZenYandex_2 # PASS python3.11 test/test_download.py TestDownload.test_ZenYandex_3 # PASS python3.11 test/test_download.py TestDownload.test_ZenYandexChannel # fails because of captcha python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_1 # fails because of captcha python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_2 # fails because of captcha python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_3 # PASS python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_4 # PASS python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_5 # fails because of captcha

For me none of the tests fail due to captchas. However, these four tests still fail:

TestDownload.test_ZenYandexChannel # fails due to channel no longer existing TestDownload.test_ZenYandexChannel_1 # fails due to channel no longer existing TestDownload.test_ZenYandexChannel_3 # fails due to channel only having 18 videos TestDownload.test_ZenYandexChannel_4 # fails due to more than 46 videos being uploaded

Add 'skip': 'The page does not exist', to the tok_media entries.
Change to 'playlist_count': 18, for jony_me.
Change to 'playlist_mincount': 46, for tatyanareva.

Done. Now all ZenYandexChannel test cases pass for me.

Fix data JSON extraction. The JSON begins like this in recent pages: a.data.hasOwnProperty(t)&&(n[t]=a.data[t])}({"data":{"__serverState__... New regular expression captures both old and new versions just in case. Fix yt-dlp#8275

Closes yt-dlp#8275 Authored by: starius

starius mentioned this pull request Oct 27, 2023

[ZenYandex] Broken video download. "Unable to extract medatada" error. #8275

Closed

11 tasks

bashonly added the site-bug Issue with a specific website label Oct 27, 2023

bashonly self-requested a review October 27, 2023 19:54

seproDev added the pending-fixes PR has had changes requested label Nov 8, 2023

pukkandan reviewed Nov 8, 2023

View reviewed changes

[ZenYandex] [ZenYandexChannel] Fix downloaders

8ad7e94

Fix data JSON extraction. The JSON begins like this in recent pages: a.data.hasOwnProperty(t)&&(n[t]=a.data[t])}({"data":{"__serverState__... New regular expression captures both old and new versions just in case. Fix yt-dlp#8275

starius force-pushed the fix-8275-fix-data-json-extraction branch from 8c32c2b to 8ad7e94 Compare November 10, 2023 19:22

starius added 3 commits November 11, 2023 00:36

[ZenYandex] fix warning about subtitles

e22f880

[ZenYandex] [ZenYandexChannel] Fix tests

2d096b9

[ZenYandexChannel] fix tests

997c356

seproDev approved these changes Nov 11, 2023

View reviewed changes

seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 11, 2023

bashonly approved these changes Nov 15, 2023

View reviewed changes

bashonly removed the pending-review PR needs a review label Nov 15, 2023

bashonly self-assigned this Nov 15, 2023

bashonly merged commit 5efe68b into yt-dlp:master Nov 15, 2023
16 checks passed

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/ZenYandex] Fix extraction (yt-dlp#8454)

fd9d5e4

Closes yt-dlp#8275 Authored by: starius

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZenYandex] Fix downloader #8454

[ZenYandex] Fix downloader #8454

starius commented Oct 27, 2023 •

edited by ghost

Loading

starius commented Oct 29, 2023

seproDev commented Nov 8, 2023

pukkandan Nov 8, 2023

ansdim1 Nov 9, 2023

starius Nov 10, 2023

starius Nov 10, 2023

seproDev Nov 10, 2023

starius Nov 10, 2023

seproDev Nov 10, 2023

starius Nov 11, 2023

seproDev Nov 11, 2023

starius Nov 11, 2023

	r'"data"\s:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_serverState_*video.+}')
	r'"data"\s[=:]', webpage, 'metadata', video_id, contains_pattern=r'{["\']_serverState_*video.+}')

[ZenYandex] Fix downloader #8454

[ZenYandex] Fix downloader #8454

Conversation

starius commented Oct 27, 2023 • edited by ghost Loading

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at 3f5b614

Summary

Walkthrough

starius commented Oct 29, 2023

seproDev commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

starius commented Oct 27, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 3f5b614`