Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZenYandex] Fix downloader #8454

Merged
merged 4 commits into from
Nov 15, 2023

Conversation

starius
Copy link
Contributor

@starius starius commented Oct 27, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

The version of ZenYandex downloader in master is broken.

I got the same error as in #8275
I researched it running the command with --write-pages and inspecting the HTML file.
I found that "data" JSON is now passed using {"data":{... syntax. I guess previously it used to be passed using {data={... syntax.
So I changed the regexp and now it works!

I checked using the command from #8275

./yt-dlp.sh -vU "https://dzen.ru/video/watch/651c35fa51b4a948e51df6de"

This is now working for me!

Fixes #8275

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Note: tests were already broken:

$ python3.11 test/test_download.py TestDownload.test_ZenYandex_1     
[debug] Loaded 1890 extractors                                                                 
[ZenYandex] Extracting URL: https://dzen.ru/media/id/606fd806cc13cb3c58c05cf5/vot-eto-focus-dedy-morozy-na-gidrociklah-60c7c443da18892ebfe85ed7                                               
[ZenYandex] 60c7c443da18892ebfe85ed7: Downloading webpage                                                                                                                                     
[ZenYandex] 60c7c443da18892ebfe85ed7: Redirecting                                                                                                                                             
ERROR: [ZenYandex] 60c7c443da18892ebfe85ed7: Unable to extract metadata; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template.
 Confirm you are on the latest version using  yt-dlp -U

I didn't fixed them.

Copilot Summary

🤖 Generated by Copilot at 3f5b614

Summary

🕵️‍♂️🛠️🎞️

Fix Zen Yandex video extraction by updating data_json regex in yt_dlp/extractor/yandexvideo.py

data_json changed
Webpage uses double quotes
Autumn leaves fall fast

Walkthrough

  • Update the regex pattern for extracting data_json from Zen Yandex videos (link)

@bashonly bashonly added the site-bug Issue with a specific website label Oct 27, 2023
@bashonly bashonly self-requested a review October 27, 2023 19:54
@starius
Copy link
Contributor Author

starius commented Oct 29, 2023

It turned out, that channel links were also broken.

Example $ yt-dlp -v https://dzen.ru/id/5f377551708c8d5df525a586 [debug] Command-line config: ['-v', 'https://dzen.ru/id/5f377551708c8d5df525a586'] [debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8 [debug] yt-dlp version nightly@2023.10.22.230540 [a40e0b37d] (zip) [debug] Python 3.11.2 (CPython x86_64 64bit) - Linux-6.1.43-1.qubes.fc32.x86_64-x86_64-with-glibc2.36 (OpenSSL 3.0.11 19 Sep 2023, glibc 2.36) [debug] exe versions: ffmpeg 5.1.3-1 (setts), ffprobe 5.1.3-1 [debug] Optional libraries: Cryptodome-3.11.0, certifi-2022.09.24, mutagen-1.46.0, requests-2.28.1, sqlite3-3.40.1, urllib3-1.26.12 [debug] Proxy map: {} [debug] Request Handlers: urllib [debug] Loaded 1890 extractors [ZenYandexChannel] Extracting URL: https://dzen.ru/id/5f377551708c8d5df525a586 [ZenYandexChannel] 5f377551708c8d5df525a586: Downloading webpage [ZenYandexChannel] 5f377551708c8d5df525a586: Redirecting ERROR: [ZenYandexChannel] 5f377551708c8d5df525a586: Unable to extract channel data; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 715, in extract ie_result = self._real_extract(url) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/yt-dlp/yt_dlp/extractor/yandexvideo.py", line 377, in _real_extract data = self._search_json( ^^^^^^^^^^^^^^^^^^ File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1277, in _search_json json_string = self._search_regex( ^^^^^^^^^^^^^^^^^^^ File "/home/user/yt-dlp/yt_dlp/extractor/common.py", line 1263, in _search_regex raise RegexNotFoundError('Unable to extract %s' % _name)

The root cause was the same. I pushed another commit, fixing ZenYandexChannel extractor.

@seproDev
Copy link
Member

seproDev commented Nov 8, 2023

Even though the tests were already broken, please fix them.
Most of the breakage seems to have happened due to them truncating and prepending some stuff to the descriptions in the og tags. I would suggest changing the description extraction for ZenYandexIE to:

'description': video_json.get('description') or self._og_search_description(webpage),

@seproDev seproDev added the pending-fixes PR has had changes requested label Nov 8, 2023
@@ -258,7 +258,7 @@ def _real_extract(self, url):
video_id = self._match_id(redirect)
webpage = self._download_webpage(redirect, video_id, note='Redirecting')
data_json = self._search_json(
r'data\s*=', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')
r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r'"data"\s*:', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')
r'"data"\s*[=:]', webpage, 'metadata', video_id, contains_pattern=r'{["\']_*serverState_*video.+}')

Just in case. Same below

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some warnings after change

yt-dlp -F https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] Extracting URL: https://dzen.ru/video/watch/64d21733267b6a31f46608ef
[ZenYandex] 64d21733267b6a31f46608ef: Downloading webpage
[ZenYandex] 64d21733267b6a31f46608ef: Redirecting
[ZenYandex] 64d21733267b6a31f46608ef: Downloading MPD manifest
WARNING: [ZenYandex] Ignoring subtitle tracks found in the DASH manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[ZenYandex] 64d21733267b6a31f46608ef: Downloading m3u8 information
WARNING: [ZenYandex] Ignoring subtitle tracks found in the HLS manifest; if any subtitle tracks are missing, please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
[info] Available formats for 64d21733267b6a31f46608ef:
ID EXT RESOLUTION FPS │ TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
dash-f1-a1-x3 m4a audio only │ 132k dash │ audio only mp4a.40.2 132k 44k [rus] DASH audio, m4a_dash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pukkandan Nice idea! I updated regulat expressions to capture both old and new patterns:

data = 
"data" :

The expression is r'("data"\s*:|data\s*=)'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ansdim1 Thank you! The warning exist in master branch as well. I guess it is the expected behavior. Anyway it is unrelated to the patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning comes from the fact, that self._extract_mpd_formats and self._extract_m3u8_formats discard the extracted subtitles. Instead, self._extract_mpd_formats_and_subtitles and self._extract_m3u8_formats_and_subtitles should be used.
Also in the future, please don't force push. All commits will be squashed upon merge anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seproDev I pushed another commit fixing the warning. However I can't find a video on dzen with subtitles to check it. The video posted by @ansdim1 doesn't have them. So the warning is removed, but the output is the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The video posted by ansdim1 does have a sub track for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed another commit, fixing some tests.
@seproDev I used your suggested 'description'.
Also updated hashes for some tests.

Status of tests:

python3.11 test/test_download.py TestDownload.test_ZenYandex  # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandex_1 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandex_2 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandex_3 # PASS

python3.11 test/test_download.py TestDownload.test_ZenYandexChannel # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_1 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_2 # fails because of captcha
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_3 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_4 # PASS
python3.11 test/test_download.py TestDownload.test_ZenYandexChannel_5 # fails because of captcha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me none of the tests fail due to captchas. However, these four tests still fail:

TestDownload.test_ZenYandexChannel   # fails due to channel no longer existing
TestDownload.test_ZenYandexChannel_1 # fails due to channel no longer existing
TestDownload.test_ZenYandexChannel_3 # fails due to channel only having 18 videos
TestDownload.test_ZenYandexChannel_4 # fails due to more than 46 videos being uploaded

Add 'skip': 'The page does not exist', to the tok_media entries.
Change to 'playlist_count': 18, for jony_me.
Change to 'playlist_mincount': 46, for tatyanareva.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now all ZenYandexChannel test cases pass for me.

Fix data JSON extraction.

The JSON begins like this in recent pages:
a.data.hasOwnProperty(t)&&(n[t]=a.data[t])}({"data":{"__serverState__...

New regular expression captures both old and new versions just in case.

Fix yt-dlp#8275
@starius starius force-pushed the fix-8275-fix-data-json-extraction branch from 8c32c2b to 8ad7e94 Compare November 10, 2023 19:22
@seproDev seproDev added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 11, 2023
@bashonly bashonly removed the pending-review PR needs a review label Nov 15, 2023
@bashonly bashonly self-assigned this Nov 15, 2023
@bashonly bashonly merged commit 5efe68b into yt-dlp:master Nov 15, 2023
16 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ZenYandex] Broken video download. "Unable to extract medatada" error.
5 participants