Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NHK World vods in Japanese exhibit list index out of range error #8303

Closed
11 tasks done
Contik opened this issue Oct 7, 2023 · 6 comments · Fixed by #8309
Closed
11 tasks done

NHK World vods in Japanese exhibit list index out of range error #8303

Contik opened this issue Oct 7, 2023 · 6 comments · Fixed by #8309
Labels
site-bug Issue with a specific website

Comments

@Contik
Copy link

Contik commented Oct 7, 2023

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Region

From anywhere outside of Japan determined by request source IP address

Provide a description that is worded well enough to be understood

Per issue #8242 comment 1751738030:

Attempting to download NhkVod videos in Japanese at https://www3.nhk.or.jp/nhkworld/ja/... currently produces a list index out of range error. English-language videos at https://www3.nhk.or.jp/nhkworld/en/... do not exhibit the same behavior, these are now working as of yesterday's merged pull request 8249 and yt-dlp version 2023.10.07.

For example daily noon and evening news videos in Japanese at https://www3.nhk.or.jp/nhkworld/ja/ondemand/video produce attached verbose output.

For context: these videos are intended for Japanese out of country so downloadable only outside of Japan. The site sends different HTTP response bodies depending on whether or not it perceives a request source IP address to be within Japan or outside of Japan. When outside of Japan the page shows:



By my understanding the videos highlighted with red border don't have any retention, NHK only ever offers the current day's video for download. These two examples are today's 7 pm news (ニュース7 aka nyusu 7) and noon news (正午のニュース aka shogo no nyusu).

"Within" Japan you'll get:


Please use NHK Plus or NHK World Premium


Basically asking you to use your NHK Plus account to watch a show you missed or to get yourself an NHK World Premium subscription.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', 'https://www3.nhk.or.jp/nhkworld/ja/ondemand/video/0451269387/']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.10.07 [377e85a17]
[debug] Python 3.11.5 (CPython x86_64 64bit) - Linux-6.5.5-arch1-1-x86_64-with-glibc2.38 (OpenSSL 3.1.3 19 Sep 2023, glibc 2.38)
[debug] exe versions: ffmpeg 6.0 (setts), ffprobe 6.0, rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.12.0, certifi-2023.07.22, sqlite3-3.43.1, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1886 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.10.07, Current version: stable@2023.10.07
yt-dlp is up to date (stable@2023.10.07)
[NhkVod] Extracting URL: https://www3.nhk.or.jp/nhkworld/ja/ondemand/video/0451269387/
[NhkVod] 0451-269: Downloading JSON metadata
ERROR: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1567, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/YoutubeDL.py", line 1702, in __extract_info
    ie_result = ie.extract(url)
                ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/common.py", line 715, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/nhk.py", line 205, in _real_extract
    return self._extract_episode_info(url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/yt_dlp/extractor/nhk.py", line 77, in _extract_episode_info
    episode = self._call_api(
              ^^^^^^^^^^^^^^^
IndexError: list index out of range
@Contik Contik added site-bug Issue with a specific website triage Untriaged issue labels Oct 7, 2023
@garret1317
Copy link
Collaborator

looks like the episode_id is getting cut off
for https://www3.nhk.or.jp/nhkworld/ja/ondemand/video/0046268465/ the extractor grabs

https://nwapi.nhk.jp/nhkworld/vodesdlist/v7b/episode/0046-268/ja/all/all.json (nothing there)
but the site grabs
https://nwapi.nhk.jp/nhkworld/vodesdlist/v7b/episode/0046-268465/ja/all/all.json

seems the assumption is that ids will always be 7 characters, but the japanese news is 10
maybe they ran out idk
r'%s%s(?P<id>[0-9a-z]{7}|[^/]+?-\d{8}-[0-9a-z]+)' NhkVodIE regex
if i replace the {7} with a +
and remove the length check in _extract_episode_info

        if len(episode_id) == 7:
            episode_id = episode_id[:4] + '-' + episode_id[4:]

it all starts working beautifully

but
the NhkVodIE regex has another section [^/]+?-\d{8}-[0-9a-z]+
should probably see what that's for and if these changes break it

@garret1317 garret1317 removed the triage Untriaged issue label Oct 8, 2023
@garret1317
Copy link
Collaborator

garret1317 commented Oct 8, 2023

it could be for radio on demand?
oh well thats broken already :trollface:

edit:
yes, was added in 061d1cd, updated in b79df1b

@garret1317
Copy link
Collaborator

wait no its only broken because it gets matched by the video regex lmao

garret1317 added a commit to garret1317/yt-dlp that referenced this issue Oct 8, 2023
radio was getting matched by a section of the regex meant for the video
extractor, and japanese-language vods broke because their ids were too
long.

this commit fixes NhkVodIE so it can extract japanese-language vods, by
removing the explicit specification of the length of the ID. It also
splits radio and tv into their own IEs, with separate regexes, so they
don't conflict with each other.

closes yt-dlp#8303 and fixes radio extraction
@garret1317 garret1317 mentioned this issue Oct 8, 2023
9 tasks
garret1317 added a commit to garret1317/yt-dlp that referenced this issue Oct 8, 2023
radio was getting matched by a section of the regex meant for the video
extractor, and japanese-language vods broke because their ids were too
long.

this commit fixes NhkVodIE so it can extract japanese-language vods, by
removing the explicit specification of the length of the ID. It also
splits radio and tv into their own regexes so they don't conflict with
each other.

fixes yt-dlp#8303 and radio extraction, replaces yt-dlp#8305
@garret1317 garret1317 mentioned this issue Oct 8, 2023
9 tasks
@gfreq

This comment was marked as spam.

@gfreq

This comment was marked as spam.

bashonly pushed a commit that referenced this issue Oct 9, 2023
@Contik
Copy link
Author

Contik commented Oct 16, 2023

❤️

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this issue Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants