Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[niconico] "Unable to extract video title" on some videos with numeric-only ID-s #13840

Closed
Xender opened this issue Aug 6, 2017 · 6 comments
Closed

Comments

@Xender
Copy link
Contributor

@Xender Xender commented Aug 6, 2017

  • I've verified and I assure that I'm running youtube-dl 2017.08.06

Before submitting an issue make sure you have:

  • At least skimmed through the README, most notably the FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

If the purpose of this issue is a bug report, site support request or you are not completely sure provide the full verbose output as follows:

$ python -m youtube_dl -F http://www.nicovideo.jp/watch/1311771889 --verbose --write-pages

[debug] System config: []
[debug] User config: ['-o', '%(title)s-%(format)s[v=%(id)s&fmt=%(format_id)s].%(ext)s', '-n']
[debug] Custom config: []
[debug] Command-line args: ['-F', 'http://www.nicovideo.jp/watch/1311771889', '--verbose', '--write-pages']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.08.06
[debug] Git HEAD: 903a183b6
[debug] Python version 3.6.2 - Linux-4.12.3-1-ARCH-x86_64-with-arch
[debug] exe versions: ffmpeg 3.3.2, ffprobe 3.3.2, rtmpdump 2.4
[debug] Proxy map: {}
[niconico] Logging in
[niconico] 1311771889: Downloading webpage
[niconico] Saving request to 1311771889_http_-_www.nicovideo.jp_watch_1311771889.dump
[niconico] 1311771889: Downloading video info page
[niconico] Saving request to 1311771889_http_-_ext.nicovideo.jp_api_getthumbinfo_1311771889.dump
[niconico] 1311771889: Downloading flv info
[niconico] Saving request to 1311771889_http_-_flapi.nicovideo.jp_api_getflv_1311771889as3=1.dump
ERROR: Unable to extract video title; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/home/xender/proj/youtube-dl/youtube_dl/YoutubeDL.py", line 776, in extract_info
    ie_result = ie.extract(url)
  File "/home/xender/proj/youtube-dl/youtube_dl/extractor/common.py", line 433, in extract
    ie_result = self._real_extract(url)
  File "/home/xender/proj/youtube-dl/youtube_dl/extractor/niconico.py", line 165, in _real_extract
    webpage, 'video title')
  File "/home/xender/proj/youtube-dl/youtube_dl/extractor/common.py", line 791, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/home/xender/proj/youtube-dl/youtube_dl/extractor/common.py", line 782, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract video title; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description of your issue, suggested solution and other information

So, there is some pretty funky stuff going on here...
Let's analyze the extractor's control flow:

        # Start extracting information
        title = xpath_text(video_info, './/title')
        if not title:
            title = self._og_search_title(webpage, default=None)
        if not title:
            title = self._html_search_regex(
                r'<span[^>]+class="videoHeaderTitle"[^>]*>([^<]+)</span>',
                webpage, 'video title')

The dumped getthumbinfo API response page:
1311771889_http_-_ext.nicovideo.jp_api_getthumbinfo_1311771889.dump

<?xml version="1.0" encoding="UTF-8"?>
<nicovideo_thumb_response status="fail">
  <error>
    <code>COMMUNITY</code>
    <description>community</description>
  </error>
</nicovideo_thumb_response>

So, xpath_text(video_info, './/title') fails on this one.
This one actually works on spme other videos, but not on this one.
I wonder if the different (numeic-only instead of sm

So the next ones (OpenGraph or regex tag search) should do the trick, right?
Enter the funkiness!

Nico serves entirely different watch page when authenticated than when not.
The watch page for an unauthenticated user indeed does contain OG tags and the title tag.
(Oh, actually the title tag now looks like <h1 itemprop="name" class="txt-title">[title]</h1>. Either the regex needs updating, or a new if branch needs to be made).

On the watch page for a loggeg-in user, however, the OG and title tags are nowhere to be found.
Instead, the page heavily relies on scripts to render properly.
The <h1> title tag is present only in the rendered DOM after scripts on page execute.

To investigate:

  • Why does getthumbinfo API return an error for this video? It doesn't do that for other ones.
  • I'm also concerned that with current script-reliant HTML, this API is actually the only title extraction mechanism out of these 3 that actually works...

Possible solutions - either:

  • Download the watch page using an unauthenticated request in order to receive a HTML that contains all the required data.
  • Look at the contents data-api-data attribute of <div id="js-initial-watch-data"> tag in authenticated watch page to see if the needed information can be found there.
  • Investigate querying the Flash player page instead of HTML5 player page. The quirk is, it seems that both pages have the same URL and some stateful action is required to switch the player.

Other points:

  • Update/add a second regex for parsing <h1 itemprop="name" class="txt-title">[title]</h1> tag.
@Xender
Copy link
Contributor Author

@Xender Xender commented Aug 6, 2017

Not posting content of 1311771889_http_-_www.nicovideo.jp_watch_1311771889.dump because of sensitive info it contains. Sorry :(

@Xender
Copy link
Contributor Author

@Xender Xender commented Aug 6, 2017

P.S. does youtube-dl include some methods for extracting information from HTML using a "proper" parser (as it provides such for XML) instead of just regexes?

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 6, 2017

There are some methods in utils. However, "proper" parsing is discouraged as it's fragile and less future proof since HTML layouts tend to change.

@Xender
Copy link
Contributor Author

@Xender Xender commented Aug 6, 2017

I see the point of discouraging using a HTML parser in a general case, but I think using something like BeautifulSoup (its' CSS selectors feature especially) together with with html5lib (lxml is too fragile indeed, I know it firsthand...) would feel less tricky than trying some more complicated regexes...

Back to the main point: Is there a way to make this (only this one) request unauthenticated (sent without cookies)?:

webpage, handle = self._download_webpage_handle(
    'http://www.nicovideo.jp/watch/' + video_id, video_id)

I'm currently digging into the code, but I'd appreciate pointing a way through :)

In parallel, I'm researching parsing js-initial-watch-data.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Aug 6, 2017

This is either fixed on NicoNico website or by recent youtube-dl changes. Anyway it's working now:

$ youtube-dl -v "http://www.nicovideo.jp/watch/1311771889" --cookie ~/tmp/cookies.txt 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'http://www.nicovideo.jp/watch/1311771889', '--cookie', '/home/yen/tmp/cookies.txt']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.08.06
[debug] Git HEAD: ee6a61166
[debug] Python version 3.6.2 - Linux-4.12.4-1-ARCH-x86_64-with-arch-Arch-Linux
[debug] exe versions: ffmpeg 3.3.2, ffprobe 3.3.2
[debug] Proxy map: {}
[niconico] 1311771889: Downloading webpage
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on 'http://smile-cll16.nicovideo.jp/smile?m=15136416.54098low'
[download] Destination: 【踊ってみた】番凩【劇団ブリオッシュ】-1311771889.mp4
[download]   8.4% of 11.91MiB at 580.84KiB/s ETA 00:19^C
ERROR: Interrupted by user

Feel free to leave comments if it's broken again.

@yan12125 yan12125 closed this Aug 6, 2017
@Xender
Copy link
Contributor Author

@Xender Xender commented Aug 6, 2017

@yan12125 Okay, you totally fixed this and this issue has been a duplicate of #13806.

Sorry for creating a duplicate and a big Thank you. :)

@Xender Xender mentioned this issue Aug 6, 2017
4 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.