Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
[niconico] "Unable to extract video title" on some videos with numeric-only ID-s #13840
Comments
|
Not posting content of |
|
P.S. does youtube-dl include some methods for extracting information from HTML using a "proper" parser (as it provides such for XML) instead of just regexes? |
|
There are some methods in |
|
I see the point of discouraging using a HTML parser in a general case, but I think using something like BeautifulSoup (its' CSS selectors feature especially) together with with html5lib (lxml is too fragile indeed, I know it firsthand...) would feel less tricky than trying some more complicated regexes... Back to the main point: Is there a way to make this (only this one) request unauthenticated (sent without cookies)?: webpage, handle = self._download_webpage_handle(
'http://www.nicovideo.jp/watch/' + video_id, video_id)I'm currently digging into the code, but I'd appreciate pointing a way through :) In parallel, I'm researching parsing |
|
This is either fixed on NicoNico website or by recent youtube-dl changes. Anyway it's working now:
Feel free to leave comments if it's broken again. |
Before submitting an issue make sure you have:
What is the purpose of your issue?
If the purpose of this issue is a bug report, site support request or you are not completely sure provide the full verbose output as follows:
Description of your issue, suggested solution and other information
So, there is some pretty funky stuff going on here...
Let's analyze the extractor's control flow:
The dumped getthumbinfo API response page:
1311771889_http_-_ext.nicovideo.jp_api_getthumbinfo_1311771889.dumpSo,
xpath_text(video_info, './/title')fails on this one.This one actually works on spme other videos, but not on this one.
I wonder if the different (numeic-only instead of sm
So the next ones (OpenGraph or regex tag search) should do the trick, right?
Enter the funkiness!
Nico serves entirely different watch page when authenticated than when not.
The watch page for an unauthenticated user indeed does contain OG tags and the title tag.
(Oh, actually the title tag now looks like
<h1 itemprop="name" class="txt-title">[title]</h1>. Either the regex needs updating, or a new if branch needs to be made).On the watch page for a loggeg-in user, however, the OG and title tags are nowhere to be found.
Instead, the page heavily relies on scripts to render properly.
The
<h1>title tag is present only in the rendered DOM after scripts on page execute.To investigate:
Possible solutions - either:
contentsdata-api-dataattribute of<div id="js-initial-watch-data">tag in authenticated watch page to see if the needed information can be found there.Other points:
<h1 itemprop="name" class="txt-title">[title]</h1>tag.