New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix npo support #31976
base: master
Are you sure you want to change the base?
Fix npo support #31976
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
I've made a few suggestions, really just conventions, so I'll let the CI test run now.
Co-authored-by: dirkf <fieldhouse@gmx.net>
I have accepted all suggestions here. Sorry I missed some conventions. |
* simplify comment * force CI
@dirkf Is it ready to merge now? |
I have had great success using this fork to download NPO Start videos. Greatly appreciated @bartbroere. It would be nice to see these changes merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to merge this but it needs valid tests.
All the current test URLs redirect to the home page, but maybe that doesn't happen in NL. Please ensure that there is at least one valid test (with playable content, non-DRM) and mark any invalid URLs with
'skip': 'Content expired',
(or ...: 'only available in NL',
, or whatever is appropriate) after the info_dict
. If an invalid test URL doesn't match the extractor's _VALID_URL
, it should be deleted (but I don't think that applies here).
If URLs are only valid in NL, or with an account, please post a console log of successful tests.
I've made some additional suggestions that I might improve with the aid of a usable test URL.
Thanks for the suggested improvements. I'll address all the feedback, and re-request review once I feel it's all done. |
@dirkf Looking into it I realised this is probably caused by the recently released new NPO app and website. I'll check all tests and maybe add some new ones |
yt-dlp/yt-dlp#9319 addresses the same issues. I think (speaking for the Dutch users a bit) that support for npo.nl is the main feature, and all the other sites often just embed the NPO player in some way. Since almost all of the other sites no longer work in the same way as before, I would propose throwing away extractors for many other domains, since rebuilding them is probably a lot quicker. |
Re-implementing these is quicker for the cases where that's even still possible
It's a bit vexing that the yt-dlp PR has done the same job, but by all means plunder any useful stuff from there. I would generally merge/back-port an updated yt-dlp extractor but generally also only when ours doesn't work. |
I'm not convinced that the pull request on On the other hand, my branch probably doesn't work for some of the So I think we can do a nice bit of "cross-plundering" between my |
Great. Also, if there are standard patterns for the discarded sites where NPO videos are embedded, we can add a class variable @classmethod
def _extract_urls(cls, webpage):
def yield_urls():
for p in cls._EMBED_REGEX:
for m in re.finditer(p, webpage):
yield m.group('url')
return list(yield_urls()) Eventually the webpage extraction system from yt-dlp should be pulled in, with the method standardised in |
This indeed seems the best way to do this. I'll rework the code to fit in this pattern. |
# TODO Find out how we could obtain this automatically | ||
# Otherwise this extractor might break each time SchoolTV deploys a new release | ||
build_id = 'b7eHUzAVO7wHXCopYxQhV' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way is to load a random page, JSON parse the __NEXT_DATA__
part and get the buildId
prop from there. But then we might as well have that 'random page' be the actual video page and skip the /_next/data/
download part as that object already contains the poms_mid
.
It is not great, but I think the only stable option is parsing __NEXT_DATA__
part for the poms_mid
like we initially did for the NPO Start webui. I have worked with nextjs for quite a while and they have never changed the __NEXT_DATA__
part as far as I know so should be relatively safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the _search_nextjs_data()
method if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the
_search_nextjs_data()
method if needed.
Thanks! I'll look into that and use it if I can make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, all the information that you might have got with the JSON API is in the Next.js hydration JSON in the page, including the build ID that is no longer of interest. I have this working but will update once other issues have been cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I can see this is coming along well. Here are a few comments and suggestions.
youtube_dl/extractor/npo.py
Outdated
product_id = program_metadata.get('productId') | ||
images = program_metadata.get('images') | ||
thumbnail = None | ||
for image in images: | ||
thumbnail = image.get('url') | ||
break | ||
title = program_metadata.get('title') | ||
descriptions = program_metadata.get('description', {}) | ||
description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief') | ||
duration = program_metadata.get('durationInSeconds') | ||
|
||
if not product_id: | ||
raise ExtractorError('No productId found for slug: %s' % slug) | ||
|
||
formats = self._download_by_product_id(product_id, slug, url) | ||
|
||
info = { | ||
'id': video_id, | ||
'title': video_id, | ||
return { | ||
'id': slug, | ||
'formats': formats, | ||
'title': title or slug, | ||
'description': description or title or slug, | ||
'thumbnail': thumbnail, | ||
'duration': duration, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON may not be as expected, or even None
. We now have utils.traverse_obj()
to make this easier; also I suggest failing earlier, either for lack of product_id
or for missing formats (that yt-dl needs you to sort):
product_id = program_metadata.get('productId') | |
images = program_metadata.get('images') | |
thumbnail = None | |
for image in images: | |
thumbnail = image.get('url') | |
break | |
title = program_metadata.get('title') | |
descriptions = program_metadata.get('description', {}) | |
description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief') | |
duration = program_metadata.get('durationInSeconds') | |
if not product_id: | |
raise ExtractorError('No productId found for slug: %s' % slug) | |
formats = self._download_by_product_id(product_id, slug, url) | |
info = { | |
'id': video_id, | |
'title': video_id, | |
return { | |
'id': slug, | |
'formats': formats, | |
'title': title or slug, | |
'description': description or title or slug, | |
'thumbnail': thumbnail, | |
'duration': duration, | |
} | |
product_id = traverse_obj(program_metadata, 'productId') | |
if not product_id: | |
raise ExtractorError('No productId found for slug: %s' % (slug,)) | |
formats = self._download_by_product_id(product_id, slug, url) | |
self._sort_formats(formats) | |
return merge_dicts(traverse_obj(program_metadata, { | |
'title': 'title', | |
'description': (('description', ('long', 'short', 'brief')), 'title'), | |
'thumbnail': ('images', Ellipsis, 'url', T(url_or_none)), | |
'duration': ('durationInSeconds', T(int_or_none)), | |
}, get_all=False), { | |
'id': slug, | |
'formats': formats, | |
'title': slug, | |
'description': slug, | |
}) |
However I wonder if it's really useful to default description
so much. Generally, it shouldn't repeat the title
, let alone the id
.
# TODO Find out how we could obtain this automatically | ||
# Otherwise this extractor might break each time SchoolTV deploys a new release | ||
build_id = 'b7eHUzAVO7wHXCopYxQhV' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the _search_nextjs_data()
method if needed.
# Skip because of m3u8 download | ||
'skip_download': True | ||
'id': video_id, | ||
'title': metadata.get('title', '') + ' - ' + metadata.get('subtitle', ''), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use utils.join_nonempty()
for this effect:
'title'
,'subtitle'
->'title - subtitle'
'title'
,''
->'title'
''
,'subtitle'
->'subtitle'
class VPROIE(NPOIE): | ||
IE_NAME = 'vpro' | ||
IE_DESC = 'vpro.nl' | ||
_VALID_URL = r'https?://(?:www\.)?vpro.nl/.*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a tight enough pattern?
formats = [] | ||
for result in results: | ||
formats.extend(self._download_by_product_id(result, video_id)) | ||
break # TODO find a better solution, VPRO pages can have multiple videos embedded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May this embedding occur in other pages (not vpro.nl)?
Are the second and up videos related (clips, trailers, etc), or is the case more like a series page with various episodes?
In the first case maybe skip the subsidiary videos; in the second normally return a playlist
result whose entries are either the url_result()
s of episode URLs constructed for each video, or info_dict
s extracted from the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test video page, there is apparently a content video and a teaser video. The former can be detected because it's inside (preceded by) <div class=grid>
.
As far as I can see other pages that might list multiple videos are playlist pages like https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/categorieen/wereld.html that don't include data-media-id
s but just have links to programme episodes. Counterexamples welcome.
Co-authored-by: Roy <git@rvsit.nl>
Co-authored-by: dirkf <fieldhouse@gmx.net>
Co-authored-by: dirkf <fieldhouse@gmx.net>
I'll be mostly offline during April, so this is just a heads up that this PR will not see much progress for a few weeks. I didn't forget about it though, and plan to continue working on it in May. |
I may be able to save you much of that work. I've run through the d/l tests locally and adjusted the code somewhat to match. If anyone has an answer to #31976 (comment), I'll be able to settle that extractor and push an update that should be close to mergeable. |
Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Fix support for NPO sites
This fixes two things that have been changed on the NPO websites:The current video player no longer returns the token in the JSON body, but instead provides us with an XSRF token in the cookie.A second call changed from GET to POST and should include this XSRF token.This branch started out as a small fix, but in the ~11 months the PR was open the NPO site was updated heavily, so now it changes a lot more than the things above. Many of the broadcaster (NL: omroep) sites' extractors (vpro etc.) no longer worked, so these have been removed entirely. I'm always willing to look into re-implementing support for some of these, if that's still possible. However, typically the broadcaster's website just embeds the npo.nl player, so grabbing the same media from the npo.nl URL is recommended.