Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix npo support #31976

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open

Fix npo support #31976

wants to merge 39 commits into from

Conversation

bartbroere
Copy link

@bartbroere bartbroere commented Mar 31, 2023

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

  • Bug fix

Fix support for NPO sites

This fixes two things that have been changed on the NPO websites:

  • The current video player no longer returns the token in the JSON body, but instead provides us with an XSRF token in the cookie.
  • A second call changed from GET to POST and should include this XSRF token.

This branch started out as a small fix, but in the ~11 months the PR was open the NPO site was updated heavily, so now it changes a lot more than the things above. Many of the broadcaster (NL: omroep) sites' extractors (vpro etc.) no longer worked, so these have been removed entirely. I'm always willing to look into re-implementing support for some of these, if that's still possible. However, typically the broadcaster's website just embeds the npo.nl player, so grabbing the same media from the npo.nl URL is recommended.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

I've made a few suggestions, really just conventions, so I'll let the CI test run now.

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
Co-authored-by: dirkf <fieldhouse@gmx.net>
@bartbroere bartbroere requested a review from dirkf April 3, 2023 07:53
@bartbroere
Copy link
Author

I have accepted all suggestions here. Sorry I missed some conventions.

* simplify comment
* force CI
@bartbroere
Copy link
Author

@dirkf Is it ready to merge now?

@RickTimmer
Copy link

I have had great success using this fork to download NPO Start videos. Greatly appreciated @bartbroere. It would be nice to see these changes merged.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to merge this but it needs valid tests.

All the current test URLs redirect to the home page, but maybe that doesn't happen in NL. Please ensure that there is at least one valid test (with playable content, non-DRM) and mark any invalid URLs with

        'skip': 'Content expired',

(or ...: 'only available in NL',, or whatever is appropriate) after the info_dict. If an invalid test URL doesn't match the extractor's _VALID_URL, it should be deleted (but I don't think that applies here).

If URLs are only valid in NL, or with an account, please post a console log of successful tests.

I've made some additional suggestions that I might improve with the aid of a usable test URL.

youtube_dl/extractor/npo.py Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
@bartbroere
Copy link
Author

I'd like to merge this but it needs valid tests.

All the current test URLs redirect to the home page, but maybe that doesn't happen in NL. Please ensure that there is at least one valid test (with playable content, non-DRM) and mark any invalid URLs with

        'skip': 'Content expired',

(or ...: 'only available in NL',, or whatever is appropriate) after the info_dict. If an invalid test URL doesn't match the extractor's _VALID_URL, it should be deleted (but I don't think that applies here).

If URLs are only valid in NL, or with an account, please post a console log of successful tests.

I've made some additional suggestions that I might improve with the aid of a usable test URL.

Thanks for the suggested improvements. I'll address all the feedback, and re-request review once I feel it's all done.

@bartbroere
Copy link
Author

@dirkf Looking into it I realised this is probably caused by the recently released new NPO app and website. I'll check all tests and maybe add some new ones

@bartbroere
Copy link
Author

yt-dlp/yt-dlp#9319 addresses the same issues. I think (speaking for the Dutch users a bit) that support for npo.nl is the main feature, and all the other sites often just embed the NPO player in some way.

Since almost all of the other sites no longer work in the same way as before, I would propose throwing away extractors for many other domains, since rebuilding them is probably a lot quicker.

Re-implementing these is quicker for the cases where that's even still possible
@dirkf
Copy link
Contributor

dirkf commented Mar 1, 2024

It's a bit vexing that the yt-dlp PR has done the same job, but by all means plunder any useful stuff from there. I would generally merge/back-port an updated yt-dlp extractor but generally also only when ours doesn't work.

@bartbroere
Copy link
Author

bartbroere commented Mar 1, 2024

It's a bit vexing that the yt-dlp PR has done the same job, but by all means plunder any useful stuff from there. I would generally merge/back-port an updated yt-dlp extractor but generally also only when ours doesn't work.

I'm not convinced that the pull request on yt_dlp actually works for the new npo.nl player yet:
https://github.com/rvsit/yt-dlp/blob/c2d2b589588c7dbe44a97b4d29bad0d68922cd07/yt_dlp/extractor/npo.py#L122-L123

On the other hand, my branch probably doesn't work for some of the yt_dlp test cases yet.

So I think we can do a nice bit of "cross-plundering" between my youtube-dl pull request and @rvsit 's yt-dlp pull request.

@bartbroere bartbroere mentioned this pull request Mar 1, 2024
9 tasks
@dirkf
Copy link
Contributor

dirkf commented Mar 1, 2024

Great.

Also, if there are standard patterns for the discarded sites where NPO videos are embedded, we can add a class variable _EMBED_REGEX that is a list of those patterns with the NPO URL found in regex group url. Add this to provide an entry-point that could be called from the generic extractor:

    @classmethod
    def _extract_urls(cls, webpage):
        def yield_urls():
            for p in cls._EMBED_REGEX:
                for m in re.finditer(p, webpage):
                    yield m.group('url')

        return list(yield_urls())

Eventually the webpage extraction system from yt-dlp should be pulled in, with the method standardised in InfoExtractor, and the generic extractor refactored to blazes.

@bartbroere
Copy link
Author

My suggestion for a universal approach would be

  • Extract POMS id from url with various implementations per url format/domain

  • They all call the same function that take a POMS id and get's the url's

  • Optionally enrich it with title and such if not already extracted from url

    • We could use the POMS api for this, but the url signing/api key is a little annoying even although those keys are public, https://rs.poms.omroep.nl/v1/api/media/[id] example_response_POMS_NTR_388772.json

This indeed seems the best way to do this. I'll rework the code to fit in this pattern.

Comment on lines +205 to +207
# TODO Find out how we could obtain this automatically
# Otherwise this extractor might break each time SchoolTV deploys a new release
build_id = 'b7eHUzAVO7wHXCopYxQhV'
Copy link

@rvsit rvsit Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way is to load a random page, JSON parse the __NEXT_DATA__ part and get the buildId prop from there. But then we might as well have that 'random page' be the actual video page and skip the /_next/data/ download part as that object already contains the poms_mid.
It is not great, but I think the only stable option is parsing __NEXT_DATA__ part for the poms_mid like we initially did for the NPO Start webui. I have worked with nextjs for quite a while and they have never changed the __NEXT_DATA__ part as far as I know so should be relatively safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the _search_nextjs_data() method if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the _search_nextjs_data() method if needed.

Thanks! I'll look into that and use it if I can make it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, all the information that you might have got with the JSON API is in the Next.js hydration JSON in the page, including the build ID that is no longer of interest. I have this working but will update once other issues have been cleared.

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I can see this is coming along well. Here are a few comments and suggestions.

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
Comment on lines 52 to 75
product_id = program_metadata.get('productId')
images = program_metadata.get('images')
thumbnail = None
for image in images:
thumbnail = image.get('url')
break
title = program_metadata.get('title')
descriptions = program_metadata.get('description', {})
description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief')
duration = program_metadata.get('durationInSeconds')

if not product_id:
raise ExtractorError('No productId found for slug: %s' % slug)

formats = self._download_by_product_id(product_id, slug, url)

info = {
'id': video_id,
'title': video_id,
return {
'id': slug,
'formats': formats,
'title': title or slug,
'description': description or title or slug,
'thumbnail': thumbnail,
'duration': duration,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON may not be as expected, or even None. We now have utils.traverse_obj() to make this easier; also I suggest failing earlier, either for lack of product_id or for missing formats (that yt-dl needs you to sort):

Suggested change
product_id = program_metadata.get('productId')
images = program_metadata.get('images')
thumbnail = None
for image in images:
thumbnail = image.get('url')
break
title = program_metadata.get('title')
descriptions = program_metadata.get('description', {})
description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief')
duration = program_metadata.get('durationInSeconds')
if not product_id:
raise ExtractorError('No productId found for slug: %s' % slug)
formats = self._download_by_product_id(product_id, slug, url)
info = {
'id': video_id,
'title': video_id,
return {
'id': slug,
'formats': formats,
'title': title or slug,
'description': description or title or slug,
'thumbnail': thumbnail,
'duration': duration,
}
product_id = traverse_obj(program_metadata, 'productId')
if not product_id:
raise ExtractorError('No productId found for slug: %s' % (slug,))
formats = self._download_by_product_id(product_id, slug, url)
self._sort_formats(formats)
return merge_dicts(traverse_obj(program_metadata, {
'title': 'title',
'description': (('description', ('long', 'short', 'brief')), 'title'),
'thumbnail': ('images', Ellipsis, 'url', T(url_or_none)),
'duration': ('durationInSeconds', T(int_or_none)),
}, get_all=False), {
'id': slug,
'formats': formats,
'title': slug,
'description': slug,
})

However I wonder if it's really useful to default description so much. Generally, it shouldn't repeat the title, let alone the id.

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
Comment on lines +205 to +207
# TODO Find out how we could obtain this automatically
# Otherwise this extractor might break each time SchoolTV deploys a new release
build_id = 'b7eHUzAVO7wHXCopYxQhV'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the _search_nextjs_data() method if needed.

# Skip because of m3u8 download
'skip_download': True
'id': video_id,
'title': metadata.get('title', '') + ' - ' + metadata.get('subtitle', ''),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use utils.join_nonempty() for this effect:

  • 'title', 'subtitle' -> 'title - subtitle'
  • 'title', '' -> 'title'
  • '', 'subtitle' -> 'subtitle'

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved
class VPROIE(NPOIE):
IE_NAME = 'vpro'
IE_DESC = 'vpro.nl'
_VALID_URL = r'https?://(?:www\.)?vpro.nl/.*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a tight enough pattern?

formats = []
for result in results:
formats.extend(self._download_by_product_id(result, video_id))
break # TODO find a better solution, VPRO pages can have multiple videos embedded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May this embedding occur in other pages (not vpro.nl)?

Are the second and up videos related (clips, trailers, etc), or is the case more like a series page with various episodes?

In the first case maybe skip the subsidiary videos; in the second normally return a playlist result whose entries are either the url_result()s of episode URLs constructed for each video, or info_dicts extracted from the page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test video page, there is apparently a content video and a teaser video. The former can be detected because it's inside (preceded by) <div class=grid>.

As far as I can see other pages that might list multiple videos are playlist pages like https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/categorieen/wereld.html that don't include data-media-ids but just have links to programme episodes. Counterexamples welcome.

@bartbroere
Copy link
Author

I'll be mostly offline during April, so this is just a heads up that this PR will not see much progress for a few weeks. I didn't forget about it though, and plan to continue working on it in May.

@dirkf
Copy link
Contributor

dirkf commented Apr 3, 2024

I may be able to save you much of that work. I've run through the d/l tests locally and adjusted the code somewhat to match. If anyone has an answer to #31976 (comment), I'll be able to settle that extractor and push an update that should be close to mergeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants