Fix npo support #31976

bartbroere · 2023-03-31T11:02:06Z

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

Bug fix

Fix support for NPO sites

~~This fixes two things that have been changed on the NPO websites:~~

~~The current video player no longer returns the token in the JSON body, but instead provides us with an XSRF token in the cookie.~~
~~A second call changed from GET to POST and should include this XSRF token.~~

This branch started out as a small fix, but in the ~11 months the PR was open the NPO site was updated heavily, so now it changes a lot more than the things above. Many of the broadcaster (NL: omroep) sites' extractors (vpro etc.) no longer worked, so these have been removed entirely. I'm always willing to look into re-implementing support for some of these, if that's still possible. However, typically the broadcaster's website just embeds the npo.nl player, so grabbing the same media from the npo.nl URL is recommended.

dirkf

Thanks for your work!

I've made a few suggestions, really just conventions, so I'll let the CI test run now.

youtube_dl/extractor/npo.py

Co-authored-by: dirkf <fieldhouse@gmx.net>

bartbroere · 2023-04-03T07:54:08Z

I have accepted all suggestions here. Sorry I missed some conventions.

* simplify comment * force CI

bartbroere · 2023-05-10T06:22:34Z

@dirkf Is it ready to merge now?

RickTimmer · 2023-06-21T09:34:44Z

I have had great success using this fork to download NPO Start videos. Greatly appreciated @bartbroere. It would be nice to see these changes merged.

dirkf

I'd like to merge this but it needs valid tests.

All the current test URLs redirect to the home page, but maybe that doesn't happen in NL. Please ensure that there is at least one valid test (with playable content, non-DRM) and mark any invalid URLs with

        'skip': 'Content expired',

(or ...: 'only available in NL',, or whatever is appropriate) after the info_dict. If an invalid test URL doesn't match the extractor's _VALID_URL, it should be deleted (but I don't think that applies here).

If URLs are only valid in NL, or with an account, please post a console log of successful tests.

I've made some additional suggestions that I might improve with the aid of a usable test URL.

youtube_dl/extractor/npo.py

bartbroere · 2024-02-26T09:58:51Z

I'd like to merge this but it needs valid tests.

All the current test URLs redirect to the home page, but maybe that doesn't happen in NL. Please ensure that there is at least one valid test (with playable content, non-DRM) and mark any invalid URLs with
        'skip': 'Content expired',
(or ...: 'only available in NL',, or whatever is appropriate) after the info_dict. If an invalid test URL doesn't match the extractor's _VALID_URL, it should be deleted (but I don't think that applies here).

If URLs are only valid in NL, or with an account, please post a console log of successful tests.

I've made some additional suggestions that I might improve with the aid of a usable test URL.

Thanks for the suggested improvements. I'll address all the feedback, and re-request review once I feel it's all done.

bartbroere · 2024-02-26T11:26:00Z

@dirkf Looking into it I realised this is probably caused by the recently released new NPO app and website. I'll check all tests and maybe add some new ones

bartbroere · 2024-03-01T12:18:41Z

yt-dlp/yt-dlp#9319 addresses the same issues. I think (speaking for the Dutch users a bit) that support for npo.nl is the main feature, and all the other sites often just embed the NPO player in some way.

Since almost all of the other sites no longer work in the same way as before, I would propose throwing away extractors for many other domains, since rebuilding them is probably a lot quicker.

Re-implementing these is quicker for the cases where that's even still possible

dirkf · 2024-03-01T13:53:49Z

It's a bit vexing that the yt-dlp PR has done the same job, but by all means plunder any useful stuff from there. I would generally merge/back-port an updated yt-dlp extractor but generally also only when ours doesn't work.

bartbroere · 2024-03-01T14:55:55Z

It's a bit vexing that the yt-dlp PR has done the same job, but by all means plunder any useful stuff from there. I would generally merge/back-port an updated yt-dlp extractor but generally also only when ours doesn't work.

I'm not convinced that the pull request on yt_dlp actually works for the new npo.nl player yet:
https://github.com/rvsit/yt-dlp/blob/c2d2b589588c7dbe44a97b4d29bad0d68922cd07/yt_dlp/extractor/npo.py#L122-L123

On the other hand, my branch probably doesn't work for some of the yt_dlp test cases yet.

So I think we can do a nice bit of "cross-plundering" between my youtube-dl pull request and @rvsit 's yt-dlp pull request.

dirkf · 2024-03-01T16:07:01Z

Great.

Also, if there are standard patterns for the discarded sites where NPO videos are embedded, we can add a class variable _EMBED_REGEX that is a list of those patterns with the NPO URL found in regex group url. Add this to provide an entry-point that could be called from the generic extractor:

    @classmethod
    def _extract_urls(cls, webpage):
        def yield_urls():
            for p in cls._EMBED_REGEX:
                for m in re.finditer(p, webpage):
                    yield m.group('url')

        return list(yield_urls())

Eventually the webpage extraction system from yt-dlp should be pulled in, with the method standardised in InfoExtractor, and the generic extractor refactored to blazes.

bartbroere · 2024-03-03T15:50:27Z

My suggestion for a universal approach would be

Extract POMS id from url with various implementations per url format/domain

They all call the same function that take a POMS id and get's the url's

Optionally enrich it with title and such if not already extracted from url

We could use the POMS api for this, but the url signing/api key is a little annoying even although those keys are public, https://rs.poms.omroep.nl/v1/api/media/[id] example_response_POMS_NTR_388772.json

This indeed seems the best way to do this. I'll rework the code to fit in this pattern.

…ff looks nice

rvsit · 2024-03-06T16:23:23Z

youtube_dl/extractor/npo.py

+        # TODO Find out how we could obtain this automatically
+        #      Otherwise this extractor might break each time SchoolTV deploys a new release
+        build_id = 'b7eHUzAVO7wHXCopYxQhV'


I think the only way is to load a random page, JSON parse the __NEXT_DATA__ part and get the buildId prop from there. But then we might as well have that 'random page' be the actual video page and skip the /_next/data/ download part as that object already contains the poms_mid.
It is not great, but I think the only stable option is parsing __NEXT_DATA__ part for the poms_mid like we initially did for the NPO Start webui. I have worked with nextjs for quite a while and they have never changed the __NEXT_DATA__ part as far as I know so should be relatively safe.

There is the _search_nextjs_data() method if needed.

There is the _search_nextjs_data() method if needed.

Thanks! I'll look into that and use it if I can make it work.

In fact, all the information that you might have got with the JSON API is in the Next.js hydration JSON in the page, including the build ID that is no longer of interest. I have this working but will update once other issues have been cleared.

youtube_dl/extractor/npo.py

dirkf

Thanks, I can see this is coming along well. Here are a few comments and suggestions.

youtube_dl/extractor/npo.py

dirkf · 2024-03-07T00:30:57Z

youtube_dl/extractor/npo.py

+        product_id = program_metadata.get('productId')
+        images = program_metadata.get('images')
+        thumbnail = None
+        for image in images:
+            thumbnail = image.get('url')
+            break
+        title = program_metadata.get('title')
+        descriptions = program_metadata.get('description', {})
+        description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief')
+        duration = program_metadata.get('durationInSeconds')
+
+        if not product_id:
+            raise ExtractorError('No productId found for slug: %s' % slug)
+
+        formats = self._download_by_product_id(product_id, slug, url)

-        info = {
-            'id': video_id,
-            'title': video_id,
+        return {
+            'id': slug,
            'formats': formats,
+            'title': title or slug,
+            'description': description or title or slug,
+            'thumbnail': thumbnail,
+            'duration': duration,
        }


The JSON may not be as expected, or even None. We now have utils.traverse_obj() to make this easier; also I suggest failing earlier, either for lack of product_id or for missing formats (that yt-dl needs you to sort):

Suggested change

product_id = program_metadata.get('productId')

images = program_metadata.get('images')

thumbnail = None

for image in images:

thumbnail = image.get('url')

break

title = program_metadata.get('title')

descriptions = program_metadata.get('description', {})

description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief')

duration = program_metadata.get('durationInSeconds')

if not product_id:

raise ExtractorError('No productId found for slug: %s' % slug)

formats = self._download_by_product_id(product_id, slug, url)

info = {

'id': video_id,

'title': video_id,

return {

'id': slug,

'formats': formats,

'title': title or slug,

'description': description or title or slug,

'thumbnail': thumbnail,

'duration': duration,

}

product_id = traverse_obj(program_metadata, 'productId')

if not product_id:

raise ExtractorError('No productId found for slug: %s' % (slug,))

formats = self._download_by_product_id(product_id, slug, url)

self._sort_formats(formats)

return merge_dicts(traverse_obj(program_metadata, {

'title': 'title',

'description': (('description', ('long', 'short', 'brief')), 'title'),

'thumbnail': ('images', Ellipsis, 'url', T(url_or_none)),

'duration': ('durationInSeconds', T(int_or_none)),

}, get_all=False), {

'id': slug,

'formats': formats,

'title': slug,

'description': slug,

})

However I wonder if it's really useful to default description so much. Generally, it shouldn't repeat the title, let alone the id.

youtube_dl/extractor/npo.py

dirkf · 2024-03-07T01:21:06Z

youtube_dl/extractor/npo.py

+        # TODO Find out how we could obtain this automatically
+        #      Otherwise this extractor might break each time SchoolTV deploys a new release
+        build_id = 'b7eHUzAVO7wHXCopYxQhV'


There is the _search_nextjs_data() method if needed.

dirkf · 2024-03-07T01:24:29Z

youtube_dl/extractor/npo.py

-            # Skip because of m3u8 download
-            'skip_download': True
+            'id': video_id,
+            'title': metadata.get('title', '') + ' - ' + metadata.get('subtitle', ''),


Or use utils.join_nonempty() for this effect:

'title', 'subtitle' -> 'title - subtitle'

'title', '' -> 'title'

'', 'subtitle' -> 'subtitle'

youtube_dl/extractor/npo.py

dirkf · 2024-03-07T01:45:39Z

youtube_dl/extractor/npo.py

+class VPROIE(NPOIE):
+    IE_NAME = 'vpro'
+    IE_DESC = 'vpro.nl'
+    _VALID_URL = r'https?://(?:www\.)?vpro.nl/.*'


Is this a tight enough pattern?

dirkf · 2024-03-07T01:51:36Z

youtube_dl/extractor/npo.py

+        formats = []
+        for result in results:
+            formats.extend(self._download_by_product_id(result, video_id))
+            break  # TODO find a better solution, VPRO pages can have multiple videos embedded


May this embedding occur in other pages (not vpro.nl)?

Are the second and up videos related (clips, trailers, etc), or is the case more like a series page with various episodes?

In the first case maybe skip the subsidiary videos; in the second normally return a playlist result whose entries are either the url_result()s of episode URLs constructed for each video, or info_dicts extracted from the page.

Looking at the test video page, there is apparently a content video and a teaser video. The former can be detected because it's inside (preceded by) <div class=grid>.

As far as I can see other pages that might list multiple videos are playlist pages like https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/categorieen/wereld.html that don't include data-media-ids but just have links to programme episodes. Counterexamples welcome.

Co-authored-by: Roy <git@rvsit.nl>

Co-authored-by: dirkf <fieldhouse@gmx.net>

bartbroere · 2024-04-03T15:48:31Z

I'll be mostly offline during April, so this is just a heads up that this PR will not see much progress for a few weeks. I didn't forget about it though, and plan to continue working on it in May.

dirkf · 2024-04-03T15:59:14Z

I may be able to save you much of that work. I've run through the d/l tests locally and adjusted the code somewhat to match. If anyone has an answer to #31976 (comment), I'll be able to settle that extractor and push an update that should be close to mergeable.

bartbroere added 4 commits March 31, 2023 12:30

Fix support for NPO downloads

3b31478

Import from compat

b4776f2

Add line comment

fb2b4e2

Fix flake8

9e1acb2

bartbroere mentioned this pull request Mar 31, 2023

Support npostart.nl at youtube-dl #28381

Open

3 tasks

dirkf requested changes Mar 31, 2023

View reviewed changes

Accept suggestions on PR; comply with conventions

6328978

Co-authored-by: dirkf <fieldhouse@gmx.net>

bartbroere requested a review from dirkf April 3, 2023 07:53

Update npo.py

0c7261d

* simplify comment * force CI

Merge branch 'ytdl-org:master' into fix-npo-support

c409a8c

dirkf requested changes Feb 25, 2024

View reviewed changes

bartbroere added 3 commits February 26, 2024 13:18

Skip a test

f76d58c

Add notes on new npo.nl site

da3d1f4

Fix token URL

5773681

bartbroere added 2 commits March 1, 2024 13:24

Delete all broken extractors

29724e7

Re-implementing these is quicker for the cases where that's even still possible

Convert the description into code

21eb451

bartbroere added 3 commits March 1, 2024 15:05

Comply with coding conventions a bit more

0dc7d95

Speculate about other ways of getting productId

fb7b717

Add the possibility to add 'hls' later

f9e59b0

bartbroere mentioned this pull request Mar 1, 2024

[npo] Fix extractor yt-dlp/yt-dlp#9319

Closed

9 tasks

Use provided util

8b1a7d9

bartbroere added 11 commits March 3, 2024 17:47

Refactor into reusable method

34b5b20

Fix lint

4fc4238

Add Ongehoord Nederland and test URL for BNNVARA

28ba01f

First version of a VPRO regex

eb6e396

Re-add Zapp

d36d50f

Encoding suggestion from PR

d426a92

Use program-detail endpoint and remove a test

3b3d73c

Re-add SchoolTV

4b24e5f

Fix flake8 and better error reporting

681b390

Add scaffolding for last few extractors and change order so the PR di…

159f825

…ff looks nice

Make diff better

0cbcd1a

rvsit reviewed Mar 6, 2024

View reviewed changes

youtube_dl/extractor/npo.py Outdated Show resolved Hide resolved

dirkf reviewed Mar 7, 2024

View reviewed changes

bartbroere and others added 11 commits March 7, 2024 16:23

Reusable code for two NTR sites

0ab79c3

Update unit tests

c08f29f

Work work

28624cf

Add an MD5

1ca4e68

Fix zapp extractor

4398f68

Resolve some of the pull request feedback

58d7a00

Merge branch 'ytdl-org:master' into fix-npo-support

d4250c8

Improve regex

ad64f37

Co-authored-by: Roy <git@rvsit.nl>

Make regex more specific and remove redundant .*

bc86c5f

Adhere to code style

4c90b2f

Co-authored-by: dirkf <fieldhouse@gmx.net>

Remove afspelen and trailing slashes with one regex

007bbea

Co-authored-by: dirkf <fieldhouse@gmx.net>

rvsit approved these changes Mar 14, 2024

View reviewed changes

Fix indent from suggestion

a60972e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix npo support #31976

Fix npo support #31976

bartbroere commented Mar 31, 2023 •

edited

dirkf left a comment

bartbroere commented Apr 3, 2023

bartbroere commented May 10, 2023

RickTimmer commented Jun 21, 2023

dirkf left a comment •

edited

bartbroere commented Feb 26, 2024

bartbroere commented Feb 26, 2024

bartbroere commented Mar 1, 2024

dirkf commented Mar 1, 2024

bartbroere commented Mar 1, 2024 •

edited

dirkf commented Mar 1, 2024

bartbroere commented Mar 3, 2024

rvsit Mar 6, 2024 •

edited

dirkf Mar 7, 2024

bartbroere Mar 14, 2024

dirkf Apr 2, 2024

dirkf left a comment

dirkf Mar 7, 2024

dirkf Mar 7, 2024

dirkf Mar 7, 2024

dirkf Mar 7, 2024

dirkf Mar 7, 2024

dirkf Apr 2, 2024

bartbroere commented Apr 3, 2024

dirkf commented Apr 3, 2024

Fix npo support #31976

Are you sure you want to change the base?

Fix npo support #31976

Conversation

bartbroere commented Mar 31, 2023 • edited

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Fix support for NPO sites

dirkf left a comment

Choose a reason for hiding this comment

bartbroere commented Apr 3, 2023

bartbroere commented May 10, 2023

RickTimmer commented Jun 21, 2023

dirkf left a comment • edited

Choose a reason for hiding this comment

bartbroere commented Feb 26, 2024

bartbroere commented Feb 26, 2024

bartbroere commented Mar 1, 2024

dirkf commented Mar 1, 2024

bartbroere commented Mar 1, 2024 • edited

dirkf commented Mar 1, 2024

bartbroere commented Mar 3, 2024

rvsit Mar 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartbroere commented Apr 3, 2024

dirkf commented Apr 3, 2024

bartbroere commented Mar 31, 2023 •

edited

dirkf left a comment •

edited

bartbroere commented Mar 1, 2024 •

edited

rvsit Mar 6, 2024 •

edited