Generalized framework for webpage-based extraction #4307

pukkandan · 2022-07-08T16:03:14Z

Motivation

1. Un-bloating GenericIE

These are the tasks that our current Generic extractor performs (in order):

Fix protocol
default_search
Redirect handling
direct video link
                    <-- The webpage is downloaded here
M3U playlist
RSS feed
ISM manifest
SMIL file
XSPF playlist
DASH manifest
F4M manifest
Camtasia video
Embed detection     <-- Manually added by each extractor
HTML5 media
JW Player playlist
JW Player data
video.js embed
JSON LD
JW Player in SFWObject => flashvars:
JW Player embed => jw_plugins|jwplayer
KWS Player
video file => (file|source)=
JW Player JS loader => (file|video_url):
Flow Player
Cinerama player
Twitter card
Open Graph video info
twitter:player iframe

Having all these in a single method makes it hard to reason about and maintain the extractor. It makes sense for everything after the webpage download to be separated out into it's own module. This is especially true for embed detection since each extractor has to currently implement it and related tests in GenericIE.

2. Pages with multiple videos

Only the first matching result is currently returned from the GenericIE. So if a webpage contains embeds from multiple websites, only one of them will be extracted. Example: #4291

3. Smarter embed detection

Most extractors' embed detection code boils down to looking for an iframe/embed that matches it's _VALID_URL. Often extractor authors don't add this code and so embeds of the respective site is not detected despite us having all the necessary code to do so. Eg: #80, blackjack4494/yt-dlc#204

4. Webpage based extractor matching

The current method of extractor matching using only URL has it's limitations

a) There are cases where url + webpage is needed for detection. These are currently implemented in the same way as embeds (Eg: [Site Request] Update Invidious section of the YT extractor to use new link element on videos. #195), but when we solve (2) we must make sure not to return additional results for this
b) Sometimes additional network requests are needed to check for a match. See: Self-hosted extractors: Mastodon, PeerTube and Misskey (with haruhi-dl merge) #1791. If this is implemented into the normal flow of generic extraction, each unsupported URL will unnecessarily add these (potentially expensive) processing. So some sort of user-facing option is needed to control this.

5. Extractor selection

Sometimes, it is useful to be able to disable specific extractors. See: #2044, #3234. This feature exists in limited capacity in the Python API, but there is no CLI option for this atm. If we are smart about the implementation, a single option should be able to address both linked requests as well as (4b)

Objectives

Ticked points have been implemented in this PR, rest are left for future improvements

Embed extraction framework
- a) The framework must ensure all the code and tests for embed-detection can be defined in the extractor
- b) Most extractors only need to match one or more regexes. The framework should make it easy to avoid boilerplate for these
- c) Detect embeds from multiple embeds in same page
- d) Most extractors do not need to be initialized to perform embed checking. But more complex cases like (Moti-4b) must have access to instance. The framework should support both cases.
- e) Automatically detect iframe/script matching the VALID_URL without having each extractor define it
- f) Actually implement the framework for all existing extractors
- g) Actually move all embed tests to the relevant extractors
Generalization of all extraction methods
- a) Return all results from multiple extraction methods
- b) Separate each step into it's own submodule
General webpage-based extraction
- a) Extractor must be able to "claim" a page. ie, no other embeds should be returned from it
- b) Extractors without _VALID_URL that is only used for webpage-based extraction
- c) Option to enable/disable extractors
- d) Extractors that need to be explicitly enabled

Remaining Issues

Extractor selection

There are quite a few decisions that needs to be made concerning this. For this reason, this PR is not going to even attempt to address this at all. Further discussion on this should happen in #3234

Smarter Embeds

When this idea was proposed to youtube-dl devs in the past (ytdl-org/youtube-dl#6216), there was concern of this causing too much false positives. The issue needs to be studied further to determine how much of a practical issue this is and what can be done to alleviate it.

The framework has been designed in a way that makes it trivial to actually implement this once a decision is made (set _EMBED_REGEX as a classproperty in common.py)

Pages with multiple videos

I assume the latter half of the GenericIE (everything after embeds) is defined in its current order to avoid false positives. Generalizing and returning all such results may not be desirable. So for the time-being, only the extractor embeds and HTML5 media (closes #4291) are returned in this fashion. If these are not detected, further extraction happens in sequence, returning only the first match. This can be trivially generalized to the other methods in future if/when need arises

Update: Camtasia video and KWS Player have also been migrated now

Test framework

The current testing framework does not have the necessary flexibility to test just the webpage extraction and skip further processing. Until this is improved, we can only test the full extraction as a single unit

Actual migration and Testing

Implementing the framework does not actually produce any real-world benefits till the extractors are actually migrated to use the new scheme. So it is imperative that we migrate most if not all the extractors as soon as possible.

At first glance, even though tedious, this appears to be simple - and this is mostly true. The issue is that all the embed tests are currently in GenericIE, listed in no particular order, and a lot of the links are dead. This makes migrating the tests quite a difficult and time consuming process. It doesn't help that it is harder to find new embed tests than finding new extractor tests.

For this reason, I have skipped tests migration in this PR. This sadly means that the changes to the embed code are not fully regression tested. Any help to add migrate/add embed tests, either fully or partially, is welcome

Update: I have now checked that all the GenericIE's tests pass to the same extend that they were previously. It clearly doesn't account for all the embed detections, but ig it's better than nothing...

Superseeds #12 which was an earlier attempt to implement a lot of the same ideas

coletdjnz

~~for tests we can probably at least migrate the ones where we set add_ie, even if they don't work~~

yt_dlp/extractor/common.py

coletdjnz · 2022-07-10T21:34:41Z

yt_dlp/extractor/common.py

+        pass
+
+    @classmethod
+    def _extract_url(cls, webpage):  # TODO: Remove


no need, there was never an official API?

Some extractors call this. It can be removed once all these are rewritten

yt_dlp • extractor\breakcom.py: 44: youtube_url = YoutubeIE._extract_url(webpage) yt_dlp • extractor\brightcove.py: 404: def _extract_url(ie, webpage): yt_dlp • extractor\carambatv.py: 82: videomore_url = VideomoreIE._extract_url(webpage) yt_dlp • extractor\cbslocal.py: 98: sendtonews_url = SendtoNewsIE._extract_url(webpage) yt_dlp • extractor\chilloutzone.py: 71: youtube_url = YoutubeIE._extract_url(webpage) yt_dlp • extractor\cracked.py: 43: youtube_url = YoutubeIE._extract_url(webpage) yt_dlp • extractor\cspan.py: 90: ustream_url = UstreamIE._extract_url(webpage) 166: senate_isvp_url = SenateISVPIE._extract_url(webpage) yt_dlp • extractor\ctsnews.py: 63: youtube_url = YoutubeIE._extract_url(page) yt_dlp • extractor\footyroom.py: 48: streamable_url = StreamableIE._extract_url(payload) yt_dlp • extractor\gameinformer.py: 13: # normal Brightcove embed code extracted with BrightcoveNewIE._extract_url 45: brightcove_url = self.BRIGHTCOVE_URL_TEMPLATE % brightcove_id if brightcove_id else BrightcoveNewIE._extract_url(self, webpage) yt_dlp • extractor\gdcvault.py: 175: embed_url = KalturaIE._extract_url(start_page) yt_dlp • extractor\heise.py: 114: kaltura_url = KalturaIE._extract_url(webpage) yt_dlp • extractor\meta.py: 68: pladform_url = PladformIE._extract_url(webpage) yt_dlp • extractor\nbc.py: 250: NBCSportsVPlayerIE._extract_url(webpage), 'NBCSportsVPlayer') yt_dlp • extractor\nexx.py: 527: return self.url_result(NexxIE._extract_url(webpage), ie=NexxIE.ie_key()) yt_dlp • extractor\normalboots.py: 39: jwplatform_url = JWPlatformIE._extract_url(webpage) yt_dlp • extractor\nowness.py: 28: bc_url = BrightcoveNewIE._extract_url(self, player_code) yt_dlp • extractor\nzherald.py: 68: bc_url = BrightcoveNewIE._extract_url(self, webpage) yt_dlp • extractor\rcs.py: 243: emb = RCSEmbedsIE._extract_url(page) yt_dlp • extractor\ukcolumn.py: 59: ie, video_url = YoutubeIE, YoutubeIE._extract_url(oembed_webpage) 61: ie, video_url = VimeoIE, VimeoIE._extract_url(url, oembed_webpage) yt_dlp • extractor\vesti.py: 114: rutv_url = RUTVIE._extract_url(page) yt_dlp • extractor\vice.py: 304: vice_url = ViceIE._extract_url(body) 314: youtube_url = YoutubeIE._extract_url(body) yt_dlp • extractor\vimeo.py: 844: url = self._extract_url(url, self._download_webpage(url, video_id)) yt_dlp • extractor\vk.py: 390: youtube_url = YoutubeIE._extract_url(info_page) 394: vimeo_url = VimeoIE._extract_url(url, info_page) 398: pladform_url = PladformIE._extract_url(info_page) 413: odnoklassniki_url = OdnoklassnikiIE._extract_url(info_page)

but these are extractor-specific apis, not an official api?

Yes, which is why I said "It can be removed once all these are rewritten". It is not being kept for backward compat but only to keep everything working till these function calls are updated

yt_dlp/extractor/generic.py

coletdjnz · 2022-07-11T22:29:23Z

test/test_download.py


    return test_template


-for name, num_tests in tests_counter.items():
-    test_method = batch_generator(name, num_tests)
+for name in tests_counter:


should we add a test_[ie]_webpage_all?

Not sure. The current implementation is to have test_(ie)_all to run both normal and webpage tests. I don't personally use the _all tests much (instead using pytest's pattern matching), so not sure whether others would find more helpful to have it unified or split

yt_dlp/extractor/generic.py

pukkandan · 2022-07-13T10:02:05Z

The failing tests are due to Generic returning a playlist. The tests need to be modified

coletdjnz · 2022-07-14T23:08:22Z

Some current issues:

core test cases need updating for change in generic returning playlist
If an extractor inherits another extractor with webpage test cases, the test cases will only run on the last extractor the test generation calls. _EMBED_REGEX is not inherited so the test cases may not even work (see VimeoIE).
Different webpage extractors extracting the same thing but with different metadata (~~HTML5 and Substack~~, HTML5 and Gfycat)
HTML5 extractor extracting invalid urls (https://www.lactv.it/2021/10/03/lac-news24-la-settimana-03-10-2021/)
Webpage tests where the id for the generic playlist is the same as one of its entries is broken (video.js, jwpayer extractors)

yt_dlp/extractor/common.py

and split download tests so they can be more easily run in CI Authored by: coletdjnz

Authored by: coletdjnz

Closes yt-dlp#4291 Authored by: coletdjnz, pukkandan

Closes yt-dlp#4352

`Brightcove` is difficult to migrate because it's subclasses may depend on the signature of the current functions. So it is left as-is for now Note: Tests have not been migrated

and split download tests so they can be more easily run in CI Authored by: coletdjnz

Authored by: coletdjnz

Closes #4291 Authored by: coletdjnz, pukkandan

Closes #4352

Deprecates `--force-generic-extractor` Closes #3234, Closes #2044 Related: #4307, #1791

pukkandan force-pushed the features/generic branch from 7f838db to e2e365a Compare July 8, 2022 16:03

pukkandan added the enhancement New feature or request label Jul 8, 2022

pukkandan force-pushed the features/generic branch from e2e365a to 511cd8c Compare July 8, 2022 16:07

coletdjnz self-requested a review July 8, 2022 18:53

pukkandan mentioned this pull request Jul 9, 2022

MediaWiki HTML5 generic video support #4291

Closed

8 tasks

pukkandan force-pushed the features/generic branch from ec80e5b to d204c9d Compare July 9, 2022 05:30

coletdjnz reviewed Jul 10, 2022

View reviewed changes

pukkandan force-pushed the features/generic branch 2 times, most recently from 33db985 to eac6aa1 Compare July 10, 2022 23:02

coletdjnz reviewed Jul 11, 2022

View reviewed changes

coletdjnz reviewed Jul 12, 2022

View reviewed changes

yt_dlp/extractor/generic.py Show resolved Hide resolved

pukkandan commented Jul 13, 2022

View reviewed changes

yt_dlp/extractor/generic.py Outdated Show resolved Hide resolved

pukkandan commented Jul 13, 2022

View reviewed changes

yt_dlp/extractor/generic.py Outdated Show resolved Hide resolved

coletdjnz reviewed Jul 15, 2022

View reviewed changes

yt_dlp/extractor/common.py Show resolved Hide resolved

pukkandan force-pushed the features/generic branch from e47744d to 6bd3411 Compare August 1, 2022 00:31

pukkandan mentioned this pull request Aug 1, 2022

Webpage-based extraction - Part 2 #4517

Draft

3 tasks

pukkandan force-pushed the features/generic branch 3 times, most recently from 20ffb23 to 2255c7c Compare August 1, 2022 01:24

pukkandan added 6 commits August 2, 2022 00:59

[extractor] Framework for embed detection (yt-dlp#4307)

de3cd2f

[extractor, test] Basic framework for embed tests (yt-dlp#4307)

6cbae36

and split download tests so they can be more easily run in CI Authored by: coletdjnz

[extractor/camtasia] Separate into own extractor (yt-dlp#4307)

cc23816

Authored by: coletdjnz

[extractor/html5] Separate into own extractor (yt-dlp#4307)

34aa9e8

Closes yt-dlp#4291 Authored by: coletdjnz, pukkandan

[extractor] Support multiple archive ids for one video (yt-dlp#4307)

1367e00

Closes yt-dlp#4352

[extractors] Use new framework for existing embeds (yt-dlp#4307)

d96fa83

`Brightcove` is difficult to migrate because it's subclasses may depend on the signature of the current functions. So it is left as-is for now Note: Tests have not been migrated

pukkandan force-pushed the features/generic branch from 2255c7c to d96fa83 Compare August 1, 2022 19:35

pukkandan merged commit bfd973e into yt-dlp:master Aug 1, 2022

pukkandan added a commit that referenced this pull request Aug 1, 2022

[extractor] Framework for embed detection (#4307)

8f97a15

pukkandan added a commit that referenced this pull request Aug 1, 2022

[extractor, test] Basic framework for embed tests (#4307)

f2e8dbc

and split download tests so they can be more easily run in CI Authored by: coletdjnz

pukkandan added a commit that referenced this pull request Aug 1, 2022

[extractor/camtasia] Separate into own extractor (#4307)

5fff2e5

Authored by: coletdjnz

pukkandan added a commit that referenced this pull request Aug 1, 2022

[extractor/html5] Separate into own extractor (#4307)

f14a2d8

Closes #4291 Authored by: coletdjnz, pukkandan

pukkandan added a commit that referenced this pull request Aug 1, 2022

[extractor] Support multiple archive ids for one video (#4307)

1e8fe57

Closes #4352

pukkandan added a commit that referenced this pull request Aug 24, 2022

Add option --use-extractors

fe7866d

Deprecates `--force-generic-extractor` Closes #3234, Closes #2044 Related: #4307, #1791

dirkf mentioned this pull request Jan 23, 2023

[VideoCdn] Add new extractor ytdl-org/youtube-dl#31481

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized framework for webpage-based extraction #4307

Generalized framework for webpage-based extraction #4307

pukkandan commented Jul 8, 2022 •

edited

coletdjnz left a comment •

edited

coletdjnz Jul 10, 2022

pukkandan Jul 10, 2022

coletdjnz Jul 11, 2022

pukkandan Jul 11, 2022

coletdjnz Jul 11, 2022

pukkandan Jul 12, 2022

pukkandan commented Jul 13, 2022

coletdjnz commented Jul 14, 2022 •

edited

Generalized framework for webpage-based extraction #4307

Generalized framework for webpage-based extraction #4307

Conversation

pukkandan commented Jul 8, 2022 • edited

Motivation

1. Un-bloating GenericIE

2. Pages with multiple videos

3. Smarter embed detection

4. Webpage based extractor matching

5. Extractor selection

Objectives

Remaining Issues

Extractor selection

Smarter Embeds

Pages with multiple videos

Test framework

Actual migration and Testing

coletdjnz left a comment • edited

Choose a reason for hiding this comment

coletdjnz Jul 10, 2022

Choose a reason for hiding this comment

pukkandan Jul 10, 2022

Choose a reason for hiding this comment

coletdjnz Jul 11, 2022

Choose a reason for hiding this comment

pukkandan Jul 11, 2022

Choose a reason for hiding this comment

coletdjnz Jul 11, 2022

Choose a reason for hiding this comment

pukkandan Jul 12, 2022

Choose a reason for hiding this comment

pukkandan commented Jul 13, 2022

coletdjnz commented Jul 14, 2022 • edited

pukkandan commented Jul 8, 2022 •

edited

coletdjnz left a comment •

edited

coletdjnz commented Jul 14, 2022 •

edited