Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized framework for webpage-based extraction #4307

Merged
merged 6 commits into from Aug 1, 2022

Conversation

pukkandan
Copy link
Member

@pukkandan pukkandan commented Jul 8, 2022

Motivation

1. Un-bloating GenericIE

These are the tasks that our current Generic extractor performs (in order):

Fix protocol
default_search
Redirect handling
direct video link
                    <-- The webpage is downloaded here
M3U playlist
RSS feed
ISM manifest
SMIL file
XSPF playlist
DASH manifest
F4M manifest
Camtasia video
Embed detection     <-- Manually added by each extractor
HTML5 media
JW Player playlist
JW Player data
video.js embed
JSON LD
JW Player in SFWObject => flashvars:
JW Player embed => jw_plugins|jwplayer
KWS Player
video file => (file|source)=
JW Player JS loader => (file|video_url):
Flow Player
Cinerama player
Twitter card
Open Graph video info
twitter:player iframe

Having all these in a single method makes it hard to reason about and maintain the extractor. It makes sense for everything after the webpage download to be separated out into it's own module. This is especially true for embed detection since each extractor has to currently implement it and related tests in GenericIE.

2. Pages with multiple videos

Only the first matching result is currently returned from the GenericIE. So if a webpage contains embeds from multiple websites, only one of them will be extracted. Example: #4291

3. Smarter embed detection

Most extractors' embed detection code boils down to looking for an iframe/embed that matches it's _VALID_URL. Often extractor authors don't add this code and so embeds of the respective site is not detected despite us having all the necessary code to do so. Eg: #80, blackjack4494/yt-dlc#204

4. Webpage based extractor matching

The current method of extractor matching using only URL has it's limitations

5. Extractor selection

Sometimes, it is useful to be able to disable specific extractors. See: #2044, #3234. This feature exists in limited capacity in the Python API, but there is no CLI option for this atm. If we are smart about the implementation, a single option should be able to address both linked requests as well as (4b)

Objectives

Ticked points have been implemented in this PR, rest are left for future improvements

  1. Embed extraction framework
    • a) The framework must ensure all the code and tests for embed-detection can be defined in the extractor
    • b) Most extractors only need to match one or more regexes. The framework should make it easy to avoid boilerplate for these
    • c) Detect embeds from multiple embeds in same page
    • d) Most extractors do not need to be initialized to perform embed checking. But more complex cases like (Moti-4b) must have access to instance. The framework should support both cases.
    • e) Automatically detect iframe/script matching the VALID_URL without having each extractor define it
    • f) Actually implement the framework for all existing extractors
    • g) Actually move all embed tests to the relevant extractors
  2. Generalization of all extraction methods
    • a) Return all results from multiple extraction methods
    • b) Separate each step into it's own submodule
  3. General webpage-based extraction
    • a) Extractor must be able to "claim" a page. ie, no other embeds should be returned from it
    • b) Extractors without _VALID_URL that is only used for webpage-based extraction
    • c) Option to enable/disable extractors
    • d) Extractors that need to be explicitly enabled

Remaining Issues

Extractor selection

There are quite a few decisions that needs to be made concerning this. For this reason, this PR is not going to even attempt to address this at all. Further discussion on this should happen in #3234

Smarter Embeds

When this idea was proposed to youtube-dl devs in the past (ytdl-org/youtube-dl#6216), there was concern of this causing too much false positives. The issue needs to be studied further to determine how much of a practical issue this is and what can be done to alleviate it.

The framework has been designed in a way that makes it trivial to actually implement this once a decision is made (set _EMBED_REGEX as a classproperty in common.py)

Pages with multiple videos

I assume the latter half of the GenericIE (everything after embeds) is defined in its current order to avoid false positives. Generalizing and returning all such results may not be desirable. So for the time-being, only the extractor embeds and HTML5 media (closes #4291) are returned in this fashion. If these are not detected, further extraction happens in sequence, returning only the first match. This can be trivially generalized to the other methods in future if/when need arises

Update: Camtasia video and KWS Player have also been migrated now

Test framework

The current testing framework does not have the necessary flexibility to test just the webpage extraction and skip further processing. Until this is improved, we can only test the full extraction as a single unit

Actual migration and Testing

Implementing the framework does not actually produce any real-world benefits till the extractors are actually migrated to use the new scheme. So it is imperative that we migrate most if not all the extractors as soon as possible.

At first glance, even though tedious, this appears to be simple - and this is mostly true. The issue is that all the embed tests are currently in GenericIE, listed in no particular order, and a lot of the links are dead. This makes migrating the tests quite a difficult and time consuming process. It doesn't help that it is harder to find new embed tests than finding new extractor tests.

For this reason, I have skipped tests migration in this PR. This sadly means that the changes to the embed code are not fully regression tested. Any help to add migrate/add embed tests, either fully or partially, is welcome

Update: I have now checked that all the GenericIE's tests pass to the same extend that they were previously. It clearly doesn't account for all the embed detections, but ig it's better than nothing...


Superseeds #12 which was an earlier attempt to implement a lot of the same ideas

Copy link
Member

@coletdjnz coletdjnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for tests we can probably at least migrate the ones where we set add_ie, even if they don't work

yt_dlp/extractor/common.py Outdated Show resolved Hide resolved
pass

@classmethod
def _extract_url(cls, webpage): # TODO: Remove
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, there was never an official API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extractors call this. It can be removed once all these are rewritten

yt_dlp • extractor\breakcom.py:
  44:         youtube_url = YoutubeIE._extract_url(webpage)

yt_dlp • extractor\brightcove.py:
  404:     def _extract_url(ie, webpage):

yt_dlp • extractor\carambatv.py:
  82:         videomore_url = VideomoreIE._extract_url(webpage)

yt_dlp • extractor\cbslocal.py:
  98:         sendtonews_url = SendtoNewsIE._extract_url(webpage)

yt_dlp • extractor\chilloutzone.py:
  71:             youtube_url = YoutubeIE._extract_url(webpage)

yt_dlp • extractor\cracked.py:
  43:         youtube_url = YoutubeIE._extract_url(webpage)

yt_dlp • extractor\cspan.py:
   90:         ustream_url = UstreamIE._extract_url(webpage)
  166:                 senate_isvp_url = SenateISVPIE._extract_url(webpage)

yt_dlp • extractor\ctsnews.py:
  63:             youtube_url = YoutubeIE._extract_url(page)

yt_dlp • extractor\footyroom.py:
  48:             streamable_url = StreamableIE._extract_url(payload)

yt_dlp • extractor\gameinformer.py:
  13:         # normal Brightcove embed code extracted with BrightcoveNewIE._extract_url
  45:         brightcove_url = self.BRIGHTCOVE_URL_TEMPLATE % brightcove_id if brightcove_id else BrightcoveNewIE._extract_url(self, webpage)

yt_dlp • extractor\gdcvault.py:
  175:         embed_url = KalturaIE._extract_url(start_page)

yt_dlp • extractor\heise.py:
  114:         kaltura_url = KalturaIE._extract_url(webpage)

yt_dlp • extractor\meta.py:
  68:         pladform_url = PladformIE._extract_url(webpage)

yt_dlp • extractor\nbc.py:
  250:             NBCSportsVPlayerIE._extract_url(webpage), 'NBCSportsVPlayer')

yt_dlp • extractor\nexx.py:
  527:         return self.url_result(NexxIE._extract_url(webpage), ie=NexxIE.ie_key())

yt_dlp • extractor\normalboots.py:
  39:         jwplatform_url = JWPlatformIE._extract_url(webpage)

yt_dlp • extractor\nowness.py:
  28:                         bc_url = BrightcoveNewIE._extract_url(self, player_code)

yt_dlp • extractor\nzherald.py:
  68:         bc_url = BrightcoveNewIE._extract_url(self, webpage)

yt_dlp • extractor\rcs.py:
  243:                 emb = RCSEmbedsIE._extract_url(page)

yt_dlp • extractor\ukcolumn.py:
  59:         ie, video_url = YoutubeIE, YoutubeIE._extract_url(oembed_webpage)
  61:             ie, video_url = VimeoIE, VimeoIE._extract_url(url, oembed_webpage)

yt_dlp • extractor\vesti.py:
  114:         rutv_url = RUTVIE._extract_url(page)

yt_dlp • extractor\vice.py:
  304:         vice_url = ViceIE._extract_url(body)
  314:         youtube_url = YoutubeIE._extract_url(body)

yt_dlp • extractor\vimeo.py:
  844:             url = self._extract_url(url, self._download_webpage(url, video_id))

yt_dlp • extractor\vk.py:
  390:         youtube_url = YoutubeIE._extract_url(info_page)
  394:         vimeo_url = VimeoIE._extract_url(url, info_page)
  398:         pladform_url = PladformIE._extract_url(info_page)
  413:         odnoklassniki_url = OdnoklassnikiIE._extract_url(info_page)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but these are extractor-specific apis, not an official api?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which is why I said "It can be removed once all these are rewritten". It is not being kept for backward compat but only to keep everything working till these function calls are updated

yt_dlp/extractor/generic.py Outdated Show resolved Hide resolved
@pukkandan pukkandan force-pushed the features/generic branch 2 times, most recently from 33db985 to eac6aa1 Compare July 10, 2022 23:02

return test_template


for name, num_tests in tests_counter.items():
test_method = batch_generator(name, num_tests)
for name in tests_counter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test_[ie]_webpage_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. The current implementation is to have test_(ie)_all to run both normal and webpage tests. I don't personally use the _all tests much (instead using pytest's pattern matching), so not sure whether others would find more helpful to have it unified or split

@pukkandan
Copy link
Member Author

The failing tests are due to Generic returning a playlist. The tests need to be modified

@coletdjnz
Copy link
Member

coletdjnz commented Jul 14, 2022

Some current issues:

  • core test cases need updating for change in generic returning playlist
  • If an extractor inherits another extractor with webpage test cases, the test cases will only run on the last extractor the test generation calls. _EMBED_REGEX is not inherited so the test cases may not even work (see VimeoIE).
  • Different webpage extractors extracting the same thing but with different metadata (HTML5 and Substack, HTML5 and Gfycat)
  • HTML5 extractor extracting invalid urls (https://www.lactv.it/2021/10/03/lac-news24-la-settimana-03-10-2021/)
  • Webpage tests where the id for the generic playlist is the same as one of its entries is broken (video.js, jwpayer extractors)

and split download tests so they can be more easily run in CI

Authored by: coletdjnz
`Brightcove` is difficult to migrate because it's subclasses may depend
on the signature of the current functions. So it is left as-is for now

Note: Tests have not been migrated
@pukkandan pukkandan merged commit bfd973e into yt-dlp:master Aug 1, 2022
pukkandan added a commit that referenced this pull request Aug 1, 2022
and split download tests so they can be more easily run in CI

Authored by: coletdjnz
pukkandan added a commit that referenced this pull request Aug 1, 2022
pukkandan added a commit that referenced this pull request Aug 1, 2022
Closes #4291

Authored by: coletdjnz, pukkandan
pukkandan added a commit that referenced this pull request Aug 24, 2022
Deprecates `--force-generic-extractor`

Closes #3234, Closes #2044

Related: #4307, #1791
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Self hosted extraction
Development

Successfully merging this pull request may close these issues.

MediaWiki HTML5 generic video support
2 participants