New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized framework for webpage-based extraction #4307
Conversation
7f838db
to
e2e365a
Compare
e2e365a
to
511cd8c
Compare
ec80e5b
to
d204c9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for tests we can probably at least migrate the ones where we set add_ie
, even if they don't work
pass | ||
|
||
@classmethod | ||
def _extract_url(cls, webpage): # TODO: Remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, there was never an official API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some extractors call this. It can be removed once all these are rewritten
yt_dlp • extractor\breakcom.py:
44: youtube_url = YoutubeIE._extract_url(webpage)
yt_dlp • extractor\brightcove.py:
404: def _extract_url(ie, webpage):
yt_dlp • extractor\carambatv.py:
82: videomore_url = VideomoreIE._extract_url(webpage)
yt_dlp • extractor\cbslocal.py:
98: sendtonews_url = SendtoNewsIE._extract_url(webpage)
yt_dlp • extractor\chilloutzone.py:
71: youtube_url = YoutubeIE._extract_url(webpage)
yt_dlp • extractor\cracked.py:
43: youtube_url = YoutubeIE._extract_url(webpage)
yt_dlp • extractor\cspan.py:
90: ustream_url = UstreamIE._extract_url(webpage)
166: senate_isvp_url = SenateISVPIE._extract_url(webpage)
yt_dlp • extractor\ctsnews.py:
63: youtube_url = YoutubeIE._extract_url(page)
yt_dlp • extractor\footyroom.py:
48: streamable_url = StreamableIE._extract_url(payload)
yt_dlp • extractor\gameinformer.py:
13: # normal Brightcove embed code extracted with BrightcoveNewIE._extract_url
45: brightcove_url = self.BRIGHTCOVE_URL_TEMPLATE % brightcove_id if brightcove_id else BrightcoveNewIE._extract_url(self, webpage)
yt_dlp • extractor\gdcvault.py:
175: embed_url = KalturaIE._extract_url(start_page)
yt_dlp • extractor\heise.py:
114: kaltura_url = KalturaIE._extract_url(webpage)
yt_dlp • extractor\meta.py:
68: pladform_url = PladformIE._extract_url(webpage)
yt_dlp • extractor\nbc.py:
250: NBCSportsVPlayerIE._extract_url(webpage), 'NBCSportsVPlayer')
yt_dlp • extractor\nexx.py:
527: return self.url_result(NexxIE._extract_url(webpage), ie=NexxIE.ie_key())
yt_dlp • extractor\normalboots.py:
39: jwplatform_url = JWPlatformIE._extract_url(webpage)
yt_dlp • extractor\nowness.py:
28: bc_url = BrightcoveNewIE._extract_url(self, player_code)
yt_dlp • extractor\nzherald.py:
68: bc_url = BrightcoveNewIE._extract_url(self, webpage)
yt_dlp • extractor\rcs.py:
243: emb = RCSEmbedsIE._extract_url(page)
yt_dlp • extractor\ukcolumn.py:
59: ie, video_url = YoutubeIE, YoutubeIE._extract_url(oembed_webpage)
61: ie, video_url = VimeoIE, VimeoIE._extract_url(url, oembed_webpage)
yt_dlp • extractor\vesti.py:
114: rutv_url = RUTVIE._extract_url(page)
yt_dlp • extractor\vice.py:
304: vice_url = ViceIE._extract_url(body)
314: youtube_url = YoutubeIE._extract_url(body)
yt_dlp • extractor\vimeo.py:
844: url = self._extract_url(url, self._download_webpage(url, video_id))
yt_dlp • extractor\vk.py:
390: youtube_url = YoutubeIE._extract_url(info_page)
394: vimeo_url = VimeoIE._extract_url(url, info_page)
398: pladform_url = PladformIE._extract_url(info_page)
413: odnoklassniki_url = OdnoklassnikiIE._extract_url(info_page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but these are extractor-specific apis, not an official api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, which is why I said "It can be removed once all these are rewritten". It is not being kept for backward compat but only to keep everything working till these function calls are updated
33db985
to
eac6aa1
Compare
|
||
return test_template | ||
|
||
|
||
for name, num_tests in tests_counter.items(): | ||
test_method = batch_generator(name, num_tests) | ||
for name in tests_counter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test_[ie]_webpage_all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. The current implementation is to have test_(ie)_all
to run both normal and webpage tests. I don't personally use the _all
tests much (instead using pytest's pattern matching), so not sure whether others would find more helpful to have it unified or split
The failing tests are due to Generic returning a playlist. The tests need to be modified |
Some current issues:
|
e47744d
to
6bd3411
Compare
20ffb23
to
2255c7c
Compare
and split download tests so they can be more easily run in CI Authored by: coletdjnz
Authored by: coletdjnz
Closes yt-dlp#4291 Authored by: coletdjnz, pukkandan
`Brightcove` is difficult to migrate because it's subclasses may depend on the signature of the current functions. So it is left as-is for now Note: Tests have not been migrated
2255c7c
to
d96fa83
Compare
and split download tests so they can be more easily run in CI Authored by: coletdjnz
Closes #4291 Authored by: coletdjnz, pukkandan
Motivation
1. Un-bloating GenericIE
These are the tasks that our current Generic extractor performs (in order):
Having all these in a single method makes it hard to reason about and maintain the extractor. It makes sense for everything after the webpage download to be separated out into it's own module. This is especially true for embed detection since each extractor has to currently implement it and related tests in GenericIE.
2. Pages with multiple videos
Only the first matching result is currently returned from the GenericIE. So if a webpage contains embeds from multiple websites, only one of them will be extracted. Example: #4291
3. Smarter embed detection
Most extractors' embed detection code boils down to looking for an iframe/embed that matches it's _VALID_URL. Often extractor authors don't add this code and so embeds of the respective site is not detected despite us having all the necessary code to do so. Eg: #80, blackjack4494/yt-dlc#204
4. Webpage based extractor matching
The current method of extractor matching using only URL has it's limitations
a) There are cases where url + webpage is needed for detection. These are currently implemented in the same way as embeds (Eg: [Site Request] Update Invidious section of the YT extractor to use new link element on videos. #195), but when we solve (2) we must make sure not to return additional results for this
b) Sometimes additional network requests are needed to check for a match. See: Self-hosted extractors: Mastodon, PeerTube and Misskey (with haruhi-dl merge) #1791. If this is implemented into the normal flow of generic extraction, each unsupported URL will unnecessarily add these (potentially expensive) processing. So some sort of user-facing option is needed to control this.
5. Extractor selection
Sometimes, it is useful to be able to disable specific extractors. See: #2044, #3234. This feature exists in limited capacity in the Python API, but there is no CLI option for this atm. If we are smart about the implementation, a single option should be able to address both linked requests as well as (4b)
Objectives
Ticked points have been implemented in this PR, rest are left for future improvements
Remaining Issues
Extractor selection
There are quite a few decisions that needs to be made concerning this. For this reason, this PR is not going to even attempt to address this at all. Further discussion on this should happen in #3234
Smarter Embeds
When this idea was proposed to youtube-dl devs in the past (ytdl-org/youtube-dl#6216), there was concern of this causing too much false positives. The issue needs to be studied further to determine how much of a practical issue this is and what can be done to alleviate it.
The framework has been designed in a way that makes it trivial to actually implement this once a decision is made (set
_EMBED_REGEX
as aclassproperty
in common.py)Pages with multiple videos
I assume the latter half of the GenericIE (everything after embeds) is defined in its current order to avoid false positives. Generalizing and returning all such results may not be desirable. So for the time-being, only the extractor embeds and HTML5 media (closes #4291) are returned in this fashion. If these are not detected, further extraction happens in sequence, returning only the first match. This can be trivially generalized to the other methods in future if/when need arises
Update: Camtasia video and KWS Player have also been migrated now
Test framework
The current testing framework does not have the necessary flexibility to test just the webpage extraction and skip further processing. Until this is improved, we can only test the full extraction as a single unit
Actual migration and Testing
Implementing the framework does not actually produce any real-world benefits till the extractors are actually migrated to use the new scheme. So it is imperative that we migrate most if not all the extractors as soon as possible.
At first glance, even though tedious, this appears to be simple - and this is mostly true. The issue is that all the embed tests are currently in GenericIE, listed in no particular order, and a lot of the links are dead. This makes migrating the tests quite a difficult and time consuming process. It doesn't help that it is harder to find new embed tests than finding new extractor tests.
For this reason, I have skipped tests migration in this PR. This sadly means that the changes to the embed code are not fully regression tested. Any help to add migrate/add embed tests, either fully or partially, is welcome
Update: I have now checked that all the GenericIE's tests pass to the same extend that they were previously. It clearly doesn't account for all the embed detections, but ig it's better than nothing...
Superseeds #12 which was an earlier attempt to implement a lot of the same ideas