-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[general options] add --force-extractor #3234
Conversation
e0e2f8d
to
66b6652
Compare
This is something that has been requested (unofficially) before, but will need a few changes to this implementation:
I won't say we have to implement all of this now, but the interface must be designed accounting for these requirements
Slight clarification. Extractors are not initialized unless actually used. The regex compilation actually happens when the URL is checked against the
Out of curiosity, is this with or without lazy extractors? |
Perhaps I'm misunderstanding, but based on looking at the profiler and some scattered prints, it seems to me that by default the input URL is checked against all extractors' _VALID_URL. That checking is what is expensive.
I believe it's with lazy extractors. The time is spent in
|
Also, thank you for taking a look. I'm unfamiliar with some of the things you mention (like instance checking). I'll take a look at solving the first few, though. |
You can check |
That is fine. I put all the related points here more for reference purposes. The last 2-3 points definitely doesn't need to be addressed in this PR. They are just further enhancements to this feature that would be helpful
Also, it might be best if I do this part myself. But first we need to decide whether to use the name or the key (will using name undermine the performance gain?) |
Yes, confirming that lazy extractors are enabled. Also, another clarification to what I said:
The checking itself is not what shows up in the profile. It's the compilation of the regexes that is expensive.
Is there a map from name to key currently? If there is such a map, there will be no performance impact. I take back what I said about Any other kind of processing like looking up key from name, or doing a search through all of the names for your idea for
I'm completely fine closing this if you want to take it over, just let me know! |
No
No! I was talking about the documentation part in specific... (I can push changes directly to the PR if I need to)
The code in lazy extractor is copied from this: yt-dlp/yt_dlp/extractor/common.py Lines 492 to 501 in 8a7f68d
Yes, but even if the On a sidenote, would unpickling the compiled regex be faster than compiling it? If so, that data can be auto-generated at build time similar to lazy-extractors. |
Please revert the changes to the executable bit of files |
Yes, you're right, I was just trying to be specific about what the profile sees - the compilation is expensive, the subsequent matching is cheap.
Interesting idea, I don't know. This StackOverflow answer suggests it might be too painful to be worth it: https://stackoverflow.com/a/65440/73632 |
I'm fine with any option name, please let me know your pick. Here are some other ideas:
|
The value of the option needs to be validated in options.py
|
hm,
|
It seems like we'd want to apply it to all redirects as well, though I'm not sure where the code for redirects is - is that the invocations in |
All invocations of diff --git a/yt_dlp/YoutubeDL.py b/yt_dlp/YoutubeDL.py
index b90173508..a3c865014 100755
--- a/yt_dlp/YoutubeDL.py
+++ b/yt_dlp/YoutubeDL.py
@@ -1378,7 +1378,10 @@ class YoutubeDL(object):
else:
ies = self._ies
+ allowed_extractors = self.params.get('force_extractor')
for ie_key, ie in ies.items():
+ if allowed_extractors and ie_key not in allowed_extractors:
+ continue
if not ie.suitable(url):
continue |
66b6652
to
593f9f4
Compare
This commit adds support for a new command-line flag, --force-extractor, that takes as an argument the name of an extractor. This extractor is pased to extract_info as the ie_key, which causes the given extractor to be used directly without having to run __init__ on all supported extractors. This is important because each extractor typically needs to compile a regex, and all of that regex compilation adds up. In a profiler, this extractor initialization occupies 80% of startup time on my test machine (the rest is module import and checking ffmpeg versions). Passing --force-extractor therefore gives the user a faster startup when a given invocation is known to always need a particular, named extractor.
593f9f4
to
be055cd
Compare
It's RFAL, thanks! I've solved 1, 3 and 4. The updated option name is pending your decision, happy to update whenever. It doesn't parse commas yet but that can be easily added, for now multiple extractors can be passed with multiple options (argparse's |
|
Since the primary motivation seems to be performance, wouldn’t it be better to redo the URL-matching machinery to scale better? In particular, many extractors can only match URLs from a single domain. If we took that into account, extracting the domain name from the URL first, each link would have be matched against not a nearly-thousand patterns, but ideally just one domain-specific pattern and failing that, a handful of general-purpose extractors. We could replace regex-based |
a63ff77
to
b14d523
Compare
But this wouldn't be possible to do without significant changes to ALL the extractors, correct? Or do you have some idea for infering domain etc from the _VALID_URL? |
Yes, it might be hard. Though with a careful enough design, the necessary modification to extractors would be just one line invoking a decorator (and maybe another one removing The domain might be possible to exfiltrate from the regex with some heuristics, though doing so at runtime might not necessarily be advantageous. |
I have no idea what kind of design you have in mind. You'll have to elaborate |
Eventually, I would like to see something like: # module-private function, not actually exported;
# ctx aggregates objects providing stuff like
# a network client, access to the credential store,
# terminal output, etc.; it does *not* contain any
# methods for parsing any data; that is delegated
# to specialised modules, which may be handed ctx
# or its constituent sub-objects as needed
def _extract_from_id(ctx, id, *, time=None):
...
# new-style registration: matched URL parts are
# decoded and passed as keyword arguments
@register_urlmatch(r'//<www>.youtube.com/embed/<:id>', iframe=True)
@register_urlmatch(r'//<www>.youtube-nocookie.com/embed/<:id>', iframe=True)
def __extract_embed(ctx, url, /, *, id):
return _extract_from_id(ctx, id, time=url.qs.get('t'))
# legacy class registration: disables _VALID_URL matching,
# the extractor is otherwise written the same way as before
@register_urlmatch(r'//<www>.youtube.com/watch?v=<:id>')
@register_urlmatch(r'//youtu.be/<:id>')
class YoutubeIE(InfoExtractor):
# not used for initial matching because
# a registration decorator is present
_VALID_URL = ...
def _extract_url(self, url):
return _extract_from_id(self._context, self._match_id(url)) where the registration decorators remember the decorated method/class in some kind of a trie of domain names (possibly in a fallback ‘catch-all’ node at the root). Matching a URL walks this trie and tests only those patterns that appear on the trie path towards the full domain name in the given URL; presumably the most specific pattern (i.e. the most nested domain) first. If no pattern matches, the page should be downloaded and tested by extractors registered against MIME types of a page (to handle e.g. direct manifest links), specific HTML elements, or iframe embed URIs. The pattern matcher should automatically decode percent-encoding and match query parameters structurally (e.g. Heuristic extraction of the domain name may be attempted by scanning the regex until the part where it matches the path. E.g. whatever comes between Incremental porting could be achieved simply by importing the decorator into each extractor module and decorating extractor classes with new-style patterns, followed by systematic refactoring to fully take advantage of the new architecture. |
Thank you @pukkandan |
cc @dasl- |
Deprecated by
Changed to
It was difficult to validate in options without affecting perfomance. So validation is implemented in YoutubeDL
I decided to go by case-insensitive extractor names
All done
Extractors can disable itself by setting Example implementation: ❯ git diff
diff --git a/yt_dlp/extractor/genericembeds.py b/yt_dlp/extractor/genericembeds.py
index 64bd20e3a..022f7db80 100644
--- a/yt_dlp/extractor/genericembeds.py
+++ b/yt_dlp/extractor/genericembeds.py
@@ -5,6 +5,7 @@
class HTML5MediaEmbedIE(InfoExtractor):
_VALID_URL = False
IE_NAME = 'html5'
+ _ENABLED = False
_WEBPAGE_TESTS = [
{
'url': 'https://html.com/media/'
|
cc @sdomi @selfisekai - I remember you guys were asking for this |
Amazing! thank you @pukkandan ! I've just tested timing the performance improvement on a raspberry pi model 3B+ using a fresh clone of this repo (2516caf) Here are the results of running this test:
median of 10 trials w/o lower is better. That's a 50% speed increase with this new feature! |
Seems to cause a regression for me. Running |
and add test Fixes #3234 (comment)
Sorry for probably asking a dumb question 😊 , but README.md has the short form of |
Any option name can be shortened as long as there is no other conflicting option |
I just noticed that initialization of a youtube video download is a lot faster, even without using the If I understand correctly, it seems like it's because now youtube is the first extractor that is loaded to try to match URL. So if you are downloading a youtube, using |
…#34. By whitelisting an extractor for yt-dlp to use, the video download initialization time can be improved. In a prior version of yt-dlp, whitelisting the 'youtube' extractor could increase video download speeds by about 50% (see: yt-dlp/yt-dlp#3234 (comment)) But ever since this change in yt-dlp, the performance benefit is more marginal now: yt-dlp/yt-dlp@d2c8aad#diff-780b22dc7eb280f5a7b2bbf79aff17826de88ddcbf2fc1116ba19901827aa4e3R3 That is because the above commit improved the performance of youtube video downloads regardless of whether the --use-extractors flag is used. See: yt-dlp/yt-dlp#3234 (comment) In real use on the pifi, I observed a 11.5% speed increase in loading youtube videos on a raspberry pi 4. Median load time for my tests was 4.8 seconds. Test results: https://docs.google.com/spreadsheets/d/1Q95L0cJLam7ohi0sBBPtM8bXD7sIpsIxkPwmDt3OwkI/edit#gid=921905349
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
This commit adds support for a new command-line flag, --force-extractor,
that takes as an argument the name of an extractor. This extractor is
pased to extract_info as the ie_key, which causes the given extractor to
be used directly without having to run
__init__
on all supportedextractors.
This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up. In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).
Passing --force-extractor therefore gives the user a faster startup when
a given invocation is known to always need a particular, named
extractor.