Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[general options] add --force-extractor #3234

Closed
wants to merge 1 commit into from

Conversation

jordanlewis
Copy link

@jordanlewis jordanlewis commented Mar 28, 2022

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This commit adds support for a new command-line flag, --force-extractor,
that takes as an argument the name of an extractor. This extractor is
pased to extract_info as the ie_key, which causes the given extractor to
be used directly without having to run __init__ on all supported
extractors.

This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up. In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).

Passing --force-extractor therefore gives the user a faster startup when
a given invocation is known to always need a particular, named
extractor.

@pukkandan
Copy link
Member

pukkandan commented Mar 28, 2022

This is something that has been requested (unofficially) before, but will need a few changes to this implementation:

  • 1. --force-generic-extractor can be deprecated in favor of --force-extractor generic (The new implementation does not deprecate this)
  • 2. The name --force-extractor gives the impression that the extractor will be forced irrespective of the URL. This is not the case (and is not possible). So --restrict-to-extractors may be a more suitable name (I'm open to other suggestions)
  • 3. The value of the option needs to be validated in options.py
  • 4. The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors. We also need to decide how to deal with case sensitivity of the keys
  • 5. We need the ability to restrict to multiple extractors like --force-extractor youtube,twitch
  • 6. It is useful to be able to block only specific extractors (Add option to prevent/restrict redirection #2044). Eg: --force-extractor -generic/--force-extractor all,-generic (Does this have any use except for generic?)
  • 7. Many sites are split across multiple extractors. So ability to match extractor names with a regex would be a plus. Eg: --force-extractor youtube.* (would this degrade performance?)
  • 8. Some extractors need "instance checking" (Self-hosted extractors: Mastodon, PeerTube and Misskey (with haruhi-dl merge) #1791). I would like to be able to fold that into this option if possible. Eg: --force-extractor mastadon-instances

I won't say we have to implement all of this now, but the interface must be designed accounting for these requirements

This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up.

Slight clarification. Extractors are not initialized unless actually used. The regex compilation actually happens when the URL is checked against the _VALID_URLs (this PR removes the checks)

In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).

Out of curiosity, is this with or without lazy extractors?

@pukkandan pukkandan added the enhancement New feature or request label Mar 28, 2022
@jordanlewis
Copy link
Author

Slight clarification. Extractors are not initialized unless actually used. The regex compilation actually happens when the URL is checked against the _VALID_URLs (this PR removes the checks)

Perhaps I'm misunderstanding, but based on looking at the profiler and some scattered prints, it seems to me that by default the input URL is checked against all extractors' _VALID_URL. That checking is what is expensive.

Out of curiosity, is this with or without lazy extractors?

I believe it's with lazy extractors. The time is spent in re.compile(_VALID_URL). I performed the performance analysis with snakeviz like follows.

python3 -m cProfile -o profile_output yt-dlp ...
snakeviz profile_output

@jordanlewis
Copy link
Author

Also, thank you for taking a look. I'm unfamiliar with some of the things you mention (like instance checking). I'll take a look at solving the first few, though.

@pukkandan
Copy link
Member

I believe it's with lazy extractors.

You can check yt-dlp -v. If they are disabled, it will show [debug] Lazy loading extractors is disabled

@pukkandan
Copy link
Member

I'm unfamiliar with some of the things you mention (like instance checking). I'll take a look at solving the first few, though.

That is fine. I put all the related points here more for reference purposes. The last 2-3 points definitely doesn't need to be addressed in this PR. They are just further enhancements to this feature that would be helpful

The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors

Also, it might be best if I do this part myself. But first we need to decide whether to use the name or the key (will using name undermine the performance gain?)

@jordanlewis
Copy link
Author

Yes, confirming that lazy extractors are enabled.

Also, another clarification to what I said:

it seems to me that by default the input URL is checked against all extractors' _VALID_URL. That checking is what is expensive.

The checking itself is not what shows up in the profile. It's the compilation of the regexes that is expensive.

But first we need to decide whether to use the name or the key (will using name undermine the performance gain?)

Is there a map from name to key currently? If there is such a map, there will be no performance impact. I take back what I said about __init__, that was incorrect. The expensive part is the implementation of _match_valid_url in LazyLoadExtractor (I can't link this because it's code generated). As you can see, each time you call _match_valid_url on an extractor, it has to run re.compile(cls._VALID_URL). This method call dominates the runtime.

Any other kind of processing like looking up key from name, or doing a search through all of the names for your idea for youtube.*, will be very cheap by comparison.

Also, it might be best if I do this part myself.

I'm completely fine closing this if you want to take it over, just let me know!

@pukkandan
Copy link
Member

Is there a map from name to key currently? If there is such a map, there will be no performance impact.

No

I'm completely fine closing this if you want to take it over, just let me know!

No! I was talking about the documentation part in specific... (I can push changes directly to the PR if I need to)

I can't link this because it's code generated)

The code in lazy extractor is copied from this:

@classmethod
def _match_valid_url(cls, url):
# This does not use has/getattr intentionally - we want to know whether
# we have cached the regexp for *this* class, whereas getattr would also
# match the superclass
if '_VALID_URL_RE' not in cls.__dict__:
if '_VALID_URL' not in cls.__dict__:
cls._VALID_URL = cls._make_valid_url()
cls._VALID_URL_RE = re.compile(cls._VALID_URL)
return cls._VALID_URL_RE.match(url)

As you can see, each time you call _match_valid_url on an extractor, it has to run re.compile(cls._VALID_URL)

Yes, but even if the re.compile was not explicitly present, it would take the same time due to re.match. Am I wrong? The explicit compilation only makes future matches much faster for playlists

On a sidenote, would unpickling the compiled regex be faster than compiling it? If so, that data can be auto-generated at build time similar to lazy-extractors.

@pukkandan
Copy link
Member

Please revert the changes to the executable bit of files

@jordanlewis
Copy link
Author

Yes, but even if the re.compile was not explicitly present, it would take the same time due to re.match. Am I wrong? The explicit compilation only makes future matches much faster for playlists

Yes, you're right, I was just trying to be specific about what the profile sees - the compilation is expensive, the subsequent matching is cheap.

On a sidenote, would unpickling the compiled regex be faster than compiling it? If so, that data can be auto-generated at build time similar to lazy-extractors.

Interesting idea, I don't know. This StackOverflow answer suggests it might be too painful to be worth it: https://stackoverflow.com/a/65440/73632

@jordanlewis
Copy link
Author

I'm fine with any option name, please let me know your pick. Here are some other ideas:

--restrict-to-extractors
--extractors
--load-extractors
--use-extractors
--only-extractors

@pukkandan
Copy link
Member

The value of the option needs to be validated in options.py

❯ yt-dlp test:youtube --force-extractor abcd
Traceback (most recent call last):
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__main__.py", line 19, in <module>
    yt_dlp.main()
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__init__.py", line 870, in main
    _real_main(argv)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__init__.py", line 860, in _real_main
    retcode = ydl.download(all_urls)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 3272, in download
    self.__download_wrapper(self.extract_info)(
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 3245, in wrapper
    res = func(*args, **kwargs)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1378, in extract_info
    ies = {ie_key: self._get_info_extractor_class(ie_key)}
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 739, in _get_info_extractor_class
    ie = get_info_extractor(ie_key)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\__init__.py", line 54, in get_info_extractor
    return globals()[ie_name + 'IE']
KeyError: 'abcdIE'

@pukkandan
Copy link
Member

pukkandan commented Mar 28, 2022

hm, restrict-extractors is not quite right either. The restriction is applied only to the input URL and not to the redirects. Is this what we actually want? If so, #2044 cannot be implemented with this

❯ yt-dlp https://www.youtube.com/c/ИгорьКлейнер --playlist-end 1 --force-extractor YoutubeTab
[youtube:tab] ИгорьКлейнер: Downloading webpage
[youtube:tab] A channel/user page was given. All the channel's videos will be downloaded. To download only the videos in the home page, add a "/featured" to the URL
[download] Downloading playlist: Igor Kleiner - Videos
[youtube:tab] playlist Igor Kleiner - Videos: Downloading 1 videos
[download] Downloading video 1 of 1
[youtube] ctSSXR4JtIg: Downloading webpage
[youtube] ctSSXR4JtIg: Downloading android player API JSON
[youtube] ctSSXR4JtIg: Downloading MPD manifest
[youtube] ctSSXR4JtIg: Downloading MPD manifest
[info] ctSSXR4JtIg: Downloading 1 format(s): 248+251
[dashsegments] Total fragments: 353
[download] Destination: 4.4 דגימה וקטורים אקראיים [ctSSXR4JtIg].f248.webm
[download]   0.3% of ~698.76KiB at  421.19B/s ETA -1:59:56 (frag 1/353)
ERROR: Interrupted by user

@jordanlewis
Copy link
Author

It seems like we'd want to apply it to all redirects as well, though I'm not sure where the code for redirects is - is that the invocations in process_ie_result? Maybe it would be easier to implement this as a list that's stored in YoutubeDL rather than passing as args everywhere.

@pukkandan
Copy link
Member

All invocations of extract_info leads to a _match_valid_url. So something like this should work (untested):

diff --git a/yt_dlp/YoutubeDL.py b/yt_dlp/YoutubeDL.py
index b90173508..a3c865014 100755
--- a/yt_dlp/YoutubeDL.py
+++ b/yt_dlp/YoutubeDL.py
@@ -1378,7 +1378,10 @@ class YoutubeDL(object):
         else:
             ies = self._ies

+        allowed_extractors = self.params.get('force_extractor')
         for ie_key, ie in ies.items():
+            if allowed_extractors and ie_key not in allowed_extractors:
+                continue
             if not ie.suitable(url):
                 continue

This commit adds support for a new command-line flag, --force-extractor,
that takes as an argument the name of an extractor. This extractor is
pased to extract_info as the ie_key, which causes the given extractor to
be used directly without having to run __init__ on all supported
extractors.

This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up. In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).

Passing --force-extractor therefore gives the user a faster startup when
a given invocation is known to always need a particular, named
extractor.
@jordanlewis
Copy link
Author

It's RFAL, thanks! I've solved 1, 3 and 4. The updated option name is pending your decision, happy to update whenever. It doesn't parse commas yet but that can be easily added, for now multiple extractors can be passed with multiple options (argparse's append action).

@pukkandan pukkandan self-requested a review March 30, 2022 01:46
@dirkf
Copy link
Contributor

dirkf commented Apr 4, 2022

--use-extractors extractor_1_ID [, extractor_n_ID]* could be a good option syntax, as it also allows --no-use-extractors ... for excluding, which doesn't work so well with the other suggestions. I refer to extractor 'ID', depending on how the extractor should be specified.

@fstirlitz
Copy link
Contributor

Since the primary motivation seems to be performance, wouldn’t it be better to redo the URL-matching machinery to scale better?

In particular, many extractors can only match URLs from a single domain. If we took that into account, extracting the domain name from the URL first, each link would have be matched against not a nearly-thousand patterns, but ideally just one domain-specific pattern and failing that, a handful of general-purpose extractors.

We could replace regex-based _VALID_RE with a more structured URL matching, similar to what HTTP server frameworks like Flask do; extract the domain name from the pattern, and later look up patterns to match against a given link by the domain.

@pukkandan pukkandan force-pushed the master branch 2 times, most recently from a63ff77 to b14d523 Compare May 18, 2022 03:35
@pukkandan
Copy link
Member

Since the primary motivation seems to be performance, wouldn’t it be better to redo the URL-matching machinery to scale better?

In particular, many extractors can only match URLs from a single domain. If we took that into account, extracting the domain name from the URL first, each link would have be matched against not a nearly-thousand patterns, but ideally just one domain-specific pattern and failing that, a handful of general-purpose extractors.

We could replace regex-based _VALID_RE with a more structured URL matching, similar to what HTTP server frameworks like Flask do; extract the domain name from the pattern, and later look up patterns to match against a given link by the domain.

But this wouldn't be possible to do without significant changes to ALL the extractors, correct? Or do you have some idea for infering domain etc from the _VALID_URL?

@fstirlitz
Copy link
Contributor

Yes, it might be hard. Though with a careful enough design, the necessary modification to extractors would be just one line invoking a decorator (and maybe another one removing _VALID_URL).

The domain might be possible to exfiltrate from the regex with some heuristics, though doing so at runtime might not necessarily be advantageous.

@pukkandan
Copy link
Member

Yes, it might be hard. Though with a careful enough design, the necessary modification to extractors would be just one line invoking a decorator (and maybe another one removing _VALID_URL).

I have no idea what kind of design you have in mind. You'll have to elaborate

@fstirlitz
Copy link
Contributor

Eventually, I would like to see something like:

# module-private function, not actually exported;
# ctx aggregates objects providing stuff like
# a network client, access to the credential store,
# terminal output, etc.; it does *not* contain any
# methods for parsing any data; that is delegated
# to specialised modules, which may be handed ctx
# or its constituent sub-objects as needed
def _extract_from_id(ctx, id, *, time=None):
    ...

# new-style registration: matched URL parts are
# decoded and passed as keyword arguments
@register_urlmatch(r'//<www>.youtube.com/embed/<:id>', iframe=True)
@register_urlmatch(r'//<www>.youtube-nocookie.com/embed/<:id>', iframe=True)
def __extract_embed(ctx, url, /, *, id):
    return _extract_from_id(ctx, id, time=url.qs.get('t'))

# legacy class registration: disables _VALID_URL matching,
# the extractor is otherwise written the same way as before
@register_urlmatch(r'//<www>.youtube.com/watch?v=<:id>')
@register_urlmatch(r'//youtu.be/<:id>')
class YoutubeIE(InfoExtractor):
    # not used for initial matching because
    # a registration decorator is present
    _VALID_URL = ...

    def _extract_url(self, url):
        return _extract_from_id(self._context, self._match_id(url))

where the registration decorators remember the decorated method/class in some kind of a trie of domain names (possibly in a fallback ‘catch-all’ node at the root). Matching a URL walks this trie and tests only those patterns that appear on the trie path towards the full domain name in the given URL; presumably the most specific pattern (i.e. the most nested domain) first. If no pattern matches, the page should be downloaded and tested by extractors registered against MIME types of a page (to handle e.g. direct manifest links), specific HTML elements, or iframe embed URIs.

The pattern matcher should automatically decode percent-encoding and match query parameters structurally (e.g. ?foo=<:foo>&quux=<:quux> should match ?quux=%EF%BC%B1&foo=foo+bar&bar=3 and assign 'foo bar' to foo and 'Q' to quux). Unless, of course, the pattern is specifically written to override that.

Heuristic extraction of the domain name may be attempted by scanning the regex until the part where it matches the path. E.g. whatever comes between (?:https?:)?// (or https?://, or http://, or…) and subsequent / must be a pattern matching the hostname. Though one must carefully pay attention to metacharacter escaping rules. If domain name extraction fails, the extractor is appended to the catch-all list.

Incremental porting could be achieved simply by importing the decorator into each extractor module and decorating extractor classes with new-style patterns, followed by systematic refactoring to fully take advantage of the new architecture.

@jordanlewis
Copy link
Author

Thank you @pukkandan

@jordanlewis
Copy link
Author

cc @dasl-

@pukkandan
Copy link
Member

pukkandan commented Aug 24, 2022

  • 1. --force-generic-extractor can be deprecated in favor of --force-extractor generic

Deprecated by --use-extractor generic,default

  • 2. The name --force-extractor gives the impression that the extractor will be forced irrespective of the URL. This is not the case (and is not possible). So --restrict-to-extractors may be a more suitable name (I'm open to other suggestions)

Changed to --use-extractors as per @dirkf's suggestion

  • 3. The value of the option needs to be validated in options.py

It was difficult to validate in options without affecting perfomance. So validation is implemented in YoutubeDL

  • 4. The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors. We also need to decide how to deal with case sensitivity of the keys

I decided to go by case-insensitive extractor names

  • 5. We need the ability to restrict to multiple extractors like --force-extractor youtube,twitch
  • 6. It is useful to be able to block only specific extractors (Add option to prevent/restrict redirection #2044). Eg: --force-extractor -generic/--force-extractor all,-generic (Does this have any use except for generic?)
  • 7. Many sites are split across multiple extractors. So ability to match extractor names with a regex would be a plus. Eg: --force-extractor youtube.* (would this degrade performance?)

All done

Extractors can disable itself by setting _ENABLED = False. Along with previously implemented _VALID_URL = False, now it is possible to implement instance checking extractors (cc @Lesmiscore)

Example implementation:

❯ git diff
diff --git a/yt_dlp/extractor/genericembeds.py b/yt_dlp/extractor/genericembeds.py
index 64bd20e3a..022f7db80 100644
--- a/yt_dlp/extractor/genericembeds.py
+++ b/yt_dlp/extractor/genericembeds.py
@@ -5,6 +5,7 @@
 class HTML5MediaEmbedIE(InfoExtractor):
     _VALID_URL = False
     IE_NAME = 'html5'
+    _ENABLED = False
     _WEBPAGE_TESTS = [
         {
             'url': 'https://html.com/media/'
❯ yt-dlp https://html.com/media/ -v
[debug] Command-line config: ['--ignore-config', 'https://html.com/media/', '-v']
[debug] Encodings: locale cp65001, fs utf-8, pref cp65001, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.08.19 [48c88e088] (source)
[debug] Lazy loading extractors is disabled
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Git HEAD: fd404bec7
[debug] Python 3.10.6 (CPython 64bit) - Windows-10-10.0.22000-SP0
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg N-107787-gc469c3c3b1-20220814 (fdk,setts), ffprobe N-107787-gc469c3c3b1-20220814, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2021.10.08, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.1
[debug] Proxy map: {}
[debug] Loaded 1661 extractors
[debug] [generic] Extracting URL: https://html.com/media/
[generic] media: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] media: Extracting information
[debug] Looking for Brightcove embeds
[debug] Looking for embeds
ERROR: Unsupported URL: https://html.com/media/
Traceback (most recent call last):
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1452, in wrapper
    return func(self, *args, **kwargs)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1528, in __extract_info
    ie_result = ie.extract(url)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\common.py", line 670, in extract
    ie_result = self._real_extract(url)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\generic.py", line 3077, in _real_extract
    raise UnsupportedError(url)
yt_dlp.utils.UnsupportedError: Unsupported URL: https://html.com/media/
❯ yt-dlp https://html.com/media/ -v --ie default,html5
[debug] Command-line config: ['--ignore-config', 'https://html.com/media/', '-v', '--ie', 'default,html5']
[debug] Encodings: locale cp65001, fs utf-8, pref cp65001, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.08.19 [48c88e088] (source)
[debug] Lazy loading extractors is disabled
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Git HEAD: fd404bec7
[debug] Python 3.10.6 (CPython 64bit) - Windows-10-10.0.22000-SP0
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg N-107787-gc469c3c3b1-20220814 (fdk,setts), ffprobe N-107787-gc469c3c3b1-20220814, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2021.10.08, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.1
[debug] Proxy map: {}
[debug] Loaded 1662 extractors
[debug] [generic] Extracting URL: https://html.com/media/
[generic] media: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] media: Extracting information
[debug] Looking for Brightcove embeds
[debug] Looking for embeds
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[debug] Identified 2 html5 embeds
[download] Downloading playlist: HTML5 Media
[generic] Playlist HTML5 Media: Downloading 2 videos of 2
[download] Downloading video 1 of 2
[debug] Default format spec: bestvideo*+bestaudio/best
[info] media-1: Downloading 1 format(s): 0
[debug] Invoking http downloader on "https://archive.org/download/ReclaimHtml5/ReclaimHtml5.mp3"
[debug] File locking is not supported. Proceeding without locking
[download] Destination: media (1) [media-1].mp3
[download]   6.9% of 28.90MiB at    1.19MiB/s ETA 00:22
ERROR: Interrupted by user

@pukkandan
Copy link
Member

cc @sdomi @selfisekai - I remember you guys were asking for this

@jordanlewis jordanlewis deleted the force-extractor branch August 24, 2022 02:51
@dasl-
Copy link

dasl- commented Aug 24, 2022

Amazing! thank you @pukkandan !

I've just tested timing the performance improvement on a raspberry pi model 3B+ using a fresh clone of this repo (2516caf)

Here are the results of running this test:

for i in $(seq 1 10) ; do time ./yt-dlp.sh -o - https://www.youtube.com/watch\?v\=zmr2I8caF0c >/dev/null ; time ./yt-dlp.sh --use-extractors youtube -o - https://www.youtube.com/watch\?v\=zmr2I8caF0c >/dev/null ; done

median of 10 trials w/o --use-extractors youtube: 9.95 seconds
median of 10 trials w/ --use-extractors youtube: 5.01 seconds

lower is better. That's a 50% speed increase with this new feature!

@samoht0
Copy link

samoht0 commented Aug 24, 2022

Seems to cause a regression for me. Running
yt-dlp -v
I'm getting
WARNING: Falling back to normal extractor since lazy extractor [...] does not have attribute _ENABLED; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version [...]
for all lazy extractors, when using 2516caf
5314b52 is fine, lazy extractor are present using
yt-dlp --list-extractors

pukkandan added a commit that referenced this pull request Aug 24, 2022
@Vangelis66
Copy link

Vangelis66 commented Aug 24, 2022

yt-dlp https://html.com/media/ -v --ie default,html5

Sorry for probably asking a dumb question 😊 , but README.md has the short form of
--use-extractors as --ies ; is it safe to assume
--ie = --ies ?

@pukkandan
Copy link
Member

Any option name can be shortened as long as there is no other conflicting option

@pukkandan
Copy link
Member

@samoht0 Fixed by e5458d1

@dasl-
Copy link

dasl- commented Sep 4, 2022

I just noticed that initialization of a youtube video download is a lot faster, even without using the --use-extractors flag, ever since this change: d2c8aad#diff-780b22dc7eb280f5a7b2bbf79aff17826de88ddcbf2fc1116ba19901827aa4e3R3

If I understand correctly, it seems like it's because now youtube is the first extractor that is loaded to try to match URL. So if you are downloading a youtube, using --use-extractors now only gets you an ~8.5% speed boost in my testing just now, using a raspberry pi model 3B+. Seems like a sensible change to make youtube the first extractor loaded by default!

dasl- added a commit to dasl-/pifi that referenced this pull request Sep 4, 2022
…#34.

By whitelisting an extractor for yt-dlp to use, the video download initialization time can be improved.

In a prior version of yt-dlp, whitelisting the 'youtube' extractor could increase video download speeds by about 50% (see: yt-dlp/yt-dlp#3234 (comment))

But ever since this change in yt-dlp, the performance benefit is more marginal now: yt-dlp/yt-dlp@d2c8aad#diff-780b22dc7eb280f5a7b2bbf79aff17826de88ddcbf2fc1116ba19901827aa4e3R3

That is because the above commit improved the performance of youtube video downloads regardless of whether the --use-extractors flag is used. See: yt-dlp/yt-dlp#3234 (comment)

In real use on the pifi, I observed a 11.5% speed increase in loading youtube videos on a raspberry pi 4. Median load time for my tests was 4.8 seconds.

Test results: https://docs.google.com/spreadsheets/d/1Q95L0cJLam7ohi0sBBPtM8bXD7sIpsIxkPwmDt3OwkI/edit#gid=921905349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Self hosted extraction
Development

Successfully merging this pull request may close these issues.

None yet

7 participants