[general options] add --force-extractor #3234

jordanlewis · 2022-03-28T21:29:20Z

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This commit adds support for a new command-line flag, --force-extractor,
that takes as an argument the name of an extractor. This extractor is
pased to extract_info as the ie_key, which causes the given extractor to
be used directly without having to run __init__ on all supported
extractors.

This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up. In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).

Passing --force-extractor therefore gives the user a faster startup when
a given invocation is known to always need a particular, named
extractor.

pukkandan · 2022-03-28T22:44:40Z

This is something that has been requested (unofficially) before, but will need a few changes to this implementation:

1. ~~--force-generic-extractor can be deprecated in favor of --force-extractor generic~~ (The new implementation does not deprecate this)
2. The name --force-extractor gives the impression that the extractor will be forced irrespective of the URL. This is not the case (and is not possible). So --restrict-to-extractors may be a more suitable name (I'm open to other suggestions)
3. The value of the option needs to be validated in options.py
4. The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors. We also need to decide how to deal with case sensitivity of the keys
5. We need the ability to restrict to multiple extractors like --force-extractor youtube,twitch
6. It is useful to be able to block only specific extractors (Add option to prevent/restrict redirection #2044). Eg: --force-extractor -generic/--force-extractor all,-generic (Does this have any use except for generic?)
7. Many sites are split across multiple extractors. So ability to match extractor names with a regex would be a plus. Eg: --force-extractor youtube.* (would this degrade performance?)
8. Some extractors need "instance checking" (Self-hosted extractors: Mastodon, PeerTube and Misskey (with haruhi-dl merge) #1791). I would like to be able to fold that into this option if possible. Eg: --force-extractor mastadon-instances

I won't say we have to implement all of this now, but the interface must be designed accounting for these requirements

This is important because each extractor typically needs to compile a
regex, and all of that regex compilation adds up.

Slight clarification. Extractors are not initialized unless actually used. The regex compilation actually happens when the URL is checked against the _VALID_URLs (this PR removes the checks)

In a profiler, this
extractor initialization occupies 80% of startup time on my test
machine (the rest is module import and checking ffmpeg versions).

Out of curiosity, is this with or without lazy extractors?

jordanlewis · 2022-03-28T23:03:58Z

Slight clarification. Extractors are not initialized unless actually used. The regex compilation actually happens when the URL is checked against the _VALID_URLs (this PR removes the checks)

Perhaps I'm misunderstanding, but based on looking at the profiler and some scattered prints, it seems to me that by default the input URL is checked against all extractors' _VALID_URL. That checking is what is expensive.

Out of curiosity, is this with or without lazy extractors?

I believe it's with lazy extractors. The time is spent in re.compile(_VALID_URL). I performed the performance analysis with snakeviz like follows.

python3 -m cProfile -o profile_output yt-dlp ...
snakeviz profile_output

jordanlewis · 2022-03-28T23:05:15Z

Also, thank you for taking a look. I'm unfamiliar with some of the things you mention (like instance checking). I'll take a look at solving the first few, though.

pukkandan · 2022-03-28T23:05:29Z

I believe it's with lazy extractors.

You can check yt-dlp -v. If they are disabled, it will show [debug] Lazy loading extractors is disabled

pukkandan · 2022-03-28T23:10:42Z

I'm unfamiliar with some of the things you mention (like instance checking). I'll take a look at solving the first few, though.

That is fine. I put all the related points here more for reference purposes. The last 2-3 points definitely doesn't need to be addressed in this PR. They are just further enhancements to this feature that would be helpful

The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors

Also, it might be best if I do this part myself. But first we need to decide whether to use the name or the key (will using name undermine the performance gain?)

jordanlewis · 2022-03-28T23:16:57Z

Yes, confirming that lazy extractors are enabled.

Also, another clarification to what I said:

it seems to me that by default the input URL is checked against all extractors' _VALID_URL. That checking is what is expensive.

The checking itself is not what shows up in the profile. It's the compilation of the regexes that is expensive.

But first we need to decide whether to use the name or the key (will using name undermine the performance gain?)

Is there a map from name to key currently? If there is such a map, there will be no performance impact. I take back what I said about __init__, that was incorrect. The expensive part is the implementation of _match_valid_url in LazyLoadExtractor (I can't link this because it's code generated). As you can see, each time you call _match_valid_url on an extractor, it has to run re.compile(cls._VALID_URL). This method call dominates the runtime.

Any other kind of processing like looking up key from name, or doing a search through all of the names for your idea for youtube.*, will be very cheap by comparison.

Also, it might be best if I do this part myself.

I'm completely fine closing this if you want to take it over, just let me know!

pukkandan · 2022-03-28T23:35:37Z

Is there a map from name to key currently? If there is such a map, there will be no performance impact.

No

I'm completely fine closing this if you want to take it over, just let me know!

No! I was talking about the documentation part in specific... (I can push changes directly to the PR if I need to)

I can't link this because it's code generated)

The code in lazy extractor is copied from this:

yt-dlp/yt_dlp/extractor/common.py

Lines 492 to 501 in 8a7f68d

    
           @classmethod 
        
           def _match_valid_url(cls, url): 
        
               # This does not use has/getattr intentionally - we want to know whether 
        
               # we have cached the regexp for *this* class, whereas getattr would also 
        
               # match the superclass 
        
               if '_VALID_URL_RE' not in cls.__dict__: 
        
                   if '_VALID_URL' not in cls.__dict__: 
        
                       cls._VALID_URL = cls._make_valid_url() 
        
                   cls._VALID_URL_RE = re.compile(cls._VALID_URL) 
        
               return cls._VALID_URL_RE.match(url)

As you can see, each time you call _match_valid_url on an extractor, it has to run re.compile(cls._VALID_URL)

Yes, but even if the re.compile was not explicitly present, it would take the same time due to re.match. Am I wrong? The explicit compilation only makes future matches much faster for playlists

On a sidenote, would unpickling the compiled regex be faster than compiling it? If so, that data can be auto-generated at build time similar to lazy-extractors.

pukkandan · 2022-03-28T23:39:08Z

Please revert the changes to the executable bit of files

jordanlewis · 2022-03-28T23:42:24Z

Yes, but even if the re.compile was not explicitly present, it would take the same time due to re.match. Am I wrong? The explicit compilation only makes future matches much faster for playlists

Yes, you're right, I was just trying to be specific about what the profile sees - the compilation is expensive, the subsequent matching is cheap.

On a sidenote, would unpickling the compiled regex be faster than compiling it? If so, that data can be auto-generated at build time similar to lazy-extractors.

Interesting idea, I don't know. This StackOverflow answer suggests it might be too painful to be worth it: https://stackoverflow.com/a/65440/73632

jordanlewis · 2022-03-28T23:47:04Z

I'm fine with any option name, please let me know your pick. Here are some other ideas:

--restrict-to-extractors
--extractors
--load-extractors
--use-extractors
--only-extractors

pukkandan · 2022-03-28T23:48:02Z

The value of the option needs to be validated in options.py

❯ yt-dlp test:youtube --force-extractor abcd
Traceback (most recent call last):
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__main__.py", line 19, in <module>
    yt_dlp.main()
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__init__.py", line 870, in main
    _real_main(argv)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\__init__.py", line 860, in _real_main
    retcode = ydl.download(all_urls)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 3272, in download
    self.__download_wrapper(self.extract_info)(
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 3245, in wrapper
    res = func(*args, **kwargs)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1378, in extract_info
    ies = {ie_key: self._get_info_extractor_class(ie_key)}
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 739, in _get_info_extractor_class
    ie = get_info_extractor(ie_key)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\__init__.py", line 54, in get_info_extractor
    return globals()[ie_name + 'IE']
KeyError: 'abcdIE'

pukkandan · 2022-03-28T23:54:49Z

hm, restrict-extractors is not quite right either. The restriction is applied only to the input URL and not to the redirects. Is this what we actually want? If so, #2044 cannot be implemented with this

❯ yt-dlp https://www.youtube.com/c/ИгорьКлейнер --playlist-end 1 --force-extractor YoutubeTab
[youtube:tab] ИгорьКлейнер: Downloading webpage
[youtube:tab] A channel/user page was given. All the channel's videos will be downloaded. To download only the videos in the home page, add a "/featured" to the URL
[download] Downloading playlist: Igor Kleiner - Videos
[youtube:tab] playlist Igor Kleiner - Videos: Downloading 1 videos
[download] Downloading video 1 of 1
[youtube] ctSSXR4JtIg: Downloading webpage
[youtube] ctSSXR4JtIg: Downloading android player API JSON
[youtube] ctSSXR4JtIg: Downloading MPD manifest
[youtube] ctSSXR4JtIg: Downloading MPD manifest
[info] ctSSXR4JtIg: Downloading 1 format(s): 248+251
[dashsegments] Total fragments: 353
[download] Destination: 4.4 דגימה וקטורים אקראיים [ctSSXR4JtIg].f248.webm
[download]   0.3% of ~698.76KiB at  421.19B/s ETA -1:59:56 (frag 1/353)
ERROR: Interrupted by user

jordanlewis · 2022-03-29T00:02:04Z

It seems like we'd want to apply it to all redirects as well, though I'm not sure where the code for redirects is - is that the invocations in process_ie_result? Maybe it would be easier to implement this as a list that's stored in YoutubeDL rather than passing as args everywhere.

pukkandan · 2022-03-29T00:40:32Z

All invocations of extract_info leads to a _match_valid_url. So something like this should work (untested):

diff --git a/yt_dlp/YoutubeDL.py b/yt_dlp/YoutubeDL.py
index b90173508..a3c865014 100755
--- a/yt_dlp/YoutubeDL.py
+++ b/yt_dlp/YoutubeDL.py
@@ -1378,7 +1378,10 @@ class YoutubeDL(object):
         else:
             ies = self._ies

+        allowed_extractors = self.params.get('force_extractor')
         for ie_key, ie in ies.items():
+            if allowed_extractors and ie_key not in allowed_extractors:
+                continue
             if not ie.suitable(url):
                 continue

This commit adds support for a new command-line flag, --force-extractor, that takes as an argument the name of an extractor. This extractor is pased to extract_info as the ie_key, which causes the given extractor to be used directly without having to run __init__ on all supported extractors. This is important because each extractor typically needs to compile a regex, and all of that regex compilation adds up. In a profiler, this extractor initialization occupies 80% of startup time on my test machine (the rest is module import and checking ffmpeg versions). Passing --force-extractor therefore gives the user a faster startup when a given invocation is known to always need a particular, named extractor.

jordanlewis · 2022-03-29T17:21:39Z

It's RFAL, thanks! I've solved 1, 3 and 4. The updated option name is pending your decision, happy to update whenever. It doesn't parse commas yet but that can be easily added, for now multiple extractors can be passed with multiple options (argparse's append action).

dirkf · 2022-04-04T10:23:06Z

--use-extractors extractor_1_ID [, extractor_n_ID]* could be a good option syntax, as it also allows --no-use-extractors ... for excluding, which doesn't work so well with the other suggestions. I refer to extractor 'ID', depending on how the extractor should be specified.

fstirlitz · 2022-04-15T11:18:47Z

Since the primary motivation seems to be performance, wouldn’t it be better to redo the URL-matching machinery to scale better?

In particular, many extractors can only match URLs from a single domain. If we took that into account, extracting the domain name from the URL first, each link would have be matched against not a nearly-thousand patterns, but ideally just one domain-specific pattern and failing that, a handful of general-purpose extractors.

We could replace regex-based _VALID_RE with a more structured URL matching, similar to what HTTP server frameworks like Flask do; extract the domain name from the pattern, and later look up patterns to match against a given link by the domain.

pukkandan · 2022-05-26T16:43:07Z

Since the primary motivation seems to be performance, wouldn’t it be better to redo the URL-matching machinery to scale better?

In particular, many extractors can only match URLs from a single domain. If we took that into account, extracting the domain name from the URL first, each link would have be matched against not a nearly-thousand patterns, but ideally just one domain-specific pattern and failing that, a handful of general-purpose extractors.

We could replace regex-based _VALID_RE with a more structured URL matching, similar to what HTTP server frameworks like Flask do; extract the domain name from the pattern, and later look up patterns to match against a given link by the domain.

But this wouldn't be possible to do without significant changes to ALL the extractors, correct? Or do you have some idea for infering domain etc from the _VALID_URL?

fstirlitz · 2022-05-26T20:42:20Z

Yes, it might be hard. Though with a careful enough design, the necessary modification to extractors would be just one line invoking a decorator (and maybe another one removing _VALID_URL).

The domain might be possible to exfiltrate from the regex with some heuristics, though doing so at runtime might not necessarily be advantageous.

pukkandan · 2022-05-26T20:59:12Z

Yes, it might be hard. Though with a careful enough design, the necessary modification to extractors would be just one line invoking a decorator (and maybe another one removing _VALID_URL).

I have no idea what kind of design you have in mind. You'll have to elaborate

fstirlitz · 2022-05-27T18:21:23Z

Eventually, I would like to see something like:

# module-private function, not actually exported;
# ctx aggregates objects providing stuff like
# a network client, access to the credential store,
# terminal output, etc.; it does *not* contain any
# methods for parsing any data; that is delegated
# to specialised modules, which may be handed ctx
# or its constituent sub-objects as needed
def _extract_from_id(ctx, id, *, time=None):
    ...

# new-style registration: matched URL parts are
# decoded and passed as keyword arguments
@register_urlmatch(r'//<www>.youtube.com/embed/<:id>', iframe=True)
@register_urlmatch(r'//<www>.youtube-nocookie.com/embed/<:id>', iframe=True)
def __extract_embed(ctx, url, /, *, id):
    return _extract_from_id(ctx, id, time=url.qs.get('t'))

# legacy class registration: disables _VALID_URL matching,
# the extractor is otherwise written the same way as before
@register_urlmatch(r'//<www>.youtube.com/watch?v=<:id>')
@register_urlmatch(r'//youtu.be/<:id>')
class YoutubeIE(InfoExtractor):
    # not used for initial matching because
    # a registration decorator is present
    _VALID_URL = ...

    def _extract_url(self, url):
        return _extract_from_id(self._context, self._match_id(url))

where the registration decorators remember the decorated method/class in some kind of a trie of domain names (possibly in a fallback ‘catch-all’ node at the root). Matching a URL walks this trie and tests only those patterns that appear on the trie path towards the full domain name in the given URL; presumably the most specific pattern (i.e. the most nested domain) first. If no pattern matches, the page should be downloaded and tested by extractors registered against MIME types of a page (to handle e.g. direct manifest links), specific HTML elements, or iframe embed URIs.

The pattern matcher should automatically decode percent-encoding and match query parameters structurally (e.g. ?foo=<:foo>&quux=<:quux> should match ?quux=%EF%BC%B1&foo=foo+bar&bar=3 and assign 'foo bar' to foo and 'Ｑ' to quux). Unless, of course, the pattern is specifically written to override that.

Heuristic extraction of the domain name may be attempted by scanning the regex until the part where it matches the path. E.g. whatever comes between (?:https?:)?// (or https?://, or http://, or…) and subsequent / must be a pattern matching the hostname. Though one must carefully pay attention to metacharacter escaping rules. If domain name extraction fails, the extractor is appended to the catch-all list.

Incremental porting could be achieved simply by importing the decorator into each extractor module and decorating extractor classes with new-style patterns, followed by systematic refactoring to fully take advantage of the new architecture.

jordanlewis · 2022-08-24T02:32:18Z

Thank you @pukkandan

jordanlewis · 2022-08-24T02:32:57Z

cc @dasl-

pukkandan · 2022-08-24T02:42:38Z

1. --force-generic-extractor can be deprecated in favor of --force-extractor generic

Deprecated by --use-extractor generic,default

2. The name --force-extractor gives the impression that the extractor will be forced irrespective of the URL. This is not the case (and is not possible). So --restrict-to-extractors may be a more suitable name (I'm open to other suggestions)

Changed to --use-extractors as per @dirkf's suggestion

3. The value of the option needs to be validated in options.py

It was difficult to validate in options without affecting perfomance. So validation is implemented in YoutubeDL

4. The user only ever sees the extractor name in the logs, not their key. So either this needs to match using the names, or we need to document the ie_keys in supported_sites/--list-extractors. We also need to decide how to deal with case sensitivity of the keys

I decided to go by case-insensitive extractor names

5. We need the ability to restrict to multiple extractors like --force-extractor youtube,twitch

6. It is useful to be able to block only specific extractors (Add option to prevent/restrict redirection #2044). Eg: --force-extractor -generic/--force-extractor all,-generic (Does this have any use except for generic?)

7. Many sites are split across multiple extractors. So ability to match extractor names with a regex would be a plus. Eg: --force-extractor youtube.* (would this degrade performance?)

All done

8. Some extractors need "instance checking" (Self-hosted extractors: Mastodon, PeerTube and Misskey (with haruhi-dl merge) #1791). I would like to be able to fold that into this option if possible. Eg: --force-extractor mastadon-instances

Extractors can disable itself by setting _ENABLED = False. Along with previously implemented _VALID_URL = False, now it is possible to implement instance checking extractors (cc @Lesmiscore)

Example implementation:

❯ git diff
diff --git a/yt_dlp/extractor/genericembeds.py b/yt_dlp/extractor/genericembeds.py
index 64bd20e3a..022f7db80 100644
--- a/yt_dlp/extractor/genericembeds.py
+++ b/yt_dlp/extractor/genericembeds.py
@@ -5,6 +5,7 @@
 class HTML5MediaEmbedIE(InfoExtractor):
     _VALID_URL = False
     IE_NAME = 'html5'
+    _ENABLED = False
     _WEBPAGE_TESTS = [
         {
             'url': 'https://html.com/media/'

❯ yt-dlp https://html.com/media/ -v
[debug] Command-line config: ['--ignore-config', 'https://html.com/media/', '-v']
[debug] Encodings: locale cp65001, fs utf-8, pref cp65001, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.08.19 [48c88e088] (source)
[debug] Lazy loading extractors is disabled
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Git HEAD: fd404bec7
[debug] Python 3.10.6 (CPython 64bit) - Windows-10-10.0.22000-SP0
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg N-107787-gc469c3c3b1-20220814 (fdk,setts), ffprobe N-107787-gc469c3c3b1-20220814, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2021.10.08, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.1
[debug] Proxy map: {}
[debug] Loaded 1661 extractors
[debug] [generic] Extracting URL: https://html.com/media/
[generic] media: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] media: Extracting information
[debug] Looking for Brightcove embeds
[debug] Looking for embeds
ERROR: Unsupported URL: https://html.com/media/
Traceback (most recent call last):
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1452, in wrapper
    return func(self, *args, **kwargs)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\YoutubeDL.py", line 1528, in __extract_info
    ie_result = ie.extract(url)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\common.py", line 670, in extract
    ie_result = self._real_extract(url)
  File "D:\Programs\Source\yt-dlp\yt-dlp\yt_dlp\extractor\generic.py", line 3077, in _real_extract
    raise UnsupportedError(url)
yt_dlp.utils.UnsupportedError: Unsupported URL: https://html.com/media/

❯ yt-dlp https://html.com/media/ -v --ie default,html5
[debug] Command-line config: ['--ignore-config', 'https://html.com/media/', '-v', '--ie', 'default,html5']
[debug] Encodings: locale cp65001, fs utf-8, pref cp65001, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.08.19 [48c88e088] (source)
[debug] Lazy loading extractors is disabled
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Git HEAD: fd404bec7
[debug] Python 3.10.6 (CPython 64bit) - Windows-10-10.0.22000-SP0
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg N-107787-gc469c3c3b1-20220814 (fdk,setts), ffprobe N-107787-gc469c3c3b1-20220814, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2021.10.08, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.1
[debug] Proxy map: {}
[debug] Loaded 1662 extractors
[debug] [generic] Extracting URL: https://html.com/media/
[generic] media: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] media: Extracting information
[debug] Looking for Brightcove embeds
[debug] Looking for embeds
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), channels, acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[debug] Identified 2 html5 embeds
[download] Downloading playlist: HTML5 Media
[generic] Playlist HTML5 Media: Downloading 2 videos of 2
[download] Downloading video 1 of 2
[debug] Default format spec: bestvideo*+bestaudio/best
[info] media-1: Downloading 1 format(s): 0
[debug] Invoking http downloader on "https://archive.org/download/ReclaimHtml5/ReclaimHtml5.mp3"
[debug] File locking is not supported. Proceeding without locking
[download] Destination: media (1) [media-1].mp3
[download]   6.9% of 28.90MiB at    1.19MiB/s ETA 00:22
ERROR: Interrupted by user

pukkandan · 2022-08-24T02:45:35Z

cc @sdomi @selfisekai - I remember you guys were asking for this

dasl- · 2022-08-24T03:10:38Z

Amazing! thank you @pukkandan !

I've just tested timing the performance improvement on a raspberry pi model 3B+ using a fresh clone of this repo (2516caf)

Here are the results of running this test:

for i in $(seq 1 10) ; do time ./yt-dlp.sh -o - https://www.youtube.com/watch\?v\=zmr2I8caF0c >/dev/null ; time ./yt-dlp.sh --use-extractors youtube -o - https://www.youtube.com/watch\?v\=zmr2I8caF0c >/dev/null ; done

median of 10 trials w/o --use-extractors youtube: 9.95 seconds
median of 10 trials w/ --use-extractors youtube: 5.01 seconds

lower is better. That's a 50% speed increase with this new feature!

samoht0 · 2022-08-24T08:06:50Z

Seems to cause a regression for me. Running
yt-dlp -v
I'm getting
WARNING: Falling back to normal extractor since lazy extractor [...] does not have attribute _ENABLED; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version [...]
for all lazy extractors, when using 2516caf
5314b52 is fine, lazy extractor are present using
yt-dlp --list-extractors

and add test Fixes #3234 (comment)

Vangelis66 · 2022-08-24T18:52:14Z

yt-dlp https://html.com/media/ -v --ie default,html5

Sorry for probably asking a dumb question 😊 , but README.md has the short form of
--use-extractors as --ies ; is it safe to assume
--ie = --ies ?

pukkandan · 2022-08-24T21:18:00Z

Any option name can be shortened as long as there is no other conflicting option

pukkandan · 2022-08-24T21:18:23Z

@samoht0 Fixed by e5458d1

dasl- · 2022-09-04T07:08:13Z

I just noticed that initialization of a youtube video download is a lot faster, even without using the --use-extractors flag, ever since this change: d2c8aad#diff-780b22dc7eb280f5a7b2bbf79aff17826de88ddcbf2fc1116ba19901827aa4e3R3

If I understand correctly, it seems like it's because now youtube is the first extractor that is loaded to try to match URL. So if you are downloading a youtube, using --use-extractors now only gets you an ~8.5% speed boost in my testing just now, using a raspberry pi model 3B+. Seems like a sensible change to make youtube the first extractor loaded by default!

…#34. By whitelisting an extractor for yt-dlp to use, the video download initialization time can be improved. In a prior version of yt-dlp, whitelisting the 'youtube' extractor could increase video download speeds by about 50% (see: yt-dlp/yt-dlp#3234 (comment)) But ever since this change in yt-dlp, the performance benefit is more marginal now: yt-dlp/yt-dlp@d2c8aad#diff-780b22dc7eb280f5a7b2bbf79aff17826de88ddcbf2fc1116ba19901827aa4e3R3 That is because the above commit improved the performance of youtube video downloads regardless of whether the --use-extractors flag is used. See: yt-dlp/yt-dlp#3234 (comment) In real use on the pifi, I observed a 11.5% speed increase in loading youtube videos on a raspberry pi 4. Median load time for my tests was 4.8 seconds. Test results: https://docs.google.com/spreadsheets/d/1Q95L0cJLam7ohi0sBBPtM8bXD7sIpsIxkPwmDt3OwkI/edit#gid=921905349

jordanlewis force-pushed the force-extractor branch from e0e2f8d to 66b6652 Compare March 28, 2022 21:31

jordanlewis mentioned this pull request Mar 28, 2022

reduce the time it takes to start playing a video dasl-/pifi#9

Closed

pukkandan added the enhancement New feature or request label Mar 28, 2022

jordanlewis force-pushed the force-extractor branch from 66b6652 to 593f9f4 Compare March 29, 2022 01:22

jordanlewis force-pushed the force-extractor branch from 593f9f4 to be055cd Compare March 29, 2022 01:34

pukkandan self-requested a review March 30, 2022 01:46

pukkandan force-pushed the master branch 2 times, most recently from a63ff77 to b14d523 Compare May 18, 2022 03:35

pukkandan mentioned this pull request Jul 8, 2022

Generalized framework for webpage-based extraction #4307

Merged

13 tasks

pukkandan mentioned this pull request Aug 4, 2022

Bad performance (requires over 5 seconds for simple --dump-json) #4558

Closed

8 tasks

pukkandan closed this in fe7866d Aug 24, 2022

jordanlewis deleted the force-extractor branch August 24, 2022 02:51

dasl- mentioned this pull request Aug 24, 2022

improve performance of yt-dlp by using --use-extractors option dasl-/pifi#34

Closed

pukkandan added a commit that referenced this pull request Aug 24, 2022

Fix lazy extractor bug in fe7866d

e5458d1

and add test Fixes #3234 (comment)

pukkandan mentioned this pull request Oct 9, 2022

Optimize string concat in mpd parser #5181

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[general options] add --force-extractor #3234

[general options] add --force-extractor #3234

jordanlewis commented Mar 28, 2022 •

edited

Loading

pukkandan commented Mar 28, 2022 •

edited

Loading

jordanlewis commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022 •

edited

Loading

jordanlewis commented Mar 29, 2022

pukkandan commented Mar 29, 2022

jordanlewis commented Mar 29, 2022

dirkf commented Apr 4, 2022

fstirlitz commented Apr 15, 2022

pukkandan commented May 26, 2022

fstirlitz commented May 26, 2022

pukkandan commented May 26, 2022

fstirlitz commented May 27, 2022

jordanlewis commented Aug 24, 2022

jordanlewis commented Aug 24, 2022

pukkandan commented Aug 24, 2022 •

edited

Loading

pukkandan commented Aug 24, 2022

dasl- commented Aug 24, 2022

samoht0 commented Aug 24, 2022

Vangelis66 commented Aug 24, 2022 •

edited

Loading

pukkandan commented Aug 24, 2022

pukkandan commented Aug 24, 2022

dasl- commented Sep 4, 2022 •

edited

Loading

[general options] add --force-extractor #3234

[general options] add --force-extractor #3234

Conversation

jordanlewis commented Mar 28, 2022 • edited Loading

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

pukkandan commented Mar 28, 2022 • edited Loading

jordanlewis commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

jordanlewis commented Mar 28, 2022

pukkandan commented Mar 28, 2022

pukkandan commented Mar 28, 2022 • edited Loading

jordanlewis commented Mar 29, 2022

pukkandan commented Mar 29, 2022

jordanlewis commented Mar 29, 2022

dirkf commented Apr 4, 2022

fstirlitz commented Apr 15, 2022

pukkandan commented May 26, 2022

fstirlitz commented May 26, 2022

pukkandan commented May 26, 2022

fstirlitz commented May 27, 2022

jordanlewis commented Aug 24, 2022

jordanlewis commented Aug 24, 2022

pukkandan commented Aug 24, 2022 • edited Loading

pukkandan commented Aug 24, 2022

dasl- commented Aug 24, 2022

samoht0 commented Aug 24, 2022

Vangelis66 commented Aug 24, 2022 • edited Loading

pukkandan commented Aug 24, 2022

pukkandan commented Aug 24, 2022

dasl- commented Sep 4, 2022 • edited Loading

jordanlewis commented Mar 28, 2022 •

edited

Loading

pukkandan commented Mar 28, 2022 •

edited

Loading

pukkandan commented Mar 28, 2022 •

edited

Loading

pukkandan commented Aug 24, 2022 •

edited

Loading

Vangelis66 commented Aug 24, 2022 •

edited

Loading

dasl- commented Sep 4, 2022 •

edited

Loading