Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Sites offering alternative simultaneous media streams #389

Closed
fstirlitz opened this issue Jun 9, 2021 · 14 comments
Closed
Labels
enhancement New feature or request

Comments

@fstirlitz
Copy link
Contributor

fstirlitz commented Jun 9, 2021

  • I'm reporting a feature request
  • I've verified that I'm running yt-dlp version 2021.06.08
  • I've searched the bugtracker for similar feature requests including closed ones

Description

(Spun off from #343)

A handful of sites sometimes offer multiple media streams of a given type that are meant to be played simultaneously or as alternatives to each other. The biggest one is probably Mediasite, which often offers a screencast stream (and presentation slides) in addition to the video stream showing the speaker. Other such cases are possible: in the past, GDCVault offered an alternative audio track containing the translation of the speaker’s talk, and a video stream containing the slides; #347 is an issue with a site offering both subtitled and dubbed versions of the same video.

This is not very common, but when it happens, it’s something of a pain to support. Currently, alternative Mediasite streams are offered as separate formats with a negative preference value, which means they are not downloaded by default, even though we can do it just fine. Because it is nowhere even mentioned that they are available for download, they can be hard to discover (see ytdl-org/youtube-dl#20611, ytdl-org/youtube-dl#23003). Some kind of general framework for handling such cases would be useful.

Here’s my design sketch: each format can declare a set of streams it contains by including a 'streams' key in its dict containing a non-empty list of stream identifiers (strings). If two formats declare the same stream identifier, they shall be considered as containing two different quality versions of the same content. If a format doesn’t have a 'streams' key, it will be synthesised based on the 'acodec' and 'vcodec' keys: the list will contain 'audio' unless 'acodec' is 'none' and it will contain 'video' unless 'vcodec' is 'none'.

The meaning of best would then be modified, and a couple of other selectors added:

  • best: Picks the single best pre-multiplexed format that contains all streams;
  • allbest: For each stream offered by the download, picks the best format containing it, and downloads them separately;
  • mergeallbest: For each stream offered by the download, picks the best format containing it, and merges them all afterwards. If merging is not possible, nothing is selected.

The default value of the -f option would then become mergeallbest/allbest. Analogous selectors for the worst formats could be provided as well.


List of extractors that could potentially benefit (with example URLs if possible):

@pukkandan
Copy link
Member

pukkandan commented Jun 10, 2021

This is a great idea and shouldn't be too difficult to implement.

There are however a few things that need addressing:

  1. Current options should keep working as is. So, bestvideo should always select the one best video stream irrespective of it's stream type
  2. At the same time, the user should be able to select the best of a particular stream type (say bestvideo.presentation)
  3. User should also have the option to select the best of each video stream (say bestvideo.all or allbestvideo)
  4. Changing best to mean the multiplexed format with ALL streams could cause compat issues. Say there is a website that provides audio+video merged and we add story boards for it. Now if someone was using -f best, it will throw error since now there is no single format with all stream types. So we should keep best to refer to any format with 1 video + 1 audio
  5. allbest/mergeallbest creates an ambiguity. Since best refers to the best multiplexed format, allbest intuitively refers to the best of each stream-type that has multiplexed formats

Due to these, I propose that we keep the video|audio(|image?) selector as-is, and add the stream selector on top of this. So the format selector will look like:

r'''(?x)
                        (?P<merge>merge)?
                        (?P<which>b|w|all|best|worst)
                        (?P<what>v|a|video|audio)?
                        (?P<containing>\*)?
                        (?:\.(?<stream>all|\w+))?
                        (?:\.(?<n>[1-9]\d*))?
'''

Here's how it would address the above points

  1. Current options should keep working as is. So, bestvideo should always select the one best video stream irrespective of it's stream type
  2. At the same time, the user should be able to select the best of a particular stream type (say bestvideo.presentation)

bestvideo -> best video of any stream type
best*.presentation -> best format (audio/video/per-merged) of type presentation
bestvideo.screencast -> best video-only format of type screencast

  1. User should also have the option to select the best of each video stream (say bestvideo.all or allbestvideo)

bestvideo.all -> Best of each type of video-only streams, downloaded seperately
mergebest*.all -> Best of each type of stream merged into 1 file

  1. Changing best to mean the multiplexed format with ALL streams could cause compat issues. Say there is a website that provides audio+video merged and we add story boards for it. Now if someone was using -f best, it will throw error since now there is no single format with all stream types. So we should keep best to refer to any format with 1 video + 1 audio

We haven't changed the meaning of best, but we also now don't have any selector to select "pre-merged format with all stream types". I will need to think about how this can be added, or if it is even needed

  1. allbest/mergeallbest creates an ambiguity. Since best refers to the best multiplexed format, allbest intuitively refers to the best of each stream-type that has multiplexed formats

best.all -> best pre-merged format of each stream type
best*.all -> Best of each type of stream, whether audio, video or pre-merged
all.screencast -> All formats of type screencast
Note that all before and after the . has different meanings

@pukkandan pukkandan added the enhancement New feature or request label Jun 11, 2021
@fstirlitz
Copy link
Contributor Author

Changing best to mean the multiplexed format with ALL streams could cause compat issues. Say there is a website that provides audio+video merged and we add story boards for it. Now if someone was using -f best, it will throw error since now there is no single format with all stream types.

That’s a good example, but I also thought about having a mechanism that would exclude certain streams from consideration by these ‘collective’ selectors (storyboards, audio tracks in languages you’re not interested in, etc.), which could mitigate this problem.

pukkandan added a commit that referenced this issue Jun 12, 2021
Necessary for #343.

* They are identified by `vcodec=acodec='none'`
* These formats show as the worst in `-F`
* Any postprocessor that expects audio/video will be skipped
* `b*` and all related selectors will skip such formats
* This commit also does not add any selector for downloading such formats. They have to be explicitly requested by the `format_id`. Implementation of a selector is left for when #389 is resolved
nixxo pushed a commit to nixxo/yt-dlp that referenced this issue Nov 22, 2021
Necessary for yt-dlp#343.

* They are identified by `vcodec=acodec='none'`
* These formats show as the worst in `-F`
* Any postprocessor that expects audio/video will be skipped
* `b*` and all related selectors will skip such formats
* This commit also does not add any selector for downloading such formats. They have to be explicitly requested by the `format_id`. Implementation of a selector is left for when yt-dlp#389 is resolved
@barkoder
Copy link

barkoder commented Feb 6, 2022

Related to this issue, YouTube now allows some videos to have 2 audio tracks such as a descriptive audio track for the blind. So yt-dlp defaults to picking 251-1 only.

Both 251-0 and 251-1 should get muxed with 251-1 selected as the default audio track when the muxed output video file gets played.

$ yt-dlp -v -F rs1WF2SkjuY
[debug] Command-line config: ['-v', '-F', 'rs1WF2SkjuY']
[debug] yt-dlp version 2022.02.04 [c1653e9] (py2exe)
[debug] exe versions: ffmpeg 4.4.1-essentials_build-www.gyan.dev (setts), ffprobe 4.4.1-essentials_build-www.gyan.dev
[debug] Optional libraries: mutagen, sqlite, websockets
[debug] Proxy map: {}
[debug] [youtube] Extracting URL: rs1WF2SkjuY
[youtube] rs1WF2SkjuY: Downloading webpage
[youtube] rs1WF2SkjuY: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, codec:vp9.2, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[info] Available formats for rs1WF2SkjuY:
ID    EXT   RESOLUTION FPS |   FILESIZE    TBR PROTO | VCODEC           VBR ACODEC      ABR     ASR MORE INFO
------------------------------------------------------------------------------------------------------------------------------------------------------
sb2   mhtml 48x27          |                   mhtml | images                                       storyboard
sb1   mhtml 80x45          |                   mhtml | images                                       storyboard
sb0   mhtml 160x90         |                   mhtml | images                                       storyboard
139-0 m4a   audio only     |    6.71MiB    48k https | audio only           mp4a.40.5   48k 22050Hz [en] English descriptive, low, m4a_dash
139-1 m4a   audio only     |    6.71MiB    48k https | audio only           mp4a.40.5   48k 22050Hz [en] English original (default), low, m4a_dash
249-0 webm  audio only     |    6.77MiB    49k https | audio only           opus        49k 48000Hz [en] English descriptive, low, webm_dash
250-0 webm  audio only     |    7.90MiB    57k https | audio only           opus        57k 48000Hz [en] English descriptive, low, webm_dash
249-1 webm  audio only     |    6.62MiB    48k https | audio only           opus        48k 48000Hz [en] English original (default), low, webm_dash
250-1 webm  audio only     |    7.67MiB    55k https | audio only           opus        55k 48000Hz [en] English original (default), low, webm_dash
140-0 m4a   audio only     |   17.80MiB   129k https | audio only           mp4a.40.2  129k 44100Hz [en] English descriptive, medium, m4a_dash
140-1 m4a   audio only     |   17.80MiB   129k https | audio only           mp4a.40.2  129k 44100Hz [en] English original (default), medium, m4a_dash
251-0 webm  audio only     |   14.32MiB   104k https | audio only           opus       104k 48000Hz [en] English descriptive, medium, webm_dash
251-1 webm  audio only     |   13.90MiB   101k https | audio only           opus       101k 48000Hz [en] English original (default), medium, webm_dash
17    3gp   176x144      6 |   10.72MiB    78k https | mp4v.20.3        78k mp4a.40.2    0k 22050Hz 144p
394   mp4   256x144     25 |    8.47MiB    61k https | av01.0.00M.08    61k video only              144p, mp4_dash
160   mp4   256x144     25 |    5.27MiB    38k https | avc1.4d400c      38k video only              144p, mp4_dash
278   webm  256x144     25 |    9.03MiB    65k https | vp9              65k video only              144p, webm_dash
395   mp4   426x240     25 |   12.92MiB    94k https | av01.0.00M.08    94k video only              240p, mp4_dash
133   mp4   426x240     25 |   11.12MiB    80k https | avc1.4d4015      80k video only              240p, mp4_dash
242   webm  426x240     25 |   12.91MiB    93k https | vp9              93k video only              240p, webm_dash
396   mp4   640x360     25 |   24.98MiB   181k https | av01.0.01M.08   181k video only              360p, mp4_dash
134   mp4   640x360     25 |   21.53MiB   156k https | avc1.4d401e     156k video only              360p, mp4_dash
18    mp4   640x360     25 |   66.35MiB   482k https | avc1.42001E     482k mp4a.40.2    0k 44100Hz 360p
243   webm  640x360     25 |   32.07MiB   233k https | vp9             233k video only              360p, webm_dash
397   mp4   854x480     25 |   45.60MiB   331k https | av01.0.04M.08   331k video only              480p, mp4_dash
135   mp4   854x480     25 |   36.59MiB   266k https | avc1.4d401e     266k video only              480p, mp4_dash
244   webm  854x480     25 |   48.72MiB   354k https | vp9             354k video only              480p, webm_dash
398   mp4   1280x720    25 |   94.03MiB   684k https | av01.0.05M.08   684k video only              720p, mp4_dash
136   mp4   1280x720    25 |   62.20MiB   452k https | avc1.4d401f     452k video only              720p, mp4_dash
22    mp4   1280x720    25 | ~247.92MiB  1761k https | avc1.64001F    1761k mp4a.40.2    0k 44100Hz 720p
247   webm  1280x720    25 |   91.82MiB   668k https | vp9             668k video only              720p, webm_dash
399   mp4   1920x1080   25 |  173.87MiB  1265k https | av01.0.08M.08  1265k video only              1080p, mp4_dash
137   mp4   1920x1080   25 |  247.94MiB  1804k https | avc1.640028    1804k video only              1080p, mp4_dash
248   webm  1920x1080   25 |  169.39MiB  1232k https | vp9            1232k video only              1080p, webm_dash
400   mp4   2560x1440   25 |  608.94MiB  4430k https | av01.0.12M.08  4430k video only              1440p, mp4_dash
271   webm  2560x1440   25 |  618.00MiB  4496k https | vp9            4496k video only              1440p, webm_dash
401   mp4   3840x2160   25 |    1.28GiB  9511k https | av01.0.12M.08  9511k video only              2160p, mp4_dash
313   webm  3840x2160   25 |    1.81GiB 13517k https | vp9           13517k video only              2160p, webm_dash

@fstirlitz
Copy link
Contributor Author

Looking back, I think my design sketch was perhaps a tad too simplistic. It would make some sense to attach some metadata as well to the whole stream/feed (what DASH refers to as an ‘adaptation set’) instead of individual formats (‘representations’ in DASH parlance). This way we would be able to attach language and stream kind information like ‘original audio’, ‘dubbed translation’, ‘voice-over translation’, ‘audio description track’, ‘forced subtitle [i.e. to be paired with an audio translation]’, ‘video with burned-in subtitles’ to the feeds themselves, so that those metadata consistently propagate to all formats and can be used in selectors.

@pukkandan
Copy link
Member

The meaning of best would then be modified, and a couple of other selectors added:

  • best: Picks the single best pre-multiplexed format that contains all streams;
  • allbest: For each stream offered by the download, picks the best format containing it, and downloads them separately;
  • mergeallbest: For each stream offered by the download, picks the best format containing it, and merges them all afterwards. If merging is not possible, nothing is selected.
r'''(?x)
                        (?P<merge>merge)?
                        (?P<which>b|w|all|best|worst)
                        (?P<what>v|a|video|audio)?
                        (?P<containing>\*)?
                        (?:\.(?<stream>all|\w+))?
                        (?:\.(?<n>[1-9]\d*))?
'''

In retrospect, neither of our original proposals quite work. There are fundamentally 2 separate features that should be addressed here:

  1. Being able to distinguish different types of streams. Many extractors currently use language for this job (eg: description tracks set to have a diff language). This can be easily addressed by adding a new field (say, type). Filtering type can be done using the ordinary filter syntax. ie, best[type=desc] rather than bestdesc or best.desc. So there is no need to give this field any special considerations.
  2. Being able to download the best of each type (or the best of each language). This is what [Feature Request] Add option to download best format for all audio languages #1176 wants

So my new proposal is:

  • Single formats (unchanged)
r'''(?x)
    (?P<which>b|w|best|worst)
    (?P<what>v|a|video|audio)?
    (?P<containing>\*)?
    (?:\.(?<n>[1-9]\d*))?
'''
  • Multiple formats without checking type
r'''(?x)
    (?P<merge>merge)?all
    (?P<what>v|a|video|audio)?  # New
'''
  • Multiple formats of grouped by type (Entirely new)
r'''(?x)
    (?P<merge>merge)?all
    (?: # This is the same pattern as single formats
        (?P<which>b|w|best|worst)
        (?P<what>v|a|video|audio)?
        (?P<containing>\*)?
        (?:\.(?<n>[1-9]\d*))?
    )
    (?:{(?<field>\w+)})
'''

Eg:

  1. mergalla - Merge all audio-only streams
  2. mergeallba{language} - Merge best audio-only streams of each language
  3. allwv.2{type} - Download second worst video-only streams of each type

Random thoughts:

  1. I chose {} as the separator arbitrarily. Anything that doesn't already have meaning can be used. The syntax is fine without any separator as well, but would be hard to read.
  2. We could also allow all{type} => allb*{type}, but that may be too ambigious
  3. Alternately, all could be removed when using {}. So mergeba{type} instead of mergeallba{type}, b{type} instead of allb{type} etc. Syntactically, both are equivalent since {} can never be used without all, but I'm on the fence about which is more readable. In this new syntax, delimiter can be just read as "of each" to quickly understand the selector. ie, ba{type} can be read "best audio of each type"
  4. We may want to allow multiple fields inside {} in the future. Eg: allbestaudio{type,language} => best streams of each (type, language) pair. But I don't have any practical use for this rn. So there is no need to currently implement it.

This still addresses all my points from #389 (comment):

bestvideo -> best video of any stream type
best*.presentation -> best format (audio/video/per-merged) of type presentation
bestvideo.screencast -> best video-only format of type screencast

becomes bestvideo, best*[type=presentation], bestvideo[type=screencast] respectively

  1. User should also have the option to select the best of each video stream (say bestvideo.all or allbestvideo)

bestvideo.all -> Best of each type of video-only streams, downloaded seperately
mergebest*.all -> Best of each type of stream merged into 1 file

becomes allbestvideo{type}, mergeallbest*{type} respectively. The advandage here is that type can be replaced by other fields like language, ext etc

now don't have any selector to select "pre-merged format with all stream types"

yt-dlp current assumes (wrongly) that each format contains atmost one video and one audio. The lack of a syntax to select this is a result of this assumption. We cannot really address it in this issue

best.all -> best pre-merged format of each stream type
best*.all -> Best of each type of stream, whether audio, video or pre-merged
all.screencast -> All formats of type screencast
Note that all before and after the . has different meanings

becomes allbest{type}, allbest*{type}, all[type=screencast] respectively. While this syntax is longer than the original proposal, it is more flexible and (hopefully) easier to understand


#3562 implements a subset of this, though it's current syntax is not fully compatible with my suggestion
cc @Lesmiscore

@coletdjnz
Copy link
Member

Panopto extractor would benefit from this too (similar to Mediasite).

I should also note that the streams may not start/end all at the same time.

This can be seen commonly with Panopto (e.g. audio stream may start a little after video stream begins). I've also seen cases where video streams that end and start throughout, overlapping or not.

Panopto provides timing data which is used in the web browser for syncing the streams.

@pukkandan
Copy link
Member

I should also note that the streams may not start/end all at the same time.

This can be seen commonly with Panopto (e.g. audio stream may start a little after video stream begins). I've also seen cases where video streams that end and start throughout, overlapping or not.

Panopto provides timing data which is used in the web browser for syncing the streams.

This seems to be unrelated to this issue. We can create some field similar to downloader_options/_ffmpeg_args in the infodict and make MergerPP use it

@Lesmiscore
Copy link
Contributor

That looks a good idea.

@fstirlitz
Copy link
Contributor Author

I don’t see a reason why a host couldn’t ever serve multiple different feeds of the same ‘type’. I think feeds ought to be distinguished based on their identity, not classification into a few rigid categories. So you need to distinguish between which format belongs to which feed, and only then describe what the feeds contain and how they relate to each other. For example, to be able to warn when the user chooses to download a video feed and a dubbed audio feed without also getting the forced subtitle paired with the dub.

Speaking of subtitles, the current situation with subtitles seems to be very similar, except worse, because subtitles have no format IDs or selectors; you can choose the language or container format of the subtitle, but if there happen to be multiple subtitles with the same language and container, there is no way to distinguish them. For this reason, I think subtitle streams should be folded into format selection, and the existing subtitle command-line options translated into modifying the format selector: --sub-langs and --sub-format set filters for the feed type, --embed-subs is translated into +, --write-subs into ,. On the other hand, if subtitle are simply dumped into formats, this may pose incompatibilities with external software reading the info_dict directly. (Though if we don’t mind that, maybe also change tcodec to scodec before it’s too late.)

I should also note that the streams may not start/end all at the same time.

Ouch. I’m not even sure which container formats support such skew-synchronized streams, if any.

@pukkandan
Copy link
Member

I don’t see a reason why a host couldn’t ever serve multiple different feeds of the same ‘type’. I think feeds ought to be distinguished based on their identity, not classification into a few rigid categories. So you need to distinguish between which format belongs to which feed, and only then describe what the feeds contain and how they relate to each other.

I don't quite understand this part. Could you elaborate? Especially, what you mean by "feed" and "identity" in this context

@pukkandan
Copy link
Member

pukkandan commented Apr 29, 2022

(Though if we don’t mind that, maybe also change tcodec to scodec before it’s too late.)

Sure, I can do that. But it's not really important imo. The tcodec field is only meant for internal use. We don't propagate it into user space (other than parse_codecs arguably being part of the API)

@fstirlitz
Copy link
Contributor Author

The definitions I use here are:

  • feed: specific time-synchronised single-medium content served by the host
  • format: a stream containing some feeds, encoded in a specific codec with specific settings (resolution, sampling/frame rate, psy(choperceptive) op(timization) level), served in a specific container over a specific protocol

The point is that feeds have identity beyond their describing metadata, and that the the extractor must be able to declare: these are the available feeds, and those are the formats that serve them, instead of simply classifying formats into fixed-in-advance buckets and hoping no host will ever serve two simultaneous screencasts from different devices, or serve each person attending a conference call as a separate video/audio feed. This is basically already the situation with subtitles, as they cannot be distinguished beyond language and format. Adding more buckets to categorise feeds may ameliorate the issue, but not truly solve it.

@pukkandan
Copy link
Member

instead of simply classifying formats into fixed-in-advance buckets

I did not mean that types should be fixed in advance. It will have to be extractor-specific

no host will ever serve two simultaneous screencasts from different devices

Isn't that just this? f1: type=screencast1, f2: type=screencast2.

This is basically already the situation with subtitles, as they cannot be distinguished beyond language and format. Adding more buckets to categorise feeds may ameliorate the issue, but not truly solve it.

I agree on the issue with subtitles, but I don't see how the same issue exists here. What am I missing? 🤔

@pukkandan
Copy link
Member

Moved to #4846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: format selection
Development

No branches or pull requests

5 participants