Downloader for slide streams #343

fstirlitz · 2021-05-23T19:33:22Z

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense

What is the purpose of your pull request?

Improvement
New feature

This implements a downloader for slideshow-like streams and uses it to download slides from Mediasite presentations and also thumbnail streams found in DASH manifests.

The hardest part was probably picking the format for the downloaded slides. I needed a format that is:

self-contained (in a single file)
easy to generate
can losslessly accommodate stand-alone images in disparate formats in a single file
can be readily opened by widely available software

Given these constraints, my only option was… MHTML. This is admittedly an unorthodox choice for a media format, and it should be noted that basically the only major software that supports it now are Blink-based browsers. Nevertheless, thanks to the HTML stub included in the file, the slideshow in the file can be readily viewed in those. I would have even included an interactive player, if it weren’t for the fact that Chromium doesn’t run scripts in MHTML files, so it would have been a considerable effort to implement something that wouldn’t even be usable; thus, I limited myself to including some basic CSS to make the slideshow passively viewable in a browser. Should the user want to extract the images, they will have no problem finding software libraries to process MIME multipart messages.

This addresses:

pukkandan · 2021-05-23T20:53:01Z

For the time-being, please add a from __future__ import unicode_literals to the new file so that the rest of the tests can run. That line can be removed later once I remove that test

pukkandan · 2021-05-23T21:27:00Z

I would have even included an interactive player, if it weren’t for the fact that Chromium doesn’t run scripts in MHTML files, so it would have been a considerable effort to implement something that wouldn’t even be usable;

It can be done with just css animations by setting the animation times using duration. Obviously, I'm not saying you should add it

Given these constraints, my only option was… MHTML

MHTML is fine imo. The other options would be:

mjpeg: But I dont think there is any motion-png. So we'll need to convert everything to jpeg which is not the greatest idea
Normal HTML. It would be possible to base64 encode all the images. The only advantage over MHTML would be better compatibility. (Why can't firefox just support mhtml 😢)
zip: Many image viewers, pdf viewers and video players support zip/tar files that contain images. One such relatively popular format is cbz. EPUB is also this idea, but using html instead of just images. But these wont give us the flexibility to add animation in the future
PPTS: OpenXML Presentation can also be used in theory, though it is quite complex to generate

I'm not very familiar with mhtml specs, so expect a full review to take a while

fstirlitz · 2021-05-24T04:01:01Z

mjpeg: But I dont think there is any motion-png. So we'll need to convert everything to jpeg which is not the greatest idea

There is, it’s called MNG. It even has support for JPEG compression too. But it would require re-framing JPEG files into JNG chunks, would be considerably hard to generate and it’s not very widely supported.

Before settling on MHTML, I considered putting JPEGs into an MPEG-4 container (complicated to generate), an AVI container (generating RIFF chunks requires knowing the length beforehand, so it would require a second pass over the file), multipart/x-mixed-replace (doesn’t allow specifying durations, limited to MJPEG in practice) and using the concat FFmpeg muxer to generate the file (would only support a single format).

HTML with base64-encoded images might have worked just as well or even better (at least we’d have scripting back), but the ⁴⁄₃ overhead…

Zirro · 2021-05-24T15:14:43Z

There is, it’s called MNG. It even has support for JPEG compression too. But it would require re-framing JPEG files into JNG chunks, would be considerably hard to generate and it’s not very widely supported.

There's also APNG - an extension to the PNG format proposed as an alternative to MNG - which has (after way too many years) managed to achieve widespread support in modern browsers and FFmpeg. It sounds like it might be just what you're looking for.

fstirlitz · 2021-05-24T16:15:47Z

APNG would require not only re-framing, but also re-encoding, which I specifically wanted to avoid.

Zirro · 2021-05-24T17:29:21Z

From pukkandan's earlier comment - "But I dont think there is any motion-png. So we'll need to convert everything to jpeg which is not the greatest idea" - I was under the impression that ideal format would be lossless (in addition to the other requirements). Whether it makes the conversion more challenging to perform on the technical level, I would've expected FFmpeg to be able to handle it. However, I recognise that there likely are intricacies to this which are unfamiliar to me.

pukkandan

I have concerns regarding implementing slides as a "video format" (ie. acodec=none, vcodec=jpeg)

This interferes with the format selection. If the user gives -f wv (or even worse -f wv+wa), they are definitely not requesting the slides. Similar issue exists for using -S
Merger should not try to merge slides with video/audio. Similarly, other postprocessors should also recognize this as not being a standard video/audio format and not try to add metadata, recode, remux etc.

I can think of 2 solutions this to this:

Add a new key in the infodict for this, similar to thumbnails/subtitles and corresponding cli --write option. This is the easiest solution since we don't need to modify any of the existing code
Add a new category slides(?) similar to video/audio. This would however require changes to the format selector code, and most (if not all) postprocessors

Whichever solution we decide to implement, it would be helpful if it is easily expandable in the future to allow features like ytdl-org/youtube-dl#14951 without adding yet another type of data

I would also like the user to be able to easily download the images separately without them being embedded into a single mhtml file. This is already somewhat accomplished by using --keep-fragments, but without the correct file extension. This feature is not necessary to be implemented within this PR though.

yt_dlp/downloader/mhtml.py

yt_dlp/extractor/common.py

yt_dlp/extractor/mediasite.py

fstirlitz · 2021-06-03T08:01:09Z

Doesn’t a negative preference/quality value mitigate problems with format selection? This is actually part of a more general problem, which Mediasite already hits with some regularity: we don’t handle sites which offer multiple independent, but synchronised streams very well. If I understand correctly, #347 likewise ultimately boils down to this.

Add a new key in the infodict for this, similar to thumbnails/subtitles and corresponding cli --write option. This is the easiest solution since we don't need to modify any of the existing code

I actually thought about doing the opposite and integrating subtitles into the formats list. These are after all synchronised with the other media streams and can be considered a first-class alternative medium of presentation. It’s exactly how FFmpeg treats them, too. (Poster thumbnails, on the other hand, don’t fit into this; they are more akin to metadata tags.)

With regard to postprocessor support, this would not have been a problem had I generated an MJPEG stream encapsulated in a Matroska or MP4 container, as I originally planned. Which I think only emphasises that slide sequences should be considered first-class media streams like the others.

This downloader is intended to be used for streams that consist of a timed sequence of stand-alone images, such as slideshows or thumbnail streams.

pukkandan · 2021-06-07T09:28:32Z

Doesn’t a negative preference/quality value mitigate problems with format selection?

When using best, yes. But not when using worst

This is actually part of a more general problem, which Mediasite already hits with some regularity: we don’t handle sites which offer multiple independent, but synchronised streams very well. If I understand correctly, #347 likewise ultimately boils down to this.

#347 is a different issue. There, the extractor simply ignores /sub or /dub part of the URL and extracts everything. I have a pretty good idea how it can be fixed, but don't have any means of testing it

Back to our issue; As I understand, multi_video is meant precisely to handle completely seperate content that is served together

Add a new key in the infodict for this, similar to thumbnails/subtitles and corresponding cli --write option. This is the easiest solution since we don't need to modify any of the existing code

I actually thought about doing the opposite and integrating subtitles into the formats list. These are after all synchronised with the other media streams and can be considered a first-class alternative medium of presentation. It’s exactly how FFmpeg treats them, too. (Poster thumbnails, on the other hand, don’t fit into this; they are more akin to metadata tags.)

If we were to consider subtitles/slides as "formats", we are effectively expanding the scope of yt-dlp to allow downloading of subs and slides without any associated video/audio. Not that this is necessarily a bad thing, but is something to think about when deciding on which method to use.

Also, if we were to treat slides as normal formats, the mediastream implementation is technically incorrect since the slides and the video are infact totally seperate content (unlike the images from mpd) and should therefore be treated as a seperate video (using multi_video)

I am still leaning towards treating slides seperately than formats if only just for the fact that it is easier to implement and since I don't see any practical benefit to making them first-class citizens.

Btw, I've directly pushed some small changes since I thought it would be easier than reviewing. If you dislike any of these changes, let me know

pukkandan · 2021-06-07T09:30:36Z

Also, I already told you this in another PR, but please try to avoid force-pushing once the review process has started. It makes it harder for me to keep track of the changes you are making

yt_dlp/extractor/mediasite.py

fstirlitz · 2021-06-07T10:36:46Z

From the documentation:

   _type "multi_video" indicates that there are multiple videos that
   form a single show, for examples multiple acts of an opera or TV episode.
   It must have an entries key like a playlist and contain all the keys
   required for a video at the same time.

This implies that multi_video is for streams meant to be played in sequence (i.e. concatenated). Mediasite, however, serves multiple streams and plays them simultaneously in sync with each other (i.e. multiplexed). As such, it is meaningful (note: this is different from ‘technically possible’) to want to merge them into a single file with the + operator, unlike with the use case mentioned above for multi_video. It’s not much different from offering a single video with multiple alternative audio tracks. In fact, this was the initial use case for ytdl-org/youtube-dl#6454. That site has since switched video hosts, and there is no longer a separate slide or alternative audio stream; but when it did, there was a video-only slides stream and an audio-only translation stream in addition to the main video showing the speaker speaking their mother tongue. Merging them allowed the user to switch video and audio streams in the player on the fly. Had I chosen a different container format, it would have been possible here, too.

Lesmiscore · 2021-06-07T13:54:06Z

@pukkandan

Add a new key in the infodict for this, similar to thumbnails/subtitles and corresponding cli --write option. This is the easiest solution since we don't need to modify any of the existing code

I agree to this idea. I think there should be a mode for writing out slides/storyboards, since they are absolutely different.

Edit: probably not absolutely, but different though

fstirlitz · 2021-06-07T15:12:40Z

Another problem with the ‘separate key’ approach is that since storyboards are often embedded in DASH streaming manifests (and I guess they may also appear in other streaming manifests), supporting them uniformly everywhere might require the same kind of API churn that #247 had to involve, with all the _extract_*_formats methods renamed to _extract_*_formats_and_subtitles and warnings when the old API had to drop streams.

pukkandan · 2021-06-08T23:09:09Z

Mediasite, however, serves multiple streams and plays them simultaneously in sync with each other (i.e. multiplexed). As such, it is meaningful (note: this is different from ‘technically possible’) to want to merge them into a single file with the + operator, unlike with the use case mentioned above for multi_video

This is a good point. But I am not sure how to improve this situation (without massive changes ofc). If you have any suggestions, we can discuss it (in a new issue preferably)

Another problem with the ‘separate key’ approach is that since storyboards are often embedded in DASH streaming manifests (and I guess they may also appear in other streaming manifests), supporting them uniformly everywhere might require the same kind of API churn that #247 had to involve, with all the extract_formats methods renamed to extract_formats_and_subtitles and warnings when the old API had to drop streams.

This can be worked around by letting the extractor add the storyboards as just another format and then plucking it out into a new key. I had the same idea for subtitles, but I never mentioned it since you had already replaced the functions

But in that case, it may be cleaner to just directly treat it as a format in that case, although it is a lot more work. Assuming we are going by that approach, how would we handle an option to download the images seperately? With the other approach, I was thinking something like --write-slides --slide-format [mhtml|image].

With the "formats method", we would need 2 separate formats and separate selectors (eg: bestslides, bestimages) for this, no? Or, perhaps we should fold this feature into --keep-fragments and just ensure the fragments have correct thumbnail?

Necessary for #343. * They are identified by `vcodec=acodec='none'` * These formats show as the worst in `-F` * Any postprocessor that expects audio/video will be skipped * `b*` and all related selectors will skip such formats * This commit also does not add any selector for downloading such formats. They have to be explicitly requested by the `format_id`. Implementation of a selector is left for when #389 is resolved

pukkandan · 2021-06-12T22:49:30Z

I have added minimum support for images format in 8326b00. With this, formats with acodec and vcodec = 'none' will be treated as images. The selectors best, worst, best* etc will never select such a format. The user will have to explicitly specify it with the format id. We can make proper selectors for this when we address #389

With that, this should be now ready for merge. I'll wait for your reply however in case you want to make any other changes

yt_dlp/downloader/mhtml.py

yt_dlp/extractor/mediasite.py

fstirlitz · 2021-06-13T07:04:23Z

A priori, acodec and vcodec both being none doesn’t necessarily have to mean an image stream, so that representation is not ideal. And I would still rather think of them as extremely bare-bones video streams. The preprocessor troubles may have been addressed simply by declaring mhtml is not a format supported by FFmpeg, and the selectors could be made ignore negative-preference streams (or maybe some other sentinel value). But if it’s a temporary hack anyway… (famous last words)

pukkandan · 2021-06-13T09:34:40Z

A priori, acodec and vcodec both being none doesn’t necessarily have to mean an image stream

How so?

And I would still rather think of them as extremely bare-bones video streams.

How is it a barebones video stream? It is not a video format and no video player / video processing tool will recognize it as such

yt_dlp/extractor/mediasite.py

fstirlitz · 2021-06-13T12:49:55Z

MHTML is not a video format. But the image stream itself is not too hard to squeeze into one, as mentioned before. The DASH storyboards should be easy to mux into an MJPEG stream in an MP4 container using FFmpeg’s concat pseudo-demuxer; perhaps we may add that some day. (Mediasite slideshows too, but at the cost of some lossiness.) It would be strange to consider storyboards both a video and a non-video stream, depending on the format it’s downloaded as. (And FFmpeg considers even still images to be ‘video’ formats.)

Subtitles were mentioned before as another timed media stream that is neither audio or video. FFmpeg also supports abstract timed data streams (TTML subtitles are currently only parsed as those). Though subtitles are already special-cased and I doubt we might ever deal with more abstract data, the inference ‘if it’s neither audio nor video, then it’s an image sequence’ does not seem particularly, ahem, sound.

pukkandan · 2021-06-13T13:01:31Z

the inference ‘if it’s neither audio nor video, then it’s an image sequence’ does not seem particularly, ahem, sound.

fair enough, but a more generalized implementation will need to be rewritten again when implementing #389. Hence why I don't want to put any more effort into it

fstirlitz · 2021-06-13T13:26:35Z

Okay then. Like I said, it’s not that bad, especially for a stop-gap measure.

This downloader is intended to be used for streams that consist of a timed sequence of stand-alone images, such as slideshows or thumbnail streams This can be used for implementing: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) ytdl-org/youtube-dl#9868 ytdl-org/youtube-dl#14951 Authored by: fstirlitz

Fixes: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) Authored by: fstirlitz

pukkandan · 2021-06-13T15:14:21Z

I have merged this as 2 commits - one for the downloader, and another for mediasite

fstirlitz · 2021-06-13T16:23:14Z

Postponing DASH then?

pukkandan · 2021-06-13T17:09:49Z

No, it is in cdb19aa

I only split out mediasite coz it is extractor specific and not really part of the core feature

pukkandan · 2021-06-13T17:20:59Z

I should have mentioned the DASH storyboards in the commit message...

I tried adding a git-notes to it, but github doesn't seem to support displaying it anywhere

Necessary for yt-dlp#343. * They are identified by `vcodec=acodec='none'` * These formats show as the worst in `-F` * Any postprocessor that expects audio/video will be skipped * `b*` and all related selectors will skip such formats * This commit also does not add any selector for downloading such formats. They have to be explicitly requested by the `format_id`. Implementation of a selector is left for when yt-dlp#389 is resolved

This downloader is intended to be used for streams that consist of a timed sequence of stand-alone images, such as slideshows or thumbnail streams This can be used for implementing: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) ytdl-org/youtube-dl#9868 ytdl-org/youtube-dl#14951 Authored by: fstirlitz

Fixes: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) Authored by: fstirlitz

pukkandan reviewed Jun 2, 2021

View reviewed changes

fstirlitz force-pushed the slideshows branch 2 times, most recently from 6639111 to 387feda Compare June 3, 2021 08:46

fstirlitz added 3 commits June 5, 2021 21:57

[downloader/mhtml] New downloader

2089ea5

This downloader is intended to be used for streams that consist of a timed sequence of stand-alone images, such as slideshows or thumbnail streams.

[mediasite] Extract slides

c5df915

[common] Extract thumbnail stream from DASH manifests

89124e6

fstirlitz force-pushed the slideshows branch from 387feda to 89124e6 Compare June 5, 2021 19:57

pukkandan added 5 commits June 7, 2021 12:48

Merge branch 'master' into slideshows

2eaa599

Remove "Unknown MIME" warnings from tests

407878b

Remove _download_params

5f323f1

Add msec to duration

bda743f

escapeHTML is a utility

4b87c0e

[cleanup]

891ab61

pukkandan force-pushed the slideshows branch from 8b6fd71 to 891ab61 Compare June 7, 2021 09:29

fstirlitz commented Jun 7, 2021

View reviewed changes

yt_dlp/extractor/mediasite.py Outdated Show resolved Hide resolved

fstirlitz mentioned this pull request Jun 9, 2021

[Feature Request] Sites offering alternative simultaneous media streams #389

Closed

[mhtml] real_download should return True on success

58ca4c9

Remove vcodec

a662714

Merge remote-tracking branch 'origin/master' into slideshows

7e35452

fstirlitz commented Jun 13, 2021

View reviewed changes

yt_dlp/downloader/mhtml.py Show resolved Hide resolved

fstirlitz commented Jun 13, 2021

View reviewed changes

yt_dlp/extractor/mediasite.py Outdated Show resolved Hide resolved

pukkandan reviewed Jun 13, 2021

View reviewed changes

yt_dlp/extractor/mediasite.py Outdated Show resolved Hide resolved

The images are not necessarily jpeg

b4fa676

pukkandan pushed a commit that referenced this pull request Jun 13, 2021

[mediasite] Extract slides (#343)

3fd4c2a

Fixes: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) Authored by: fstirlitz

pukkandan closed this Jun 13, 2021

fstirlitz deleted the slideshows branch July 11, 2021 14:06

nixxo pushed a commit to nixxo/yt-dlp that referenced this pull request Nov 22, 2021

[mediasite] Extract slides (yt-dlp#343)

e00a43f

Fixes: ytdl-org/youtube-dl#4974 (comment) ytdl-org/youtube-dl#4540 (comment) ytdl-org/youtube-dl#11185 (comment) Authored by: fstirlitz

pukkandan mentioned this pull request Dec 19, 2021

Add option to download youtube storyboard as jpg #2048

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloader for slide streams #343

Downloader for slide streams #343

fstirlitz commented May 23, 2021 •

edited

pukkandan commented May 23, 2021

pukkandan commented May 23, 2021 •

edited

fstirlitz commented May 24, 2021

Zirro commented May 24, 2021

fstirlitz commented May 24, 2021

Zirro commented May 24, 2021

pukkandan left a comment

fstirlitz commented Jun 3, 2021 •

edited

pukkandan commented Jun 7, 2021

pukkandan commented Jun 7, 2021

fstirlitz commented Jun 7, 2021

Lesmiscore commented Jun 7, 2021 •

edited

fstirlitz commented Jun 7, 2021

pukkandan commented Jun 8, 2021

pukkandan commented Jun 12, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

pukkandan commented Jun 13, 2021

Downloader for slide streams #343

Downloader for slide streams #343

Conversation

fstirlitz commented May 23, 2021 • edited

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

pukkandan commented May 23, 2021

pukkandan commented May 23, 2021 • edited

fstirlitz commented May 24, 2021

Zirro commented May 24, 2021

fstirlitz commented May 24, 2021

Zirro commented May 24, 2021

pukkandan left a comment

Choose a reason for hiding this comment

fstirlitz commented Jun 3, 2021 • edited

pukkandan commented Jun 7, 2021

pukkandan commented Jun 7, 2021

fstirlitz commented Jun 7, 2021

Lesmiscore commented Jun 7, 2021 • edited

fstirlitz commented Jun 7, 2021

pukkandan commented Jun 8, 2021

pukkandan commented Jun 12, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented Jun 13, 2021

pukkandan commented Jun 13, 2021

pukkandan commented Jun 13, 2021

fstirlitz commented May 23, 2021 •

edited

pukkandan commented May 23, 2021 •

edited

fstirlitz commented Jun 3, 2021 •

edited

Lesmiscore commented Jun 7, 2021 •

edited