Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Filter YouTube livestream recordings which are not fully processed #26290
Comments
|
A few updates:
|
Checklist
Description
Immediately after a livestream on YouTube finishes, if it is sufficiently long, the full recording is not universally available. Generally only the final 2 hours can be viewed, though the duration can be longer depending on the settings for the stream or shorter if the stream goes offline and back online again. Eventually, some processing occurs on YouTube's side and the full video is re-encoded as a single video rather than an m3u8 playlist. Often this processing is done in minutes, but in extreme cases I have seen it take over a week. I have heard that the full livestream can be viewed immediately on mobile, but (not owning any to test) I have not found any way to get youtube-dl to grab one of those versions.
If an example is needed for testing, https://www.youtube.com/watch?v=LCt3b5updPQ is not yet processed and it will probably take a while, but due to the nature of the issue any one link won't work forever. I can provide more examples as they are needed or they can easily be found browsing through recent YouTube livestreams. Every YouTube livestream for which the recording is published goes through this process, though only sufficiently long ones will be clipped.
For automated archival purposes, it is obviously desirable to skip downloading versions which are not a full recording of the stream. In particular, if
--download-archive downloaded.txtis used, the incomplete recordings will still be silently downloaded and added todownloaded.txt, preventing future downloads from getting the full version. Hence method for handling unprocessed streams (especially those for which the full recording are not yet available) is needed. Here are some approaches to this filtering inside youtube-dl that don't quite work:--match-filter "!is_live"does not work at all (though I think it used to?). It will exclude currently live streams but not those which are finished but not yet processed.--format "(bestvideo+bestaudio/best)[protocol!*=m3u8]"does not work because the video and/or audio feeds don't show up as m3u8 protocol for some reason; instead they generally return a null protocol. I am not sure if this is a bug.--format "(best)[protocol!*=m3u8]"does work, so long as you are fine with the reduced quality that will result. In my case this is not a good option. The streams I am getting are primarily audio focused, so to conserve bandwidth and storage I am presently using--format(bestvideo[height<=360]+bestaudio/best[height<=360])in order to get the highest quality audio with just a passable video. If I switched frombestvideo[height<=360]+bestaudiotobestin order to filter by protocol then the filesizes will be extremely large, and if I useheight<=360then the audio quality will generally be lower.--match-filter "duration>1"was a hack that previously worked perfectly as long as you aren't worried about missing actual 1s long videos. For whatever reason, unprocessed streams always used to show up as 1s long. Unfortunately, recent changes in YouTube's pipeline seem to have broken this hack. Particularly, now after a stream ends, youtube-dl now sometimes finds an audio track which returns the proper full duration of the livestream. As above this hack will work if you use--format bestbut will sometimes fail if you use--format (bestvideo+bestaudio). To be clear, if you actually do download that track, it will not be the full length, just the clipped (2 hour) length, but if you do a--get-durationor--write-info-jsonthe full stream duration will be there.Note: the audio track issue described here is not consistent. It is transient and probably server dependent. Multiple nearly-simultaneous calls to
get-durationwill return 1s or the full duration seemingly at random. As a result, it is very hard to provide a live example of this but I have seen it now about 10 times in the past week (out of ~100 livestream downloads). It definitely isn't some one-of-a-kind occurrence, but I don't really know how a developer would be able to reproduce it on their end without just trying a bunch of recently finished livestreams.I tried pretty much everything I could think of (including a bunch of other things not listed) but nothing seems to work flawlessly right now. If there is some way within youtube-dl to implement this filtering and I have stupidly missed it, I will be ecstatic to learn how to do so.
It is possible to work around all of this but only with a fair amount of work. Currently, my approach is to run a
--get-idscheck for the playlist/channel in question. Then looping over those returned IDs, run a--get-info-json --skip-downloadcommand first and parse that JSON file to check if the recording is safe to download (i.e. a single recording rather than a collection of fragments). Finally if the video is identified as safe it is downloaded and recorded todownloaded.txt. (You should do it this way rather than in just 2 commands downloading all the JSON files first and then the video files because the operation is time-sensitive; if the JSON files are all downloaded first, after the first few video downloads enough time can pass that the data has changed on YouTube's end.) Note that this is still not ideal. In addition to adding code complexity, it can still fail if, for example, the call to download the JSON and the call to download the actual recording go to different servers which are not synchronized. I have not seen it fail yet but given enough time it will surely happen. Simply doing the filtering inside youtube-dl is obviously superior and should not be that difficult. This method is also slower, and since it requires moreyoutube-dlcalls it might increase the risk of getting disconnected by YouTube's servers (not sure on that).Hence I'm requesting that one of these be implemented (ordered based on how easy I suspect they would be to accomplish):
audio/video_fragments >1) which is true for unprocessed YouTube livestreams and false for processed ones and YouTube videos which were not livestreamed, regardless of the download format. As far as I know simply checking if the file is fragmented generally works but there could be edge cases I'm not aware of. Or maybe(bestvideo+bestaudio)[protocol!*=m3u8]not working is a bug and that should just be fixed.is_complete) which is true for unprocessed YouTube livestreams if and only if the full video is available and is always true for processed livestreams and other YouTube videos. One way to do this would be the following: both the true and cut durations are visible on YouTube in a browser (for example, the full duration shows up on thumbnails in the bottom right corner while the clipped duration shows up on the actual video player). These could be compared and if the difference is more than a few seconds the filter is triggered. I do not know how to automate getting these numbers though.(Aside: While I would be happy with either approach, filtering based on fragmentation or on completeness both have advantages. If download speed is a concern, the recording will download much faster after it is processed. The resulting filesize is also generally smaller. However, if the goal is to always have an up-to-date archive, then it is better to download complete livestream recordings even if they are fragmented, and perhaps re-encode them locally. I would certainly not complain if both these methods were available!)