A better SponsorBlock #360

nihil-admirari · 2021-05-29T19:20:24Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

Changes are similar in purpose to SponSkrub, but the implementation is completely different.

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This is a complete replacement of SponSkrub. New features:

Proper cut implemented using concat FFmpeg demuxer, mentioned in Trim video without re-encoding faissaloo/SponSkrub#32. According to https://sattlers.org/2019/09/30/ffmpeg-split-and-cut-video-segments/, small stutter/frame skipping mentioned in the issue can be fixed by forcing key frames, an option for which is also provided.
Implementation takes videoDuration from SponsorBlock responses into account.
Implementation is compatible with --skip-chapters.
Different chapter merge algorithm, which does not lead to "Chapter end time X before chapter start Y" reported in --keep-metadata sometimes causes 'Chapter end time X before start Y' faissaloo/SponSkrub#27.
SRT, ASS, and VTT subtitles (seems like FFmpeg does not fully support others) are cut along with the video, solving Subtitles are misaligned after cutting faissaloo/SponSkrub#21.
Fully cross-platform (SponSkrub still doesn't have macOS binaries).
Requires no external dependencies.
Written in Python, not in D, which makes contributing much easier.

pukkandan · 2021-05-29T19:47:38Z

I was also planning to rewrite the sponskrub integration. However, my plan was quite different from your implementation. Here is what I was thinking:

Don't touch sponskrub. Implement this independantly
Implement chapter merger as a utility and sponsorblock API to youtube extractor itself. This automatically takes care of the chapter mode. User can just do --add-metadata --use-sponsorblock. This also makes it so that the sponsorblock data is available in the info-json if needed
Write a concat postprocessor since it has other uses as well. See Add concat-playlist flag to merge videos from a playlist into one continuous file ytdl-org/youtube-dl#28153 and related issues
Add an option --remove-chapters that internally uses SplitChaptersPP and the concatPP to remove any chapter from the video. Now, using --remove-chapters sponsorblock.+ will take care of the sponskrub cut mode. This option automatically also allows user to specify the categories to be cut

Doing it this way would provide us with a lot more flexibility imo. Let me know what you think of this appoach

pukkandan · 2021-05-29T20:07:31Z

Add an option --remove-chapters that internally uses SplitChaptersPP and the concatPP to remove any chapter

Looking at your code in more detail, using your method of inpoint/outpoint is indeed a better approach than splitting and then concatting

nihil-admirari · 2021-05-30T13:57:53Z

Should I bring the entire SponSkrub implementation back, disable it by default and hide its options from help, or just create aliases --sponskrub -> --sponsorblock etc.?
Looks like SponsorBlock has plans to support other sites; server-side already accepts service query parameter, so moving SponsorBlock to YouTube extractor might not be a good idea.

I'll look at updating info-json.
Concat is a very raw operation; I'm afraid, a generic FFmpegConcatPostProcessor will have to handle SponsorBlock, handle playlists, and over time maybe other operations, which have only concatenation in common, making it difficult to maintain.

I can, however, move concat code to the base class FFmpegPostProcessor, so that if someone is going to implement ConcatPlaylistPostProcessor, he can reuse the implementation.
--remove-chapters sponsorblock.+ is not exactly what cut does. If there is an already existing Chapter with a sponsor in it, SponsorBlock will create chapters Chapter - 1, Sponsor, Chapter - 2. Cut however will simply decrease Chapter's duration, without splitting it into Chapter - 1, Chapter - 2, which --remove-chapters sponsorblock.+ implies.

Specifying categories to cut will take some time to implement, chapter merge algorithms need to be updated to handle it. Handling categories to cut would make removing arbitrary chapters possible, but to implement --remove-chapters, I would rather suggest to extract ChaptersPP from FFmpegMetadataPP (SponsorBlock already has to pass flags into it), and moving SponsorBlock code into ChaptersPP to have all chapter handling code in one place. What do you think about it?

pukkandan · 2021-05-30T14:34:53Z

Should I bring the entire SponSkrub implementation back, disable it by default and hide its options from help, or just create aliases --sponskrub -> --sponsorblock etc.?

Yes, leave sponskrub as-is for now. We'll worry about how to document it once this new implementation is complete

Looks like SponsorBlock has plans to support other sites; server-side already accepts service query parameter, so moving SponsorBlock to YouTube extractor might not be a good idea.

In that case, we should split the PP into SponsorBlockPP and RemoveChaptersPP. Then call the SponsorBlockPP as a pre-processor (like MetadataFromFieldPP)

Concat is a very raw operation; I'm afraid, a generic FFmpegConcatPostProcessor will have to handle SponsorBlock, handle playlists, and over time maybe other operations, which have only concatenation in common, making it difficult to maintain.

I am not familiar enough with ffmpeg to comment on this. I'll trust your judgement on this for now. If need be, we can generalize it later

I can, however, move concat code to the base class FFmpegPostProcessor, so that if someone is going to implement ConcatPlaylistPostProcessor, he can reuse the implementation.

Yes, please do so if possible

--remove-chapters sponsorblock.+ is not exactly what cut does. If there is an already existing Chapter with a sponsor in it, SponsorBlock will create chapters Chapter - 1, Sponsor, Chapter - 2. Cut however will simply decrease Chapter's duration, without splitting it into Chapter - 1, Chapter - 2, which --remove-chapters sponsorblock.+ implies.

It is though. Take this example: the video has chapters A, B with sponsor sections in the middle of A. The user calls yt-dlp with something like --sponsorblock --remove-chapters "sponsorblock-.+". So SponsorBlock preprocessor will first call the API and merge the chapter data to give A-1, sponsorblock-sponsor, A-2, B. This is saved in the infojson. Now RemoveChaptersPP will run and remove the chapters that match the regex causing the video to end up with the chapters A-1, A-2, B. Ofcourse, unless the user provides --add-metadata, this chapter info is not written to file

Specifying categories to cut will take some time to implement, chapter merge algorithms need to be updated to handle it.

I don't understand. Nothing needs to change in the chapter merger. It is simply that when we cut sections out, we cut off only the chapters matching the regex given by user.

I would rather suggest to extract ChaptersPP from FFmpegMetadataPP (SponsorBlock already has to pass flags into it), and moving SponsorBlock code into ChaptersPP to have all chapter handling code in one place

If needed, we could seperate the chapter code into its own function. However, this is not necessary for this PR. Once we put the modified chapter data in the infodict using SponsorBlock pre-processor, FFmpegMetadataPP will automatically take care of adding these chapters

nihil-admirari · 2021-05-30T20:22:23Z

Yes, leave sponskrub as-is for now.

Ok, I'll bring it back, and move to Deprecated Options.

Yes, please do so if possible

Ok, I'll do.

It is though.

SponSkrub does in fact implement the cut the way you just described, but the current implementation is different. SponSkrub implementation is easier from the developer perspective, but it is not necessarily better from the user perspective.

Please take a look at test_merge_with_sponsor_chapters_SponsorsWithinChapter. expected_no_cut has the original chapter split: ['c - 1', 's1', 'c - 2', 's2', 'c - 3', 's3', 'c - 4'], but expected_cut does not: ['c'], only the duration of the chapter has been reduced from 7 to 4.

I would rather want to have that feature present, I don't like to see chapters split for no reason. Please tell me if you dislike that particular feature of cut.

Nothing needs to change in the chapter merger.

Please take a look at Helper.append() and Helper.handle_intersection() inside of _merge_with_sponsor_chapters. Currently its implementation depends on the global self._cut. To leave some chapters and cut out others, the flag must be made per chapter.

In that case, we should split the PP into SponsorBlockPP and RemoveChaptersPP.
Once we put the modified chapter data in the infodict using SponsorBlock pre-processor, FFmpegMetadataPP will automatically take care of adding these chapters

FFmpegMetadataPP does in fact write chapter metadata even in this implementation. The only changes there were made to support writing chapters when user requested --sponsorblock, but did not request either --sponsorblock-cut or --append-metadata (you need to hide whitespace changes to see it, most changed lines are due to indentation). I think it's better from UX perspective to make --sponsorblock imply --add-chapters, instead of forcing people to specify them both. It also matches SponSkrub's implementation. What do you think about it?

However, it is not the reason why I suggested to extract ChaptersPP. These are the parts of the implementation:

Modifying chapters object by either splicing sponsor chapters in or by cutting them out.
Cut is implemented in a way described above: no unnecessary splits.
A new feature that would allow cutting arbitrary chapters, not only sponsors.
Cutting the video itself and the subtitles.
Writing chapter metadata.

Let's look at ways to combine these pieces together.

SponsorBlockPP only queries the API, modifies chapters taking only sponsors into account, and cuts sponsors out of the video. Then RemoveChaptersPP has to modify chapters the second time, and cut the video the second time. Chapter modification code has to be extracted to make it available to RemoveChaptersPP. FFmpegMetadataPP writes chapters to the file as it currently does.

This design tries to separate concerns. The separation of concern comes at the cost of having to modify chapters twice, and having to cut the video twice. At first, that double cut seems avoidable by having SponsorBlockPP pass some info['__sponsor_segments'] down to RemoveChaptersPP instead of doing the work itself. Now RemoveChaptersPP has to pay attention to SponsorBlock parameters to e.g. implement a cut the way it was described above. That passing of information between post-processors created a tight coupling between them nullifying the separation of concerns that the design tried to achieve in the first place.

Alternative design is to have ChaptersPP handling both --remove-chapters and SponsorBlock. It contains all the code to splice and cut the chapters (which by necessity implies cutting the video and the subtitles). SponsorBlock code now is just a couple of functions that conditionally attach sponsor chapters to the chapters explicitly requested to be removed. Cutting without splitting is possible since all the information is in the same place.

It is still possible to retain chapter writing in FFmpegMetadataPP. But it raises the obvious question: why all chapter handling is in one place, but the actual chapter writing is in another? Moving chapter writing code from FFmpegMetadataPP seems like a natural thing to do. It also makes it possible to have --sponsorblock imply --add-chapters without the necessity to pass an ugly info['__sponsor_blocked'] down the line. --add-metadata of course would still imply running ChaptersPP to match the old behaviour.

All future chapter handling code would have a natural post-processor to tag along to. What do you think about creating ChaptersPP in a way described above? Again, nothing changes if chapter writing stays in FFmpegMetadataPP. But having separate post-processors for SponsorBlock and --remove-chapters doesn't look like a good design due to the reasons described above. And having --remove-chapters handled by SponsorBlockPP looks rather strange.

faissaloo

Nice work, just a few things that might be worth thinking about

yt_dlp/postprocessor/sponsorblock.py

nihil-admirari · 2021-05-31T07:59:23Z

@pukkandan, looks like FFmpegMetadataPP exists in the original fork, so chapter writing code better be left in there to minimize merge conflicts. But as I said previously

nothing changes if chapter writing stays in FFmpegMetadataPP.

On the other hand, are you OK with having ModifyChaptersPP handling both --remove-chapters and SponsorBlock? Creating two tightly coupled post-processors having to communicate through info[__sponsorblock_segments] looks like bad design, they better be in the same place to maximize cohesion. And without that communication, implementation of non-splitting cut is impossible.

nihil-admirari · 2021-06-06T22:16:36Z

Small changes:

Brought SponSkrub back. Its options are hidden from the README, but everything else stays the same. New implementation is disabled by default.
Post-processor updates info-json after modifying chapters.
Concat implementation is moved to the superclass and is now available to every FFmpeg post-processor.
Minimized changes in FFmpegMetadataPP. Still --sponsorblock implies adding chapters even if --add-metadata wasn't explicitly requested.
Renamed plaintext to reveal_video_id.
HTTP now must be explicitly requested.
Rebased on master.
Since Python 2 has been deprecated, the code now uses Python >= 3.6 features: raise ... from ..., nonlocal, format strings, response encoding for JSON data and type annotations. encodeFilename is no longer used, io.open replaced with plain open.

Big change: implementing --remove-chapters and --sponsorblock-cut-categories turned out to be rather difficult. The algorithm was completely rewritten and is now based on a priority queue.

Please review.

README.md update line from Makefile:

COLUMNS=80 $(PYTHON) yt_dlp/__main__.py --help | $(PYTHON) devscripts/make_readme.py

generate an enormous diff. Is that line correct?

pukkandan · 2021-06-07T09:40:38Z

huh, I thought I had already replied to your earlier comment. I must have typed it up and then forgot to submit 😢 Give me some time to collect my thought again on the various points. Then I will review

README.md update line from Makefile
generate an enormous diff. Is that line correct?

Yes. The big diff is caused due to the new option names being too long. Anyway, you should revert that and only create the corresponding options for now. We can make README.md once we have finalized everything else

nihil-admirari · 2021-06-07T10:07:34Z

Reverted changes in README + some small fixes.

Workflow checks failed due to

2021-06-07T09:41:51.6758889Z AssertionError: None is not true : Regexp didn't match: '(?:(?:#.*?|\\s*)\\n)*from __future__ import (?:[a-z_]+,\\s*)*unicode_literals' not found, unicode_literals import  missing in /home/runner/work/yt-dlp/yt-dlp/yt_dlp/postprocessor/modify_chapters.py

I've added unicode_literals import, but I'm not sure that test has any uses now, after all, Python 2 has been deprecated.

pukkandan · 2021-06-07T10:08:48Z

I've added unicode_literals import, but I'm not sure that test has any uses now, after all, Python 2 has been deprecated.

Yes it has no use, I just forgot to disable that test

nihil-admirari · 2021-06-07T14:46:19Z

Fixed lint error.

Python 3.4 on Windows fails with SyntaxErrors, which is expected considering that the code uses 3.6 format strings. Since only Python 3.6 will be supported in the future, can you please just retire Python 3.4?

pukkandan · 2021-06-07T16:16:50Z

You can ignore the 3.4 test. By the time this code is merged, it will be removed

phyzical · 2021-06-12T14:33:07Z

@nihil-admirari hey there, not sure if im jumping the gun on giving this a test, but when i try --sponsorblock --sponsorblock-cut-categories sponsor for https://www.youtube.com/watch?v=Dy312cUHumk i get a ERROR: Postprocessing: Conversion failed! and the video playback ends at the first sponsor block.

When i didnt cut i had this warning thrown Some SponsorBlock segments overlap incase that helps hint in a direction. i will try to debug further if i get some more time :)

also on my quick usage i wonder if another flag of say --sponsorblock-cut should be added which will just cut what is queried so by default that would be all categories, hell id argue we should cut by default if --sponsorblock is provided, but i assume this is due to how sponskrub defaults were.

nihil-admirari · 2021-06-12T20:19:30Z

also on my quick usage i wonder if another flag of say --sponsorblock-cut should be added which will just cut what is queried so by default that would be all categories

I shortened command line options: dropped -categories from --sponsorblock-query/cut-categories, and added a shortcut all to select everything. Is it better now?

... when i try --sponsorblock --sponsorblock-cut-categories sponsor for https://www.youtube.com/watch?v=Dy312cUHumk i get a ERROR: Postprocessing: Conversion failed! and the video playback ends at the first sponsor block.

Please always provide -v logs in such cases, otherwise I'd have to guess what formats you've been trying to download. MP4 format -bv*[ext=mp4]+ba[ext=m4a] worked fine for me.

WEBM format bv*+ba threw:

Application provided invalid, non monotonically increasing dts to muxer in stream 0: 43417 >= 43410

It looks like a rather popular problem. https://ipcamtalk.com/threads/fix-non-monotonous-dts-errors-in-ffmpeg.30465/#post-286226 suggested to add -copytb 1, which I did. Unfortunately it didn't work. An additional -strict experimental didn't work either, and FFmpeg documentation warns against using it anyway:

Note: experimental decoders can pose a security risk, do not use this for decoding untrusted input.

Tested with the latest FFmpeg master from https://johnvansickle.com/ffmpeg/.

Unless someone knows a magic workaround, nothing can be done about it. You can try to open an issue in FFmpeg bug tracker, but it looks like there is a plenty of such bug reports already. I can only suggest you to switch to MP4, for which cutting does work.

phyzical · 2021-06-13T07:10:38Z

@nihil-admirari thanks for the additions all should do the job for what i was suggesting but atm it throws invalid SponsorBlock category: all --sponsorblock --sponsorblock-cut all, assume you will need to skip the category check if all is provided

Sorry! i didnt even realize there was a -v i should have looked harder, will in the future 👍 .

sorry, ive jsut come from the base yt-dl and my format settings were not working with yt-dlp, so i was just using whatever the default detected was. i can confirm it removes the sponsor as expected with the format you suggested now and with almost no overhead to the processing 😄 good job! an amazing speed improvement on the original sponskrub implementation!

once the all category issue is resolved i will give it a test with subtitles also.

Though it should be noted that others may run into similar issues as i did, maybe a note in the readme for sponsor block should be added warning about format issues once that is done? though like you've found it looks more like an ffmpeg issue than an issue with your post processor

pukkandan · 2021-06-13T09:46:45Z

Application provided invalid, non monotonically increasing dts to muxer in stream 0: 43417 >= 43410

This issue does not exist in ffmpeg versions <= 3.0.1. See ytdl-org/youtube-dl#28042

nihil-admirari · 2021-06-13T10:34:01Z

atm it throws invalid SponsorBlock category: all

Have you pulled the latest changes from this branch? Caveat: because of force push (rebase on master), you may need to remove you local branch, then pull, and then checkout this branch again.

When i didnt cut i had this warning thrown Some SponsorBlock segments overlap

Actually, it was supposed to warn even when doing a cut. Fixed.

should be added warning about format issues once that is done

Added a warning.

I tried various FFmpeg options that were mentioned as workarounds in various places.

+discardcorrupt, +genpts, +igndts fail with error.
-bsf setts=dts=PREV_INDTS/PREV_OUTDTS and various other combinations fail too.
-bsf vp9_superframe/vp9_superframe_split/vp9_raw_reorder all fail.
-vsync drop proceeds to completion, but the resulting video is broken: VLC only shows static frames that replace each other infrequently.
+noparse +nofillin doubles the file size, but the resulting video is pure black screen.
Forced re-encoding with --remove-chapters-force-keyframes works (at least on the smallest video -S +size), but is terribly slow.

phyzical · 2021-06-13T12:30:35Z

@nihil-admirari not sure what happened as i had the new flags, in any case all now works, tested one with subtitles looks to keep in sync. from my few test runs all looks good to me 👍

nihil-admirari · 2021-06-13T15:12:04Z

we should cut by default if --sponsorblock is provided, but i assume this is due to how sponskrub defaults were.

Indeed, cutting was disabled by default to match SponSkrub behaviour. It made sense for SponSkrub, which implements cut by re-encoding. In this implementation cutting is fast, and, I presume, the majority of people who use SponsorBlock would like to cut rather than contemplate the chapters. Changed to cut by default.

nihil-admirari · 2021-06-16T16:46:10Z

@phyzical, if you still feel like testing, you can find patched FFmpeg here: https://github.com/nihil-admirari/FFmpeg-With-VP9-Timestamp-Fix/releases. It should handle VP9 videos with corrupted timestamps.

phyzical · 2021-06-17T00:30:19Z

@nihil-admirari no worries ill give it a crack later today after work 👍

Signed-off-by: nihil-admirari <50202386+nihil-admirari@users.noreply.github.com>

…tain useful data

And merge adjacent chapters based on title, not categories

docs

pukkandan · 2021-09-01T02:29:13Z

Added a template for sponsor chapter names

This was the perfect idea. I have modified it to use output template instead. Then, added code to merge adjacent chapters based on title rather than category. This now satisfies both our uses. By default, no data is lost in the chapter names, but by using --sponsorblock-chapter-title "SponsorBlock: %(name)s", I can get the chapters to use only the title of the smallest category

I've also made a few small changes. Have a look at the commit messages

If you are satisfied with how it is now, I believe this can be merged

yt_dlp/postprocessor/modify_chapters.py

nihil-admirari · 2021-09-01T07:59:22Z

yt_dlp/postprocessor/sponsorblock.py

+        'sponsor': 'Sponsor',
+        'intro': 'Intermission/Intro Animation',
+        'outro': 'Endcards/Credits',
+        'selfpromo': 'Unpaid/Self Promotion',


IMHO, old names were better.

I personally don't care either way, but these are the actual names as documented here: https://wiki.sponsor.ajay.app/index.php/Segment_Categories

yt_dlp/postprocessor/modify_chapters.py

nihil-admirari · 2021-09-01T08:05:37Z

If you are satisfied with how it is now, I believe this can be merged

I made some minor fixes. I'm OK with merging this, though I don't quite understand why the smallest sponsor contributing to overlap is the most representative of the overlap itself.

yt_dlp/options.py

pukkandan · 2021-09-02T00:40:50Z

@nihil-admirari Thanks for sticking with me till the end of the PR even though it took some large delays and quite a few detours to get this done 😄

SponsorBlock options: * The fetched sponsor sections are written to infojson * `--sponsorblock-remove` removes specified chapters from file * `--sponsorblock-mark` marks the specified sponsor sections as chapters * `--sponsorblock-chapter-title` to specify sponsor chapter template * `--sponsorblock-api` to use a different API Related improvements: * Split `--embed-chapters` from `--embed-metadata` * Add `--remove-chapters` to remove arbitrary chapters * Add `--force-keyframes-at-cuts` for more accurate cuts when removing and splitting chapters Deprecates all `--sponskrub` options Authored by: nihil-admirari, pukkandan

faissaloo reviewed May 30, 2021

View reviewed changes

yt_dlp/postprocessor/sponsorblock.py Outdated Show resolved Hide resolved

yt_dlp/postprocessor/sponsorblock.py Outdated Show resolved Hide resolved

yt_dlp/postprocessor/sponsorblock.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

pukkandan mentioned this pull request Jun 8, 2021

[TwitCasting] Some archive downloads only the first hour of it #382

Closed

5 tasks

This comment has been minimized.

Sign in to view

nihil-admirari and others added 9 commits August 30, 2021 21:33

Fix key error

59f004a

Signed-off-by: nihil-admirari <50202386+nihil-admirari@users.noreply.github.com>

chapters can be None

3f4a9dc

Category names as documented in SponsorBlock

3b00ea8

Better messages

8a00bd9

The output of SponsorBlockPP is written to infojson and so should con…

7dc5376

…tain useful data

use output template for sponsorblock_chapter_title

58cb4cd

And merge adjacent chapters based on title, not categories

Fix tests

daa18a3

raise fatal error when durations mismatch

7aaeb5d

docs

2d97d8f

docs

flake8

9a48ea2

pukkandan reviewed Sep 1, 2021

View reviewed changes

yt_dlp/postprocessor/modify_chapters.py Show resolved Hide resolved