Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebVTT] Allow spaces before newlines for CueBlock #7681

Merged

Conversation

TSRBerry
Copy link
Contributor

@TSRBerry TSRBerry commented Jul 23, 2023

Description

This PR fixes the CueBlock parser for WebVTT to allow optional spaces or tabs in front of the settings list.

See: https://www.w3.org/TR/webvtt1/#webvtt-cue-block

In particular:

  1. Optionally, one or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters followed by a WebVTT cue settings list.

And this part of the WebVTT cue settings list definition:

A WebVTT cue settings list consist of a sequence of zero or more WebVTT cue settings

This PR came to be when I recently encountered a vtt file that included one space before the newline in the WebVTT cue block, which caused the parser to fail like this:

ERROR: Parse error at position 57 (near '1\n00:00:00.634 --> 0')

The actual error doesn't occur at position 57 (which is the 1), but rather at position 88 (which is right at the end of the last timestamp) because no newline character can be matched.

yt-dlp/yt_dlp/webvtt.py

Lines 289 to 290 in 86aea0d

if not parser.consume(_REGEX_NL):
return None

Returning null here then causes the parse_fragment method to raise the ParserError as seen in the error snippet.

yt-dlp/yt_dlp/webvtt.py

Lines 392 to 397 in 86aea0d

block = CueBlock.parse(parser)
if block:
yield block
continue
raise ParseError(parser)

Fixes #7453

Replaces #7454, as it doesn't adhere to the WebVTT spec.


Template notes

Comments about some template items

I have a few issues with some of the template items, but didn't know where to add my comments, so I'm doing that here:

  • Searched the bugtracker for similar pull requests

There is another open PR for this (see above), but it doesn't adhere to the WebVTT spec and would break vtt files without a space at the end of the cue timings.

Running flake8 results in the following error:

yt_dlp/webvtt.py:80:26: BLK100 Black would make changes.

This is not code that I changed and can't actually be resolved by running black.
Black changes a few more things in that file and also causes another issue for flake8:

yt_dlp/webvtt.py:82:53: E203 whitespace before ':'

The code in question was reformatted like that:

class ParseError(Exception):
    def __init__(self, parser):
        super().__init__(
            "Parse error at position %u (near %r)"
            % (parser._pos, parser._data[parser._pos : parser._pos + 20])
        )

Removing the whitespace after parser._pos then causes this flake8 error:

yt_dlp/webvtt.py:82:53: BLK100 Black would make changes.

No idea what I'm supposed to do here, so I ignored BLK100 instead and completed the flake8 run successfully.

As for the tests, I couldn't find any WebVTT tests to run, so I executed the normal testsuite as explained by the developer instructions.

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

馃 Generated by Copilot at 28ccfe9

Summary

馃洜锔忦煋濔煄烇笍

Improve WebVTT cue parsing by using a regex to handle whitespace. This affects the file yt_dlp/webvtt.py.

Sing, O Muse, of the skillful coder who improved the WebVTT cue parsing
By adding support for optional whitespace, a subtle and cunning task
He wielded the mighty regex, like Hephaestus forging his tools of fire
And matched and consumed the spaces, as a lion devours a hapless stag

Walkthrough

  • Allow optional whitespace in WebVTT cue blocks by defining a regex for zero or more spaces or tabs (link) and using it to consume any whitespace before parsing the cue components in the parse method of the CueBlock class (link) in yt_dlp/webvtt.py

The spec allows spaces even if the following settings list is empty.

See: https://www.w3.org/TR/webvtt1/#webvtt-cue-block
@TSRBerry
Copy link
Contributor Author

I'm not a fan of the usage of Copilot, but it's a tiny PR, so whatever.

@TSRBerry TSRBerry changed the title webvtt: Allow spaces before newlines for CueBlock [WebVTT] Allow spaces before newlines for CueBlock Jul 23, 2023
@bashonly bashonly added the enhancement New feature or request label Jul 23, 2023
@pukkandan
Copy link
Member

Related #7454

@TSRBerry
Copy link
Contributor Author

Is there anything missing from my side for this PR?

@bashonly bashonly self-requested a review November 7, 2023 13:45
@pukkandan pukkandan removed the core-triage triage requested from a core dev label Nov 8, 2023
@pukkandan pukkandan merged commit 15f22b4 into yt-dlp:master Nov 28, 2023
14 checks passed
@pukkandan
Copy link
Member

Sorry it took so long to get to this

bashonly added a commit to bashonly/yt-dlp that referenced this pull request Dec 12, 2023
This reverts "[webvtt] Allow spaces before newlines for CueBlock (yt-dlp#7681)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERROR: Parse error at position 8 (near '149\n00:00:00.000 -->') when downloading subtitles from SRGSSR
3 participants