Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NRK] Ambiguous subtitle duration thousands #18399

Closed
forthrin opened this issue Dec 4, 2018 · 4 comments
Closed

[NRK] Ambiguous subtitle duration thousands #18399

forthrin opened this issue Dec 4, 2018 · 4 comments

Comments

@forthrin
Copy link

@forthrin forthrin commented Dec 4, 2018

  • I've verified and I assure that I'm running youtube-dl 2018.12.03
  • At least skimmed through the README, most notably the FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones
  • Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser
  • Bug report (encountered problems with youtube-dl)
$ youtube-dl --write-sub --sub-lang no --convert-subtitles srt --verbose https://tv.nrk.no/serie/unge-lovende/sesong/4/episode/2/avspiller
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--write-sub', u'--sub-lang', u'no', u'--convert-subtitles', u'srt', u'--verbose', u'https://tv.nrk.no/serie/unge-lovende/sesong/4/episode/2/avspiller']
[debug] Encodings: locale UTF-8, fs utf-8, out utf8, pref UTF-8
[debug] youtube-dl version 2018.12.03
[debug] Python version 2.7.15 (CPython) - Darwin-16.7.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg 4.0.2, ffprobe 4.0.2, rtmpdump 2.4
[debug] Proxy map: {}
[debug] Default format spec: bestvideo+bestaudio/best
[debug] Invoking downloader on u'http://nordond34c-f.akamaihd.net/i/no/open/ps/km/kmte20006218/kmte20006218ca/kmte20006218ca_,141,316,563,1266,2250,.mp4.csmil/index_4_av.m3u8?null=0'

Please see the subtitles at: https://undertekst.nrk.no/prod/KMTE20/00/KMTE20006218CA/TMP/KMTE20006218CA.ttml

  <p begin="00:01:59.320" dur="00:00:04.60" style="left">-Vil dere ha dette?<br />-Det tar vi etterpå.</p>
  <p begin="00:02:03.680" dur="00:00:04.940" style="left">Jeg tenkte du skulle<br />sitte der og Elise her.</p>

Notice the duration that ends with ".60". When converting to .srt, youtube-dl incorrectly interprets this as ".600" rather than the ".060" evidently intended by the author, causing several subtitles to overrun the succeeding subtitle, which in turn makes VLC annoyingly display the subtitles over each other. The problem applies for all subtitled videos on the entire site, so it's a major nuisance.

@evexoio
Copy link

@evexoio evexoio commented Dec 4, 2018

04.60 is 04.600

I don't get it, why would we convert .60 to .060? isn't this a mistake from the subtitles author end?

@forthrin
Copy link
Author

@forthrin forthrin commented Dec 4, 2018

@evexoio: They probably do something like sprintf("...:%d", thousands) rather than sprintf("...:%03d", thousands).Will fixing it break any other sites?

@forthrin forthrin changed the title TTML subtitle durations are interpreted incorrectly TTML durations conversion quirk (patch included) Dec 10, 2018
@evexoio
Copy link

@evexoio evexoio commented Dec 11, 2018

@forthrin can you open a PR please?

@forthrin forthrin mentioned this issue Apr 13, 2019
5 of 5 tasks complete
@forthrin forthrin changed the title TTML durations conversion quirk (patch included) [NRK] Ambiguous subtitle duration thousands May 9, 2019
@forthrin
Copy link
Author

@forthrin forthrin commented May 9, 2019

I've found a way to get around the issue.

The site provides two subtitle formats, VTT and TTML, but only the latter has the thousands issue.

However, youtube-dl doesn't pick up the VTT file, because the site metadata points to a m3u8 file, which again points to a VTT file. This may be non-standard, but that's how they do it.

Getting the VTT file and adding --sub-format vtt/ttml on the command line, the issue disappears.

I am not proud of this crappy code. I assume it breaks with several of the project's programming conventions, but I couldn't find any other way to do it at the moment, and it works.

Maybe someone can suggest a way to clean up the code, so it can be submitted to the project?

diff --git a/youtube_dl/extractor/nrk.py b/youtube_dl/extractor/nrk.py
index 072f920a9..7a120376a 100644
--- a/youtube_dl/extractor/nrk.py
+++ b/youtube_dl/extractor/nrk.py
@@ -4,2 +4,3 @@ from __future__ import unicode_literals
 import re
+import os
 
@@ -18,2 +19,3 @@ from ..utils import (
     try_get,
+    determine_ext,
 )
@@ -78,5 +80,12 @@ class NRKBaseIE(InfoExtractor):
                     subtitle_url = asset.get('%sSubtitlesUrl' % subtitle)
+                    subtitle_url = compat_urllib_parse_unquote(subtitle_url)
                     if subtitle_url:
+                        ext = determine_ext(subtitle_url)
+                        if ext == "m3u8":
+                            m3u8 = self._download_webpage(subtitle_url, ext)
+                            head = os.path.dirname(subtitle_url)
+                            tail = re.findall(r'(?i)(.*\.vtt)', m3u8)[0]
+                            subtitle_url = '%s/%s' % (head, tail)
                         subtitles.setdefault('no', []).append({
-                            'url': compat_urllib_parse_unquote(subtitle_url)
+                            'url': subtitle_url
                         })
diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 08af57bae..a203a7e19 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -2760,2 +2760,5 @@ def parse_dfxp_time_expr(time_expr):
 
+    if re.match(r'^(\d+):(\d\d):(\d\d)[\.:]\d{1,2}$', time_expr):
+        print('NOTICE: Ambiguous timecode in subtitles: ' + time_expr)
+
     mobj = re.match(r'^(\d+):(\d\d):(\d\d(?:(?:\.|:)\d+)?)$', time_expr)
@forthrin forthrin closed this May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.