Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] convert npo.nl webvtt-subtitles to srt #4954

Closed
Reino17 opened this issue Feb 15, 2015 · 7 comments
Closed

[Request] convert npo.nl webvtt-subtitles to srt #4954

Reino17 opened this issue Feb 15, 2015 · 7 comments
Assignees
Labels

Comments

@Reino17
Copy link

@Reino17 Reino17 commented Feb 15, 2015

Frenzie and dstftw, thanks for the initial support, but npo.nl subtitles are of the webvtt format, which a lot of players don't support.
For yesterday's 8 o' clock newsbulletin for example, youtube-dl reports:

youtube-dl.exe -s --list-subs http://www.npo.nl/nos-journaal/14-02-2015/POW_00942207
[npo.nl] POW_00942206: Downloading JSON metadata
......
[npo.nl] POW_00942206: Downloading h264_std stream JSON
WARNING: Automatic Captions not supported by this server
[npo.nl] POW_00942206: Available subtitles for video: nl
[npo.nl] POW_00942206: Available automatic captions for video:

When I then download these subtitles:...

youtube-dl.exe --skip-download --write-sub --sub-format srt --sub-lang nl http://www.npo.nl/
nos-journaal/14-02-2015/POW_00942207
[npo.nl] POW_00942207: Downloading JSON metadata
......
[npo.nl] POW_00942207: Downloading h264_std stream JSON
[info] Writing video subtitles to: NOS Journaal-POW_00942207.nl.srt

..., the first couple of lines look like this:

WEBVTT

1
00:00:03.000 --> 00:00:05.019
888

2
00:00:05.019 --> 00:00:12.005
Een schietpartij op een bijeenkomst in Kopenhagen met een omstreden Zweedse cartoonist.
  • First of all: If youtube-dl has the ability to detect the subtitle-format, could you perhaps show that in the process, like: [npo.nl] POW_00942206: Available subtitles for video: nl (vtt).
  • Secondly: Whether or not the subtitle-format is detected, --sub-format srt didn't do anything.
    In this case, with npo-subs, you'd only have to remove the first 2 lines to have fully working srt-subtitles.
    I was hoping that if youtube-dl detects vtt-subtitles, --sub-format srt would actually convert these to srt.
@jaimeMF jaimeMF added the subtitles label Feb 15, 2015
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Feb 15, 2015

Running ffmpeg -i 'NOS Journaal-POW_00942207.nl.srt' something.srt seems to work fine, but I haven't tried with libav.

I'm working on properly handling subtitles formats (for #4019).

@Reino17
Copy link
Author

@Reino17 Reino17 commented Feb 27, 2015

My findings:

ffmpeg -hide_banner -i http://e.omroep.nl/tt888/POW_00942207 "NOS Journaal-POW_00942207.nl.s
rt"
...
  Stream #0:0 -> #0:0 (webvtt (native) -> subrip (srt))
Press [q] to stop, [?] for help
[webvtt @ 02f9f500] Invalid UTF-8 in decoded subtitles text; maybe missing -sub_charenc opti
on

So I don't know about other websites, but in case of npo.nl that commandline needs to be:

ffmpeg -hide_banner -sub_charenc CP1252 -i http://e.omroep.nl/tt888/POW_00942207 "NOS Journa
al-POW_00942207.nl.srt"

Then muxing...
The only method that works without any errors is muxing a progressive videostream with subs to matroska:

ffmpeg -hide_banner -i http://content50c1b.omroep.nl/....../POW_00942207/std.20150214.m4v 
-sub_charenc CP1252 -i http://e.omroep.nl/tt888/POW_00942207 -map 0:0 -map 0:1 -c copy
-map 1 -c:s srt -metadata:s:s:0 language=nl "NOS Journaal-POW_00942207.nl.m4v.mkv"

Adaptive videostream with subs to matroska:

ffmpeg -hide_banner -i http://....npostreaming.nl/.../POW_00942207-audio_eng%3D128000-video%
3D1001000.m3u8 -sub_charenc CP1252 -i http://e.omroep.nl/tt888/POW_00942207 -c copy -bsf:a
aac_adtstoasc -c:s srt -metadata:s:s:0 language=nl "NOS Journaal-POW_00942207.nl.m3u8.mkv"
...
[matroska @ 03bfefe0] Error parsing AAC extradata, unable to determine samplerate.

Here -bsf:a aac_adtstoasc doesn't seem to do anything. Unless I've overlooked something, there's no other option than to mux to and create a temporary mp4-file.

Adaptive videostream with subs to mp4:

ffmpeg -hide_banner -fflags genpts -i http://....npostreaming.nl/.../POW_00942207-audio_eng%
3D128000-video%3D1001000.m3u8 -sub_charenc CP1252 -i http://e.omroep.nl/tt888/POW_00942207
-c copy -bsf:a aac_adtstoasc -c:s mov_text -metadata:s:s:0 language=nl "NOS Journaal-POW_009
42207.nl.m3u8.mp4"
...
[mp4 @ 030420a0] Application provided duration: -1 / timestamp: 5019 is out of range for mov
/mp4 format
[mp4 @ 030420a0] pts has no value

Here -fflags genpts doesn't seem to do anything. But despite these warnings (they're not errors), a working mp4 is created.

Progressive videostream with subs to mp4:
Same "pts has no value" as above.

I hope this helps you in your subtitle-quest.

@jaimeMF jaimeMF self-assigned this Feb 28, 2015
@jaimeMF jaimeMF closed this in e9fade7 Feb 28, 2015
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Feb 28, 2015

In the next version, you'll be able to run youtube-dl 'http://www.npo.nl/nos-journaal/14-02-2015/POW_00942207' --write-sub --convert-subtitles srt and the vtt subtitles will be converted to srt.
If you want to embed them in a mkv file, please open a new issue.

Thanks for the report.

@Reino17
Copy link
Author

@Reino17 Reino17 commented Feb 28, 2015

Have you tested it yourself, because it doesn't work. FFMpeg is detected, but doesn't do anything, hence there's no conversion at all. And the subs still have the vtt-extension.

D:\youtube-dl-master>python -m youtube_dl -v --skip-download --write-sub --convert-subtitles
srt http://www.npo.nl/nos-journaal/07-02-2015/POW_00942206
[debug] System config: []
[debug] User config: []
[debug] Command-line args: ['-v', '--skip-download', '--write-sub', '--convert-subtitles',
'srt', 'http://www.npo.nl/nos-journaal/07-02-2015/POW_00942206']
[debug] Encodings: locale cp1252, fs mbcs, out cp437, pref cp1252
[debug] youtube-dl version 2015.02.26.2
[debug] Python version 3.4.2 - Windows-XP-5.1.2600-SP3
[debug] exe versions: ffmpeg N-69779-g2a72b16
[debug] Proxy map: {}
[npo.nl] POW_00942206: Downloading JSON metadata
[npo.nl] POW_00942206: Downloading token
[npo.nl] POW_00942206: Downloading adaptive JSON
[npo.nl] POW_00942206: Downloading adaptive stream JSON
[npo.nl] POW_00942206: Downloading m3u8 information
[npo.nl] POW_00942206: Downloading h264_bb JSON
[npo.nl] POW_00942206: Downloading h264_bb stream JSON
[npo.nl] POW_00942206: Downloading h264_sb JSON
[npo.nl] POW_00942206: Downloading h264_sb stream JSON
[npo.nl] POW_00942206: Downloading h264_std JSON
[npo.nl] POW_00942206: Downloading h264_std stream JSON
[info] Writing video subtitles to: NOS Journaal-POW_00942206.nl.vtt
WEBVTT

1
00:00:03.000 --> 00:00:04.018
888

2
00:00:04.018 --> 00:00:09.006
Op de veiligheidsconferentie in M�unchen klinkt
duidelijke taal over het Oekraiense conflict.
...
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Feb 28, 2015

For a moment you make me doubt if I have tested it :)

You just have to remove the --skip-download option, it's how the postprocessors system works: they won't be run if the video is not downloaded, because most of them need it. (If you have already downloaded the video it will detect it and won't be redownloaded)

@Reino17
Copy link
Author

@Reino17 Reino17 commented Feb 28, 2015

D:\youtube-dl-master>python -m youtube_dl -v --write-sub --convert-subtitles srt http://www.
npo.nl/nos-journaal/07-02-2015/POW_00942206
...
[info] Writing video subtitles to: NOS Journaal-POW_00942206.nl.vtt
[debug] Invoking downloader on 'http://content50c1b.omroep.nl/.../POW_00942206/std.20150207.
m4v?odiredirecturl=...'
[download] NOS Journaal-POW_00942206.m4v has already been downloaded
[download] 100% of 177.21MiB
[ffmpeg] Converting subtitles
[debug] ffmpeg command line: ffmpeg -y -i 'NOS Journaal-POW_00942206.nl.vtt' -f srt 'NOS Jou
rnaal-POW_00942206.nl.srt'

I see. But could you perhaps load the subtitle-url directly instead of seperately downloading the webvtt-subtitles?

NOS Journaal-POW_00942206.nl.srt by youtube-dl (and same as vtt):

2
00:00:04,018 --> 00:00:09,008
Op de veiligheidsconferentie in M�unchen klinkt
duidelijke taal over het Oekraiense conflict.

NOS Journaal-POW_00942206.nl.srt through ffmpeg
(ffmpeg -sub_charenc CP1252 -i 'NOS Journaal-POW_00942206.nl.vtt' 'NOS Journaal-POW_00942206.nl.srt'):

2
00:00:04,018 --> 00:00:09,008
Op de veiligheidsconferentie in M�unchen klinkt
duidelijke taal over het Oekraiense conflict.

NOS Journaal-POW_00942206.nl.srt through ffmpeg
(ffmpeg -sub_charenc CP1252 -i http://e.omroep.nl/tt888/POW_00942206 'NOS Journaal-POW_00942206.nl.srt'):

2
00:00:04,018 --> 00:00:09,008
Op de veiligheidsconferentie in MÈunchen klinkt
duidelijke taal over het Oekraiense conflict.

Obviously MÈunchen is still misspelled, but at least it's now the same as shown on npo.nl (by the flashplayer).

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Feb 28, 2015

You'll always end up downloading from the url, and it's easier to always download them and only run ffmpeg if --convert-subtitles is given.

About the characters issue, they don't tell on the HTTP headers which encoding they are using:

$ curl 'http://e.omroep.nl/tt888/POW_00942207' -v > /dev/null
* Adding handle: conn: 0x7fd4e0803000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fd4e0803000) send_pipe: 1, recv_pipe: 0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to e.omroep.nl port 80 (#0)
*   Trying 145.58.30.39...
* Connected to e.omroep.nl (145.58.30.39) port 80 (#0)
> GET /tt888/POW_00942207 HTTP/1.1
> User-Agent: curl/7.30.0
> Host: e.omroep.nl
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sat, 28 Feb 2015 22:30:20 GMT
* Server Apache is not blacklisted
< Server: Apache
< Cache-Control: private, max-age=300, must-revalidate
< Expires: Sat, 28 Feb 2015 22:35:20 GMT
< Last-Modified: Sat, 28 Feb 2015 22:30:20 GMT
< Access-Control-Allow-Origin: *
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/vtt
<
{ [data not shown]
100 28464    0 28464    0     0  30599      0 --:--:-- --:--:-- --:--:-- 30606
* Closing connection 0

So we can't detect it and assume that it's utf-8 and ignore errors. If they are always encoded with CP1252 we could download them inside the extractor and properly decode them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.