New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I extract the subtitles in plain text? #17178

Closed
magician11 opened this Issue Aug 7, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@magician11

magician11 commented Aug 7, 2018

I can see how to extract the automatically generated subtitles for a video..
e.g.

youtube-dl --write-auto-sub --skip-download https://youtu.be/bQLkDomt59A

This creates a file in this instance called React Router v4-bQLkDomt59A.en.vtt

The first part of this WEBVTT file looks like this..

WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:00.079 --> 00:00:07.249 align:start position:0%
 
hi<c.colorE5E5E5><00:00:04.100><c> so</c><00:00:05.100><c> in</c><00:00:05.940><c> terms</c><00:00:06.210><c> of</c><00:00:06.390><c> in</c><00:00:06.629><c> terms</c><00:00:06.660><c> of</c><00:00:07.020><c> the</c></c>

00:00:07.249 --> 00:00:07.259 align:start position:0%
hi<c.colorE5E5E5> so in terms of in terms of the
 </c>

00:00:07.259 --> 00:00:11.419 align:start position:0%
hi<c.colorE5E5E5> so in terms of in terms of the
routing<00:00:09.710><c> how</c><00:00:10.710><c> do</c><00:00:10.769><c> I</c><00:00:10.830><c> move</c><00:00:10.950><c> from</c><00:00:11.070><c> one</c><00:00:11.160><c> patient</c></c>

00:00:11.419 --> 00:00:11.429 align:start position:0%
routing<c.colorE5E5E5> how do I move from one patient
 </c>

What is the best way to simply extract the plain text from these subtitles? Notice how the text repeats, so I can't just strip out the tags.

From the YouTube dashboard, I can download the srt and sbv formats. These look far easier to post-process.

However, when I try to grab the srt format using this tool

youtube-dl --write-auto-sub --sub-format=srt --skip-download https://youtu.be/bQLkDomt59A

I get

[youtube] bQLkDomt59A: Downloading video info webpage
[youtube] bQLkDomt59A: Looking for automatic captions
[youtube] bQLkDomt59A: Downloading MPD manifest
[youtube] bQLkDomt59A: Downloading MPD manifest
WARNING: No subtitle format found matching "srt" for language en, using vtt
[info] Writing video subtitles to: React Router v4-bQLkDomt59A.en.vtt

What am I missing here?

Otherwise for the vtt file that does get downloaded, any suggestions for a library to post-process this file?

Thanks.

@dstftw

This comment has been minimized.

Collaborator

dstftw commented Aug 7, 2018

There is no feature to "just strip tags". Subtitles are provided as is. You can convert to other formats with --convert-subs but this will preserve the markup whenever possible.

@dstftw dstftw closed this Aug 7, 2018

Repository owner deleted a comment from egyBlind Aug 7, 2018

@magician11

This comment has been minimized.

magician11 commented Aug 7, 2018

I don't fully understand this... from the YouTube dashboard I can download a srt. With this tool I can't. Why's that?

So instead you're saying I need to download the vtt and then use the --convert-subs flag?

When I try that youtube-dl --write-auto-sub --convert-subs=srt --skip-download https://youtu.be/bQLkDomt59A it just downloads the vtt and doesn't convert it to an srt.

Also what's confusing is when I --list-subs I get

youtube-dl  --list-subs --skip-download https://youtu.be/bQLkDomt59A
[youtube] bQLkDomt59A: Downloading webpage
[youtube] bQLkDomt59A: Downloading video info webpage
WARNING: video doesn't have subtitles
[youtube] bQLkDomt59A: Looking for automatic captions
[youtube] bQLkDomt59A: Downloading MPD manifest
[youtube] bQLkDomt59A: Downloading MPD manifest
Available automatic captions for bQLkDomt59A:
Language formats
gu       vtt, ttml
zh-Hans  vtt, ttml
zh-Hant  vtt, ttml
gd       vtt, ttml
ga       vtt, ttml
gl       vtt, ttml
lb       vtt, ttml
la       vtt, ttml
lo       vtt, ttml
tr       vtt, ttml
lv       vtt, ttml
lt       vtt, ttml
th       vtt, ttml
tg       vtt, ttml
te       vtt, ttml
fil      vtt, ttml
haw      vtt, ttml
yi       vtt, ttml
ceb      vtt, ttml
yo       vtt, ttml
de       vtt, ttml
da       vtt, ttml
el       vtt, ttml
eo       vtt, ttml
en       vtt, ttml
eu       vtt, ttml
et       vtt, ttml
es       vtt, ttml
ru       vtt, ttml
ro       vtt, ttml
bn       vtt, ttml
be       vtt, ttml
bg       vtt, ttml
uk       vtt, ttml
jv       vtt, ttml
bs       vtt, ttml
ja       vtt, ttml
xh       vtt, ttml
co       vtt, ttml
ca       vtt, ttml
cy       vtt, ttml
cs       vtt, ttml
ps       vtt, ttml
pt       vtt, ttml
pa       vtt, ttml
vi       vtt, ttml
pl       vtt, ttml
hy       vtt, ttml
hr       vtt, ttml
ht       vtt, ttml
hu       vtt, ttml
hmn      vtt, ttml
hi       vtt, ttml
ha       vtt, ttml
mg       vtt, ttml
uz       vtt, ttml
ml       vtt, ttml
mn       vtt, ttml
mi       vtt, ttml
mk       vtt, ttml
ur       vtt, ttml
mt       vtt, ttml
ms       vtt, ttml
mr       vtt, ttml
ta       vtt, ttml
my       vtt, ttml
af       vtt, ttml
sw       vtt, ttml
is       vtt, ttml
am       vtt, ttml
it       vtt, ttml
iw       vtt, ttml
sv       vtt, ttml
ar       vtt, ttml
su       vtt, ttml
zu       vtt, ttml
az       vtt, ttml
id       vtt, ttml
ig       vtt, ttml
nl       vtt, ttml
no       vtt, ttml
ne       vtt, ttml
ny       vtt, ttml
fr       vtt, ttml
ku       vtt, ttml
fy       vtt, ttml
fa       vtt, ttml
fi       vtt, ttml
ka       vtt, ttml
kk       vtt, ttml
sr       vtt, ttml
sq       vtt, ttml
ko       vtt, ttml
kn       vtt, ttml
km       vtt, ttml
st       vtt, ttml
sk       vtt, ttml
si       vtt, ttml
so       vtt, ttml
sn       vtt, ttml
sm       vtt, ttml
sl       vtt, ttml
ky       vtt, ttml
sd       vtt, ttml
bQLkDomt59A has no subtitles

So captions but no subtitles?

@jakemcannon

This comment has been minimized.

jakemcannon commented Aug 14, 2018

Having the same issue. Last night I was able to download srt files with --convert-subs srt but for whatever reason today this same command on the same video will not work

@magician11

This comment has been minimized.

magician11 commented Aug 14, 2018

Hi @jakecan13 I've been researching this a bunch, and I finally figured it out using another module.

The working code sample is as follows...

const { getSubtitles } = require('youtube-captions-scraper');
const getYouTubeID = require('get-youtube-id');

const getYouTubeSubtitles = async youtubeUrl => {
  try {
    const videoID = getYouTubeID(youtubeUrl);
    const subtitles = await getSubtitles({ videoID });
    return subtitles.reduce(
      (accumulator, currentSubtitle) =>
        `${accumulator} ${currentSubtitle.text}`,
      ''
    );
  } catch (error) {
    console.log(`Error getting captions: ${error.message}`);
  }
};

(async () => {
  const consoleArguments = process.argv;
  if (consoleArguments.length !== 3) {
    console.log(
      'usage example: node get-youtube-subtitles.js https://www.youtube.com/watch?v=gypAjPp6eps'
    );
    return;
  }

  const subtitles = await getYouTubeSubtitles(consoleArguments[2]);
  console.log(subtitles);
})();

Here is the gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment