Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I extract the subtitles in plain text? #17178

Closed
magician11 opened this issue Aug 7, 2018 · 11 comments
Closed

How do I extract the subtitles in plain text? #17178

magician11 opened this issue Aug 7, 2018 · 11 comments

Comments

@magician11
Copy link

@magician11 magician11 commented Aug 7, 2018

I can see how to extract the automatically generated subtitles for a video..
e.g.

youtube-dl --write-auto-sub --skip-download https://youtu.be/bQLkDomt59A

This creates a file in this instance called React Router v4-bQLkDomt59A.en.vtt

The first part of this WEBVTT file looks like this..

WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:00.079 --> 00:00:07.249 align:start position:0%
 
hi<c.colorE5E5E5><00:00:04.100><c> so</c><00:00:05.100><c> in</c><00:00:05.940><c> terms</c><00:00:06.210><c> of</c><00:00:06.390><c> in</c><00:00:06.629><c> terms</c><00:00:06.660><c> of</c><00:00:07.020><c> the</c></c>

00:00:07.249 --> 00:00:07.259 align:start position:0%
hi<c.colorE5E5E5> so in terms of in terms of the
 </c>

00:00:07.259 --> 00:00:11.419 align:start position:0%
hi<c.colorE5E5E5> so in terms of in terms of the
routing<00:00:09.710><c> how</c><00:00:10.710><c> do</c><00:00:10.769><c> I</c><00:00:10.830><c> move</c><00:00:10.950><c> from</c><00:00:11.070><c> one</c><00:00:11.160><c> patient</c></c>

00:00:11.419 --> 00:00:11.429 align:start position:0%
routing<c.colorE5E5E5> how do I move from one patient
 </c>

What is the best way to simply extract the plain text from these subtitles? Notice how the text repeats, so I can't just strip out the tags.

From the YouTube dashboard, I can download the srt and sbv formats. These look far easier to post-process.

However, when I try to grab the srt format using this tool

youtube-dl --write-auto-sub --sub-format=srt --skip-download https://youtu.be/bQLkDomt59A

I get

[youtube] bQLkDomt59A: Downloading video info webpage
[youtube] bQLkDomt59A: Looking for automatic captions
[youtube] bQLkDomt59A: Downloading MPD manifest
[youtube] bQLkDomt59A: Downloading MPD manifest
WARNING: No subtitle format found matching "srt" for language en, using vtt
[info] Writing video subtitles to: React Router v4-bQLkDomt59A.en.vtt

What am I missing here?

Otherwise for the vtt file that does get downloaded, any suggestions for a library to post-process this file?

Thanks.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Aug 7, 2018

There is no feature to "just strip tags". Subtitles are provided as is. You can convert to other formats with --convert-subs but this will preserve the markup whenever possible.

@dstftw dstftw closed this Aug 7, 2018
@ytdl-org ytdl-org deleted a comment from IbraamNasif Aug 7, 2018
@magician11
Copy link
Author

@magician11 magician11 commented Aug 7, 2018

I don't fully understand this... from the YouTube dashboard I can download a srt. With this tool I can't. Why's that?

So instead you're saying I need to download the vtt and then use the --convert-subs flag?

When I try that youtube-dl --write-auto-sub --convert-subs=srt --skip-download https://youtu.be/bQLkDomt59A it just downloads the vtt and doesn't convert it to an srt.

Also what's confusing is when I --list-subs I get

youtube-dl  --list-subs --skip-download https://youtu.be/bQLkDomt59A
[youtube] bQLkDomt59A: Downloading webpage
[youtube] bQLkDomt59A: Downloading video info webpage
WARNING: video doesn't have subtitles
[youtube] bQLkDomt59A: Looking for automatic captions
[youtube] bQLkDomt59A: Downloading MPD manifest
[youtube] bQLkDomt59A: Downloading MPD manifest
Available automatic captions for bQLkDomt59A:
Language formats
gu       vtt, ttml
zh-Hans  vtt, ttml
zh-Hant  vtt, ttml
gd       vtt, ttml
ga       vtt, ttml
gl       vtt, ttml
lb       vtt, ttml
la       vtt, ttml
lo       vtt, ttml
tr       vtt, ttml
lv       vtt, ttml
lt       vtt, ttml
th       vtt, ttml
tg       vtt, ttml
te       vtt, ttml
fil      vtt, ttml
haw      vtt, ttml
yi       vtt, ttml
ceb      vtt, ttml
yo       vtt, ttml
de       vtt, ttml
da       vtt, ttml
el       vtt, ttml
eo       vtt, ttml
en       vtt, ttml
eu       vtt, ttml
et       vtt, ttml
es       vtt, ttml
ru       vtt, ttml
ro       vtt, ttml
bn       vtt, ttml
be       vtt, ttml
bg       vtt, ttml
uk       vtt, ttml
jv       vtt, ttml
bs       vtt, ttml
ja       vtt, ttml
xh       vtt, ttml
co       vtt, ttml
ca       vtt, ttml
cy       vtt, ttml
cs       vtt, ttml
ps       vtt, ttml
pt       vtt, ttml
pa       vtt, ttml
vi       vtt, ttml
pl       vtt, ttml
hy       vtt, ttml
hr       vtt, ttml
ht       vtt, ttml
hu       vtt, ttml
hmn      vtt, ttml
hi       vtt, ttml
ha       vtt, ttml
mg       vtt, ttml
uz       vtt, ttml
ml       vtt, ttml
mn       vtt, ttml
mi       vtt, ttml
mk       vtt, ttml
ur       vtt, ttml
mt       vtt, ttml
ms       vtt, ttml
mr       vtt, ttml
ta       vtt, ttml
my       vtt, ttml
af       vtt, ttml
sw       vtt, ttml
is       vtt, ttml
am       vtt, ttml
it       vtt, ttml
iw       vtt, ttml
sv       vtt, ttml
ar       vtt, ttml
su       vtt, ttml
zu       vtt, ttml
az       vtt, ttml
id       vtt, ttml
ig       vtt, ttml
nl       vtt, ttml
no       vtt, ttml
ne       vtt, ttml
ny       vtt, ttml
fr       vtt, ttml
ku       vtt, ttml
fy       vtt, ttml
fa       vtt, ttml
fi       vtt, ttml
ka       vtt, ttml
kk       vtt, ttml
sr       vtt, ttml
sq       vtt, ttml
ko       vtt, ttml
kn       vtt, ttml
km       vtt, ttml
st       vtt, ttml
sk       vtt, ttml
si       vtt, ttml
so       vtt, ttml
sn       vtt, ttml
sm       vtt, ttml
sl       vtt, ttml
ky       vtt, ttml
sd       vtt, ttml
bQLkDomt59A has no subtitles

So captions but no subtitles?

@jakemcannon
Copy link

@jakemcannon jakemcannon commented Aug 14, 2018

Having the same issue. Last night I was able to download srt files with --convert-subs srt but for whatever reason today this same command on the same video will not work

@magician11
Copy link
Author

@magician11 magician11 commented Aug 14, 2018

Hi @jakecan13 I've been researching this a bunch, and I finally figured it out using another module.

The working code sample is as follows...

const { getSubtitles } = require('youtube-captions-scraper');
const getYouTubeID = require('get-youtube-id');

const getYouTubeSubtitles = async youtubeUrl => {
  try {
    const videoID = getYouTubeID(youtubeUrl);
    const subtitles = await getSubtitles({ videoID });
    return subtitles.reduce(
      (accumulator, currentSubtitle) =>
        `${accumulator} ${currentSubtitle.text}`,
      ''
    );
  } catch (error) {
    console.log(`Error getting captions: ${error.message}`);
  }
};

(async () => {
  const consoleArguments = process.argv;
  if (consoleArguments.length !== 3) {
    console.log(
      'usage example: node get-youtube-subtitles.js https://www.youtube.com/watch?v=gypAjPp6eps'
    );
    return;
  }

  const subtitles = await getYouTubeSubtitles(consoleArguments[2]);
  console.log(subtitles);
})();

Here is the gist.

@dacorsa
Copy link

@dacorsa dacorsa commented Jan 14, 2020

Thanks but now how can i download a video with subtitles with this module???
best regards

@jakemcannon
Copy link

@jakemcannon jakemcannon commented Jan 14, 2020

@dacorsa You want to download both the video as well as the subtitles in a .txt file for a video?

@dacorsa
Copy link

@dacorsa dacorsa commented Jan 14, 2020

Yes, but i'd like the video+subs embeded...as unique file

@jakemcannon
Copy link

@jakemcannon jakemcannon commented Jan 14, 2020

@dacorsa Sorry, In that case, I'm not sure how to achieve this.

@dacorsa
Copy link

@dacorsa dacorsa commented Jan 14, 2020

ok and for separate files? can you help me?

@dacorsa
Copy link

@dacorsa dacorsa commented Jan 16, 2020

Thanks i solved with your link:
youtube-dl -ci -f "bestvideo[ext=mp4]"+"bestaudio[ext=m4a]" --sub-lang en,it,fr,pt --write-auto-sub --write-sub --embed-subs --merge-output-format mp4 https:..........

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.