This is a Ruby gem which allows you to retrieve the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!
This is a Ruby port of the Python youtube-transcript-api by jdepoix.
Add this line to your application's Gemfile:
gem 'youtube-transcript-rb'And then execute:
bundle install
Or install it yourself as:
gem install youtube-transcript-rb
The easiest way to get a transcript for a given video is to execute:
require 'youtube_rb/transcript'
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
api.fetch(video_id)Note: By default, this will try to access the English transcript of the video. If your video has a different language, or you are interested in fetching a transcript in a different language, please read the section below.
Note: Pass in the video ID, NOT the video URL. For a video with the URL
https://www.youtube.com/watch?v=12345the ID is12345.
This will return a FetchedTranscript object looking somewhat like this:
#<YoutubeRb::Transcript::FetchedTranscript
@video_id="12345",
@language="English",
@language_code="en",
@is_generated=false,
@snippets=[
#<YoutubeRb::Transcript::TranscriptSnippet @text="Hey there", @start=0.0, @duration=1.54>,
#<YoutubeRb::Transcript::TranscriptSnippet @text="how are you", @start=1.54, @duration=4.16>,
# ...
]
>This object implements Enumerable, so you can iterate over it:
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
fetched_transcript = api.fetch(video_id)
# is iterable
fetched_transcript.each do |snippet|
puts snippet.text
end
# indexable
last_snippet = fetched_transcript[-1]
# provides a length
snippet_count = fetched_transcript.lengthIf you prefer to handle the raw transcript data you can call fetched_transcript.to_raw_data, which will return
an array of hashes:
[
{
'text' => 'Hey there',
'start' => 0.0,
'duration' => 1.54
},
{
'text' => 'how are you',
'start' => 1.54,
'duration' => 4.16
},
# ...
]You can also use the convenience methods on the module directly:
require 'youtube_rb/transcript'
# Fetch a transcript
transcript = YoutubeRb::Transcript.fetch(video_id)
# List available transcripts
transcript_list = YoutubeRb::Transcript.list(video_id)You can add the languages param if you want to make sure the transcripts are retrieved in your desired language
(it defaults to english).
YoutubeRb::Transcript::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de', 'en'])It's an array of language codes in a descending priority. In this example it will first try to fetch the german
transcript ('de') and then fetch the english transcript ('en') if it fails to do so. If you want to find out
which languages are available first, have a look at list.
If you only want one language, you still need to format the languages argument as an array:
YoutubeRb::Transcript::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de'])You can also add preserve_formatting: true if you'd like to keep HTML formatting elements such as <i> (italics)
and <b> (bold).
YoutubeRb::Transcript::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de', 'en'], preserve_formatting: true)If you want to list all transcripts which are available for a given video you can call:
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
transcript_list = api.list(video_id)This will return a TranscriptList object which is iterable and provides methods to filter the list of transcripts for
specific languages and types, like:
transcript = transcript_list.find_transcript(['de', 'en'])By default this module always chooses manually created transcripts over automatically created ones, if a transcript in
the requested language is available both manually created and generated. The TranscriptList allows you to bypass this
default behaviour by searching for specific transcript types:
# filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])The methods find_generated_transcript, find_manually_created_transcript, find_transcript return Transcript
objects. They contain metadata regarding the transcript:
puts transcript.video_id
puts transcript.language
puts transcript.language_code
# whether it has been manually created or generated by YouTube
puts transcript.is_generated
# whether this transcript can be translated or not
puts transcript.translatable?
# a list of languages the transcript can be translated to
puts transcript.translation_languagesand provide the method, which allows you to fetch the actual transcript data:
transcript.fetchThis returns a FetchedTranscript object, just like YouTubeTranscriptApi.new.fetch does.
YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to
access this feature. To do so Transcript objects provide a translate method, which returns a new translated
Transcript object:
transcript = transcript_list.find_transcript(['en'])
translated_transcript = transcript.translate('de')
puts translated_transcript.fetchrequire 'youtube_rb/transcript'
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
# retrieve the available transcripts
transcript_list = api.list('video_id')
# iterate over all available transcripts
transcript_list.each do |transcript|
# the Transcript object provides metadata properties
puts transcript.video_id
puts transcript.language
puts transcript.language_code
# whether it has been manually created or generated by YouTube
puts transcript.is_generated
# whether this transcript can be translated or not
puts transcript.translatable?
# a list of languages the transcript can be translated to
puts transcript.translation_languages
# fetch the actual transcript data
puts transcript.fetch
# translating the transcript will return another transcript object
puts transcript.translate('en').fetch if transcript.translatable?
end
# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['de', 'en'])
# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['de', 'en'])You can fetch transcripts for multiple videos at once:
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
# Fetch multiple videos
transcripts = api.fetch_all(['video1', 'video2', 'video3'])
transcripts.each do |video_id, transcript|
puts "#{video_id}: #{transcript.length} snippets"
end
# With error handling - continue even if some videos fail
api.fetch_all(['video1', 'video2'], continue_on_error: true) do |video_id, result|
if result.is_a?(StandardError)
puts "Error for #{video_id}: #{result.message}"
else
puts "Got #{result.length} snippets for #{video_id}"
end
endFormatters are meant to be an additional layer of processing of the transcript you pass it. The goal is to convert a
FetchedTranscript object into a consistent string of a given "format". Such as a basic text (.txt) or even formats
that have a defined specification such as JSON (.json), WebVTT (.vtt), SRT (.srt), etc...
The Formatters module provides a few basic formatters:
JSONFormatterPrettyPrintFormatterTextFormatterWebVTTFormatterSRTFormatter
Here is how to import from the Formatters module:
require 'youtube_rb/transcript'
# Some provided formatter classes, each outputs a different string format.
YoutubeRb::Formatters::JSONFormatter
YoutubeRb::Formatters::TextFormatter
YoutubeRb::Formatters::PrettyPrintFormatter
YoutubeRb::Formatters::WebVTTFormatter
YoutubeRb::Formatters::SRTFormatterLet's say we wanted to retrieve a transcript and store it to a JSON file. That would look something like this:
require 'youtube_rb/transcript'
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new
transcript = api.fetch(video_id)
formatter = YoutubeRb::Formatters::JSONFormatter.new
# .format_transcript(transcript) turns the transcript into a JSON string.
json_formatted = formatter.format_transcript(transcript)
# Now we can write it out to a file.
File.write('your_filename.json', json_formatted)
# Now should have a new JSON file that you can easily read back into Ruby.Passing extra keyword arguments
Since JSONFormatter leverages JSON.generate you can also forward keyword arguments into
.format_transcript(transcript) such as making your file output prettier:
json_formatted = YoutubeRb::Formatters::JSONFormatter.new.format_transcript(
transcript,
indent: ' ',
space: ' '
)You can also use the FormatterLoader to dynamically load formatters by name:
require 'youtube_rb/transcript'
loader = YoutubeRb::Formatters::FormatterLoader.new
# Load by type name: "json", "pretty", "text", "webvtt", "srt"
formatter = loader.load("json")
output = formatter.format_transcript(transcript)
formatter = loader.load("srt")
File.write('transcript.srt', formatter.format_transcript(transcript))You can implement your own formatter class. Just inherit from the Formatter base class and ensure you implement the
format_transcript and format_transcripts methods which should ultimately return a string:
class MyCustomFormatter < YoutubeRb::Formatters::Formatter
def format_transcript(transcript, **options)
# Do your custom work in here, but return a string.
'your processed output data as a string.'
end
def format_transcripts(transcripts, **options)
# Do your custom work in here to format an array of transcripts, but return a string.
'your processed output data as a string.'
end
endThe library provides a comprehensive set of exceptions for different error scenarios:
require 'youtube_rb/transcript'
begin
transcript = YoutubeRb::Transcript.fetch(video_id)
rescue YoutubeRb::Transcript::TranscriptsDisabled => e
puts "Subtitles are disabled for this video"
rescue YoutubeRb::Transcript::NoTranscriptFound => e
puts "No transcript found for the requested languages"
puts e.requested_language_codes
rescue YoutubeRb::Transcript::NoTranscriptAvailable => e
puts "No transcripts are available for this video"
rescue YoutubeRb::Transcript::VideoUnavailable => e
puts "The video is no longer available"
rescue YoutubeRb::Transcript::TooManyRequests => e
puts "Rate limited by YouTube"
rescue YoutubeRb::Transcript::RequestBlocked => e
puts "Request blocked by YouTube"
rescue YoutubeRb::Transcript::IpBlocked => e
puts "Your IP has been blocked by YouTube"
rescue YoutubeRb::Transcript::PoTokenRequired => e
puts "PO token required - this is a YouTube limitation"
rescue YoutubeRb::Transcript::CouldNotRetrieveTranscript => e
puts "Could not retrieve transcript: #{e.message}"
end| Exception | Description |
|---|---|
Error |
Base error class |
CouldNotRetrieveTranscript |
Base class for transcript retrieval errors |
YouTubeDataUnparsable |
YouTube data cannot be parsed |
YouTubeRequestFailed |
HTTP request to YouTube failed |
VideoUnplayable |
Video cannot be played |
VideoUnavailable |
Video is no longer available |
InvalidVideoId |
Invalid video ID provided |
RequestBlocked |
YouTube is blocking requests |
IpBlocked |
IP has been blocked by YouTube |
TooManyRequests |
Rate limited (HTTP 429) |
TranscriptsDisabled |
Subtitles are disabled for the video |
AgeRestricted |
Video is age-restricted |
NotTranslatable |
Transcript cannot be translated |
TranslationLanguageNotAvailable |
Requested translation language not available |
FailedToCreateConsentCookie |
Failed to create consent cookie |
NoTranscriptFound |
No transcript found for requested languages |
NoTranscriptAvailable |
No transcripts available for the video |
PoTokenRequired |
PO token required to fetch transcript |
Unfortunately, YouTube has started blocking most IPs that are known to belong to cloud providers (like AWS, Google Cloud
Platform, Azure, etc.), which means you will most likely run into RequestBlocked or IpBlocked exceptions when
deploying your code to any cloud solutions. Same can happen to the IP of your self-hosted solution, if you are doing
too many requests. You can work around these IP bans using proxies.
Note: Proxy support is planned for a future release.
When initializing a YouTubeTranscriptApi object, it will create a Faraday HTTP client which will be used for all
HTTP(S) requests. However, you can optionally pass a custom Faraday connection into its constructor:
require 'faraday'
http_client = Faraday.new do |conn|
conn.options.timeout = 60
conn.headers['Accept-Encoding'] = 'gzip, deflate'
conn.ssl.verify = true
conn.ssl.ca_file = '/path/to/certfile'
conn.adapter Faraday.default_adapter
end
api = YoutubeRb::Transcript::YouTubeTranscriptApi.new(http_client: http_client)
api.fetch(video_id)
# Share same connection between two instances
api_2 = YoutubeRb::Transcript::YouTubeTranscriptApi.new(http_client: http_client)
api_2.fetch(video_id)This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know!
After checking out the repo, run bin/setup to install dependencies. Then, run bundle exec rspec to run the tests.
You can also run bin/console for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install.
bundle exec rspec
Bug reports and pull requests are welcome on GitHub at https://github.com/stadia/youtube-transcript-rb.
This project is available as open source under the terms of the MIT License.
This is a Ruby port of youtube-transcript-api by jdepoix.