# Ways to Fetch Youtube Transcripts

code authored by: Sue Ye

[Blog](https://medium.com/@sueye0425/fetch-youtube-transcripts-for-your-first-llm-project-fa78b8ad8dec)

  - Situation A: if you already know the video_id
  - Situation B: if you don't know which video to download, and you want to search


### Situation A: You know which video you want to download

In [20]:
video_id="zizonToFXDs"
video_url = f'https://www.youtube.com/watch?v={video_id}'

- **Using YoutubeLoader from langchain_community**

In [6]:
from langchain_community.document_loaders import YoutubeLoader

In [None]:
loader = YoutubeLoader.from_youtube_url(video_url, add_video_info=False)
data = loader.load()
print(data[0].page_content)

JOHN EWALD: Hello, and
welcome to Introduction to Large Language Models. My name is John Ewald, and
I'm a training developer here at Google Cloud. In this course, you learn
to define large language models, or LLMs,
describe LLM use cases, explain prompt tuning, and
describe Google's Gen AI development tools. Large language models, or LLMs,
are a subset of deep learning. To find out more
about deep learning, see our Introduction to
Generative AI course video. LLMs and generative AI
intersect and they are both a part of deep learning. Another area of AI you
may be hearing a lot about is generative AI. This is a type of
artificial intelligence that can produce new content,
including text, images, audio, and synthetic data. So what are large
language models? Large language models refer to
large general-purpose language models that can be
pre-trained and then fine tuned for specific purposes. What do pre-trained
and fine tuned mean? Imagine training a dog. Often, you train your
dog basic co

In [8]:
data

[Document(metadata={'source': 'zizonToFXDs'}, page_content='JOHN EWALD: Hello, and\nwelcome to Introduction to Large Language Models. My name is John Ewald, and\nI\'m a training developer here at Google Cloud. In this course, you learn\nto define large language models, or LLMs,\ndescribe LLM use cases, explain prompt tuning, and\ndescribe Google\'s Gen AI development tools. Large language models, or LLMs,\nare a subset of deep learning. To find out more\nabout deep learning, see our Introduction to\nGenerative AI course video. LLMs and generative AI\nintersect and they are both a part of deep learning. Another area of AI you\nmay be hearing a lot about is generative AI. This is a type of\nartificial intelligence that can produce new content,\nincluding text, images, audio, and synthetic data. So what are large\nlanguage models? Large language models refer to\nlarge general-purpose language models that can be\npre-trained and then fine tuned for specific purposes. What do pre-trained\na

- **Using youtube-transcript-api**

In [9]:
from youtube_transcript_api import YouTubeTranscriptApi

In [13]:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
for line in transcript:
    print(f"{line['text']} ({line['start']}s)")

JOHN EWALD: Hello, and
welcome to Introduction (0.57s)
to Large Language Models. (2.76s)
My name is John Ewald, and
I'm a training developer here (4.387s)
at Google Cloud. (6.72s)
In this course, you learn
to define large language (8.16s)
models, or LLMs,
describe LLM use cases, (10.86s)
explain prompt tuning, and
describe Google's Gen AI (14.67s)
development tools. (18.18s)
Large language models, or LLMs,
are a subset of deep learning. (20.43s)
To find out more
about deep learning, (24.42s)
see our Introduction to
Generative AI course video. (26.22s)
LLMs and generative AI
intersect and they are (29.67s)
both a part of deep learning. (32.43s)
Another area of AI you
may be hearing a lot about (35.4s)
is generative AI. (38.04s)
This is a type of
artificial intelligence that (39.72s)
can produce new content,
including text, images, audio, (41.97s)
and synthetic data. (45.73s)
So what are large
language models? (48.29s)
Large language models refer to
large general-purpose language (50.65s

- **Using yt_dlp**

In [25]:
import subprocess

In [26]:
def download_transcript(video_url):
    command = [
        "yt-dlp",
        "--write-auto-sub",
        "--skip-download",
        "--sub-lang", "en",
        "--output", "%(title)s",
        video_url
    ]
    subprocess.run(command)
video_url = f"https://www.youtube.com/watch?v={video_id}"
download_transcript(video_url)
print("Transcript downloaded (if available).")

[youtube] Extracting URL: https://www.youtube.com/watch?v=zizonToFXDs
[youtube] zizonToFXDs: Downloading webpage
[youtube] zizonToFXDs: Downloading tv client config
[youtube] zizonToFXDs: Downloading player f6e09c70
[youtube] zizonToFXDs: Downloading tv player API JSON
[youtube] zizonToFXDs: Downloading ios player API JSON
[youtube] zizonToFXDs: Downloading m3u8 information
[info] zizonToFXDs: Downloading subtitles: en
[info] zizonToFXDs: Downloading 1 format(s): 18
[info] Writing video subtitles to: Introduction to large language models.en.vtt
[download] Destination: Introduction to large language models.en.vtt
[download] 100% of  124.61KiB in 00:00:00 at 2.22MiB/s
Transcript downloaded (if available).




### Situation B: You don't know which video you want to download. Instead, you want to search for it. 

In [31]:
query = 'AI tutorials'  # Replace with your search query

- **Using yt-dlp**

In [29]:
import yt_dlp

In [None]:
ydl_opts = {'extract_flat': True}
search_url = f'ytsearch10:{query}'

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    search_results = ydl.extract_info(search_url, download=False)
    videos = []
    for entry in search_results['entries']:
        videos.append({
            "title": entry.get('title', 'N/A'),
            "url": entry.get('url', 'N/A'),
            "duration": entry.get('duration', 'N/A'),
            "views": entry.get('view_count', 'N/A')
        })

print(videos)

[youtube:search] Extracting URL: ytsearch10:AI tutorials
[download] Downloading playlist: AI tutorials
[youtube:search] query "AI tutorials": Downloading web client config
[youtube:search] query "AI tutorials" page 1: Downloading API JSON
[youtube:search] Playlist AI tutorials: Downloading 10 items of 10
[download] Downloading item 1 of 10
[download] Downloading item 2 of 10
[download] Downloading item 3 of 10
[download] Downloading item 4 of 10
[download] Downloading item 5 of 10
[download] Downloading item 6 of 10
[download] Downloading item 7 of 10
[download] Downloading item 8 of 10
[download] Downloading item 9 of 10
[download] Downloading item 10 of 10
[download] Finished downloading playlist: AI tutorials
[{'title': 'Google’s AI Course for Beginners (in 10 minutes)!', 'url': 'https://www.youtube.com/watch?v=Yq0QkCxoTHM', 'duration': 558.0, 'views': 1530730}, {'title': "99% of Beginners Don't Know the Basics of AI", 'url': 'https://www.youtube.com/watch?v=nVyD6THcvDQ', 'duration'

- **Using YouTube Data API for Search & Metadata**

In [33]:
import googleapiclient.discovery
from youtube_transcript_api import YouTubeTranscriptApi

In [30]:
# Set up YouTube API
import os
import requests
from dotenv import load_dotenv
load_dotenv()  # Load environment variables from .env file
YOUTUBE_API_KEY = os.getenv("YOUTUBE_DATA_API_KEY")  # Fetch the API key

In [34]:
def search_youtube(query, max_results=5):
    """Search YouTube and return a list of (video_id, title)."""
    youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=YOUTUBE_API_KEY)
    response = youtube.search().list(q=query, part="id,snippet", maxResults=max_results, type="video").execute()
    return [(item["id"]["videoId"], item["snippet"]["title"]) for item in response.get("items", [])]

def get_transcript(video_id):
    """Fetch the transcript of a YouTube video."""
    try:
        return "\n".join(t["text"] for t in YouTubeTranscriptApi.get_transcript(video_id))
    except Exception:
        return "Transcript not available."

# Interactive search and transcript retrieval
videos = search_youtube(query)

for idx, (video_id, title) in enumerate(videos, start=1):
    print(f"{idx}. {title} (https://www.youtube.com/watch?v={video_id})")
    video_id, title = videos[idx-1]
    print(f"\nTranscript for: {title}\n{get_transcript(video_id)}")

1. Google’s AI Course for Beginners (in 10 minutes)! (https://www.youtube.com/watch?v=Yq0QkCxoTHM)

Transcript for: Google’s AI Course for Beginners (in 10 minutes)!
if you don't have a technical background
but you still want to learn the basics
of artificial intelligence stick around
because we were distilling Google's
4-Hour AI course for beginners into just
10 minutes I was initially very
skeptical because I thought the course
would be too conceptual we're all about
practical tips on this channel and
knowing Google the course might just
disappear after 1 hour but I found the
underlying Concepts actually made me
better at using tools like Chachi BT and
Google bard and cleared up a bunch of
misconceptions I didn't know I had about
AI machine learning and large language
models so starting with the broadest
possible question what is artificial
intelligence it turns out and I'm so
embarrassed to admit I didn't know this
AI is an entire field of study like
physics and machine learning is 

- **Using youtube_search**

In [35]:
from youtube_search import YoutubeSearch

In [36]:

results = YoutubeSearch(query, max_results=10).to_json()

print(results)

{"videos": [{"id": "Yq0QkCxoTHM", "thumbnails": ["https://i.ytimg.com/vi/Yq0QkCxoTHM/hq720.jpg?sqp=-oaymwEjCOgCEMoBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLARELf9KoD6kOLmS9N0V9l_Z6DxnQ", "https://i.ytimg.com/vi/Yq0QkCxoTHM/hq720.jpg?sqp=-oaymwEXCNAFEJQDSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLAX3MEkewIR0eKmkG6HvpX9m1yLlw"], "title": "Google\u2019s AI Course for Beginners (in 10 minutes)!", "long_desc": null, "channel": "Jeff Su", "duration": "9:18", "views": "1,530,743 views", "publish_time": "1 year ago", "url_suffix": "/watch?v=Yq0QkCxoTHM&pp=ygUMQUkgdHV0b3JpYWxz"}, {"id": "nVyD6THcvDQ", "thumbnails": ["https://i.ytimg.com/vi/nVyD6THcvDQ/hq720.jpg?sqp=-oaymwEjCOgCEMoBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLD_IaBa3inV2k0z3iPrNMI7TScnPA", "https://i.ytimg.com/vi/nVyD6THcvDQ/hq720.jpg?sqp=-oaymwEXCNAFEJQDSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLDdYOChgNEE2yDdgsE4gtMYS88AJw"], "title": "99% of Beginners Don't Know the Basics of AI", "long_desc": null, "channel": "Jeff Su", "duration