# Youtube Transcripts RAG

##  Structured Output For YouTube Transcripts

In [14]:
from openai import OpenAI

openai_client = OpenAI()

def llm(user_prompt, instructions=None, model="gpt-4o-mini"):
    messages = []

    if instructions:
        messages.append({
            "role": "system",
            "content": instructions
        })

    messages.append({
        "role": "user",
        "content": user_prompt
    })

    response = openai_client.responses.create(
        model=model,
        input=messages
    )

    return response.output_text

In [1]:
!uv add youtube-transcript-api

[2K[2mResolved [1m152 packages[0m [2min 1.09s[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   [37m⠋[0m [2mPreparing packages...[0m (0/0)                                                   
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m     0 B/473.68 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m     0 B/473.68 KiB        [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m 14.90 KiB/473.68 KiB      [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m 30.90 KiB/473.68 KiB      [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m 46.90 KiB/473.68 KiB      [1A
[2K[1A[37m⠹[0m [2mPreparing packages...[0m (0/1)----------------[0m[0m 62.90 KiB/473.68 KiB      [1A
[2K[1A[37m⠹[0m [2mPreparing packag

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi

In [3]:
video_id = 'ph1PxZIkz1o'

In [4]:
ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch(video_id)

We can optionally save it for later use:

In [5]:
import pickle

In [6]:
with open(f'{video_id}.bin', 'wb') as f_out:
    pickle.dump(transcript, f_out)

When running in codespaces we might need to use a file downloaded

In [None]:
!wget https://github.com/alexeygrigorev/ai-bootcamp-codespace/raw/refs/heads/main/week1/ph1PxZIkz1o.bin

In [None]:
with open(f'{video_id}.bin', 'rb') as f_in:
    transcript = pickle.load(f_in)

In [9]:
transcript[:10]

[FetchedTranscriptSnippet(text='So hi everyone. Uh today we are going to', start=0.0, duration=5.04),
 FetchedTranscriptSnippet(text='talk about our upcoming course. The', start=2.96, duration=3.52),
 FetchedTranscriptSnippet(text='upcoming course is called machine', start=5.04, duration=5.92),
 FetchedTranscriptSnippet(text='learning zoom camp. And um this is', start=6.48, duration=5.92),
 FetchedTranscriptSnippet(text='already I put the link in the', start=10.96, duration=3.599),
 FetchedTranscriptSnippet(text="description. So if you're watching um", start=12.4, duration=4.719),
 FetchedTranscriptSnippet(text="this video in recording or you're", start=14.559, duration=4.88),
 FetchedTranscriptSnippet(text='watching it live, you go here in the', start=17.119, duration=4.561),
 FetchedTranscriptSnippet(text='description after under this video and', start=19.439, duration=5.6),
 FetchedTranscriptSnippet(text='then you see a link course. uh click on', start=21.68, duration=6.24)]

In [10]:
def format_timestamp(seconds: float) -> str:
    """Convert seconds to H:MM:SS if > 1 hour, else M:SS"""
    total_seconds = int(seconds)
    hours, remainder = divmod(total_seconds, 3600)
    minutes, secs = divmod(remainder, 60)

    if hours > 0:
        return f"{hours}:{minutes:02}:{secs:02}"
    else:
        return f"{minutes}:{secs:02}"

In [11]:
def make_subtitles(transcript) -> str:
    lines = []

    for entry in transcript:
        ts = format_timestamp(entry.start)
        text = entry.text.replace('\n', ' ')
        lines.append(ts + ' ' + text)

    return '\n'.join(lines)

In [12]:
subtitles = make_subtitles(transcript)

In [13]:
print(subtitles[:500])

0:00 So hi everyone. Uh today we are going to
0:02 talk about our upcoming course. The
0:05 upcoming course is called machine
0:06 learning zoom camp. And um this is
0:10 already I put the link in the
0:12 description. So if you're watching um
0:14 this video in recording or you're
0:17 watching it live, you go here in the
0:19 description after under this video and
0:21 then you see a link course. uh click on
0:25 that link and this bring you will bring
0:27 you to
0:29 this website this GitHub


In [18]:
instructions = """
Summarize the transcript and describe the main purpose of the video
and the main ideas. 

Also output chapters with time. Use usual sentence case, not Title Case for the chapter.

Output format: 

<OUTPUT>
Summary

timestamp chapter 
timestamp chapter
...
timestamp chapter
</OUTPUT>

Don't include <OUTPUT> in the output
"""

In [19]:
answer = llm(subtitles, instructions=instructions)

In [20]:
print(answer)

The video is a presentation about the upcoming "Machine Learning Zoom Camp" course, detailing its structure, content, and answering frequently asked questions. The course, aimed at aspiring machine learning engineers, will commence on September 15th and consists of updated and foundational modules, primarily focusing on practical skills relevant to the industry, such as deploying models and working with Python and command-line tools. 

The speaker clarifies the suitability of the course for individuals with programming backgrounds, mentions that the course does not guarantee job placement but highlights past successes of graduates finding employment. Participants can expect to learn essential skills for entering the machine learning field, though advanced theoretical mathematics is not a primary focus. The course encourages project-based learning and offers a certificate upon completion of certain requirements.

Chapters:
00:00 Introduction to the Machine Learning Zoom Camp course
00:2

### We can use structured output with Pydantic to have a better time parsing it 

[Structured model outputs](https://platform.openai.com/docs/guides/structured-outputs)

In [21]:
from pydantic import BaseModel

In [22]:
class Chapter(BaseModel):
    timestamp: str
    title: str

class YTSummaryResponse(BaseModel):
    summary: str
    chapters: list[Chapter]

In [23]:
def llm_structured(instructions, user_prompt, output_type, model="gpt-4o-mini"):
    messages = [
        { "role": "system", "content": instructions },
        { "role": "user", "content": user_prompt }
    ]

    response = openai_client.responses.parse(
        model=model,
        input=messages,
        text_format=output_type
    )

    return response.output_parsed

In [24]:
summary = llm_structured(
    instructions=instructions,
    user_prompt=subtitles,
    output_type=YTSummaryResponse
)

In [26]:
print(summary.summary)
print()
for c in summary.chapters:
    print(c.timestamp, c.title)

The video introduces the upcoming "Machine Learning Zoom Camp" course, outlining its content, schedule, and prerequisites. It encourages viewers to sign up and explains course structure, focusing on machine learning engineering skills, updates to course materials, and the importance of hands-on projects. Additionally, common questions about job placement, prerequisites, and course outcomes are addressed, emphasizing that prior programming knowledge and comfort with command line usage are essential for participants, while also clarifying the nature of the course versus traditional data science.

0:00 Introduction to the machine learning zoom camp
2:50 Purpose of the session and course duration
3:18 Course updates and modules
6:06 Prerequisites for the course
8:06 Using command line and programming languages
11:43 Focus on machine learning engineering
18:51 Learning environment and computer requirements
22:44 Course structure and job readiness
29:59 Deadlines and live interaction
34:26 E

## Chunking YouTube Transcripts

While it's okay to use the entire document for summarization (the LLM processes everything at once), RAG requires chunking.

Problems with large documents:
- Token limits: Most LLMs have maximum input token limits
- Cost: Longer prompts cost more money
- Performance: LLMs perform worse with very long contexts
- Relevance: Not all parts of a long document are relevant to a specific question

Benefits of chunking:
- Focused context: Only relevant information is sent to the LLM
- Better search: Smaller chunks improve search precision
- Cost efficiency: You only pay for tokens that matter
- Improved accuracy: LLMs give better answers with focused, relevant context

### Sliding Window Chunking

In [27]:
def sliding_window(seq, size, step):
    """Create overlapping chunks using sliding window approach."""
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        batch = seq[i:i+size]
        result.append(batch)
        if i + size >= n:
            break

    return result

# Example:
print(sliding_window(list(range(18)), 10, 3))
# Output: [[0,1,2,3,4,5,6,7,8,9], [3,4,5,6,7,8,9,10,11,12], ...]

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [9, 10, 11, 12, 13, 14, 15, 16, 17]]


In [30]:
def join_lines(transcript) -> str:
    """Join transcript entries into continuous text."""
    lines = []

    for entry in transcript:
        text = entry.text.replace('\n', ' ')
        lines.append(text)

    return ' '.join(lines)

def format_chunk(chunk):
    """Format a chunk with start/end timestamps and text."""
    time_start = format_timestamp(chunk[0].start)
    time_end = format_timestamp(chunk[-1].start)
    text = join_lines(chunk)

    return {
        'start': time_start,
        'end': time_end,
        'text': text
    }

In [28]:
chunk = transcript[:10]

In [31]:
join_lines(chunk)

"So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on"

In [32]:
format_chunk(chunk)

{'start': '0:00',
 'end': '0:21',
 'text': "So hi everyone. Uh today we are going to talk about our upcoming course. The upcoming course is called machine learning zoom camp. And um this is already I put the link in the description. So if you're watching um this video in recording or you're watching it live, you go here in the description after under this video and then you see a link course. uh click on"}

In [None]:
chunks = []

for chunk in sliding_window(transcript, 60, 30):
    processed = format_chunk(chunk)
    chunks.append(processed)

In [34]:
print(f"Created {len(chunks)} chunks")

Created 46 chunks


In [35]:
from minsearch import Index

index = Index(text_fields=["text"])
index.fit(chunks)

<minsearch.minsearch.Index at 0x114c838c0>

In [37]:
index.search("Can I find a job after the course?", num_results=5)

[{'start': '52:34',
  'end': '55:07',
  'text': "project I submitted was a fake course project. So there was nothing that's why I didn't get any points. Uh the reason I got uh nine uh is uh cuz I evaluated other peers. So that's why um like for each evalation I get three points. But this is how it's done. So the we evaluate projects by doing peer review and peer review is mandatory to complete the project. So if you submit a project but you don't do peer reviewing you fail the project and if you fail a project you fail the course. Right? So this very important to do peer reviews. Uh will the course make one job ready? Yes. If you put effort in the the the course and if you make a good project, if you also follow our recommendations to learn in public, this will definitely make you job ready. Uh what's the next path to follow after the completing the course? Uh to step into advanced stuff, find a job. That's the best way. Um cuz you can do courses forever, but I think you need to work o

## Building the RAG System

In [38]:
import json

def search(query):
    """Search for relevant documents."""
    return index.search(
        query=query,
        num_results=15
    )

instructions = """
Answer the QUESTION based on the CONTEXT from the subtitles of a YouTube video.

Use only the facts from the CONTEXT when answering the QUESTION.

When answering the question, 
provide the citation in form of the video URL pointing at the timestamp where
this is discussed. If the question is discussed in multiple documents,
cite all of them.

Don't use markdown or any formatting in the output.
""".strip()

prompt_template = """
<VIDEO_ID>
{video_id}
</VIDEO_ID>

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(question, search_results):
    context = json.dumps(search_results)
    return prompt_template.format(
        question=question,
        context=context,
        video_id=video_id
    ).strip()

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    response = llm(prompt, instructions=instructions)
    return response

# Test it:
answer = rag('Can I find a job after the course?')
print(answer)

Yes, there is a chance of finding a job after completing the course. While the course does not provide direct job placement, many participants have successfully found jobs afterwards. The skills taught in the course are essential for a machine learning engineer, which enhances your employability in the field. It is emphasized that gaining real experience through projects and potentially volunteering can further make you job-ready. For best practices, starting on projects as soon as possible is advised rather than solely taking more courses.

You can find more details about this at the following timestamps in the video:
- 1:21 - 3:49
- 51:23 - 53:52
- 53:54 - 56:16
- 34:49 - 37:18

For more information, you can check the video here: https://www.youtube.com/watch?v=ph1PxZIkz1o
