# **YouTube Video RAG**

Authored by [Kalyan KS](https://www.linkedin.com/in/kalyanksnlp/). To stay updated with LLM, RAG and Agent updates, you can follow me on [Twitter](https://x.com/kalyan_kpl).

- Step-1 : Extract the YouTube video transcript
- Step-2 : Chunk the extracted transcript text
- Step-3 : Create a vector store with the transcript chunks
- Step-4 : Create a retriever which will return the relevant chunks
- Step-5 : Build context from the relevant chunk texts
- Step-6 : Build the RAG chain using rag prompt, LLM and string output parser.
- Step-7 : Run the RAG chain to get the answer.

## **Install and import libraries**

- YoutubeLoader uses `youtube-transcript-api` python library to extract the transcript.

In [None]:
!pip install -qU langchain langchain-community langchain-text-splitters
!pip install -qU langchain-openai langchain-chroma youtube-transcript-api

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m18.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m38.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m4.2 MB/s[0m

In [None]:
from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

## **Set up LLM API Key**

- Save the `OPENAI_API_KEY` in Google Colab Secrets

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

## **Extract YouTube video transcript**

In [None]:
from typing import List
from langchain.schema import Document

def yt_transcript(video_url: str) -> List[Document]:
    """
    Extracts transcript of given YouTube video using YoutubeLoader.

    Parameters:
    video_url (str): The URL of the YouTube video.

    Returns:
    List[Document]: A list of Document objects containing the transcript.
    """

    print("YouTube video transcript is extracted...")
    loader = YoutubeLoader.from_youtube_url(video_url)
    transcript = loader.load()

    return transcript

In [None]:
video_url = "https://www.youtube.com/watch?v=d4IyR-kl_mY"
transcript = yt_transcript(video_url)

YouTube video transcript is extracted...


In [None]:
print(transcript)

[Document(metadata={'source': 'd4IyR-kl_mY'}, page_content="a passenger fery service from nagapattinam in Tamil Nadu to kesan Tay in jafna is being resumed after several months this has revived the potential for strengthening cultural and economic ties between India and Sri Lanka what is this fairy service all [Music] about welome this is T sures Kumar thank you for joining me in this episode of the Focus Tamil Nadu on August 16th shivaganga a passenger faery service will set sail from the nagap patnam port and it will reach kesay or kks port in about 4 hours this fairy service has a seating occupancy of 150 including 27 Premier seating the two coastal towns have historically shared a very close cultural ties in ancient times nagan Nadu that is the present day nagapattinam referred only to Sri Lanka likewise kesan derived its name after the Hindu dat mugan or ktia Kang is famed for his beaches and temples it has two historic temples the kirim Malai nageswaran Temple and the muram kanda

In [None]:
print(f"Number of documents = {len(transcript)}")

Number of documents = 1


## **Chunk Transcript text**

In [None]:
def yt_chunk(transcript: List[Document]) -> List[Document]:
    """
    Splits extracted transcript text into smaller chunks using RecursiveCharacterTextSplitter.

    Parameters:
    transcript (List[Document]): A list of Document objects containing extracted transcript.

    Returns:
    List[Document]: A list of chunked Document objects.
    """

    print("YouTube video transcript text is chunked....")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = text_splitter.split_documents(transcript)

    return chunks

In [None]:
chunks = yt_chunk(transcript)

PDF file text is chunked....


In [None]:
print(f"Number of chunks = {len(chunks)}")

Number of chunks = 6


In [None]:
print(chunks[0])

page_content='a passenger fery service from nagapattinam in Tamil Nadu to kesan Tay in jafna is being resumed after several months this has revived the potential for strengthening cultural and economic ties between India and Sri Lanka what is this fairy service all [Music] about welome this is T sures Kumar thank you for joining me in this episode of the Focus Tamil Nadu on August 16th shivaganga a passenger faery service will set sail from the nagap patnam port and it will reach kesay or kks port in about 4 hours this fairy service has a seating occupancy of 150 including 27 Premier seating the two coastal towns have historically shared a very close cultural ties in ancient times nagan Nadu that is the present day nagapattinam referred only to Sri Lanka likewise kesan derived its name after the Hindu dat mugan or ktia Kang is famed for his beaches and temples it has two historic temples the kirim Malai nageswaran Temple and the muram kandaswami temple India and Sri Lanka have historic

## **Create Vector Store**

In [None]:
# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_yt")

In [None]:
def create_vector_store(chunks: List[Document], db_path: str) -> Chroma:
    """
    Creates a Chroma vector store from chunked documents.

    Parameters:
    chunks (List[Document]): A list of chunked Document objects.
    db_path (str): The directory path to persist the vector store.

    Returns:
    Chroma: A Chroma vector store containing the embedded documents.
    """

    print("Chrome vector store is created...\n")

    embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
    db = Chroma.from_documents(documents=chunks, embedding=embedding_model, persist_directory=db_path)

    return db

In [None]:
db = create_vector_store(chunks, persistent_directory)

Chrome vector store is created...



## **Retrieve relevant chunks**

In [None]:
def retrieve_context(db: Chroma, query: str) -> List[Document]:
    """
    Retrieves relevant document chunks from the Chroma vector store based on a query.

    Parameters:
    db (Chroma): The Chroma vector store containing embedded documents.
    query (str): The query string to search for relevant document chunks.

    Returns:
    List[Document]: A list of retrieved relevant document chunks.
    """

    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
    print("Relevant chunks are retrieved...\n")
    relevant_chunks = retriever.invoke(query)

    return relevant_chunks

In [None]:
query = "passenger fery service is resumed from which place?"

relevant_chunks = retrieve_context(db, query)

Relevant chunks are retrieved...



In [None]:
print(f"Number of relevant chunks = {len(relevant_chunks)}")

Number of relevant chunks = 2


In [None]:
for i, chunk in enumerate(relevant_chunks):
  print(f"Chunk-{i}")
  print(chunk)
  print("\n")

Chunk-0
page_content='a passenger faery service from caral in the union territory of puducheri to KES and turai however subsequently it was decided to have the fer service from nagapattinam to kks Port last year amid's great Fanfare Cher Yani a highspeed craft had set sail from nagap patam to Kay carrying about 50 passengers on board at that time prime minister Narendra Modi had said that this will be a new chapter in diplomatic and economic ties between the two countries he was optimistic that the fery service would help to strengthen the cultural civilizational and Commercial ties between India and Sri Lanka he believed that this marittime connectivity was the central theme of The Joint vision of Indo Sri Lanka economic ties Modi had even said that India will now take steps to revive the sea route between t Manar and rameshwaram the Sri Lankan president ranil Vikram Ming had also hailed this revival of Maritime ties between the two countries however this fery service was shortlived a

## **Build context**

In [None]:
def build_context(relevant_chunks: List[Document]) -> str:
    """
    Builds a context string from retrieved relevant document chunks.

    Parameters:
    relevant_chunks (List[Document]): A list of retrieved relevant document chunks.

    Returns:
    str: A concatenated string containing the content of the relevant chunks.
    """

    print("Context is built from relevant chunks")
    context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

    return context

In [None]:
context = build_context(relevant_chunks)

Context is built from relevant chunks


In [None]:
print(context)

a passenger faery service from caral in the union territory of puducheri to KES and turai however subsequently it was decided to have the fer service from nagapattinam to kks Port last year amid's great Fanfare Cher Yani a highspeed craft had set sail from nagap patam to Kay carrying about 50 passengers on board at that time prime minister Narendra Modi had said that this will be a new chapter in diplomatic and economic ties between the two countries he was optimistic that the fery service would help to strengthen the cultural civilizational and Commercial ties between India and Sri Lanka he believed that this marittime connectivity was the central theme of The Joint vision of Indo Sri Lanka economic ties Modi had even said that India will now take steps to revive the sea route between t Manar and rameshwaram the Sri Lankan president ranil Vikram Ming had also hailed this revival of Maritime ties between the two countries however this fery service was shortlived after its initial

a pa

## **Combine all the steps into one function**

In [None]:
from typing import Dict

def get_context(inputs: Dict[str, str]) -> Dict[str, str]:
    """
    Creates or loads a vector store for the video transcript and retrieves relevant chunks based on a query.

    Args:
        inputs (Dict[str, str]): A dictionary containing the following keys:
            - 'video_url' (str): The YouTube video URL.
            - 'query' (str): The user query.
            - 'db_path' (str): Path to the vector database.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'context' (str): Extracted relevant context.
            - 'query' (str): The user query.
    """
    video_url, query, db_path  = inputs['video_url'], inputs['query'], inputs['db_path']

    # Create new vector store if it does not exist
    if not os.path.exists(db_path):
        print("Creating a new vector store...\n")
        transcript = yt_transcript(video_url)
        chunks = yt_chunk(transcript)
        db = create_vector_store(chunks, db_path)

    # Load the existing vector store
    else:
        print("Loading the existing vector store\n")
        embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
        db = Chroma(persist_directory=db_path, embedding_function=embedding_model)

    relevant_chunks = retrieve_context(db, query)
    context = build_context(relevant_chunks)

    return {'context': context, 'query': query}

## **Build RAG chain**

In [None]:
template = """ You are an AI model trained for question answering. You should answer the
  given question based on the given context only.
  Question : {query}
  \n
  Context : {context}
  \n
  If the answer is not present in the given context, respond as: The answer to this question is not available
  in the provided content.
  """

rag_prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(model='gpt-4o-mini')

str_parser = StrOutputParser()

rag_chain = (
    RunnableLambda(get_context)
    | rag_prompt
    | llm
    | str_parser
)

## **Run RAG chain**

In [None]:
# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_yt")

In [None]:
# Video URL
video_url = "https://www.youtube.com/watch?v=d4IyR-kl_mY"

In [None]:
# Write the query
query = 'passenger fery service is resumed from which place?'

In [None]:
answer = rag_chain.invoke({'video_url':video_url, 'query':query, 'db_path':persistent_directory})

Loading the existing vector store

Relevant chunks are retrieved...

Context is built from relevant chunks


In [None]:
print(f"Query:{query}\n")
print(f"Generated answer:{answer}")

Query:passenger fery service is resumed from which place?

Generated answer:The passenger ferry service is resumed from Nagapattinam in Tamil Nadu.
