## YOUTUBE VIDEO chat using RAG

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled

In [4]:
fetcher = YouTubeTranscriptApi()

In [7]:
transcript_list = fetcher.fetch(video_id = "d_qvLDhkg00" )

In [8]:
transcript_list

FetchedTranscript(snippets=[FetchedTranscriptSnippet(text='The basic function underlying a normal distribution,', start=0.0, duration=3.182), FetchedTranscriptSnippet(text='aka a Gaussian, is e to the negative x squared.', start=3.182, duration=2.938), FetchedTranscriptSnippet(text='But you might wonder, why this function?', start=6.64, duration=1.7), FetchedTranscriptSnippet(text='Of all the expressions we could dream up that give you some symmetric smooth graph', start=8.72, duration=3.988), FetchedTranscriptSnippet(text='with mass concentrated towards the middle, why is it that the theory of probability', start=12.708, duration=4.085), FetchedTranscriptSnippet(text='seems to have a special place in its heart for this particular expression?', start=16.793, duration=3.647), FetchedTranscriptSnippet(text="For the last many videos I've been hinting at an answer to this question,", start=21.38, duration=3.239), FetchedTranscriptSnippet(text="and here we'll finally arrive at something lik

In [11]:
type(transcript_list)

youtube_transcript_api._transcripts.FetchedTranscript

In [15]:
print(transcript_list[0])
print(transcript_list[0].text)

FetchedTranscriptSnippet(text='The basic function underlying a normal distribution,', start=0.0, duration=3.182)
The basic function underlying a normal distribution,


In [17]:
transcript = " ".join(chunk.text for chunk in transcript_list)

In [19]:
transcript

"The basic function underlying a normal distribution, aka a Gaussian, is e to the negative x squared. But you might wonder, why this function? Of all the expressions we could dream up that give you some symmetric smooth graph with mass concentrated towards the middle, why is it that the theory of probability seems to have a special place in its heart for this particular expression? For the last many videos I've been hinting at an answer to this question, and here we'll finally arrive at something like a satisfying answer. As a quick refresher on where we are, a couple videos ago we talked about the central limit theorem, which describes how as you add multiple copies of a random variable, for example rolling a weighted die many different times, or letting a ball bounce off of a peg repeatedly, then the distribution describing that sum tends to look approximately like a normal distribution. What the central limit theorem says is as you make that sum bigger and bigger, under appropriate 

In [21]:
# we will use this podcast video for our project -> Gfr50f6ZBvo

## STEP 1 : INDEXING 

## STEP 1a : Transcript loading (external knowledge)

In [25]:
video_id = "Gfr50f6ZBvo"
fetcher = YouTubeTranscriptApi()
try:
    transcript_list = fetcher.fetch(video_id=video_id , languages= ["en"])
    transcript = " ".join(chunk.text for chunk in transcript_list)
    print("Done fetching transcript")
except TranscriptsDisabled:
    print("No captions available for this video")

Done fetching transcript


In [27]:
transcript

"the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all by itself to play the game of gold better than any human in the world and alpha fold two that solved protein folding both tasks considered nearly impossible for a very long time demus is widely considered to be one of the most brilliant and impactful humans in the history of artificial intelligence and science and engineering in general this was truly an honor and a pleasure for me to finally sit down with him for this conversation and i'm sure we will talk many times again in the future this is the lex friedman podcast to support it please check out our sponsors in the description and now dear friends here's demis hassabis let's start with a bit of a personal question am i an ai program you wrote to interview people until i get good enough

## STEP 1b : TEXT SPLITTING 

In [31]:
# text spiltter 
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000 , chunk_overlap =200)
chunks = splitter.create_documents([transcript])

In [33]:
chunks[0].page_content

"the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all by itself to play the game of gold better than any human in the world and alpha fold two that solved protein folding both tasks considered nearly impossible for a very long time demus is widely considered to be one of the most brilliant and impactful humans in the history of artificial intelligence and science and engineering in general this was truly an honor and a pleasure for me to finally sit down with him for this conversation and i'm sure we will talk many times again in the future this is the lex friedman podcast to support it please check out our sponsors in the description and now dear friends here's demis hassabis let's start with a bit of a personal question am i an ai program you wrote to interview people until i get good enough

In [35]:
len(chunks)

168

## STEP 1c and 1d : EMBEDDING AND VECTOR STORE 

In [71]:
# # testing embedding model 
# embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
# vector = embeddings.embed_query("hello, world!")
# vector[:5]

[-0.023955047, 0.011876456, -0.0033613679, -0.0584139, 0.0015592978]

In [83]:
!pip install langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-1.2.0-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0.0,>=0.33.4 (from langchain-huggingface)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Downloading langchain_huggingface-1.2.0-py3-none-any.whl (30 kB)
Downloading huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
   ---------------------------------------- 0.0/566.1 kB ? eta -:--:--
   ---------------------------------------- 566.1/566.1 kB 3.2 MB/s eta 0:00:00
Installing collected packages: huggingface-hub, langchain-huggingface
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface_hub 0.29.2
    Uninstalling huggingface_hub-0.29.2:
      Successfully uninstalled huggingface_hub-0.29.2
Successfully installed huggingface-hub-0.36.0 langchain-huggingface-1.2.0


In [41]:
from langchain_huggingface import HuggingFaceEmbeddings

In [47]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'} 
encode_kwargs = {'normalize_embeddings': False}


embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)



vector_store = FAISS.from_documents(chunks, embeddings)

In [49]:
vector_store.index_to_docstore_id

{0: 'af903914-b4cf-4ee6-81ec-8245d02efd17',
 1: '62050ee7-deec-49c8-9d52-c6f9d89b0103',
 2: 'b3adcbf5-af11-4ad2-8d8f-66fc12f6dcab',
 3: '38bb4935-7fe9-4026-b6e0-4b6e9a1cdd57',
 4: 'a8169ccc-7f5c-477b-a73a-5fc0c2b2bed3',
 5: 'c44b2b70-f310-4041-b35f-a5a1f69f3215',
 6: 'a4b44639-3de3-45a2-b591-abcb6dc87637',
 7: '01792b1e-2068-4088-8bfa-482d9d0a658d',
 8: 'bed74465-029d-4957-8ce2-20adc38f444d',
 9: '8203d9d6-530d-40ff-b3cc-e0e90dbd8210',
 10: '6b52f930-f182-467d-98dd-b1646d677fd5',
 11: '1f7b20aa-7ab8-4eb0-b601-7aacec5fc42e',
 12: '03bf3736-a1fa-4696-b1b4-307a90190244',
 13: 'ac8bc9dc-4953-45c5-8e26-2213b296694e',
 14: '11ce8c20-5864-4ed1-b7ab-b1346f2f80d2',
 15: '54a94dc8-5206-47b9-9567-ca549c47e10d',
 16: '053225d0-e9dc-4d72-8f93-7567a0a682ae',
 17: '8b734335-9183-4696-8aa5-da942f8f7364',
 18: '1e7be535-6c24-4fba-9f39-86078d1df91e',
 19: '391180ad-ad56-4704-9cab-494a8eeb4d62',
 20: 'c05d2b7d-5d6a-40b8-a350-417093232c65',
 21: '159bd9fe-b4c1-47f8-a854-ed4e2302085b',
 22: 'aa8e06c1-2d2b-

In [51]:
vector_store.get_by_ids(['c0397475-d222-4276-9c17-dd906a8b39c6'])

[Document(id='c0397475-d222-4276-9c17-dd906a8b39c6', metadata={}, page_content="from the systems like all right how do i explain to the excuse me exactly all right let me i don't have time to explain uh maybe i'll draw you a picture that it is i mean how do you even begin um to answer that question well i think it would um what would you what would you think the answer could possibly look like i think it could it could start looking like uh uh more fundamental explanations of physics would be the beginning you know more careful specification of that taking you walking us through by the hand as to what one would do to maybe prove those things out maybe giving you glimpses of what things you totally missed in the physics of today exactly just here here's glimpses of no like there's a much uh a much more elaborate world or a much simpler world or something a much deeper maybe simpler explanation yes of things right than the standard model of physics which we know doesn't work but we still

## STEP 2 : RETRIEVER

In [99]:

retriever = vector_store.as_retriever(search_type = "similarity", search_kwargs = {"k":4})


In [55]:
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001703E59D480>, search_kwargs={'k': 4})

In [61]:
retriever.invoke("What is Deep Mind? ")

[Document(id='e5b73d45-b64c-4f0c-8380-7c9106c81d73', metadata={}, page_content="something there's something more deeper underlying it maybe computational now if we were in a if we were in a sort of safari park and everything we were seeing was a hologram and it was projected by the aliens or whatever that to me is not much different than thinking we're inside of another universe because we still can't see true reality right i mean there's there's other explanations it could be that the way they're communicating is just fundamentally different that we're too dumb to understand the much better methods of communication they have it could be i mean i mean it's silly to say but our own thoughts could be the methods by which they're communicating like the place from which our ideas writers talk about this like the muse yeah it sounds like very kind of uh wild but it could be thoughts it could be some interactions with our mind that we think are originating from us is actually something that 

## STEP 3 : AUGMENTATION 


In [104]:

prompt = PromptTemplate(
    template = """
    You are helpful assistant.
    Answers only from the provided transcript context. 
    If the context is insufficient, just say you don't know.

    Context : {context}

    Question: {question}
    """, 
    input_variables=['context', 'question']
)

In [65]:
question = "is the topic of nuclear discussed in this video ?"

retrieved_docs = retriever.invoke(question)

In [69]:
retrieved_docs

[Document(id='57e6b3c6-a3c3-4fdc-85a4-d08118f66758', metadata={}, page_content="what people have done that within just one year which is a short amount of time in science and uh it's been used by over 500 000 researchers have used it we think that's almost every biologist in the world i think there's roughly 500 000 biologists in the world professional biologists have used it to to look at their proteins of interest we've seen amazing fundamental research done so a couple of weeks ago front cover there was a whole special issue of science including the front cover which had the nuclear pore complex on it which is one of the biggest proteins in the body the nuclear poor complex is a protein that governs all the nutrients going in and out of your cell nucleus so they're like little hole gateways that open and close to let things go in and out of your cell nucleus so they're really important but they're huge because they're massive doughnut rings shaped things and they've been looking to 

In [71]:
# merging the whole context 

context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
context_text

"what people have done that within just one year which is a short amount of time in science and uh it's been used by over 500 000 researchers have used it we think that's almost every biologist in the world i think there's roughly 500 000 biologists in the world professional biologists have used it to to look at their proteins of interest we've seen amazing fundamental research done so a couple of weeks ago front cover there was a whole special issue of science including the front cover which had the nuclear pore complex on it which is one of the biggest proteins in the body the nuclear poor complex is a protein that governs all the nutrients going in and out of your cell nucleus so they're like little hole gateways that open and close to let things go in and out of your cell nucleus so they're really important but they're huge because they're massive doughnut rings shaped things and they've been looking to try and figure out that structure for decades and they have lots of you know\n\

In [73]:
final_prompt = prompt.invoke({"context":context_text , "question":question})

In [75]:
final_prompt

StringPromptValue(text="\n    You are helpful assistant.\n    Answers only from the provided transcript context. \n    If the context is insufficient, just say you don't know.\n\n    Context : what people have done that within just one year which is a short amount of time in science and uh it's been used by over 500 000 researchers have used it we think that's almost every biologist in the world i think there's roughly 500 000 biologists in the world professional biologists have used it to to look at their proteins of interest we've seen amazing fundamental research done so a couple of weeks ago front cover there was a whole special issue of science including the front cover which had the nuclear pore complex on it which is one of the biggest proteins in the body the nuclear poor complex is a protein that governs all the nutrients going in and out of your cell nucleus so they're like little hole gateways that open and close to let things go in and out of your cell nucleus so they're re

## STEP 4 : GENERATION 


In [79]:
from langchain_google_genai import ChatGoogleGenerativeAI


In [81]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash-lite",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [89]:
ans = llm.invoke(final_prompt)

print(ans.content)

Yes, the topic of the nuclear pore complex is discussed in the video.


## BUILDING THE CHAIN FOR SMOOTH WORKFLOW 
    - as the above code was way manual 
    - making it cool and short with CHAINS

In [110]:
from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

In [112]:
## intializing embedding model and llm 

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'} 
encode_kwargs = {'normalize_embeddings': False}


embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash-lite",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [114]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [120]:
def RAG_Pipeline(video_id):

    # 1. indexing
    fetcher = YouTubeTranscriptApi()
    try:
        transcript_list = fetcher.fetch(video_id=video_id , languages= ["en"])
        transcript = " ".join(chunk.text for chunk in transcript_list)
        print("Done fetching transcript")
    except TranscriptsDisabled:
        print("No captions available for this video")

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.create_documents([transcript])
    print("Done Splitting")
    vector_store = FAISS.from_documents(chunks, embeddings)
    print("vector store stage completed")
    # 2. Retriever 
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
    print("Initialized Retriever")
    prompt = PromptTemplate(
    template= """
    You are a helpful assistant. 
    Answer only from the provided transcript context.
    if context is insufficient, just say you don't know.

    Context : {context}
    Question: {user_input}
    """, 
    input_variables=['context' , 'user_input']
    )
    parallel_chain = RunnableParallel({
        "context":retriever | RunnableLambda(format_docs),
        "user_input": RunnablePassthrough()
        }
    )

    parser = StrOutputParser()

    main_chain = parallel_chain | prompt | llm | parser
    
    while True: 
        print("Enter 0 to exit")
        user_input = input("Enter your query: ")

        if user_input == '0':
            break 

        print(main_chain.invoke(user_input))
        
        

In [122]:
RAG_Pipeline("Gfr50f6ZBvo")

Done fetching transcript
Done Splitting
vector store stage completed
Initialized Retriever
Enter 0 to exit


Enter your query:  Summarize the video


I don't know.
Enter 0 to exit


Enter your query:  does it talk about something fission ?


No, the transcript only mentions fusion.
Enter 0 to exit


Enter your query:  who is the speaker in this ? 


The speaker in this transcript is Lex Friedman.
Enter 0 to exit


Enter your query:  what does it says about AI


I don't know.
Enter 0 to exit


Enter your query:  what does it says about Alien


The provided text discusses several points about aliens:

*   It questions the idea that alien civilizations would be uniform in their communication methods, suggesting a normal distribution of behaviors and levels of development, from primitive to stoical and philosophical.
*   It raises the possibility that some alien civilizations might be more advanced and destructive.
*   It touches upon the "Great Filter" concept, suggesting that if other alien civilizations have reached a certain level, something might have prevented them from becoming multi-planetary or reaching out into the stars.
*   The speaker expresses a personal opinion that, based on current evidence, it is most likely that humans are alone, meaning there are no other alien civilizations.
Enter 0 to exit


Enter your query:  0
