<a href="https://colab.research.google.com/github/ronaknavadiya/Youtube-chatbot-RAG/blob/main/YouTube_Chatbot_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [155]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "TOKEN"

In [25]:
!pip install -q youtube-transcript-api langchain-community langchain_huggingface faiss-cpu tiktoken python-dotenv

In [43]:
# Imports

from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser


# Indexing

## a. Document Ingestion

In [27]:
youtube_video_id = "Cyv-dgv80kE"

try:

  transcript_list = YouTubeTranscriptApi().fetch(youtube_video_id)
  # print(len(transcript_list))

  # flatten the list into plain text
  trannscript = " ".join(data.text for data in transcript_list)
  print(trannscript)

except TranscriptsDisabled as e:
  print("Error : " + e.with_traceback)





## b. Text Splitting

In [28]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([trannscript])

print(len(chunks))

305


In [29]:
chunks[0]

Document(metadata={}, page_content="welcome to the AI Engineers guide for the L chain this is a four course that will take you from the assumption that you know nothing about Lang chain to being able to proficiently use the framework either you know within line chain within line graph or even elsewhere uh from the fundamentals that you will learn in this course now this course will be broken up into multiple chapters we're going to start by talking a little bit about what line chain is and when we should really be using it and when maybe we don't want to use it we'll talk about the pros and cons and also about the the why the line chain ecosystem not just about the line chain framework itself from there we'll introduce Lang chain we'll just have a look at a few examples before diving into essentially the basics of the framework now I will just note that all this is for Lang chain 0.3 so that is latest current version although that being said we will cover a little bit of where line cha

## c. Create Embeddings and Store it in vectore store

In [30]:
embeddings = HuggingFaceEmbeddings()


  embeddings = HuggingFaceEmbeddings()


In [31]:
vector_store = FAISS.from_documents(chunks, embeddings)

In [32]:
vector_store.index_to_docstore_id

{0: '41ae70dd-f10d-41f5-bd19-0952556f6d0e',
 1: '62972a21-6e37-4a50-97f2-e66834e241cb',
 2: '108008b1-3cb5-417c-96dd-347141895bb5',
 3: 'f489a4fa-dc8e-4cd1-bc65-0baa871d05f1',
 4: '23856d42-509e-40e9-9208-7138756c8b0f',
 5: '6bdc1dc4-666e-4864-a8e3-5feaffc48610',
 6: '99a4f23b-eb43-4d84-b62a-c0814ccabdb0',
 7: 'a12e137c-118e-46c8-94c4-66e605921c95',
 8: '9d55535b-27f8-43d7-a5df-f86abd70c01d',
 9: '44adb51d-a820-4eea-ad23-ff5975189979',
 10: 'd89ca995-5d52-40bd-ae0d-3c41c4846149',
 11: '96b1a75d-753b-4596-9875-5c7eb5dde1fe',
 12: '041d6605-7d30-4d45-b744-06fa8d95e241',
 13: '7ee01745-43c2-4039-b64c-c8daea6f3a3f',
 14: 'a2bc679c-2fc1-41b3-add0-2ce62b0809a0',
 15: 'b6b041ea-1052-4972-b8bc-1c690c2c4283',
 16: '9f384934-2693-4d3f-b48c-e7f28ac7b73b',
 17: 'b5ef4b0f-be25-4903-bb7c-f3013aa8ad08',
 18: '4da83f3e-ad92-4a80-bcbc-b9d126e16f7b',
 19: 'dd538030-a938-438a-b3f7-1f4334f74739',
 20: 'ba78033b-4a4d-479a-98d8-337f681d834d',
 21: 'c8ee9332-1f91-4fa0-af95-3d7bee0059b1',
 22: '74920602-948d-

In [33]:
vector_store.get_by_ids(['c6b5b7b1-18b0-4b3f-8b7f-54f5ecad4622'])

[]

# 2. Retrieval

In [78]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7994311315e0>, search_kwargs={'k': 4})

In [79]:
retriever.invoke("What is langchain")

[Document(id='7ee01745-43c2-4039-b64c-c8daea6f3a3f', metadata={}, page_content="because one L chain is a thing helping you learn and two one of the main Frameworks that I recommend a lot of people to move on to is actually line graph which is still within the L chain ecosystem and it still uses a lot of L chain objects and methods and of course Concepts as well so even if you do move on from line chain you may move on to something like L graph which you can no line chain for anyway and let's say you do move on to another framework in set said in that scenario the concepts that you learn from Lang chain are still pretty important so to just finish up this chapter I just want to summarize on that question of should you be using Lang chain what's important to remember is that Lang chain does abstract a lot now this abstraction of L chain is both a strength and a weakness with more experience those abstractions can feel like a limitation and that is why we sort of go with the idea that L c

# 3. Augmentation

In [138]:
llm = HuggingFaceEndpoint(repo_id="deepseek-ai/DeepSeek-V3.1", task="text-generation")

model = ChatHuggingFace(llm = llm)

In [125]:
prompt = PromptTemplate(
    template="""
      You are an helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      {context}

      Question: {question}

    """,
    input_variables=["context", "question"]
)

In [126]:
#  user input
question = "What is PromptTemplate?"
retrieved_docs = retriever.invoke(question)

In [127]:
retrieved_docs

[Document(id='260d07b8-7a03-4144-93c0-6966dc23bedc', metadata={}, page_content="the context within that there and we have our user query here okay so we'll run this and if we take a look uh here we haven't specified what our input variables are okay but we can see that we have query and we have context up here right so we can see that okay these are the input variables we just haven't explicitly defined them here so let's just confirm with this that line chain did pick those up and we can see that it did so it has context and query as our input variables for the prompt template that we just defined okay so we can also see the structure of our temp plates let's have a look okay so we can see that within messages here we have a system message prompt template the way that we Define this you can see here that we have from messages and this will consume various uh different structures so you can see here that it has a from messages it is a sequence of message like representation so we could

In [128]:
context = "\n\n".join(doc.page_content for doc in retrieved_docs)

In [129]:
final_prompt = prompt.invoke({"context": context, "question": question})

In [130]:
final_prompt

StringPromptValue(text="\n      You are an helpful assistant.\n      Answer ONLY from the provided transcript context.\n      If the context is insufficient, just say you don't know.\n\n      the context within that there and we have our user query here okay so we'll run this and if we take a look uh here we haven't specified what our input variables are okay but we can see that we have query and we have context up here right so we can see that okay these are the input variables we just haven't explicitly defined them here so let's just confirm with this that line chain did pick those up and we can see that it did so it has context and query as our input variables for the prompt template that we just defined okay so we can also see the structure of our temp plates let's have a look okay so we can see that within messages here we have a system message prompt template the way that we Define this you can see here that we have from messages and this will consume various uh different struct

# 4. LLM Generation

In [139]:
answer = model.invoke(final_prompt)
print(answer)

content='Based on the provided context, a PromptTemplate is a structure used to define and format a prompt for a language model. It is shown to have input variables, such as "context" and "query," which are filled in to create the final prompt that is sent to the model.\n\nThe context also describes that these templates can be used to create different types of messages within a conversation, such as a "system message prompt template" and a "human message" (user prompt template). These are then chained together with a language model to perform tasks like generating a response.' additional_kwargs={} response_metadata={'token_usage': {'completion_tokens': 116, 'prompt_tokens': 873, 'total_tokens': 989}, 'model_name': 'deepseek-ai/DeepSeek-V3.1', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run--8421ce25-3fde-4031-83a2-b9914283bbd2-0' usage_metadata={'input_tokens': 873, 'output_tokens': 116, 'total_tokens': 989}


# Building a chain

In [141]:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda, RunnableParallel

In [142]:
parser = StrOutputParser()

In [147]:
def retrieve_conext(question: str) -> str:
  retrieved_docs = retriever.invoke(question)
  context = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context

In [148]:
retriever_chain = RunnableParallel({
    'question': RunnablePassthrough(),
    'context' : RunnableLambda(retrieve_conext)
})

In [149]:
retriever_chain.invoke("what is langchain")

{'question': 'what is langchain',
 'context': "because one L chain is a thing helping you learn and two one of the main Frameworks that I recommend a lot of people to move on to is actually line graph which is still within the L chain ecosystem and it still uses a lot of L chain objects and methods and of course Concepts as well so even if you do move on from line chain you may move on to something like L graph which you can no line chain for anyway and let's say you do move on to another framework in set said in that scenario the concepts that you learn from Lang chain are still pretty important so to just finish up this chapter I just want to summarize on that question of should you be using Lang chain what's important to remember is that Lang chain does abstract a lot now this abstraction of L chain is both a strength and a weakness with more experience those abstractions can feel like a limitation and that is why we sort of go with the idea that L chain is really good to get starte

In [152]:
main_chain = retriever_chain | prompt | model | parser

In [153]:
result = main_chain.invoke("what is langchain")

In [154]:
result

"Based on the provided transcript, LangChain is described as:\n\n*   One of, if not the most, popular open-source framework within the Python ecosystem for AI.\n*   A framework that abstracts a lot, which is both a strength and a weakness.\n*   Really good to get started with, but as a project grows in complexity or engineers gain more experience, its abstractions can feel like a limitation.\n*   A core tool in an AI engineer's toolkit, and the concepts learned from it are considered important even if moving to another framework like LangGraph (which is still part of the LangChain ecosystem and uses many LangChain objects, methods, and concepts)."