# **Building a RAG-Based Chatbot with LangChain From Youtube Playlist**


we will explore how to build a **Retrieval-Augmented Generation (RAG)**-based chatbot using **LangChain**, a powerful framework for developing applications with large language models (LLMs). A RAG-based model enhances the capabilities of language models by augmenting their responses with external knowledge retrieved from a document corpus.

> ⚠️ **Note:** It is highly recommended to run this lab in **Google Colab** for ease of setup, access to required packages, and GPU acceleration if needed. You can upload this notebook to [Google Colab](https://colab.research.google.com/) and run it directly there.

## 🎯 **Objective**

The goal of this lab is to build a chatbot that can:

1. **Retrieve relevant information** from a document store or knowledge base based on the user's question.
2. **Generate coherent answers** using the retrieved context, effectively combining retrieval with generative responses.

This approach makes it possible to create chatbots that are more **context-aware**, **factual**, and can handle a wide variety of questions by leveraging external data.

---

## **Key Concepts**

1. **Retrieval-Augmented Generation (RAG)**:
   - RAG models combine two stages — first, retrieving relevant information from an external corpus, and second, generating a response using the retrieved information.
   - This allows the model to provide more accurate and up-to-date answers.

2. **LangChain**:
   - LangChain is a Python framework designed to simplify the development of applications using LLMs. It provides modules for **document processing**, **prompt management**, and managing the flow of information between components.
   - LangChain allows developers to easily integrate LLMs with external sources of information, creating powerful applications like RAG-based chatbots.

3. **Document Retriever**:
   - This is a key part of the RAG system that identifies and retrieves relevant documents or pieces of information based on a query.
   - A good document retriever ensures that the chatbot's responses are informed by the most relevant information.

4. **Generative Models**:
   - The generative model (e.g., **GPT-3**, **Mistral**, etc.) takes the retrieved context and produces a natural language response.
   - The generative model allows the system to convert retrieved knowledge into a coherent and natural response.

---

By the end of this lab, you will have a working **RAG-based chatbot** that can intelligently answer questions based on external knowledge.

# Imports

Import the necessary libraries and tools to get started with our RAG-based chatbot implementation:

In [1]:
from IPython.display import clear_output

# Install the necessary libraries. This might take some time
!pip -q install langchain openai chromadb tiktoken sentence_transformers langchainhub langchain-community tqdm yt-dlp whisper --upgrade openai-whisper

clear_output()


In [2]:
# Adapted from https://python.langchain.com/docs/use_cases/question_answering/
import os
import re
import torch
import yt_dlp
import whisper
from tqdm import tqdm
from langchain import hub
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.schema import StrOutputParser
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter

device = "cuda" if torch.cuda.is_available() else "cpu"


Add your Hugging Face Token

In [3]:
# We'll be using Mistral 7B for inference
# Make you hugging face token and make sure to tick the required checkboxes for inference
os.environ['HUGGINGFACE_API_TOKEN'] = ""



# 1 - Gathering the Dataset

In this step, we will gather a dataset by downloading and transcribing audio from YouTube videos in a playlist. We will use the **Whisper** model for transcription and **yt-dlp** to download the audio from YouTube videos. The transcriptions will be saved as text files for further processing.


Your task is to:
1. Download the audio of YouTube videos from a specific playlist.
2. Transcribe the audio using the Whisper model.
3. Save the transcriptions along with video metadata such as the video URL and title.

#### Steps Involved

1. **Playlist URL**: We will define the playlist URL that contains multiple YouTube videos from which we need to extract and transcribe the audio.

2. **Download Audio**: Using **yt-dlp** (a powerful downloader), we will extract the audio from the videos in the playlist. The audio will be saved as MP3 files in a specific folder.

3. **Transcribe Audio**: After downloading the audio, we will use the **Whisper** model to transcribe the speech to text. This will allow us to convert the spoken content of the videos into a text format.

4. **Save Transcriptions**: The transcriptions will be saved as text files, where each file will include the YouTube video URL, title, and the transcribed text.

5. **Organize Files**: The downloaded audio files and their corresponding transcriptions will be saved in separate folders, ensuring proper organization of the data.

### 🎬 **3Blue1Brown Playlist**

For this lab, we will be using the **3Blue1Brown YouTube Neural Networks playlist**, which is renowned for its engaging and educational content on mathematics and machine learning. The playlist contains several videos that break down complex concepts related to neural networks and deep learning using visually intuitive animations. These videos provide clear explanations and are perfect for transcription and knowledge extraction, as they offer valuable insights into the workings of neural networks while being easy to follow and understand.



We will extract the audio from each of these videos, transcribe the speech using the Whisper model, and organize the content for further use.



In [4]:
# Define the 3Blue1Brown Neural Networks playlist URL
playlist_url = "https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi"

# Initialize The base Whisper model and use Cuda for faster computations
model = whisper.load_model("base").to(device)  

# Set up yt-dlp options for downloading audio
ydl_opts = {
    # 'cookiefile': '/content/cookies.txt', #you might need a cookei file if you get error downloading the audios.
    'format': 'bestaudio/best',
    'outtmpl': 'downloads/%(id)s.%(ext)s',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',  # Set bitrate/quality to 192 kbps
    }],
}

os.makedirs('downloads', exist_ok=True) # Creating directories for saving files if not exist
os.makedirs('transcriptions', exist_ok=True)

# CODE to complete the function
def process_video(url):
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info_dict = ydl.extract_info(url, download=True)
            video_id = info_dict.get('id', None)
            video_title = info_dict.get('title', None)
            audio_file = f'downloads/{video_id}.mp3'

            video_title = re.sub(r'[\\/*?:"<>|]', "_", video_title)

            # Use the whisper model defined on the audio file to transcribe it and then save the transcription as txt file in
            # transcription folder
            result = model.transcribe(audio_file)
            transcript = result["text"]

            # Save transcription to file
            transcript_file = f'transcriptions/{video_id}.txt'
            with open(transcript_file, 'w', encoding='utf-8') as f:
                f.write(f"{url}\n")
                f.write(f"{video_title}\n")
                f.write("TRANSCRIPT\n")
                f.write(f"{transcript}\n")

            print(f"✅ Successfully processed: {video_title}")


    except Exception as e:
        print(f"❌ Error processing video {url}: {e}")


100%|███████████████████████████████████████| 139M/139M [00:06<00:00, 22.4MiB/s]


In [5]:
# Fetch the playlist using yt-dlp
with yt_dlp.YoutubeDL({'quiet': True}) as ydl:
    playlist_info = ydl.extract_info(playlist_url, download=False)

# If you get an Error in donwloading youtube vides, watch the video at the following link to see how you can
# work with cookies and use the code below
# https://www.youtube.com/watch?v=DsS1jCDZGek&t=26s

# with yt_dlp.YoutubeDL({'cookiefile': '/content/cookies.txt', 'quiet': True}) as ydl:
#     playlist_info = ydl.extract_info(playlist_url, download=False)

video_urls = []
if 'entries' in playlist_info:
    video_urls = [entry['webpage_url'] for entry in playlist_info['entries'] if 'webpage_url' in entry]

# Process all videos in the playlist to transcribe them
for url in tqdm(video_urls, desc="Processing videos", unit="video"):
    process_video(url)


Processing videos:   0%|          | 0/8 [00:00<?, ?video/s]

[youtube] Extracting URL: https://www.youtube.com/watch?v=aircAruvnKk
[youtube] aircAruvnKk: Downloading webpage
[youtube] aircAruvnKk: Downloading tv client config
[youtube] aircAruvnKk: Downloading player 73381ccc-main
[youtube] aircAruvnKk: Downloading tv player API JSON
[youtube] aircAruvnKk: Downloading ios player API JSON
[youtube] aircAruvnKk: Downloading m3u8 information
[info] aircAruvnKk: Downloading 1 format(s): 251-7
[download] Destination: downloads/aircAruvnKk.webm
[download] 100% of   17.91MiB in 00:00:00 at 34.34MiB/s  
[ExtractAudio] Destination: downloads/aircAruvnKk.mp3
Deleting original file downloads/aircAruvnKk.webm (pass -k to keep)


Processing videos:  12%|█▎        | 1/8 [01:32<10:44, 92.06s/video]

✅ Successfully processed: But what is a neural network_ _ Deep learning chapter 1
[youtube] Extracting URL: https://www.youtube.com/watch?v=IHZwWFHWa-w
[youtube] IHZwWFHWa-w: Downloading webpage
[youtube] IHZwWFHWa-w: Downloading tv client config
[youtube] IHZwWFHWa-w: Downloading player 73381ccc-main
[youtube] IHZwWFHWa-w: Downloading tv player API JSON
[youtube] IHZwWFHWa-w: Downloading ios player API JSON
[youtube] IHZwWFHWa-w: Downloading m3u8 information
[info] IHZwWFHWa-w: Downloading 1 format(s): 251-5
[download] Destination: downloads/IHZwWFHWa-w.webm
[download] 100% of   19.74MiB in 00:00:00 at 33.55MiB/s  
[ExtractAudio] Destination: downloads/IHZwWFHWa-w.mp3
Deleting original file downloads/IHZwWFHWa-w.webm (pass -k to keep)


Processing videos:  25%|██▌       | 2/8 [03:10<09:33, 95.62s/video]

✅ Successfully processed: Gradient descent, how neural networks learn _ DL2
[youtube] Extracting URL: https://www.youtube.com/watch?v=Ilg3gGewQ5U
[youtube] Ilg3gGewQ5U: Downloading webpage
[youtube] Ilg3gGewQ5U: Downloading tv client config
[youtube] Ilg3gGewQ5U: Downloading player 73381ccc-main
[youtube] Ilg3gGewQ5U: Downloading tv player API JSON
[youtube] Ilg3gGewQ5U: Downloading ios player API JSON
[youtube] Ilg3gGewQ5U: Downloading m3u8 information
[info] Ilg3gGewQ5U: Downloading 1 format(s): 251-4


ERROR: unable to download video data: HTTP Error 403: Forbidden
Processing videos:  38%|███▊      | 3/8 [03:14<04:29, 53.82s/video]

❌ Error processing video https://www.youtube.com/watch?v=Ilg3gGewQ5U: ERROR: unable to download video data: HTTP Error 403: Forbidden
[youtube] Extracting URL: https://www.youtube.com/watch?v=tIeHLnjs5U8
[youtube] tIeHLnjs5U8: Downloading webpage
[youtube] tIeHLnjs5U8: Downloading tv client config
[youtube] tIeHLnjs5U8: Downloading player 73381ccc-main
[youtube] tIeHLnjs5U8: Downloading tv player API JSON
[youtube] tIeHLnjs5U8: Downloading ios player API JSON
[youtube] tIeHLnjs5U8: Downloading m3u8 information
[info] tIeHLnjs5U8: Downloading 1 format(s): 251-3
[download] Destination: downloads/tIeHLnjs5U8.webm
[download] 100% of    9.98MiB in 00:00:01 at 8.32MiB/s   
[ExtractAudio] Destination: downloads/tIeHLnjs5U8.mp3
Deleting original file downloads/tIeHLnjs5U8.webm (pass -k to keep)


Processing videos:  50%|█████     | 4/8 [04:08<03:36, 54.15s/video]

✅ Successfully processed: Backpropagation calculus _ DL4
[youtube] Extracting URL: https://www.youtube.com/watch?v=LPZh9BOjkQs
[youtube] LPZh9BOjkQs: Downloading webpage
[youtube] LPZh9BOjkQs: Downloading tv client config
[youtube] LPZh9BOjkQs: Downloading player 73381ccc-main
[youtube] LPZh9BOjkQs: Downloading tv player API JSON
[youtube] LPZh9BOjkQs: Downloading ios player API JSON
[youtube] LPZh9BOjkQs: Downloading m3u8 information
[info] LPZh9BOjkQs: Downloading 1 format(s): 251-10
[download] Destination: downloads/LPZh9BOjkQs.webm
[download] 100% of    7.79MiB in 00:00:00 at 15.78MiB/s  
[ExtractAudio] Destination: downloads/LPZh9BOjkQs.mp3
Deleting original file downloads/LPZh9BOjkQs.webm (pass -k to keep)


Processing videos:  62%|██████▎   | 5/8 [04:52<02:30, 50.27s/video]

✅ Successfully processed: Large Language Models explained briefly
[youtube] Extracting URL: https://www.youtube.com/watch?v=wjZofJX0v4M
[youtube] wjZofJX0v4M: Downloading webpage
[youtube] wjZofJX0v4M: Downloading tv client config
[youtube] wjZofJX0v4M: Downloading player 73381ccc-main
[youtube] wjZofJX0v4M: Downloading tv player API JSON
[youtube] wjZofJX0v4M: Downloading ios player API JSON
[youtube] wjZofJX0v4M: Downloading m3u8 information
[info] wjZofJX0v4M: Downloading 1 format(s): 251-3
[download] Destination: downloads/wjZofJX0v4M.webm
[download] 100% of   27.39MiB in 00:00:03 at 6.91MiB/s   
[ExtractAudio] Destination: downloads/wjZofJX0v4M.mp3
Deleting original file downloads/wjZofJX0v4M.webm (pass -k to keep)


Processing videos:  75%|███████▌  | 6/8 [07:11<02:41, 80.63s/video]

✅ Successfully processed: Transformers (how LLMs work) explained visually _ DL5
[youtube] Extracting URL: https://www.youtube.com/watch?v=eMlx5fFNoYc
[youtube] eMlx5fFNoYc: Downloading webpage
[youtube] eMlx5fFNoYc: Downloading tv client config
[youtube] eMlx5fFNoYc: Downloading player 73381ccc-main
[youtube] eMlx5fFNoYc: Downloading tv player API JSON
[youtube] eMlx5fFNoYc: Downloading ios player API JSON
[youtube] eMlx5fFNoYc: Downloading m3u8 information
[info] eMlx5fFNoYc: Downloading 1 format(s): 251-3
[download] Destination: downloads/eMlx5fFNoYc.webm
[download] 100% of   26.38MiB in 00:00:01 at 21.54MiB/s  
[ExtractAudio] Destination: downloads/eMlx5fFNoYc.mp3
Deleting original file downloads/eMlx5fFNoYc.webm (pass -k to keep)


Processing videos:  88%|████████▊ | 7/8 [09:18<01:35, 95.65s/video]

✅ Successfully processed: Attention in transformers, step-by-step _ DL6
[youtube] Extracting URL: https://www.youtube.com/watch?v=9-Jl0dxWQs8
[youtube] 9-Jl0dxWQs8: Downloading webpage
[youtube] 9-Jl0dxWQs8: Downloading tv client config
[youtube] 9-Jl0dxWQs8: Downloading player 73381ccc-main
[youtube] 9-Jl0dxWQs8: Downloading tv player API JSON
[youtube] 9-Jl0dxWQs8: Downloading ios player API JSON
[youtube] 9-Jl0dxWQs8: Downloading m3u8 information
[info] 9-Jl0dxWQs8: Downloading 1 format(s): 251-3
[download] Destination: downloads/9-Jl0dxWQs8.webm
[download] 100% of   23.48MiB in 00:00:00 at 31.58MiB/s  
[ExtractAudio] Destination: downloads/9-Jl0dxWQs8.mp3
Deleting original file downloads/9-Jl0dxWQs8.webm (pass -k to keep)


Processing videos: 100%|██████████| 8/8 [11:07<00:00, 83.48s/video] 

✅ Successfully processed: How might LLMs store facts _ DL7





In [None]:
# !yt-dlp --cookies /content/cookies.txt https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi


# 2 - Process Dataset into LangChain Documents

In this step, we will begin by fetching the dataset from the **3Blue1Brown YouTube Neural Networks playlist**. This dataset consists of transcriptions of the videos, where each transcription is stored as a plaintext file. Each file begins with the YouTube URL and the video title, which we'll parse as metadata. The actual transcript content starts after the "TRANSCRIPT" separator.

We'll use LangChain to convert these transcriptions into documents, allowing us to later utilize them for retrieval-augmented generation (RAG) tasks in building our chatbot.


We'll process each video transcription and load it into a LangChain **Document** object (https://js.langchain.com/docs/modules/data_connection/document_loaders/how_to/creating_documents). This object consists of two key attributes:

- **page_content**: This contains the actual content of the transcript that we want to index and search semantically.
- **metadata**: This includes any associated metadata, such as the video title and YouTube URL, which we can use to identify and retrieve specific documents.


In [6]:
# Function to process individual transcription file
def process_txt_file(file_path):

    # Read the file and Extract URL and Title and page content after "TRANSCRIPT"
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    url = ""
    title = ""
    page_content = []
    transcript_found = False


    for line in lines:
        line = line.strip()
        if not transcript_found:
            if line == "TRANSCRIPT":
                transcript_found = True
            elif not url:
                url = line
            elif not title:
                title = line
        else:

            page_content.append(line)


    page_content = " ".join(page_content).strip()

    # Return a Document object with page content and metadata
    return Document(page_content=page_content, metadata={'source': url, 'title': title})

def create_documents_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            doc = process_txt_file(os.path.join(directory_path, filename))
            documents.append(doc)
    return documents

# Give the path of your directory where you saved the transcriptions

directory_path = "transcriptions"
docs = create_documents_from_directory(directory_path)

print(f"Loaded {len(docs)} documents")
print("Metadata of first document:", docs[0].metadata)
print("Content of first document:", docs[0].page_content[:200])


Loaded 7 documents
Metadata of first document: {'source': 'https://www.youtube.com/watch?v=9-Jl0dxWQs8', 'title': 'How might LLMs store facts _ DL7'}
Content of first document: If you feed a large language model the phrase, Michael Jordan plays the sport of blank, and you have it predict what comes next, and it correctly predicts basketball. This would suggest that somewhere


# 3 - Splitting the Documents into Chunks

In this step, we will split the transcripts into smaller, manageable chunks. This is important because large documents can be inefficient for semantic search and retrieval. By breaking them down, we can increase the accuracy and efficiency of our retrieval-augmented generation (RAG) model.

Each chunk will contain a segment of the transcript, ensuring that the content remains coherent and relevant for further processing in the LangChain framework.


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/chunks.png)

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)

print("Number of chunks:", len(all_splits))
print("Metadata of first chunk:", all_splits[0].metadata)
print("Content of first chunk:", all_splits[0].page_content)


Number of chunks: 274
Metadata of first chunk: {'source': 'https://www.youtube.com/watch?v=9-Jl0dxWQs8', 'title': 'How might LLMs store facts _ DL7', 'start_index': 0}
Content of first chunk: If you feed a large language model the phrase, Michael Jordan plays the sport of blank, and you have it predict what comes next, and it correctly predicts basketball. This would suggest that somewhere, inside its hundreds of billions of parameters, it's baked in knowledge about a specific person and his specific sport. And I think in general anyone who's played around with one of these models has the clear sense that it's memorized tons and tons of facts. So a reasonable question you could ask is, how exactly does that work, and where do those facts live? Last December a few researchers from Google Deep Mind posted about work on this question, and they were using this specific example of


# 4 - Embedding the Chunks and Loading into a Vector Database

This step is crucial for enabling semantic search over the transcript data. The goal is to convert the text chunks into numerical representations (embeddings) that can be indexed and searched efficiently. These embeddings will allow us to retrieve relevant context for generating answers to user queries.

### **BGE Embeddings**

- **BGE (Beijing General Embeddings)**: BGE models available on HuggingFace are some of the best-performing open-source embedding models. BGE models transform text into dense vectors (embeddings), which capture the semantic meaning of the text. The BGE model we will use is designed to be highly efficient and accurate in understanding the relationships between text data, making it ideal for our use case of semantic search.
  
  For more information about the BGE model, you can visit [HuggingFace BGE](https://huggingface.co/BAAI/bge-large-en).

### **Chroma**

- **Chroma**: Chroma is an open-source vector database specifically designed for managing and querying embeddings. Once the text chunks are converted into embeddings, they are stored in Chroma, where they can be quickly retrieved for semantic search tasks. Chroma integrates seamlessly with LangChain, making it a great choice for building AI applications that require fast and accurate retrieval of relevant information.

  Chroma runs directly on your machine, allowing you to easily get started without requiring complex cloud services. It supports features like vector search, similarity queries, and more.

  Check out a more comprehensive list of vector databases [here](https://www.datacamp.com/blog/the-top-5-vector-databases).

### Embedding Process and Visualization

In the diagram below, you can see the process of embedding text chunks into numerical representations, and then storing these embeddings in a vector database (Chroma). This allows us to perform efficient semantic searches later on, improving the quality of our chatbot's responses.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/vector-store.png)

By the end of this step, we'll have a vector store of embedded text chunks, ready to be used for retrieving relevant context based on a given query.


In [8]:
model_name = "BAAI/bge-base-en"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': device},
    encode_kwargs=encode_kwargs
)

vectorstore = Chroma.from_documents(documents=all_splits, embedding=bge_embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})


  bge_embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/719 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
prompt_question = "What is Gradient Descent?"

#based on the provided prompt, using retriever, retrieve the relevant docs from vectorDB which matches best (highest similarity) with the prompt.
# use get_relevant_documents()
retrieved_docs = retriever.get_relevant_documents(prompt_question)

print("Total docs retrieved",len(retrieved_docs))

for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print("Metadata:", doc.metadata)
    print("Content:", doc.page_content)


Total docs retrieved 5

Document 1:
Metadata: {'title': 'Gradient descent, how neural networks learn _ DL2', 'source': 'https://www.youtube.com/watch?v=IHZwWFHWa-w', 'start_index': 9980}
Content: about a network learning is that it's just minimizing a cost function. And notice, one consequence of that is that it's important for this cost function to have a nice smooth output so that we can find a local minimum by taking little steps down hill. This is why, by the way, artificial neurons have continuously ranging activations, rather than simply being active or inactive in a binary way, the way that biological neurons are. This process of repeatedly nudging an input of a function by some multiple of the negative gradient is called gradient descent. It's a way to converge toward some local minimum of a cost function, basically a valley in this graph. I'm still showing the picture of a

Document 2:
Metadata: {'source': 'https://www.youtube.com/watch?v=IHZwWFHWa-w', 'title': 'Gradient desce

  retrieved_docs = retriever.get_relevant_documents(prompt_question) ##CODE HERE


# 5 - Full RAG Chain

Let's now put everything together to build a fully functional RAG chain using Lanchain Expression Language -> https://python.langchain.com/docs/expression_language/.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/retrieval.png)

In [None]:
from langchain_community.llms import HuggingFaceHub, HuggingFaceEndpoint

prompt = hub.pull("rlm/rag-prompt")

# TODO: Initialize 'llm' using HuggingFaceHub
# use  repo_id="mistralai/Mistral-7B-Instruct-v0.2", top_p = 0.95 and task as text-generation
llm = HuggingFaceHub(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    model_kwargs={"top_p": 0.95},
    task="text-generation",
    huggingfacehub_api_token=os.environ['HUGGINGFACE_API_TOKEN']
)

def format_docs(docs):
    # Combine all document contents into one string (separated by newlines)
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# I Modified this
response = rag_chain.invoke("What is Gradient Descent?")
print("\nRAG Response:")
print(response)


# 6 - Quoting sources

One key benefit of RAG (Retrieval-Augmented Generation) systems is the ability to trace answers back to their original sources. By modifying our chain, we can return not only the generated answer but also the metadata from the retrieved documents—effectively quoting the sources used by the LLM to generate its response.

In this part, you need to:
1. Modify rag_chain_with_source to retrieve relevant documents based on the input question.

2. Return Metadata: Ensure the chain returns both the generated answer and metadata (e.g., source, title, author) of the retrieved documents.

3. Invoke the Chain:

Observe the generated answer along with the quoted sources.

<!-- ![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/references.png) -->

In [11]:
from operator import itemgetter
from langchain.schema.runnable import RunnableMap

rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | prompt
    | llm
    | StrOutputParser()
)


rag_chain_with_source = RunnableMap(
    {
        "documents": retriever,  # Use your retriever here
        "question": RunnablePassthrough()   # Pass the input question unchanged
    }
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,  # Use the rag_chain_from_docs defined above
}

response = rag_chain_with_source.invoke("When to use Relu or Sigmoid in a Neural Network?")
print(response)




{'documents': [{'start_index': 17483, 'title': 'But what is a neural network_ _ Deep learning chapter 1', 'source': 'https://www.youtube.com/watch?v=aircAruvnKk'}, {'start_index': 16986, 'title': 'But what is a neural network_ _ Deep learning chapter 1', 'source': 'https://www.youtube.com/watch?v=aircAruvnKk'}, {'source': 'https://www.youtube.com/watch?v=aircAruvnKk', 'start_index': 17983, 'title': 'But what is a neural network_ _ Deep learning chapter 1'}, {'source': 'https://www.youtube.com/watch?v=tIeHLnjs5U8', 'title': 'Backpropagation calculus _ DL4', 'start_index': 3988}, {'title': 'Backpropagation calculus _ DL4', 'start_index': 1498, 'source': 'https://www.youtube.com/watch?v=tIeHLnjs5U8'}], 'answer': "Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: When to use Relu or Sigm

In [12]:
import textwrap

def format_string_response(response_string, line_width=80):
    lines = response_string.split("\n")

    # Extract context, question, and answer parts from the input string
    context = []
    question = None
    answer = None
    in_context = False
    in_answer = False

    for line in lines:
        if line.startswith("Context:"):
            in_context = True
            context.append(line.replace("Context:", "").strip())
        elif line.startswith("Question:"):
            question = line.replace("Question:", "").strip()
        elif line.startswith("Answer:"):
            in_answer = True
            answer = line.replace("Answer:", "").strip()
        elif in_context:
            context.append(line.strip())
        elif in_answer:
            answer += " " + line.strip()

    # Combine context into a single string and wrap the text if it's too long
    context_str = " ".join(context).strip()
    wrapped_context = textwrap.fill(context_str, width=line_width)

    # Wrap the answer if it's too long
    wrapped_answer = textwrap.fill(answer, width=line_width)
    formatted_response = f"Question: {question}\n\nContext: {wrapped_context}\n\nAnswer: {wrapped_answer}"

    return formatted_response

formatted_response = format_string_response(response['answer'], line_width=180)
print(formatted_response)


Question: When to use Relu or Sigmoid in a Neural Network?

Context: the relevant weighted sum into that interval between 0 and 1, you know, kind of motivated by this biological analogy of neurons either being inactive or active. Exactly. But
relatively few modern networks actually use sigmoid anymore. It's kind of old school, right? Yeah, or rather, relu seems to be much easier to train. And relu stands for rectified
linear unit? Yes, it's this kind of function where you're just taking a max of 0 and a, where a is given by what you were explaining in the video, and what this was sort of
motivated from, I think, was a partially biological analogy with how neurons would either be activated or not. And so if it passes a certain threshold, it would be the  series this
summer, but I'm jumping back into it after this project, so patrons, you can look out for updates there. To close things off here, I have with me Lisa Lee, who did her PhD work on
the theoretical side of deep learning, and w

In [13]:
Questions = [
    "What is backpropagation in neural networks?",
    "How does a convolutional neural network work?",
    "What is the purpose of activation functions?"
]

for Question in Questions:
    response = rag_chain_with_source.invoke(Question)
    formatted_response = format_string_response(response['answer'], line_width=180)
    print(formatted_response)




Question: What is backpropagation in neural networks?

Context: the heart of how a neural network learns, is called back propagation. And it's what I'm going to be talking about next video. There I really want to take the time to walk through
what exactly happens to each weight and each bias for a given piece of training data. Trying to give an intuitive feel for what's happening beyond the pile of relevant calculus and
formulas. Right here, right now, the main thing I want you to know, independent of implementation details, is that what we mean when we talk about a network learning is that it's
just minimizing a cost function. And notice, one consequence of that is that it's important for this cost function to have a nice smooth output so that we can  repeat the process
for all the weights and biases feeding into that layer. So pat yourself on the back. If all of this makes sense, you have now looked deep into the heart of back propagation, the
workhorse behind how neural networks lea



Question: How does a convolutional neural network work?

Context: is, how exactly does that work, and where do those facts live? Last December a few researchers from Google Deep Mind posted about work on this question, and they were using this
specific example of matching athletes to their sports. And although a full mechanistic understanding of how facts are stored remains unsolved, they had some interesting partial
results, including the very general high-level conclusion that the facts seem to live inside a specific part of these networks, known fancifully as the multi-layer perceptrons, or
MLPs for short. In the last couple of chapters, you and I have been digging into the details behind transformers, the architecture underlying large language models,  model with a
huge number of parameters without it either grossly overfitting the training data, or being completely intractable to train. Deep learning describes a class of models that in the
last couple decades have proven to scale 



Question: What is the purpose of activation functions?

Context: squishes the real number line into the range between zero and one. And a common function that does this is called the sigmoid function, also known as a logistic curve, basically
very negative inputs end up close to zero, very positive inputs end up close to one, and it just steadily increases around the input zero. So the activation of the neuron here is
basically a measure of how positive the relevant weighted sum is. But maybe it's not that you want the neuron to light up when the weighted sum is bigger than zero. Maybe you only
want it to be active when the sum is bigger than say 10. That is, you want some bias for it to be inactive. What we'll do then is just add in some other number,  is going to cause
the most efficient decrease to the cost function. And we're just going to focus on the connection between the last two neurons. Let's label the activation of that last neuron with
a superscript L indicating which layer