## 📘 RAG-Learn: Retrieval-Augmented Chatbot for Lecture Videos



### This notebook demonstrates how to:

#### 1. Extract audio from lecture videos.

#### 2. Transcribe the audio into text using Whisper.

#### 3. Chunk and embed the transcripts.

#### 4. Store them in a FAISS vector database.

#### 5. Build a chatbot that answers questions using RAG.

### Importing Libraries

In [2]:
import os
os.environ["ANONYMIZED_TELEMETRY"] = "false" ## Sets ANONYMIZED_TELEMETRY = false to avoid telemetry pings from LangChain.
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

## Audio Extraction

### Define Paths and Video Extensions
Defines directories where videos and audio outputs are stored.
Lists supported video file extensions.
Uses tqdm.notebook for progress bars in Jupyter.

In [3]:
import os
import subprocess
from tqdm.notebook import tqdm  # nice progress bars in notebooks

# Paths
VIDEOS_DIR = "data/videos"
AUDIO_DIR = "data/audio"

VIDEO_EXTENSIONS = (".mp4", ".mkv", ".mov", ".avi")

### Collect All Video Files
Recursively searches VIDEOS_DIR for video files with supported extensions.
Stores absolute paths in video_files.

In [4]:
video_files = []
for root, dirs, files in os.walk(VIDEOS_DIR):
    for file in files:
        if file.lower().endswith(VIDEO_EXTENSIONS):
            video_files.append(os.path.join(root, file))

print(f"Found {len(video_files)} video files.")

Found 24 video files.


### Define Audio Extraction Function
Defines a helper function using ffmpeg to convert videos → .wav audio files.
Standardizes audio format (mono, 16 kHz) for Whisper compatibility.

In [5]:
## Fn to extract audio files from  videos
def extract_audio(video_path, audio_path):
    os.makedirs(os.path.dirname(audio_path), exist_ok=True)
    command = [
        "ffmpeg",
        "-i", video_path,
        "-vn",
        "-acodec", "pcm_s16le",
        "-ar", "16000",
        "-ac", "1",
        audio_path,
        "-y"
    ]
    subprocess.run(command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

### Extract Audio from All Videos
Loops through all videos, creates mirrored folder structure in AUDIO_DIR, and saves .wav files.

Displays progress bar during processing.

In [6]:
# Extract audio with progress bar
for video_path in tqdm(video_files, desc="Extracting audio"):
    relative_path = os.path.relpath(video_path, VIDEOS_DIR)
    audio_path = os.path.join(AUDIO_DIR, os.path.splitext(relative_path)[0] + ".wav")
    extract_audio(video_path, audio_path)

print("All audio files extracted!")

Extracting audio:   0%|          | 0/24 [00:00<?, ?it/s]

All audio files extracted!


## Transcribing Audio

### Load Whisper Model
Loads OpenAI’s Whisper speech-to-text model.

Provides directory paths for audio input and transcript output.

Model size can be tuned for speed/accuracy tradeoff.

In [7]:
import whisper

# Paths
AUDIO_DIR = "data/audio"
TRANSCRIPTS_DIR = "data/transcripts"

# Load Whisper model (small or medium for speed, large for accuracy)
model = whisper.load_model("small")  # or "medium", "large"

### Collect Audio Files
Finds all .wav audio files for transcription.

In [8]:
# Collect all audio files
audio_files = []
for root, dirs, files in os.walk(AUDIO_DIR):
    for file in files:
        if file.lower().endswith(".wav"):
            audio_files.append(os.path.join(root, file))

print(f"Found {len(audio_files)} audio files.")

Found 24 audio files.


### Transcribe Audio Files
Iterates through all audio files.

Skips transcription if transcript already exists.

Saves transcriptions as .txt files in mirrored directory structure.

In [9]:
for audio_path in tqdm(audio_files, desc="Transcribing audio"):
    # Create mirrored transcript path
    relative_path = os.path.relpath(audio_path, AUDIO_DIR)
    transcript_path = os.path.join(TRANSCRIPTS_DIR, os.path.splitext(relative_path)[0] + ".txt")
    os.makedirs(os.path.dirname(transcript_path), exist_ok=True)
    
    # Skip if transcript already exists
    if os.path.exists(transcript_path):
        continue
    
    # Transcription
    result = model.transcribe(audio_path)
    text = result["text"]
    
    # Save transcript
    with open(transcript_path, "w", encoding="utf-8") as f:
        f.write(text)

print("All transcripts generated!")

Transcribing audio:   0%|          | 0/24 [00:00<?, ?it/s]

All transcripts generated!


### Load Transcripts into Documents
Loads all transcripts into LangChain Document objects for processing.

In [10]:
TRANSCRIPTS_DIR = "data/transcripts"

documents_before_split = []

for root, dirs, files in os.walk(TRANSCRIPTS_DIR):
    for file in files:
        if file.lower().endswith(".txt"):
            file_path = os.path.join(root, file)
            loader = TextLoader(file_path, encoding="utf-8")
            docs = loader.load()
            documents_before_split.extend(docs)

print(f"Loaded {len(documents_before_split)} documents")

Loaded 24 documents


### Verify Document Paths
Confirms all transcripts are loaded and paths are correct.

In [11]:
# Check original document paths
paths = [doc.metadata['source'] for doc in documents_before_split]
print(set(paths))   # unique file paths
print(len(paths))   # total files loaded


{'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[2] Rag Overview/[1] Architecture of a RAG app/[1] Architecture of a RAG app.txt', 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[2] Rag Overview/[3] Introduction to embedding models /[3] Introduction to embedding models.txt', 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[3] Beyond the Basics/[1] Understanding your RAG app with observability /[1] Understanding your RAG app with observability.txt', 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[3] Beyond the Basics/[2] Begin optimizing your data ingestion /[2] Begin optimizing your data ingestion.txt', 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[2] Rag Overview/[5] Demo: Calling an LLM /[5] Demo: Calling an LLM.txt', 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[3] Beyond the Basics/[9] Solution: Different embedding models /[9] Solution: Different em

### Check Sample Document Length
Prints length of first transcript to understand text size.

In [12]:
len(documents_before_split[0].page_content)

2192

### Split Documents into Chunks
Splits long transcripts into manageable overlapping chunks for embedding.

In [13]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700, ## max size of every chunk
    chunk_overlap = 50, ## common elements between the current chunk and the one after to preserve correlation
)

documents_after_split = text_splitter.split_documents(documents_before_split)

### Check Chunk Size
Confirms chunking worked (new length ~700).

In [14]:
len(documents_after_split[0].page_content)
## old length = 2192, new length = 697
## still within the 700 size range
## the next chunk will include the last 100 of the current chunk

697

### Compare Average Lengths Before/After Splitting
Computes average document length before vs. after splitting.

Ensures chunking reduced sizes as expected.

In [15]:
avg_doc_length = lambda docs: sum([len(doc.page_content) for doc in docs])//len(docs)

avg_char_before_split = avg_doc_length(documents_before_split)
avg_char_after_split = avg_doc_length(documents_after_split)

print(f'Before splitting: {avg_char_before_split}')
print(f'After splitting: {avg_char_after_split}')

Before splitting: 1227
After splitting: 551


### Define Embedding Function
Loads HuggingFace sentence transformer to embed text chunks into dense vectors.

Runs on GPU (cuda) if available.

In [16]:
## Embedding function
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name = 'sentence-transformers/all-MiniLM-L6-v2', ## famous embedding model
    model_kwargs = {'device': 'cuda'},
    encode_kwargs = {'normalize_embeddings' : True}
)

2025-09-14 22:48:12.243284: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-14 22:48:12.301932: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-14 22:48:13.833647: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


### Build FAISS Vector Store
Creates a FAISS index from all document embeddings.

Allows fast similarity search for retrieval.

In [17]:
## Vector database
vector_store = FAISS.from_documents(documents = documents_after_split, embedding = huggingface_embeddings)

### Test Similarity Search
Runs a test query against the FAISS store.

Displays one retrieved document.

In [18]:
query = 'What is RAG?'
relevant_docs = vector_store.similarity_search(query, k=3)
print(relevant_docs[0].page_content)

Let's start with the definition. What is RAG? RAG stands for Retrieval Augmented Generation. The basic principle behind this technique is to give context to your LLMs so they can answer questions better. There are four main pieces of a RAG application. First, the language model. Usually, this refers to a large language model, but people today are now developing smaller language models that may be able to do the same job. Second, the embedding model. This is what transforms your data into vectors. Third, the vector database. This is where you store your vectorized data. Optionally, a framework to make building your RAG app easier. We'll go into these pieces in detail in later videos. The


### Create Retriever
Converts FAISS store into a retriever object for RAG pipelines.

In [19]:
retriever = vector_store.as_retriever(search_type = 'similarity', search_kwargs = {'k' : 3})

### Load HuggingFace LLM
Loads FLAN-T5 (a lightweight LLM) for answer generation.

Configured for controlled, concise outputs.

In [21]:
from langchain_huggingface.llms import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
        model_id="google/flan-t5-base",
        task="text2text-generation",
        model_kwargs={
            "temperature": 0.1,
            "max_length": 128,
            "do_sample": True
        },
        pipeline_kwargs={
            "max_new_tokens": 128
        }
    )

Device set to use cuda:0


### Define Custom Prompt Template
Provides instructions to the LLM on how to structure answers.

Limits length and enforces honesty about unknowns.

In [22]:
# Define custom prompt template
prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, just say "I don't know".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

{context}

Question: {question}

Helpful Answer:
"""

PROMPT = PromptTemplate(
      template=prompt_template,
      input_variables=["context", "question"]
  )

### Build Retrieval-Augmented QA Chain
Creates a RetrievalQA chain that:

    1. Retrieves relevant transcript chunks.

    2. Passes them + question to the LLM.

    3. Returns generated answer + sources.

In [23]:
from langchain.chains import RetrievalQA

retrievalQA = RetrievalQA.from_chain_type(
    llm = hf,
    chain_type = 'stuff',
    retriever = retriever,
    return_source_documents = True,
    chain_type_kwargs = {'prompt' : PROMPT}
)


### Run Query Through RAG System
Runs the test query "What is RAG?" through the QA pipeline.

Returns structured result with answer + sources.

In [24]:
result = retrievalQA.invoke({'query' : query})
print(result)

{'query': 'What is RAG?', 'result': 'RAG stands for Retrieval Augmented Generation. The basic principle behind this technique is to give context to your LLMs so they can answer questions better. There are four main pieces of a RAG application. First, the language model. Usually, this refers to a large language model, but people today are now developing smaller language models that may be able to do the same job. Second, the embedding model. This is what transforms your data into vectors. Third, the vector database. This is where you store your vectorized data. Optionally, a framework to make', 'source_documents': [Document(metadata={'source': 'data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[2] Rag Overview/[1] Architecture of a RAG app/[1] Architecture of a RAG app.txt'}, page_content="Let's start with the definition. What is RAG? RAG stands for Retrieval Augmented Generation. The basic principle behind this technique is to give context to your LLMs so they can ans

### Inspect Results
Prints available result fields.

Displays retrieved documents and their original transcript sources for transparency.

In [25]:
print(result.keys())

relevant_docs = result['source_documents']
print(f'There are {len(relevant_docs)} documents retrieved which are relevant to the query.')
print("*" * 100)

for i, doc in enumerate(relevant_docs):
    print(f"Relevant Document #{i+1}:")
    print(f"Source file: {doc.metadata['source']}")
    print(f"Content:\n{doc.page_content}")
    print("-" * 100)


dict_keys(['query', 'result', 'source_documents'])
There are 3 documents retrieved which are relevant to the query.
****************************************************************************************************
Relevant Document #1:
Source file: data/transcripts/ Hands-On AI: Retrieval Augmented Generation (RAG)/[2] Rag Overview/[1] Architecture of a RAG app/[1] Architecture of a RAG app.txt
Content:
Let's start with the definition. What is RAG? RAG stands for Retrieval Augmented Generation. The basic principle behind this technique is to give context to your LLMs so they can answer questions better. There are four main pieces of a RAG application. First, the language model. Usually, this refers to a large language model, but people today are now developing smaller language models that may be able to do the same job. Second, the embedding model. This is what transforms your data into vectors. Third, the vector database. This is where you store your vectorized data. Optionally, a 