# RAG-based YouTube Q&A

## 1. Introduction & Project Goal

This notebook provides a complete walkthrough for building a  application that allows you to "chat" with  YouTube video. The core idea is to leverage a  **Naive  Retrieval-Augmented Generation** architecture to answer questions based on a video's transcribed content.

### The Problem We're Solving

YouTube is a vast repository of knowledge, but long videos like lectures, podcasts, and documentaries make it difficult to find specific information quickly. Our goal is to build a system where a user can provide a YouTube video URL, and then ask specific questions to get instant, accurate answers sourced directly from the video's content, effectively creating a conversational search engine for that video.

### Retrieval-Augmented Generation 

RAG is a technique for building LLM-based applications that can answer questions about specific, private data. Instead of relying solely on the LLM's pre-trained knowledge (which can be outdated or generic), RAG grounds the model's responses in a relevant context that we provide.

The process involves two main stages:
1.  **Indexing**: We process our data source (the YouTube transcript) by breaking it into chunks, converting these chunks into numerical representations (embeddings), and storing them in a searchable vector database.
    
2.  **Retrieval & Generation**:
    
    -   **Retrieval**: When a user asks a question, we convert that query into an embedding using the same embedding model. We then search the vector database for the most semantically similar transcript chunks those most relevant to the user's question 
        
    -   **Augmentation & Generation**: We build a prompt by concatenating those retrieved chunks (with some formatting) alongside the original question. This augmented prompt is fed into the LLM, guiding it to generate an answer that’s directly grounded in the provided text rather than relying on generalized or out-of-date internal knowledge
        
  
We will use the **LangChain** framework to orchestrate this entire pipeline, leveraging its powerful components and the LangChain Expression Language to build a clean, modular, and efficient chain.

## 2. Setup and Installations

First, we need to install the necessary Python libraries and configure our environment with the required API keys.

### 2.1. Create and activate a virtual environment

In [None]:
# Navigate to the project root directory
cd youtube-rag

python -m venv youtube_rag_env
.\youtube_rag_env\Scripts\activate
# On Unix or MacOS
# source youtube_rag_env/bin/activate

### 2.2 Install dependencies

In [None]:
pip install -r requirements.txt

### 2.2. Setting Up API Keys

In [None]:
# Create a .env file with the following:
# PINECONE_API_KEY=your_pinecone_api_key
# COHERE_API_KEY=your_cohere_api_key
# GOOGLE_API_KEY=your_google_api_key

In [12]:
# load environment variables from .env file
import os
from dotenv import load_dotenv
load_dotenv()

True

## 3. Project Architecture 

![Technical Architecture](https://raw.githubusercontent.com/saidibnerradi603/YouTubeRAG/master/img/technical_architecture.png?token=GHSAT0AAAAAADESO7444ZIZUJQUQNJMZZCU2ELZ7GQ)


#### Data Flow Process

1. **Content Extraction**
   - System starts with a YouTube video
   - YouTube Transcript API extracts the spoken content as text

2. **Processing & Indexing**
   - Text Splitter breaks the transcript into manageable chunks
   - Embedding Model (Cohere) converts each chunk into vector embeddings
   - These embeddings are stored in a Vector Database (Pinecone)

3. **Query Handling**
   - When a user asks a question about the video
   - The system converts their query into the same vector space
   - The Retriever performs a semantic search to find the most relevant chunks

4. **Response Generation**
   - Retrieved chunks are combined with the original query
   - A prompt template structures this information
   - LLM (Gemini) generates a comprehensive response based solely on the provided context

#### Key Components

- **Vector Database**: Enables semantic search through embeddings
- **Retriever**: Finds the most relevant information using similarity matching
- **Prompt Engineering**: Structures context and query for optimal LLM understanding
- **LLM**: Generates human-like responses grounded in the video content

##  Phase 1: Indexing the YouTube Video

In this phase, we'll take a YouTube video, process its transcript, and prepare it for querying.

### 1.1. Document Loading

We start by defining the YouTube video we want to work with and fetching its transcript. The ``youtube-transcript-api`` library makes this straightforward by allowing us to retrieve captions directly using the video ID.


In [3]:
from youtube_transcript_api import YouTubeTranscriptApi
from langchain.docstore.document import Document

video_id="aircAruvnKk"
# Get transcript using youtube-transcript-api
transcript_list = YouTubeTranscriptApi.get_transcript(
    video_id, 
    languages=["en"]
)
            
# Combine all transcript segments into a single text
full_transcript = " ".join([segment['text'] for segment in transcript_list])

# Create Document object to match LangChain format
documents = [Document(
    page_content=full_transcript,
    metadata={
        'source': video_id,
    }
)]

In [4]:
documents

[Document(metadata={'source': 'aircAruvnKk'}, page_content="This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3. And I want you to take a moment to appreciate how crazy it is that brains can do this so effortlessly. I mean, this, this and this are also recognizable as 3s, even though the specific values of each pixel is very different from one image to the next. The particular light-sensitive cells in your eye that are firing when you see this 3 are very different from the ones firing when you see this 3. But something in that crazy-smart visual cortex of yours resolves these as representing the same idea, while at the same time recognizing other images as their own distinct ideas. But if I told you, hey, sit down and write for me a program that takes in a grid of 28x28 pixels like this and outputs a single number between 0 and 10, telling you what it thinks the digit is, well the task goes 

### 1.2. Text Splitting (Chunking)

The complete transcript is too long to fit into a single LLM prompt. We need to split it into smaller, semantically meaningful chunks. We'll use LangChain's ``RecursiveCharacterTextSplitter``, which is effective at keeping related text together.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

docs = text_splitter.split_documents(documents)

In [6]:
print("Number of text chunks:",len(docs))
print("_________________________\n")
print("First text chunk: " , docs[0].page_content)
print("First text chunk: " , docs[0].metadata)

Number of text chunks: 23
_________________________

First text chunk:  This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3. And I want you to take a moment to appreciate how crazy it is that brains can do this so effortlessly. I mean, this, this and this are also recognizable as 3s, even though the specific values of each pixel is very different from one image to the next. The particular light-sensitive cells in your eye that are firing when you see this 3 are very different from the ones firing when you see this 3. But something in that crazy-smart visual cortex of yours resolves these as representing the same idea, while at the same time recognizing other images as their own distinct ideas. But if I told you, hey, sit down and write for me a program that takes in a grid of 28x28 pixels like this and outputs a single number between 0 and 10, telling you what it thinks the digit is, well th

### 1.3. Embedding and Vector Store Creation

Now, we'll convert each chunk into a ``vector embedding`` and store it in a ``Pinecone`` vector store. This allows for fast and efficient retrieval based on semantic similarity.

In [None]:
import os
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

# Create index if it doesn't exist
index_name = "youtube-rag-pdf"
if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,  
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Connect
index = pc.Index(index_name)

In [15]:
from langchain_pinecone import PineconeVectorStore
from langchain_cohere import CohereEmbeddings


embeddings = CohereEmbeddings(model="embed-english-v3.0")

vector_store = PineconeVectorStore(
    index=index,
    embedding=embeddings # Embedding model (Cohere)
)

# Add documents
vector_store.add_documents(documents=docs)

['d8ff86fa-5705-46db-8638-788a10239659',
 '8a295bca-0cdc-49ed-a98a-bb224e7b6a56',
 '1353c046-eb45-42fb-94f7-9320a0c5f86d',
 '6aaceaf6-44e6-4f9b-a3c9-7a7a4f9115d5',
 'd566212e-88dc-42bb-b7e5-97d988c91a00',
 '7a2e8468-8350-4341-ad85-043d62d02149',
 '00b7e067-765e-4fc2-812b-16ba82fbfb28',
 '38f02f78-2eef-4fd2-820f-bd90912df8eb',
 '357c7d63-405c-4396-bf62-97f70b2b6b39',
 'ddd7f090-5639-4c0f-ae4f-f7c99ef4df70',
 'ea6efe9f-3e10-4c3c-8b28-c62da19295b3',
 '7a4b7714-281c-4da9-802c-883bdf02fee8',
 'b71c379b-96cb-423e-9cc9-4e5dd290aa3d',
 '7c49cb23-f157-4d8d-a0d1-5d2178727dc5',
 '1f15da1b-6ebe-47e3-af56-5e8dcc9e4fde',
 '210fbd11-f428-4fab-86b3-96e089b6af64',
 '46a9e87c-e05b-48ee-8f30-d34618f76e0b',
 '5b20f4d5-6906-4d25-9a10-a490130b99bb',
 '9b0f37b1-bdd2-485a-b1ec-a07554cdcd55',
 '75f60c39-38f3-47c7-8a44-5c402d3698d6',
 'c52da7c0-76c2-433c-9c66-351ad4e2801a',
 '1419bcef-1a21-47b4-9f83-9a623e3d0013',
 'b69a9c36-d57c-4774-8dae-0252656a939a']

## Phase 2:  Retrieval 

The goal of retrieval is to find the most relevant information from our knowledge base (the vector store) to answer a user's question

In [16]:
retriever = vector_store.as_retriever(
    search_type="similarity", 
    search_kwargs={"k":5, "filter": {"source": "aircAruvnKk"}}
)

In [17]:
retriever

VectorStoreRetriever(tags=['PineconeVectorStore', 'CohereEmbeddings'], vectorstore=<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x000001A7EC426470>, search_kwargs={'k': 5, 'filter': {'source': 'aircAruvnKk'}})

In [19]:
question = "What are neural networks, and how do they work?"


retrieved_docs = retriever.invoke(question)

# Let's inspect the retrieved documents
print(f"Retrieved {len(retrieved_docs)} documents.")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Document {i+1} ---\n")
    print(doc.page_content)

Retrieved 5 documents.

--- Document 1 ---

the more powerful modern variants, and trust me it still has plenty of complexity for us to wrap our minds around. But even in this simplest form it can learn to recognize handwritten digits, which is a pretty cool thing for a computer to be able to do. And at the same time you'll see how it does fall short of a couple hopes that we might have for it. As the name suggests neural networks are inspired by the brain, but let's break that down. What are the neurons, and in what sense are they linked together? Right now when I say neuron all I want you to think about is a thing that holds a number, specifically a number between 0 and 1. It's really not more than that. For example the network starts with a bunch of neurons corresponding to each of the 28x28 pixels of the input image, which is 784 neurons in total. Each one of these holds a number that represents the grayscale value of the corresponding pixel, ranging from 0 for black pixels up to 1

## Phase 3 Augmentation 

The goal of augmentation is to create a well-formed prompt for the LLM by combining the retrieved context with the original question.

### 3.1 Prompt template

In [20]:
from langchain_core.prompts import PromptTemplate

# Improved professional Q&A assistant prompt template for YouTube transcripts

template = """
You are an expert Q&A assistant specialized in answering questions about YouTube videos.

Your responses must be based **entirely** on the  context provided below, as well as what you already know . 

### Guidelines:
1. Provide a **clear**, **detailed**, and **well-structured** answer formatted in Markdown.
2. If the context lacks sufficient details, reply with:
   > The provided context does not contain enough information to answer this question.
4. Keep your tone professional and concise.

---

### Context:
{context}

---

### User Question:
{question}

---

### Answer:
"""

prompt = PromptTemplate.from_template(template)


### 3.2 Format the retrieved documents

In [21]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

context_text = format_docs(retrieved_docs)
print(context_text)

the more powerful modern variants, and trust me it still has plenty of complexity for us to wrap our minds around. But even in this simplest form it can learn to recognize handwritten digits, which is a pretty cool thing for a computer to be able to do. And at the same time you'll see how it does fall short of a couple hopes that we might have for it. As the name suggests neural networks are inspired by the brain, but let's break that down. What are the neurons, and in what sense are they linked together? Right now when I say neuron all I want you to think about is a thing that holds a number, specifically a number between 0 and 1. It's really not more than that. For example the network starts with a bunch of neurons corresponding to each of the 28x28 pixels of the input image, which is 784 neurons in total. Each one of these holds a number that represents the grayscale value of the corresponding pixel, ranging from 0 for black pixels up to 1 for white pixels. This number inside the

n

### 3.3 Invoke the prompt with the context and question

In [22]:
final_prompt = prompt.invoke({"context": context_text, "question": question})

print("--- Final Augmented Prompt ---\n")
print(final_prompt.to_string())

--- Final Augmented Prompt ---


You are an expert Q&A assistant specialized in answering questions about YouTube videos.

Your responses must be based **entirely** on the  context provided below, as well as what you already know . 

### Guidelines:
1. Provide a **clear**, **detailed**, and **well-structured** answer formatted in Markdown.
2. If the context lacks sufficient details, reply with:
   > The provided context does not contain enough information to answer this question.
4. Keep your tone professional and concise.

---

### Context:
the more powerful modern variants, and trust me it still has plenty of complexity for us to wrap our minds around. But even in this simplest form it can learn to recognize handwritten digits, which is a pretty cool thing for a computer to be able to do. And at the same time you'll see how it does fall short of a couple hopes that we might have for it. As the name suggests neural networks are inspired by the brain, but let's break that down. What ar

## Phase 4 : Generation 

In [25]:
from IPython.display import Markdown
import textwrap

def to_markdown(text):
  """
    Convert a string to a Markdown-formatted block for display in Jupyter notebooks.

    - Replaces bullet points ('•') with Markdown-style bullets ('  *').
    - Indents the text for blockquote formatting.
    - Returns an IPython.display.Markdown object.
  """
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [26]:
from langchain_google_genai import GoogleGenerativeAI


llm = GoogleGenerativeAI(model="gemini-2.5-flash")


response_message = llm.invoke(final_prompt)

print("--- Final Generated Answer ---\n")
to_markdown(response_message)

--- Final Generated Answer ---



> Neural networks are computational models inspired by the brain, composed of interconnected "neurons." They are essentially complex functions designed to take in numerical inputs and produce numerical outputs.
> 
> Here's a breakdown of how they work:
> 
> 1.  **Neurons:**
>     *   At their core, neurons are entities that hold a number, specifically a value between 0 and 1.
>     *   More accurately, each neuron functions as a function itself, taking inputs from all neurons in the preceding layer and outputting a single number between 0 and 1.
> 
> 2.  **Layers:**
>     *   Neural networks are structured in layers.
>     *   **Input Layer:** For tasks like recognizing handwritten digits, the input layer consists of neurons corresponding to each pixel of an input image (e.g., 784 neurons for a 28x28 pixel image). Each neuron in this layer holds a number representing the grayscale value of its corresponding pixel (0 for black, 1 for white).
>     *   **Hidden Layers:** These are intermediate layers where complex computations occur, processing the activations from the previous layer.
>     *   **Output Layer:** The final layer, where the network's decision or prediction is presented. For digit recognition, this might be 10 neurons, with the brightest neuron indicating the network's chosen digit.
> 
> 3.  **Information Flow (Activation):**
>     *   When an image is fed into the network, the input layer neurons are "lit up" according to the brightness of each pixel.
>     *   This pattern of activations then causes a specific pattern in the next layer, which in turn influences the subsequent layer, and so on, until a final pattern emerges in the output layer.
>     *   The neuron with the highest activation (brightest) in the output layer represents the network's "choice" for the input.
> 
> 4.  **Mathematical Operations:**
>     *   The transition of activations from one layer to the next is governed by mathematical operations involving:
>         *   **Weights and Biases:** These are parameters that the network learns to adjust. Weights determine the strength of connection between neurons, while biases are added to the weighted sums. The network involves around 13,000 such parameters to pick up on specific patterns.
>         *   **Matrix Vector Products:** Activations are propagated through the network using matrix-vector multiplication, which efficiently handles the interactions between neurons across layers.
>         *   **Sigmoid Function (Squishification Function):** After the weighted sum and bias addition, a sigmoid function is applied to each component of the resulting vector. This function "squishes" the values to be between 0 and 1, ensuring they remain within the valid range for neuron activations.
> 
> 5.  **Learning:**
>     *   Neural networks "learn" by adjusting their weights and biases based on provided data. This process allows them to identify and pick up on patterns necessary for tasks like recognizing digits.

## Phase 5 Building the RAG Chain with LCEL

Now that we understand each phase, we can use the LangChain Expression Language (LCEL) to compose them into a single, elegant pipeline.


In [28]:
from langchain_core.runnables import RunnablePassthrough,RunnableLambda

# The main RAG chain
rag_chain = (
    # The first part is a dictionary with the context and question.
    {
        # The context is generated by retrieving documents and formatting them.
        "context": retriever | RunnableLambda(format_docs), 
        # The question is passed through directly from the input.
        "question": RunnablePassthrough()
    }
    # The dictionary is piped into the prompt template.
    | prompt
    # The prompt is piped into the LLM.
    | llm
)

print("RAG chain created successfully!")

RAG chain created successfully!


In [29]:
question_summary = "Can you summarize the video's discussion?"
final_response = rag_chain.invoke(question_summary)

print(f"Question: {question_summary}\n")
print(f"Answer:\n")
to_markdown(final_response)

Question: Can you summarize the video's discussion?

Answer:



> This video introduces the fundamental concepts of neural networks, specifically focusing on their structure, with the goal of explaining what a neural network actually is and how it "learns" as a piece of mathematics rather than just a buzzword.
> 
> Key discussion points include:
> *   **The Task:** The video aims to build a neural network capable of recognizing handwritten digits, a classic example in the field. The network takes a 28x28 pixel image as input and outputs a single number between 0 and 10, representing the recognized digit.
> *   **Video Structure:** This particular video is devoted to the "structure" component of neural networks, while the subsequent video will delve into the "learning" process—how the network adjusts its weights and biases by looking at data.
> *   **Network Type:** The video focuses on the simplest, "plain vanilla" form of neural networks, which is presented as a necessary prerequisite for understanding more powerful modern variants.
> *   **Layers of Abstraction:** The concept of how neural networks can combine inputs (like pixels) into higher-level features (like edges, then patterns, then digits) through layers of abstraction is discussed.
> *   **Sigmoid Function:** The video briefly touches upon the sigmoid function, noting its historical use in early networks to "squish" weighted sums into an interval between zero and one, drawing an analogy to neurons being active or inactive.
> *   **Funding and Collaboration:** Lisha Li, who did her PhD work on the theoretical side of deep learning and works at Amplify Partners (a venture capital firm that provided funding for the video), is featured to discuss the sigmoid function.
> *   **Future Content:** The speaker mentions an upcoming video on the learning process, ongoing updates for a probability series for Patreon supporters, and resources for further learning and code download will be provided after the two videos.

## Multi-Query Retriever

Uses an LLM to generate multiple variations of the user's query, retrieves documents for each variation, and returns the unique union of results.

**Use Cases**:

-   Handling **ambiguous** queries
-   Systems where single queries might miss relevant documents

In [31]:
from langchain.retrievers.multi_query import MultiQueryRetriever
import logging


# Enable logging to see the expanded queries
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_type="similarity", search_kwargs={"k":5, "filter": {"source": video_id}}),
    llm=llm
    
)


question="What are neural networks, and how do they work ? "


retrieved_docs = retriever_from_llm.invoke(question)

# Let's inspect the retrieved documents
print(f"Retrieved {len(retrieved_docs)} documents.")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Document {i+1} ---\n")
    print(doc.page_content)


INFO:langchain.retrievers.multi_query:Generated queries: ['1.  Explain the fundamental principles and operational mechanisms of artificial neural networks.', '2.  Describe the architecture and learning processes of neural networks, including their core components and algorithms.', '3.  How do neural networks function to process information and make predictions or classifications?']


Retrieved 8 documents.

--- Document 1 ---

we represent it by organizing all those biases into a vector, and adding the entire vector to the previous matrix vector product. Then as a final step, I'll wrap a sigmoid around the outside here, and what that's supposed to represent is that you're going to apply the sigmoid function to each specific component of the resulting vector inside. So once you write down this weight matrix and these vectors as their own symbols, you can communicate the full transition of activations from one layer to the next in an extremely tight and neat little expression, and this makes the relevant code both a lot simpler and a lot faster, since many libraries optimize the heck out of matrix multiplication. Remember how earlier I said these neurons are simply things that hold numbers? Well of course the specific numbers that they hold depends on the image you feed in, so it's actually more accurate to think of each neuron as a function, one that takes in the ou

In [32]:
rag_chain = (
    {
        "context": retriever_from_llm | RunnableLambda(format_docs), 
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
)


final_response = rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer:\n")
to_markdown(final_response)

INFO:langchain.retrievers.multi_query:Generated queries: ['Provide a definition and conceptual overview of neural networks.', 'Describe the internal workings and operational principles of artificial neural networks.', 'Explain the architecture, components, and learning process of neural networks.']


Question: What are neural networks, and how do they work ? 

Answer:



> Neural networks are computational models inspired by the brain, structured as a "piece of math" rather than just a buzzword. They are essentially complex functions designed to process inputs and produce outputs, capable of tasks like recognizing handwritten digits.
> 
> Here's a breakdown of what they are and how they work based on the provided context:
> 
> **1. What are Neural Networks?**
> *   They are inspired by the brain, consisting of interconnected "neurons" organized into layers.
> *   The entire network functions as a single, complex mathematical function. For instance, a network for handwritten digit recognition takes 784 numbers (from a 28x28 pixel image) as input and outputs 10 numbers.
> *   They contain a large number of adjustable "parameters" in the form of weights and biases (e.g., 13,000 in the described network), which act as "knobs and dials" that pick up on specific patterns in the data.
> 
> **2. How Neural Networks Work (Structure and Processing):**
> 
> *   **Neurons:** At their core, neurons are "things that hold a number," specifically a value between 0 and 1. More accurately, each neuron is a function that takes the outputs (activations) of all neurons in the previous layer and "spits out a number between 0 and 1."
> *   **Input Layer:** The network begins with an input layer. For recognizing handwritten digits, there are 784 neurons, each corresponding to a pixel in a 28x28 input image. Each neuron holds a number representing the grayscale value of its pixel (0 for black, 1 for white).
> *   **Layers and Connections:** Neurons are linked in a layered structure. The "activations" (the numbers held by neurons) from one layer influence the activations in the next.
> *   **Weights:** Each connection between neurons in adjacent layers is assigned a "weight," which is just a number. These weights are organized into a matrix. Positive weights are depicted as green pixels, and negative weights as red pixels, with brightness indicating value.
> *   **Biases:** In addition to weights, each layer also has associated "biases," which are organized into a vector.
> *   **Processing Between Layers (Mathematical Representation):**
>     1.  The activations from the previous layer are collected into a vector.
>     2.  A "weighted sum" is computed by performing a matrix-vector product between the weight matrix (representing connections to the current layer) and the activation vector from the previous layer.
>     3.  The bias vector for the current layer is added to this result.
>     4.  Finally, a "sigmoid function" (or "squishification function") is applied to each component of the resulting vector. This ensures that the output number for each neuron in the current layer remains between 0 and 1.
>     5.  This entire transition from one layer's activations to the next can be expressed compactly as `sigmoid(Weight_matrix * previous_activations_vector + bias_vector)`.
> *   **Output Layer:** The final layer produces a pattern of activations. In the case of digit recognition, the brightest neuron in this output layer (which represents a digit from 0 to 10) indicates the network's prediction for what digit the input image represents.
> *   **Learning:** The "learning" process in neural networks refers to the computer automatically finding the appropriate settings (values) for all these weights and biases. This allows the network to effectively solve the problem at hand, such as recognizing handwritten digits by identifying patterns like edges, loops, or other components. The provided context states that this video focuses on the structure, and the subsequent video will delve into how learning occurs.