# Rebuilding WALL¬∑E's Memories with RAG

One ordinary Earth-cleaning afternoon, WALL¬∑E climbed a pile of old iPhones trying to rescue a Rubik's Cube.

<img src="https://compote.slate.com/images/17bdccdd-d8c9-44e6-b7f8-96f03ca50b33.jpeg?crop=1560%2C1040%2Cx0%2Cy0" width="400"/>

A pigeon startled him. He slipped.

**CRASH.**

When WALL¬∑E woke up, something was wrong...

```
\>>> SYSTEM BOOTING...  
\>>> WALL-E unit #700X  
\>>> STATUS: üü• MEMORY CORRUPTED
```

<img src="https://wp-cdn.fortect.com/uploads/2023/10/20111547/BSOD-Memory-Management-1024x536.webp" width="600"/>

Oh no... The fall seemed pretty bad. Let's try asking him a couple of questions:

> üßë‚Äçüîß: It's okay, buddy. You took a pretty bad fall. Let‚Äôs try something simple. Who are you?  
ü§ñ: I... I do not know. Memory blocks missing.  
üßë‚Äçüîß: Hmm. Okay. Let‚Äôs try this... Do you remember EVE?  
ü§ñ: E...V...E... error. No match found in memory banks. Who... is EVE?

üíî WALL¬∑E has lost all his memories...

But wait! We found an ancient relic in a dusty old USB: **the original WALL¬∑E movie script! üìù**

In [None]:
# Obtain the script! We are nice enough to locate and prepare it for you
!curl -L "https://assets.scriptslug.com/live/pdf/scripts/wall-e-2008.pdf?v=1729115058" -o walle_script.pdf

<img src="https://i.etsystatic.com/39233251/r/il/6c8e18/5323736276/il_fullxfull.5323736276_k62y.jpg" width="300"/>

Good news! We can use this script to rebuild WALL¬∑E's memories using **Retrieval Augmented Generation** powered by **LangChain**.

This will allow us to:
- Load the original script
- Break it into memory-safe chunks
- Search relevant fragments when a question is asked
- Use a language model to reconstruct answers

Let's get started!

## What is Retrieval Augmented Generation (RAG)?

Large language models like GPT are powerful, but they don't have access to your custom data ‚Äî like WALL¬∑E's movie script ‚Äî unless you give it to them.

**RAG (Retrieval-Augmented Generation)** is a way to augment an LLM with external knowledge dynamically.

It works like this:
1. When you ask a question, we retrieve relevant documents from a knowledge base (like pieces of a movie script).
2. These documents are passed along with your question to the LLM, which then uses both to generate an informed response.


Below is a more technical defintion:

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20250210184749053767/What-is-RAG_.webp" width="600"/>

**Why is this useful?**
- You don‚Äôt have to fine-tune a model.
- You can update data without retraining.
- It keeps answers grounded in known sources.

In this workshop, we‚Äôll build a RAG pipeline to help WALL¬∑E recall information from his script ‚Äî step by step.

## What is LangChain?

We‚Äôve talked about what RAG is ‚Äî but how do we actually *build* a system that can retrieve documents and talk like WALL¬∑E? That‚Äôs where **LangChain** comes in.

LangChain is an open-source framework that connects together all distinctive parts in an AI app:
- The **LLM** (e.g., OpenAI's GPT-4)
- The **retriever** (e.g., a vector store to search memory)
- The **embedding model** (to turn text into numeric form)
- The **document loaders** (like PDFs, websites, or APIs)

Instead of writing the code that integrates each component, LangChain gives us modular tools and pre-built chains to make everything talk to each other.

In this project, LangChain will help us:
- Load the WALL¬∑E script
- Split it into smaller chunks
- Generate vector embeddings
- Store and search those embeddings
- Feed context to the LLM and return answers

## üöÄ Step 0: Fire Up WALL¬∑E‚Äôs Core Systems (Environment Setup)

Before we can help WALL¬∑E remember anything, we need to prepare the systems that simulate his brain.

**Make sure you have a working [OpenAI API Key](https://platform.openai.com/account/api-keys) for the LLM and embedding model access.**

In [None]:
# Install the main langchain package
!pip install --quiet --upgrade langchain

# Install the main LangChain library and its key components required for our project
!pip install --quiet --upgrade langchain-core langchain-text-splitters langchain-community langgraph langchain-openai

# Install other dependencies to work with PDFs and transformers
!pip install --quiet --upgrade pypdf sentence_transformers

In [None]:
# Load OpenAI API key
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

In [None]:
# Set up embeddings model
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
# Set up in-memory vector store
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)

In [None]:
# Set up chat model
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [None]:
assert embeddings is not None, "‚ùå Embeddings model uninitialized.."
assert vector_store is not None, "‚ùå Vector store uninitialized."
assert llm is not None, "‚ùå Language model uninitialized."

print("‚úÖ You are good to go!")

**üõ†Ô∏è Want to customize?**

What if I'm not a fan of OpenAI? Am I locked into using their embeddings and chat models?

Of course not! LangChain is really flexible on this. You can easily swap out components to fit your needs or preferences:

- Try different **LLMs** like Anthropic Claude, Cohere, or Mistral
- Use different **vector stores** like FAISS, Pinecone, or Chroma
- Run models **locally** or in the **cloud**

In fact, the only parts of the code you will need to modify are the initial setup cells above where we define the embedding model, vector store, and LLM. The rest of the pipeline ‚Äî loading, splitting, retrieving, generating ‚Äî will work just the same.

This is one thing I really love about langchain: how *‚Äúplug-and-chug‚Äù* it is.

üìö Check out the [LangChain docs](https://docs.langchain.com/) for more information.

## üìú Step 1: Load the Memory Archive (Document Loader)

Now that we have our basic setup, we‚Äôll start by loading the WALL¬∑E script from the PDF file that you (hopefully) have downloaded from running a previous cell. This will become the ‚Äúmemory source‚Äù from which WALL¬∑E can later reconstruct his thoughts.

LangChain provides a `PyPDFLoader` that extracts text from each page of the script, returning it as a list of documents.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Load the PDF script into WALL¬∑E's recovery core
loader = PyPDFLoader("walle_script.pdf")
pages = []
async for page in loader.alazy_load():
    pages.append(page)

```
[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 35% ‚Äî Locating movie script‚Ä¶  
[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë] 75% ‚Äî Found 1 source: `walle_script.pdf`  
[‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà] 100% ‚Äî Script Loaded ‚úÖ
```

Let‚Äôs inspect a sample page to check if the script is actually loaded:

In [None]:
print(f"Total pages loaded: {len(pages)}\n")

# Skipping the title page (pages[0])
page_num = 1
print(f"{'='*40}")
print(f"üìÑ Page {page_num} Metadata")
print(f"{'-'*40}")
print(pages[page_num].metadata)

print(f"\n{'='*40}")
print(f"üìú Page {page_num} Content")
print(f"{'-'*40}")
print(pages[page_num].page_content)
print(f"{'='*40}\n")

Looks good!

<img src="https://www.iamag.co/wp-content/uploads/2018/02/cover-walle.jpg" width="600"/>

Each page includes both the **text content** and **metadata** like page number, title, author, etc.

This is useful for debugging, understanding file structure, or even filtering specific pages. But for our purposes, we don't have to worry too much about it.

## ‚úÇÔ∏è Step 2: Break the Script into Memory-Safe Chunks (Text Splitter)

What happens if we try to give WALL¬∑E the **entire** script all at once? We gently place the documents into his input slot.  

> Beep... Whirr... BZZZT...  
> Eeee‚Äì...ERR‚Äì...üí•  

Uh oh... Turns out WALL¬∑E‚Äôs memory unit has a limited space, just like most language models.

Language models can only ‚Äúsee‚Äù a fixed number of tokens at a time, known as the **context window**.  

For example, even powerful models like GPT-4 have a context limit (e.g., 8k, 32k, or 128k tokens depending on the variant). If you try to input more than that, the model will ignore or truncate the excess.

To help him process this massive amount of text in a manageable way, we need to:
- Split the script into smaller **chunks**
- Overlap these chunks to maintain **overall context**
- Track their **original position** in the document

We‚Äôll use LangChain‚Äôs `RecursiveCharacterTextSplitter`, which intelligently breaks documents based on structure (paragraphs, sentences, etc.).

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(pages)

print(f"Split script into {len(all_splits)} sub-documents.")

How this works:

- The text splitter tries to split text at natural boundaries (e.g., paragraphs, then sentences, then characters).
- We define the maximum number of characters (e.g. 1000) as the target chunk size.
- We also define how much context from the previous chunk is retained in the next one (e.g. 200 characters).
- If no good boundaries are found, it falls back to smaller units.

This smart splitting ensures that each piece of document remains semantically coherent and doesn‚Äôt randomly cut off mid-way.

## üß≤ STEP 3: Upload to WALL¬∑E's Memory Module (Vector Store)

WALL¬∑E doesn‚Äôt store memories like we do. It can't directly understand texts. Therefore, we need to convert each chunk of the movie script into an **embedding** ‚Äî a numerical vector that captures the semantic meaning of the text. These embeddings are created using a powerful transformer model.

<img src="https://www.cs.cmu.edu/~dst/WordEmbeddingDemo/figures/fig5.png" width="600"/>

Once we have these vectors, we store them in a **vector store** ‚Äî a searchable database optimized for similarity-based retrieval.

LangChain abstracts these complicated ideas away nicely. It handles all the interfacing and heavy lifting, so we only need a single call to store the documents.

In [None]:
document_ids = vector_store.add_documents(documents=all_splits)

Let‚Äôs print a few document IDs to confirm everything uploaded correctly:

In [None]:
print(document_ids[:3])

These unique IDs can be used to reference, update, or delete specific documents.

Now that WALL¬∑E‚Äôs memory has been fully indexed, we can start asking him questions.

But before that, we need to create the actual ‚Äúthought pipeline‚Äù, basically the RAG chain that:
1. Accepts a question
2. Searches the vector store for relevant memory chunks
3. Feeds those chunks to the language model
4. Returns a contextual, informed answer

We are almost there! Just need to connect the dots together.

## ü§ñ Step 4: Reconstruct Thoughts ‚Äì Create the RAG Chain

WALL¬∑E‚Äôs memory fragments are now embedded, indexed, and stored. It‚Äôs time to bring him back to life.

In a RAG pipeline, it‚Äôs not enough to just retrieve relevant documents ‚Äî what is the chat model supposed to do with them anyways? We need to *tell* the language model how to use them.

This is where **prompting** comes in.

A **prompt** is the actual input string that gets sent to the language model. It usually includes:
- The user‚Äôs question
- Retrieved context (from the vector store)
- Optional formatting or instructions ("Answer concisely", "Use markdown", etc.)

Instead of hardcoding the prompt yourself, LangChain offers **LangChain Hub** ‚Äî a registry of ready-made prompt templates.

We‚Äôll use a popular one for Retrieval-Augmented Generation:

In [None]:
# Load a generic RAG-style prompt from LangChain Hub
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

By the way, here is how the actual prompt looks like.

*üí° You can customize this later with your own prompt templates ‚Äî just make sure it contains placeholders for question and context.*

```
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
```

Let's try asking a question and see if WALL¬∑E is able to spit out the correct answer:

In [None]:
# Ask WALL¬∑E a question
question = "Who is Eve?"

# Step 1: Retrieve relevant documents using similarity search
retrieved_docs = vector_store.similarity_search(question)

# Step 2: Combine their content into a single context block
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

# Step 3: Plug question + context into the RAG prompt
final_prompt = prompt.invoke({
    "question": question,
    "context": docs_content
})

# Step 4: Send the composed prompt to the LLM for response generation
answer = llm.invoke(final_prompt)

In [None]:
print(answer)

Hooray!

<img src="https://davidswanson.wordpress.com/files/2009/02/wall-e.jpg" width="800"/>

## üåü Mission Complete: WALL¬∑E Remembers

You've just built a working **RAG system**!

Let‚Äôs recap what we did:
- Used LangChain to load a real-world document
- Split it into reasonably sized chunks
- Turned those into vector embeddings
- Stored them in a searchable vector store
- Queried it via an LLM to simulate memory reconstruction


This same pipeline can be adapted for:
- Document Q&A systems
- Chatbots with memory
- Internal knowledge assistants
- Customer support agents

Now that you‚Äôve helped WALL¬∑E recover, try loading your own documents ‚Äî and help something else remember. üíæü§ñ