Implementing RAG based on my HW 

im just tryna learn rag. This project is an implementation based off of my COGSCI-132 hw. 

RAG can be broken down into 5 actionable steps:

1. Load: We load in our data using DocumentLoaders. 
2. Split: Our data is split into chunks in order to be captured within the context window. We will use RecursiveTextSplitter to recursively split our text until the chunks are within the limt 
3. Store: The split data is then converted into embeddings. Embeddings is a way for text to be represented numerically for computers to understand. These embeddings are then stored in a vector store.
4. Retrieve: The most relevant embeddings are retrieved based on a similarity search with the query.
5. Generate: Both the query and most relevant context are then passed into the LLM to output the most accurate response.

Setup

We instantiate LangSmith for tracing, as well as OpenAI. The import getpass is used for a more secure input.

In [14]:
# our text splitters and langchain
%pip install --quiet --upgrade langchain-text-splitters langchain_community langgraph 

# our llm 
%pip install -qU "langchain[openai]"

# our vector db 
%pip install -qU langchain-chroma 

# our pdf document loader
%pip install "langchain-unstructured[local]" 
%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured "unstructured[pdf]" python-magic

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
import getpass 
import os 

os.environ["LANGSMITH_TRACING"] = "True"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

In [5]:
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass('Enter OpenAI API Key: ')

Here we will instantiate our LLM model, vector DB and our embeddings function.

In [9]:
from langchain.chat_models import init_chat_model
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma 

llm = init_chat_model("openai:gpt-5-nano") # or (model="gpt-5-nano", model_provider="openai")

embeddings = OpenAIEmbeddings(model='text-embeddings-3-large')

vector_store = Chroma(
    collection_name="homework",
    embedding_function=embeddings
)

Indexing 

We are going to index our data using **Unstructured**. This will allow us to parse not only text but also images within PDF documents.

In [None]:
from langchain_unstructured import UnstructuredLoader
from unstructured.cleaners.core import clean_extra_whitespace

file_path = 'data/ermentrout-and-terman-ch-1.pdf'

loader = UnstructuredLoader(
    file_path,
    post_processors=[clean_extra_whitespace]
)
docs = loader.load()

total_chars = "".join([doc.page_content for doc in docs])

print("Number of documents: ", len(docs))
print("Length of characteres: ", len(total_chars))

Number of documents:  796
Length of characteres:  58532


Splitting Documents

The length of characters exceeds the context window, and it would be difficult for models to find information with very long inptus.

To handle this we split the documents into chunks for embedding and vector storage to retrieve only the most relevant information. 

We will use RecursiveTextSplitter to recursively split our text until the chunks are within the context window limit.

In [49]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize RecursiveCharacterTextSplitter 
# params: chunks, overlap, add_start_index 
# overlap for the context to not be cutoff 
# add_start_index makes each chunk store their orginal position in the text 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=300,
    add_start_index=True
)

# to actually split the documents we use the split_documetns method 
all_splits = text_splitter.split_documents(docs)

print(f"split documents into {len(all_splits)}")

split documents into 797
