# **Naive RAG**
The Naive RAG is the simplest technique in the RAG ecosystem, providing a straightforward approach to combining retrieved data with LLM models for efficient user responses.

Research Paper: [RAG](https://arxiv.org/pdf/2005.11401)

# **📚 Notebook Summary & Step-by-Step Guide** 

## **🎯 What This Notebook Does**
This notebook implements a **Naive RAG (Retrieval Augmented Generation)** system using Azure OpenAI and FAISS. It demonstrates how to:
- Load and process CSV data into a vector database
- Perform semantic search on documents
- Generate AI responses based on retrieved context
- Handle Windows encoding issues and Azure OpenAI rate limits

## **🔧 Key Libraries & Their Roles**

### **Core RAG Libraries**
- **`langchain`** - Main orchestration framework for RAG pipelines
- **`langchain-community`** - Extended components (CSV loader, FAISS integration)
- **`langchain-openai`** - Azure OpenAI integration for embeddings and chat

### **Vector Storage & Search**
- **`faiss-cpu`** - Facebook AI Similarity Search - fast vector similarity search
- **`pandas`** - Data manipulation and CSV handling (fallback option)

### **Azure OpenAI Components**
- **`AzureOpenAIEmbeddings`** - Converts text to 1536-dimensional vectors
- **`AzureChatOpenAI`** - GPT-4o-mini for generating responses

### **Document Processing**
- **`CSVLoader`** - Loads CSV files into LangChain Document format
- **`RecursiveCharacterTextSplitter`** - Splits large documents into chunks

## **📋 Step-by-Step Process**

### **Step 1: Environment Setup**
- Configure Azure OpenAI credentials (API key, endpoint, version)
- Set up local development environment (Windows compatibility)

### **Step 2: Document Loading & Processing**
- Load CSV data using `CSVLoader` with UTF-8 encoding (Windows fix)
- Limit documents to 20 to avoid Azure rate limits
- Split documents into 500-character chunks for better retrieval

### **Step 3: Vector Database Creation**
- Convert document chunks to embeddings using Azure OpenAI `text-embedding-ada-002`
- Store embeddings in FAISS in-memory vector database
- Explore vectorstore structure for educational understanding

### **Step 4: RAG Pipeline Setup**
- Create retriever from FAISS vectorstore
- Set up Azure ChatGPT (gpt-4o-mini) for response generation
- Build RAG chain: Query → Retrieve Context → Generate Response

### **Step 5: Testing & Validation**
- Test with targeted questions about document content
- Demonstrate how similarity search finds relevant context
- Show end-to-end RAG response generation

### **Step 6: Evaluation Preparation**
- Structure data for evaluation frameworks
- Prepare datasets for quality assessment (Athina AI integration commented out)

## **🚀 Key Innovations in This Implementation**
- **Windows Compatibility**: UTF-8 encoding fix for Unicode issues
- **Rate Limit Management**: Document limiting to prevent Azure quota exhaustion  
- **Educational Features**: Vectorstore exploration for learning purposes
- **Azure Integration**: Full Azure OpenAI stack instead of standard OpenAI
- **Error Handling**: Robust fallback mechanisms and debugging

## **💡 Learning Outcomes**
Students will understand:
- How text becomes vectors and enables semantic search
- The role of each component in the RAG pipeline
- Practical considerations for production deployment
- Azure OpenAI integration and configuration
- Vector database concepts and similarity search

## **Initial Setup**

In [2]:
# ! pip install --q athina langchain-openai
%pip install langchain langchain-community langchain-openai

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import os
# from google.colab import userdata  # Uncomment if using Google Colab

# Azure OpenAI Configuration
# For local development, set these environment variables directly or use a .env file
os.environ["AZURE_OPENAI_API_KEY"] = os.getenv('AZURE_OPENAI_API_KEY', 'Your_Azure_OpenAI_API_Key_Here')
os.environ["AZURE_OPENAI_ENDPOINT"] = os.getenv('AZURE_OPENAI_ENDPOINT', 'https://your-resource-name.openai.azure.com/')
os.environ["OPENAI_API_VERSION"] = "2024-02-01"  # or your preferred API version

# Other API keys (optional)
os.environ['ATHINA_API_KEY'] = os.getenv('ATHINA_API_KEY', '')
os.environ['PINECONE_API_KEY'] = os.getenv('PINECONE_API_KEY', '')

## **Indexing**

In [2]:
# load embedding model
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    model="text-embedding-ada-002",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"]
)

In [21]:
# Test CSV file reading and show preview
import pandas as pd
import os

csv_path = "../data/context.csv"
print(f"Checking if file exists: {os.path.exists(csv_path)}")
print(f"Full path: {os.path.abspath(csv_path)}")

if os.path.exists(csv_path):
    # Read and display CSV content
    df_preview = pd.read_csv(csv_path)
    print(f"\nCSV shape: {df_preview.shape}")
    print(f"Columns: {list(df_preview.columns)}")
    print("\nFirst 3 rows:")
    print(df_preview.head(3))
    
    print(f"\n" + "="*80)
    print("FULL CONTENT OF FIRST ROW:")
    print("="*80)
    if len(df_preview) > 0:
        for col in df_preview.columns:
            content = str(df_preview[col].iloc[0])
            print(f"\n📄 Column '{col}':")
            print("-" * 40)
            print(content)
            print("-" * 40)
else:
    print("❌ CSV file not found at the specified path")
    print("Current working directory:", os.getcwd())
    print("Files in current directory:", os.listdir("."))
    if os.path.exists("../data"):
        print("Files in ../data directory:", os.listdir("../data"))

Checking if file exists: True
Full path: f:\LernRAG\rag-cookbooks\data\context.csv

CSV shape: (232, 1)
Columns: ['context']

First 3 rows:
                                             context
0  ['African immigration to the United States ref...
1  ["Discount points, also called mortgage points...
2  ['Interlibrary loan (abbreviated ILL, and some...

FULL CONTENT OF FIRST ROW:

📄 Column 'context':
----------------------------------------
['African immigration to the United States refers to immigrants to the United States who are or were nationals of modern African countries. The term African in the scope of this article refers to geographical or national origins rather than racial affiliation. Between the Immigration and Nationality Act of 1965 and 2017, Sub-Saharan African-born population in the United States grew to 2.1 million people.Sub-Saharan Africans in the United States come from almost all regions in Africa and do not constitute a homogeneous group. They include peoples from d

In [3]:
# load data
from langchain_community.document_loaders import CSVLoader

# Create CSVLoader with UTF-8 encoding (Windows compatibility)
loader = CSVLoader(
    file_path="../data/context.csv",
    encoding="utf-8"
)

# Load documents
documents = loader.load()
print(f"Loaded {len(documents)} documents from CSV")

# Limit documents to prevent Azure OpenAI rate limits
MAX_DOCUMENTS = 20  # Adjust this number based on your rate limits
if len(documents) > MAX_DOCUMENTS:
    documents = documents[:MAX_DOCUMENTS]
    print(f"Limited to {len(documents)} documents to avoid rate limits")
else:
    print(f"Using all {len(documents)} documents")

Loaded 232 documents from CSV
Limited to 20 documents to avoid rate limits


In [4]:
# split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(documents)

## **Pinecone Vector Database**

In [None]:
# # initialize pinecone client
# from pinecone import Pinecone as PineconeClient, ServerlessSpec
# pc = PineconeClient(
#     api_key=os.environ.get("PINECONE_API_KEY"),
# )

In [None]:
# # create index
# pc.create_index(
#         name='my-index',
#         dimension=1536,
#         metric="cosine",
#         spec=ServerlessSpec(
#             cloud="aws",
#             region="us-east-1"
#         )
#     )

In [None]:
# # load index
# index_name = "my-index"

In [None]:
# # create vectorstore
# from langchain.vectorstores import Pinecone
# vectorstore = Pinecone.from_documents(
#     documents=documents,
#     embedding=embeddings,
#     index_name=index_name
# )

## **FAISS (Optional)**

In [5]:
# optional vectorstore
#%pip install --q faiss-cpu

# create vectorstore
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents, embeddings)

In [11]:
# Explore the FAISS vectorstore for educational purposes
print("🔍 EXPLORING THE FAISS VECTORSTORE")
print("=" * 50)

# Basic vectorstore information
print(f"📊 Vectorstore Type: {type(vectorstore)}")
print(f"📊 Number of vectors in index: {vectorstore.index.ntotal}")
print(f"📊 Vector dimension: {vectorstore.index.d}")

# Show some document texts and their metadata
print(f"\n📄 SAMPLE DOCUMENTS IN VECTORSTORE:")
print("-" * 40)
docstore_dict = vectorstore.docstore._dict
sample_docs = list(docstore_dict.items())[:5]  # Show first 5 documents

for i, (doc_id, doc) in enumerate(sample_docs):
    print(f"\n📄 Document {i+1} (ID: {doc_id}):")
    print(f"   Text preview: {doc.page_content[:150]}...")
    print(f"   Metadata: {doc.metadata}")
    print(f"   Full text length: {len(doc.page_content)} characters")

# Test similarity search to show how retrieval works
print(f"\n🔍 TESTING SIMILARITY SEARCH:")
print("-" * 40)
test_query = "World War"
similar_docs = vectorstore.similarity_search(test_query, k=3)

print(f"Query: '{test_query}'")
print(f"Found {len(similar_docs)} similar documents:")

for i, doc in enumerate(similar_docs):
    print(f"\n📄 Result {i+1}:")
    print(f"   Text: {doc.page_content[:200]}...")
    print(f"   Source: {doc.metadata.get('source', 'Unknown')}")
    print(f"   Row: {doc.metadata.get('row', 'N/A')}")

# Show how embeddings work conceptually
print(f"\n🧠 UNDERSTANDING EMBEDDINGS:")
print("-" * 40)
print("Each document is converted to a 1536-dimensional vector using Azure OpenAI")
print("FAISS stores these vectors and enables fast similarity search")
print("When you query, your question is also converted to a vector")
print("FAISS finds the most similar vectors (closest in high-dimensional space)")
print("\n💡 This is how RAG retrieves relevant context for your questions!")

🔍 EXPLORING THE FAISS VECTORSTORE
📊 Vectorstore Type: <class 'langchain_community.vectorstores.faiss.FAISS'>
📊 Number of vectors in index: 320
📊 Vector dimension: 1536

📄 SAMPLE DOCUMENTS IN VECTORSTORE:
----------------------------------------

📄 Document 1 (ID: e266eaba-c665-47bd-ba75-604b5c1fb820):
   Text preview: context: ['African immigration to the United States refers to immigrants to the United States who are or were nationals of modern African countries. T...
   Metadata: {'source': '../data/context.csv', 'row': 0}
   Full text length: 500 characters

📄 Document 2 (ID: ce540381-e6b6-436b-9630-861c11aa6bcb):
   Text preview: do not constitute a homogeneous group. They include peoples from different national, linguistic, ethnic, racial, cultural and social backgrounds. As s...
   Metadata: {'source': '../data/context.csv', 'row': 0}
   Full text length: 497 characters

📄 Document 3 (ID: c9115f6d-bf91-4784-b1ae-ab19f0cb3a9a):
   Text preview: Immigration legislation ==\n\n\n=== 

## **Retriever**

In [12]:
# create retriever
retriever = vectorstore.as_retriever()

## **RAG Chain**

In [13]:
# load llm
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
    model="gpt-4o-mini",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"],
    temperature=0
)

In [14]:
# create document chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

template = """"
You are a helpful assistant that answers questions based on the provided context.
Use the provided context to answer the question.
Question: {input}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [27]:
# Let's first examine the first document to create a targeted question
print("📄 FIRST DOCUMENT CONTENT:")
print("=" * 50)
first_doc = documents[0]  # Get the first document after splitting
print(f"Content: {first_doc.page_content}")
print(f"Metadata: {first_doc.metadata}")
print("=" * 50)

# Based on the first document content, ask a targeted question
# This question should retrieve the first document as the most relevant context
response = rag_chain.invoke("What is African immigration to the United States?")
print(f"\n🤖 RAG Response:")
print(response)

📄 FIRST DOCUMENT CONTENT:
Content: context: ['African immigration to the United States refers to immigrants to the United States who are or were nationals of modern African countries. The term African in the scope of this article refers to geographical or national origins rather than racial affiliation. Between the Immigration and Nationality Act of 1965 and 2017, Sub-Saharan African-born population in the United States grew to 2.1 million people.Sub-Saharan Africans in the United States come from almost all regions in Africa and do not constitute a homogeneous group. They include peoples from different national, linguistic, ethnic, racial, cultural and social backgrounds. As such, US and foreign born Sub-Saharan Africans are distinct from native-born African Americans, many of whose ancestors were involuntarily brought from West and Central Africa to the colonial United States by means of the historic Atlantic slave trade. African immigration is now driving the growth of the Black pop

## **Preparing Data for Evaluation**

In [28]:
# create dataset
question = ["What is African immigration to the United States?"]
response = []
contexts = []

# Inference
for query in question:
  response.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "query": question,
    "response": response,
    "context": contexts,
}

In [29]:
# create dataset
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [30]:
# create dataframe
import pandas as pd
df = pd.DataFrame(dataset)

In [31]:
df

Unnamed: 0,query,response,context
0,What is African immigration to the United States?,African immigration to the United States refer...,[context: ['African immigration to the United ...


In [20]:
# Convert to dictionary
df_dict = df.to_dict(orient='records')

# Convert context to list
for record in df_dict:
    if not isinstance(record.get('context'), list):
        if record.get('context') is None:
            record['context'] = []
        else:
            record['context'] = [record['context']]

## **Evaluation in Athina AI**

We will use **Does Response Answer Query** eval here. It Checks if the response answer the user's query. To learn more about this. Please refer to our [documentation](https://docs.athina.ai/api-reference/evals/preset-evals/overview) for further details.

In [None]:
# # set api keys for Athina evals
# from athina.keys import AthinaApiKey, OpenAiApiKey
# OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
# AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

In [None]:
# # load dataset
# from athina.loaders import Loader
# dataset = Loader().load_dict(df_dict)

In [None]:
# # evaluate
# from athina.evals import DoesResponseAnswerQuery
# DoesResponseAnswerQuery(model="gpt-4o").run_batch(data=dataset).to_df()