# RAG System - End-to-End Retrieval-Augmented Generation

This notebook implements a complete RAG pipeline with the following phases:

## Phase A - Indexing (Offline)
1. **Document Upload** - Load documents (PDF, TXT, DOCX)
2. **Chunking** - Split documents into manageable chunks
3. **Chunk Embedding** - Generate vector embeddings for each chunk
4. **Vector Storage** - Store embeddings in ChromaDB

## Phase B - Inference (Online)
5. **Query Embedding** - Embed the user's query
6. **Similarity Search** - Find similar chunks using cosine similarity
7. **Top-K Selection** - Select the most relevant chunks
8. **Augmented Generation** - Generate response using Gemini LLM

## Setup and Installation

Run the cell below to install required dependencies (if not already installed):

In [None]:
# Uncomment and run to install dependencies
# !pip install -r requirements.txt

## Import Libraries and Initialize

In [None]:
# Standard library imports
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.insert(0, os.path.dirname(os.path.abspath('__file__')))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Core modules
from core import (
    DocumentLoader, Document,
    TextChunker, Chunk,
    EmbeddingGenerator, EmbeddedChunk,
    VectorStore,
    Retriever, RetrievalResult,
    ResponseGenerator, GenerationResult
)

# Configuration
from config import RAGConfig

# Widgets
from widgets import (
    UploadWidget,
    ChunkingWidget,
    EmbeddingWidget,
    QueryWidget,
    create_embedding_visualization,
    create_similarity_chart,
    create_chunk_statistics_dashboard
)

# Visualization
import plotly.io as pio
pio.renderers.default = 'notebook'

print("All imports successful!")
print(f"GEMINI_API_KEY configured: {'Yes' if os.getenv('GEMINI_API_KEY') else 'No - Please set in .env file'}")

## Configuration

Initialize the RAG configuration with default parameters. You can modify these values as needed.

In [None]:
# Initialize configuration
config = RAGConfig(
    # Chunking
    chunk_size=500,
    chunk_overlap=50,
    chunking_strategy='sentence',
    
    # Embedding
    embedding_model='all-MiniLM-L6-v2',
    embedding_batch_size=32,
    embedding_device='cpu',
    
    # Vector Store
    collection_name='rag_demo',
    persist_directory='./chroma_db',
    
    # Retrieval
    top_k=5,
    similarity_threshold=0.0,
    
    # Generation
    llm_model='gemini-2.0-flash-lite',
    temperature=0.7,
    max_tokens=1024
)

print("Configuration initialized:")
for key, value in config.to_dict().items():
    if key != 'system_prompt':
        print(f"  {key}: {value}")

---
# Phase A: Indexing (Offline)
---

## Step 1: Document Upload

Upload documents or load from the sample data directory. Supported formats: PDF, TXT, DOCX

In [None]:
# Create upload widget
upload_widget = UploadWidget()
upload_widget.display()

In [None]:
# Alternative: Load sample documents programmatically
sample_docs = DocumentLoader.load_directory('./data')

for doc in sample_docs:
    upload_widget.add_document(doc)
    print(f"Loaded: {doc.source} ({len(doc):,} chars)")

print(f"\nTotal documents: {len(upload_widget.get_documents())}")

In [None]:
# Get loaded documents
documents = upload_widget.get_documents()

# Preview first document
if documents:
    print(f"Preview of '{documents[0].source}':\n")
    print(documents[0].preview(500))

## Step 2: Text Chunking

Split documents into smaller chunks for processing. Adjust parameters using the interactive controls.

In [None]:
# Create chunking widget
chunking_widget = ChunkingWidget()
chunking_widget.set_documents(documents)
chunking_widget.display()

In [None]:
# Alternative: Create chunks programmatically
chunker = TextChunker(
    chunk_size=config.chunk_size,
    overlap=config.chunk_overlap,
    strategy=config.chunking_strategy
)

chunks = chunker.chunk_documents(documents)
print(f"Created {len(chunks)} chunks")

# Get statistics
stats = chunker.get_statistics(chunks)
print(f"\nStatistics:")
for key, value in stats.items():
    if key != 'sources':
        print(f"  {key}: {value:.2f}" if isinstance(value, float) else f"  {key}: {value}")

In [None]:
# Visualize chunk statistics
fig = create_chunk_statistics_dashboard(chunks)
fig.show()

## Step 3: Chunk Embedding

Generate vector embeddings for each chunk using a pre-trained model.

In [None]:
# Initialize embedding generator
embedder = EmbeddingGenerator(
    model_name=config.embedding_model,
    device=config.embedding_device
)

print(f"Model: {embedder.model_name}")
print(f"Embedding dimension: {embedder.embedding_dim}")

In [None]:
# Generate embeddings for all chunks
embedded_chunks = embedder.embed_chunks(
    chunks,
    batch_size=config.embedding_batch_size,
    show_progress=True
)

print(f"\nGenerated {len(embedded_chunks)} embeddings")
print(f"Embedding shape: {embedded_chunks[0].embedding.shape}")

## Step 4: Vector Storage

Store embeddings in ChromaDB vector database.

In [None]:
# Initialize vector store
vector_store = VectorStore(
    collection_name=config.collection_name,
    persist_directory=config.persist_directory,
    reset=True  # Set to False to keep existing data
)

# Add embeddings
count = vector_store.add(embedded_chunks)

print(f"Added {count} vectors to collection '{config.collection_name}'")
print(f"\nVector Store Statistics:")
for key, value in vector_store.get_statistics().items():
    print(f"  {key}: {value}")

## Visualize Embedding Space

Use UMAP/t-SNE/PCA to visualize the high-dimensional embeddings in 2D.

In [None]:
# Get all embeddings for visualization
all_embeddings, all_metadata = vector_store.get_all_embeddings()

print(f"Embeddings shape: {all_embeddings.shape}")
print(f"Metadata count: {len(all_metadata)}")

In [None]:
# Create embedding visualization (UMAP)
fig = create_embedding_visualization(
    all_embeddings,
    all_metadata,
    method='UMAP'  # Options: 'UMAP', 't-SNE', 'PCA'
)
fig.show()

---
# Phase B: Inference (Online)
---

## Steps 5-7: Query and Retrieval

- **Step 5**: Embed the query using the same model
- **Step 6**: Calculate cosine similarity with all chunk embeddings
- **Step 7**: Select top-k most similar chunks

In [None]:
# Initialize retriever
retriever = Retriever(embedder, vector_store)

In [None]:
# Example query
query = "What is machine learning and what are its main types?"

# Retrieve relevant chunks
results = retriever.retrieve(
    query=query,
    k=config.top_k,
    threshold=config.similarity_threshold
)

print(f"Query: {query}")
print(f"\nRetrieved {len(results)} chunks:")
for r in results:
    print(f"\n[Rank {r.rank}] Score: {r.score:.4f} | Source: {r.source}")
    print(f"  {r.text[:150]}...")

In [None]:
# Visualize similarity scores
fig = create_similarity_chart(results)
fig.show()

In [None]:
# Visualize query in embedding space
query_embedding = retriever.last_query_embedding
retrieved_indices = retriever.get_retrieved_indices(results, all_metadata)

fig = create_embedding_visualization(
    all_embeddings,
    all_metadata,
    method='UMAP',
    query_embedding=query_embedding,
    retrieved_indices=retrieved_indices
)
fig.show()

## Step 8: Augmented Generation

Use the retrieved context to generate a response with Gemini LLM.

In [None]:
# Initialize generator
generator = ResponseGenerator(
    model=config.llm_model,
    temperature=config.temperature,
    max_tokens=config.max_tokens
)

print(f"LLM Model: {generator.model_name}")
print(f"Temperature: {generator.temperature}")
print(f"Max Tokens: {generator.max_tokens}")

In [None]:
# Generate response
generation_result = generator.generate(
    query=query,
    context_chunks=results,
    system_prompt=config.system_prompt
)

print(f"Query: {query}")
print(f"\n{'='*60}")
print("GENERATED RESPONSE:")
print('='*60)
print(generation_result.response)
print(f"\n{'='*60}")
print(f"Sources: {', '.join(generation_result.sources)}")
print(f"Tokens used: {generation_result.total_tokens}")

---
# Interactive Query Interface
---

Use the interactive widget below to query the RAG system with your own questions.

In [None]:
# Create interactive query widget
query_widget = QueryWidget()
query_widget.set_components(embedder, vector_store)
query_widget.display()

---
# Complete Pipeline Example
---

Run the entire RAG pipeline in one cell.

In [None]:
def run_rag_query(query: str, top_k: int = 5, temperature: float = 0.7):
    """
    Run a complete RAG query.
    
    Args:
        query: The question to ask
        top_k: Number of chunks to retrieve
        temperature: LLM temperature
    
    Returns:
        GenerationResult object
    """
    print(f"Query: {query}")
    print("\n" + "-"*50)
    
    # Step 5-7: Retrieve
    print("\n[Step 5-7] Retrieving relevant chunks...")
    results = retriever.retrieve(query=query, k=top_k)
    print(f"  Found {len(results)} relevant chunks")
    
    # Show retrieved chunks
    for r in results:
        print(f"    [{r.rank}] Score: {r.score:.3f} - {r.source}")
    
    # Step 8: Generate
    print("\n[Step 8] Generating response...")
    generator.update_parameters(temperature=temperature)
    result = generator.generate(query=query, context_chunks=results)
    
    print("\n" + "="*50)
    print("ANSWER:")
    print("="*50)
    print(result.response)
    print("\n" + "-"*50)
    print(f"Sources: {', '.join(result.sources)}")
    print(f"Tokens: {result.total_tokens}")
    
    return result

# Example queries
example_queries = [
    "What is deep learning and how does it work?",
    "How do I create a virtual environment in Python?",
    "What are the ethical considerations in AI?",
]

In [None]:
# Run an example query
result = run_rag_query(example_queries[0], top_k=5, temperature=0.7)

In [None]:
# Try another query
result = run_rag_query(example_queries[1], top_k=5, temperature=0.7)

---
# Custom Query
---

Enter your own query below:

In [None]:
# Enter your custom query here
my_query = "What is RAG and how does it improve LLM responses?"

result = run_rag_query(my_query, top_k=5, temperature=0.7)

---
## Summary

This notebook demonstrated a complete RAG pipeline:

1. **Document Upload** - Loaded documents from files
2. **Chunking** - Split documents into smaller pieces
3. **Embedding** - Generated vector representations
4. **Storage** - Saved to ChromaDB
5. **Query Embedding** - Converted query to vector
6. **Similarity Search** - Found related chunks
7. **Top-K Selection** - Selected best matches
8. **Generation** - Created response with Gemini

Each step can be customized using the interactive widgets or by modifying the configuration.