# RAG Pipeline Optimization – Jupyter Walkthrough
This notebook demonstrates how to:
- Remove trimming
- Chunk documents smartly (~250 tokens)
- Rehydrate surrounding chunks
- Use formatting/metadata for context-aware chunking
- (Optional) Use an LLM for chunk splitting
- Simplify citation output
- Apply load balancing to OpenAI API calls

In [None]:
# Install necessary libraries (uncomment below if not already installed)
# !pip install python-docx tiktoken openai
import openai
import tiktoken
from docx import Document

## 1. Tokenizer Helper
Function to convert text into OpenAI token format using `tiktoken`.

In [None]:
def tokenize(text):
    enc = tiktoken.encoding_for_model("gpt-4")
    return enc.encode(text)

## 2. Paragraph-Aware Chunking
This function reads a `.docx` file and chunks content by paragraphs, aiming for 100–250 token chunks.
Very long paragraphs are split; short ones are merged.

In [None]:
def chunk_paragraphs(doc_path, min_tokens=100, max_tokens=250):
    doc = Document(doc_path)
    chunks, buffer = [], ""
    
    for para in doc.paragraphs:
        if not para.text.strip():
            continue
        buffer += " " + para.text.strip()
        if len(tokenize(buffer)) >= min_tokens:
            if len(tokenize(buffer)) > max_tokens:
                sentences = buffer.split('. ')
                chunk, current = "", []
                for s in sentences:
                    current.append(s)
                    if len(tokenize('. '.join(current))) >= max_tokens:
                        chunks.append('. '.join(current))
                        current = []
                buffer = '. '.join(current)
            else:
                chunks.append(buffer)
                buffer = ""
    if buffer:
        chunks.append(buffer)
    return chunks

## 3. Chunk Rehydration
Function to include 1 chunk before and after each selected chunk index, for better context.

In [None]:
def rehydrate(retrieved_ids, all_chunks):
    rehydrated = []
    for i in retrieved_ids:
        parts = [all_chunks[i]]
        if i > 0:
            parts.insert(0, all_chunks[i - 1])
        if i + 1 < len(all_chunks):
            parts.append(all_chunks[i + 1])
        rehydrated.append(" ".join(parts))
    return rehydrated

## 4. Simplified Citation Format
Use this format in your prompts and outputs to simplify parsing.

In [None]:
import re

def extract_citations(text):
    return re.findall(r"<AtlasSource id=(\d+)>", text)

## 5. Load Balanced OpenAI API Call
Randomly choose an API key per call for higher throughput across quotas.

In [None]:
import random
api_keys = ["API_KEY_1", "API_KEY_2", "API_KEY_3"]

def query_openai(prompt):
    openai.api_key = random.choice(api_keys)
    return openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )