# 🚀 Creative RAG Pipeline Optimization

This notebook explains the **core performance bottleneck** in RAG pipelines — especially related to the trimming step — and walks through a **creative, optimized, and cost-effective solution**.

We will:
- Identify the **issue** with the trimming step
- Propose a **creative fix** using smart chunking and rehydration
- Provide **code examples** with efficiency in mind
- Include **tips to reduce cost and increase throughput**
- Integrate optional **LLM-aware enhancements**

## ❗ The Problem: Trimming Bottleneck

The original pipeline used a **small model (e.g., o3-mini)** to trim retrieved documents before feeding them to GPT-4.
- Trimming isolates content, hurting relevance.
- Small models fail at "needle-in-haystack" tasks.
- It creates a **major infrastructure bottleneck** and adds latency.

### 🚫 Key Issues:
- Poor chunk boundaries = content chopped mid-thought.
- Redundant compute = all retrieved chunks evaluated by a weak model.
- Increased cost with limited gain.


## ✅ Solution: Trimming-Free, Chunk-Aware Design

**Instead of trimming**, we apply:
- Smarter ~250 token chunking (format-aware)
- Context rehydration (add neighbors at retrieval)
- Optional use of LLM to guide hard splits
- Efficient batching and lightweight parsing

In [None]:
# Required libraries
# !pip install tiktoken python-docx openai
import tiktoken
from docx import Document
import random, openai

## 🔧 Step 1: Smart Chunking Function
- Keeps chunks around 250 tokens
- Merges small sections and splits large ones with awareness


In [None]:
def tokenize(text, model="gpt-4"):
    return tiktoken.encoding_for_model(model).encode(text)

def smart_chunk_docx(path, min_tokens=100, max_tokens=250):
    doc = Document(path)
    chunks, buffer = [], ""

    for para in doc.paragraphs:
        if not para.text.strip():
            continue
        buffer += " " + para.text.strip()
        tokens = tokenize(buffer)
        if len(tokens) >= max_tokens:
            chunks.append(buffer.strip())
            buffer = ""
    if buffer:
        chunks.append(buffer.strip())
    return chunks

## 🔁 Step 2: Rehydration Logic
Adds neighboring chunks to restore context. This is better than trimming as it keeps original semantic flow intact.


In [None]:
def rehydrate_chunks(indices, all_chunks):
    final = []
    for idx in indices:
        chunk_group = []
        if idx > 0:
            chunk_group.append(all_chunks[idx - 1])
        chunk_group.append(all_chunks[idx])
        if idx + 1 < len(all_chunks):
            chunk_group.append(all_chunks[idx + 1])
        final.append("\n".join(chunk_group))
    return final

## 🏷️ Step 3: Simple Citation Marking
Use `<AtlasSource id=X>` tags instead of messy JSON citations.

In [None]:
import re

def extract_simple_citations(text):
    return re.findall(r"<AtlasSource id=(\d+)>", text)

## ⚖️ Step 4: Load-Balanced OpenAI Calls
Distribute across multiple API keys for higher throughput.

In [None]:
api_keys = ["API_KEY_1", "API_KEY_2"]  # replace with real keys

def load_balanced_query(prompt):
    openai.api_key = random.choice(api_keys)
    return openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )

## 💡 Optimization Summary
| Dimension       | Optimization                                 |
|----------------|-----------------------------------------------|
| **Cost**       | Remove trimming, fewer model hops             |
| **Throughput** | No o3-mini bottleneck, load balance GPT calls |
| **Efficiency** | Pre-chunking + rehydration avoids redundancy  |
| **Quality**    | Better context preservation and citation      |
| **Creativity** | LLM-aware chunking (future) + markup formats  |
