# RAG System - Exploratory Analysis

**Learning Objectives:**
- Understand the structure of our knowledge base
- Explore different chunking strategies
- Analyze document statistics
- Test embedding quality

**For Interviews:**  
This notebook demonstrates your ability to:
- Analyze data before building ML systems
- Make informed decisions about hyperparameters
- Understand trade-offs in RAG design

In [None]:
import sys
sys.path.append('../src')

import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Imports complete")

## 1. Load and Inspect Knowledge Base

First, let's understand what data we're working with.

In [None]:
# Load knowledge base
with open('../data/knowledge_base.json', 'r') as f:
    documents = json.load(f)

print(f"üìö Loaded {len(documents)} documents\n")

# Show first document
print("Example Document:")
print(json.dumps(documents[0], indent=2))

In [None]:
# Convert to DataFrame for analysis
df = pd.DataFrame(documents)

# Add word counts
df['word_count'] = df['content'].apply(lambda x: len(x.split()))
df['char_count'] = df['content'].apply(len)

print("\nüìä Document Statistics:")
print(df[['word_count', 'char_count']].describe())

df.head()

In [None]:
# Visualize document lengths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Word count distribution
axes[0].hist(df['word_count'], bins=20, edgecolor='black', alpha=0.7)
axes[0].set_title('Distribution of Document Lengths (Words)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Word Count')
axes[0].set_ylabel('Number of Documents')
axes[0].axvline(df['word_count'].mean(), color='red', linestyle='--', label=f'Mean: {df["word_count"].mean():.0f}')
axes[0].legend()

# Category distribution (if present)
if 'category' in df.columns:
    category_counts = df['category'].value_counts()
    axes[1].barh(category_counts.index, category_counts.values)
    axes[1].set_title('Documents by Category', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Number of Documents')
else:
    axes[1].text(0.5, 0.5, 'No category information', ha='center', va='center')

plt.tight_layout()
plt.show()

print(f"\nüìà Average document length: {df['word_count'].mean():.1f} words")
print(f"üìà Median document length: {df['word_count'].median():.1f} words")

## 2. Chunking Strategy Analysis

**Interview Key Point:** Chunking is crucial in RAG. Too small = lost context, too large = irrelevant info.

Let's test different chunking strategies.

In [None]:
from document_processor import chunk_text, process_documents

# Test different chunk sizes
chunk_sizes = [100, 200, 400, 800]
overlap = 50

chunk_analysis = []

for chunk_size in chunk_sizes:
    chunks = process_documents(documents, chunk_size=chunk_size, overlap=overlap)
    
    chunk_analysis.append({
        'chunk_size': chunk_size,
        'total_chunks': len(chunks),
        'avg_chunks_per_doc': len(chunks) / len(documents),
        'avg_chunk_words': sum(len(c['text'].split()) for c in chunks) / len(chunks)
    })

chunk_df = pd.DataFrame(chunk_analysis)
print("\nüîç Chunking Strategy Comparison:")
print(chunk_df.to_string(index=False))

In [None]:
# Visualize chunking impact
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Total chunks created
axes[0].plot(chunk_df['chunk_size'], chunk_df['total_chunks'], marker='o', linewidth=2, markersize=8)
axes[0].set_title('Total Chunks vs Chunk Size', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Chunk Size (words)')
axes[0].set_ylabel('Total Number of Chunks')
axes[0].grid(True, alpha=0.3)

# Average chunks per document
axes[1].plot(chunk_df['chunk_size'], chunk_df['avg_chunks_per_doc'], marker='s', linewidth=2, markersize=8, color='green')
axes[1].set_title('Average Chunks per Document', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Chunk Size (words)')
axes[1].set_ylabel('Chunks per Document')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Chunking Insights

**For Interviews - Be ready to discuss:**
1. **Smaller chunks (100-200 words):**
   - ‚úÖ More precise retrieval
   - ‚ùå May lose context
   - Good for: Factual Q&A

2. **Medium chunks (400 words):**
   - ‚úÖ Balance of precision and context
   - ‚úÖ Industry standard for many use cases
   - Good for: General RAG applications

3. **Large chunks (800+ words):**
   - ‚úÖ More context preserved
   - ‚ùå May include irrelevant information
   - Good for: Complex topics requiring background

## 3. Overlap Impact Analysis

**Interview Key Point:** Overlap prevents information loss at chunk boundaries.

In [None]:
# Test different overlap values
chunk_size = 300  # Fixed chunk size
overlaps = [0, 25, 50, 100]

overlap_analysis = []

for overlap in overlaps:
    chunks = process_documents(documents, chunk_size=chunk_size, overlap=overlap)
    
    overlap_analysis.append({
        'overlap': overlap,
        'total_chunks': len(chunks),
        'overlap_percentage': (overlap / chunk_size) * 100
    })

overlap_df = pd.DataFrame(overlap_analysis)
print("\nüîÑ Overlap Strategy Comparison (chunk_size=300):")
print(overlap_df.to_string(index=False))

In [None]:
# Visualize overlap impact
fig, ax = plt.subplots(figsize=(10, 6))

ax.bar(overlap_df['overlap'].astype(str), overlap_df['total_chunks'], edgecolor='black', alpha=0.7)
ax.set_title('Impact of Overlap on Chunk Count', fontsize=12, fontweight='bold')
ax.set_xlabel('Overlap (words)')
ax.set_ylabel('Total Number of Chunks')

# Add percentage labels
for i, row in overlap_df.iterrows():
    ax.text(i, row['total_chunks'] + 2, f"{row['overlap_percentage']:.1f}%", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Insight: More overlap creates more chunks (and more storage/compute cost)")

## 4. Example: Analyze Chunk Boundaries

Let's look at actual chunks to understand overlap.

In [None]:
# Create chunks with overlap
chunks = process_documents(documents[:1], chunk_size=50, overlap=10)

print(f"\nüìÑ Document chunks with overlap=10:")
print(f"Total chunks created: {len(chunks)}\n")

# Show first 3 chunks
for i in range(min(3, len(chunks))):
    chunk = chunks[i]
    print(f"--- Chunk {i+1} ---")
    print(f"ID: {chunk['id']}")
    print(f"Text: {chunk['text'][:200]}...")
    print(f"Words: {len(chunk['text'].split())}")
    print()

## 5. Recommended Configuration

Based on our analysis, here are recommended starting points:

In [None]:
print("\n‚öôÔ∏è RECOMMENDED RAG CONFIGURATIONS\n")

configs = [
    {
        'use_case': 'Factual Q&A',
        'chunk_size': 200,
        'overlap': 30,
        'top_k': 3,
        'reasoning': 'Short chunks for precise fact retrieval'
    },
    {
        'use_case': 'General Purpose',
        'chunk_size': 400,
        'overlap': 50,
        'top_k': 3,
        'reasoning': 'Balanced context and precision'
    },
    {
        'use_case': 'Complex Topics',
        'chunk_size': 600,
        'overlap': 75,
        'top_k': 2,
        'reasoning': 'More context per chunk, fewer chunks needed'
    }
]

config_df = pd.DataFrame(configs)
print(config_df.to_string(index=False))

print("\nüí° Interview Tip: Always explain WHY you chose these parameters!")

## 6. Text Statistics Analysis

Understanding vocabulary and text complexity helps inform chunk size decisions.

In [None]:
# Analyze vocabulary
all_words = []
for doc in documents:
    all_words.extend(doc['content'].lower().split())

unique_words = set(all_words)

print("\nüìö Vocabulary Statistics:")
print(f"Total words: {len(all_words):,}")
print(f"Unique words: {len(unique_words):,}")
print(f"Vocabulary richness: {len(unique_words)/len(all_words):.2%}")

# Average sentence length
sentence_lengths = []
for doc in documents:
    sentences = doc['content'].split('.')
    for sent in sentences:
        if sent.strip():
            sentence_lengths.append(len(sent.split()))

print(f"\nAverage sentence length: {sum(sentence_lengths)/len(sentence_lengths):.1f} words")

## 7. Key Takeaways for Interviews

**When discussing RAG in interviews, mention:**

1. **Data Analysis First:**
   - Always analyze document statistics before choosing chunk size
   - Consider average document length, vocabulary, domain

2. **Chunking Trade-offs:**
   - Chunk size affects retrieval precision vs context
   - Overlap prevents information loss but increases cost
   - No one-size-fits-all solution

3. **Iterative Approach:**
   - Start with recommended defaults (chunk_size=400, overlap=50)
   - Evaluate retrieval quality
   - Adjust based on results

4. **Cost Considerations:**
   - More chunks = more storage + more embedding cost
   - Larger top_k = more tokens sent to LLM = higher cost
   - Balance quality vs cost

In [None]:
print("\n‚úÖ Exploratory Analysis Complete!")
print("\nNext steps:")
print("1. Implement RAG pipeline (see 02_rag_implementation.ipynb)")
print("2. Evaluate retrieval quality (see 03_evaluation_optimization.ipynb)")