# RAG System

**Phase A:** Load Documents → Chunk → Embed → Store in Vector DB

**Phase B:** Query → Retrieve Similar Chunks → Generate Answer

In [188]:
# CELL 1: Install dependencies (run once)
# !pip install -r requirements.txt

In [189]:
# CELL 2: Imports
import os, warnings
warnings.filterwarnings('ignore')

from dotenv import load_dotenv
load_dotenv()

from core import DocumentLoader, TextChunker, EmbeddingGenerator, VectorStore, Retriever, ResponseGenerator
from config import RAGConfig
from widgets import create_embedding_visualization, create_similarity_chart, create_chunk_statistics_dashboard

print("Imports OK!")
print(f"GEMINI_API_KEY: {'SET' if os.getenv('GEMINI_API_KEY') else 'NOT SET'}")

Imports OK!
GEMINI_API_KEY: SET


In [190]:
# CELL 3: Configuration - EDIT THESE VALUES
config = RAGConfig(
    chunk_size=500,
    chunk_overlap=50,
    chunking_strategy='sentence',
    embedding_model='all-MiniLM-L6-v2',
    top_k=5,
    llm_model='gemini-2.5-flash-lite',
    temperature=0.7,
    max_tokens=1024
)
print("Config set!")

Config set!


---
## Phase A: Indexing
---

In [191]:
# CELL 4: Load Documents - EDIT PATH HERE
documents = DocumentLoader.load_directory('./data/_testdata')

for doc in documents:
    print(f"{doc.source}: {len(doc):,} chars")
print(f"\nTotal: {len(documents)} docs")

d1.1.txt: 52 chars
d1.txt: 62 chars
d2.txt: 53 chars
d3.txt: 100 chars
d4.txt: 57 chars
d5.txt: 37 chars
d6.txt: 31 chars

Total: 7 docs


In [192]:
# CELL 5: Chunk documents
chunker = TextChunker(chunk_size=config.chunk_size, overlap=config.chunk_overlap, strategy=config.chunking_strategy)
chunks = chunker.chunk_documents(documents)
print(f"Created {len(chunks)} chunks (avg {chunker.get_statistics(chunks)['avg_length']:.0f} chars)")

Created 7 chunks (avg 56 chars)


In [193]:
# CELL 6: Visualize chunks (optional - comment out to skip)
# create_chunk_statistics_dashboard(chunks).show()
print(f"Chunk statistics: {len(chunks)} chunks created")

Chunk statistics: 7 chunks created


In [194]:
# CELL 7: Generate embeddings
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

embedder = EmbeddingGenerator(model_name=config.embedding_model, device=device)
embedded_chunks = embedder.embed_chunks(chunks, batch_size=32, show_progress=True)
print(f"\nGenerated {len(embedded_chunks)} embeddings")

Using device: cuda


Embedding chunks:   0%|          | 0/1 [00:00<?, ?it/s]

Loading embedding model: all-MiniLM-L6-v2...
Model loaded (dim=384)

Generated 7 embeddings


In [195]:
# CELL 8: Store in vector database
vector_store = VectorStore(collection_name='rag_collection', persist_directory='./chroma_db', reset=True)
count = vector_store.add(embedded_chunks)
print(f"Stored {count} vectors")

Deleted existing collection: rag_collection
Stored 7 vectors


In [196]:
# CELL 9: Visualize embedding space (deprecated)
#all_embeddings, all_metadata = vector_store.get_all_embeddings()

# Automatic subsampling to 2000 points + faster UMAP params
# Should take ~5-10s instead of 40s+
#create_embedding_visualization(all_embeddings, all_metadata, method='UMAP').show()

#print(f"Visualized {len(all_embeddings)} embeddings (subsampled to 2000 for speed)")

---
## Phase B: Query
---

In [197]:
# CELL 10: Initialize retriever and generator
retriever = Retriever(embedder, vector_store)
generator = ResponseGenerator(model=config.llm_model, temperature=config.temperature, max_tokens=config.max_tokens)
print(f"Ready! Using {config.llm_model}")

Ready! Using gemini-2.5-flash-lite


In [198]:
# CELL 11: ASK A QUESTION - EDIT YOUR QUERY HERE
#query = "Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered?"
query = "How old is Anton?"

# Retrieve
results = retriever.retrieve(query=query, k=config.top_k)
print(f"Query: {query}\n")
print("Retrieved chunks:")
for r in results:
    print(f"\n{'='*80}")
    print(f"[{r.rank}] Score: {r.score:.3f} | Source: {r.source}")
    print(f"{'='*80}")
    print(r.text)

# Generate
response = generator.generate(query=query, context_chunks=results)
print(f"\n{'='*50}\nANSWER:\n{'='*50}")
print(response.response)
print(f"\nSources: {', '.join(response.sources)}")

Query: How old is Anton?

Retrieved chunks:

[1] Score: 0.607 | Source: d1.txt
Anton is a really cool and clever person who lives in Croydon.

[2] Score: 0.573 | Source: d1.1.txt
Anton is a really cool and clever software engineer.

[3] Score: 0.287 | Source: d2.txt
The really cool and clever person from Croydon is 22.

[4] Score: 0.252 | Source: d5.txt
The Millenium Bridge is 25 years old.

[5] Score: 0.134 | Source: d3.txt
The software engineer is rather hard-working, but he ends up with very little free time as a result.

ANSWER:
22

Sources: d2.txt, d1.txt, d5.txt, d1.1.txt, d3.txt


In [199]:
# CELL 12: Visualize query in embedding space
all_embeddings, all_metadata = vector_store.get_all_embeddings()
query_embedding = retriever.last_query_embedding
retrieved_indices = retriever.get_retrieved_indices(results, all_metadata)
scores = [r.score for r in results]

fig = create_embedding_visualization(
    all_embeddings, all_metadata, method='UMAP',
    query_embedding=query_embedding,
    retrieved_indices=retrieved_indices,
    retrieval_scores=scores
)
fig.show()

In [200]:
# CELL 13: Quick ask function - use for more questions
def ask(question, top_k=5):
    results = retriever.retrieve(query=question, k=top_k)
    response = generator.generate(query=question, context_chunks=results)
    print(f"Q: {question}\n\nA: {response.response}\n\nSources: {', '.join(response.sources)}")
    return response

In [201]:
# CELL 14: Ask more questions
# ask("What is deep learning?")

---
## Method Comparison: Top-K vs MMR vs QUBO-RAG
**All three methods answer the same query, but with different retrieval strategies**
---

In [None]:
# Setup: Import and define test query
from core.retrieval_strategies import create_retrieval_strategy
import numpy as np

# Medical diagnosis query with overlapping symptoms  
#test_query = "Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered in the differential diagnosis?"
test_query = "How old is Anton?"
k = 2

print(f"Test Query: '{test_query}'")
print(f"Retrieving top {k} chunks with each method...\n")

Test Query: 'How old is Anton?'
Retrieving top 3 chunks with each method...



### Method 1: Top-K (Naive)
Simple relevance-based retrieval - picks chunks with highest similarity scores

In [212]:
# METHOD 1: TOP-K (Naive)
print("="*70)
print("METHOD 1: TOP-K (NAIVE)")
print("="*70)

# Retrieve with Naive strategy
query_emb = embedder.embed_query(test_query)
candidates = vector_store.search(query_emb, k=k*3)
strategy = create_retrieval_strategy('naive')
results_naive, meta = strategy.retrieve(query_emb, candidates, k=k)

# Show retrieved chunks
print(f"\nRetrieved {len(results_naive)} chunks:")
for r in results_naive:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_naive = generator.generate(query=test_query, context_chunks=results_naive)

print("\n" + "-"*70)
print("TOP-K ANSWER:")
print("-"*70)
print(response_naive.response)
print(f"\nSources: {', '.join(response_naive.sources)}")
print("="*70)

METHOD 1: TOP-K (NAIVE)

Retrieved 3 chunks:
  [1] Score: 0.607 | Anton is a really cool and clever person who lives in Croydon....
  [2] Score: 0.573 | Anton is a really cool and clever software engineer....
  [3] Score: 0.287 | The really cool and clever person from Croydon is 22....

----------------------------------------------------------------------
TOP-K ANSWER:
----------------------------------------------------------------------
The really cool and clever person from Croydon is 22.

Sources: d2.txt, d1.1.txt, d1.txt


### Method 2: MMR (Maximal Marginal Relevance)
Balances relevance and diversity - avoids redundant chunks

In [211]:
# METHOD 2: MMR (Maximal Marginal Relevance)
print("="*70)
print("METHOD 2: MMR (MAXIMAL MARGINAL RELEVANCE)")
print("="*70)

# Retrieve with MMR strategy (lambda=0.5 balances relevance/diversity)
strategy = create_retrieval_strategy('mmr', lambda_param=0.5)
results_mmr, meta = strategy.retrieve(query_emb, candidates, k=k)

# Show retrieved chunks
print(f"\nRetrieved {len(results_mmr)} chunks:")
for r in results_mmr:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_mmr = generator.generate(query=test_query, context_chunks=results_mmr)

print("\n" + "-"*70)
print("MMR ANSWER:")
print("-"*70)
print(response_mmr.response)
print(f"\nSources: {', '.join(response_mmr.sources)}")
print("="*70)

METHOD 2: MMR (MAXIMAL MARGINAL RELEVANCE)

Retrieved 3 chunks:
  [1] Score: 0.607 | Anton is a really cool and clever person who lives in Croydon....
  [2] Score: 0.252 | The Millenium Bridge is 25 years old....
  [3] Score: 0.134 | The software engineer is rather hard-working, but he ends up with very little fr...

----------------------------------------------------------------------
MMR ANSWER:
----------------------------------------------------------------------
I cannot find this information in the provided documents.

Sources: d5.txt, d3.txt, d1.txt


### Method 3: QUBO-RAG (Quantum-Inspired Optimization)
Uses ORBIT simulator to optimize relevance-diversity tradeoff via p-bit computing

In [208]:
# METHOD 3: QUBO-RAG (Quantum-Inspired with ORBIT)
print("="*70)
print("METHOD 3: QUBO-RAG (QUANTUM-INSPIRED OPTIMIZATION)")
print("="*70)

# Retrieve with QUBO strategy - TUNED FOR DIVERSITY
# alpha=0.35 emphasizes diversity (65% diversity weight!)
# n_replicas=4 for better optimization
# full_sweeps=10000 for convergence
strategy = create_retrieval_strategy('qubo', alpha=0.3, 
                                    solver_params={'n_replicas': 4, 'full_sweeps': 10000})
results_qubo, meta = strategy.retrieve(query_emb, candidates, k=k)

print(f"ORBIT simulation time: {meta['execution_time']:.2f}s")
print(f"Alpha: {meta['alpha']} (lower = more diversity emphasis)")
print(f"Constraint satisfied: {meta.get('constraint_satisfied', 'N/A')}")

# Show retrieved chunks
print(f"\nRetrieved {len(results_qubo)} chunks:")
for r in results_qubo:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_qubo = generator.generate(query=test_query, context_chunks=results_qubo)

print("\n" + "-"*70)
print("QUBO-RAG ANSWER:")
print("-"*70)
print(response_qubo.response)
print(f"\nSources: {', '.join(response_qubo.sources)}")
print("="*70)

METHOD 3: QUBO-RAG (QUANTUM-INSPIRED OPTIMIZATION)
[2025-12-07 09:06:12] INFO - orbit.simulator: Simulation starting...
[2025-12-07 09:06:15] INFO - orbit.simulator: Simulation completed in 2.51 seconds
ORBIT simulation time: 2.51s
Alpha: 0.3 (lower = more diversity emphasis)
Constraint satisfied: True

Retrieved 3 chunks:
  [1] Score: 0.607 | Anton is a really cool and clever person who lives in Croydon....
  [2] Score: 0.573 | Anton is a really cool and clever software engineer....
  [3] Score: 0.287 | The really cool and clever person from Croydon is 22....

----------------------------------------------------------------------
QUBO-RAG ANSWER:
----------------------------------------------------------------------
The really cool and clever person from Croydon is 22.

Sources: d2.txt, d1.1.txt, d1.txt


### Diversity Comparison
Compare how diverse the retrieved chunks are for each method

In [206]:
# Compare diversity metrics across all three methods
from core.diversity_metrics import compare_retrieval_methods, print_comparison_table

# Convert results to dict format for metrics
results_dict = {}
for name, results in [('Top-K', results_naive), ('MMR', results_mmr), ('QUBO-RAG', results_qubo)]:
    results_dict[name] = [{
        'id': r.id,
        'score': r.score,
        'embedding': next((c['embedding'] for c in candidates if c['id'] == r.id), None)
    } for r in results]

print("\n" + "="*70)
print("DIVERSITY METRICS COMPARISON")
print("="*70)
comparison = compare_retrieval_methods(results_dict)
print_comparison_table(comparison)

print("\nKey Insight:")
print(f"  • Top-K intra-list similarity:   {comparison['Top-K']['intra_list_similarity']:.4f}")
print(f"  • MMR intra-list similarity:     {comparison['MMR']['intra_list_similarity']:.4f}")  
print(f"  • QUBO-RAG intra-list similarity: {comparison['QUBO-RAG']['intra_list_similarity']:.4f}")
print("\n  → Lower = more diverse chunks (less redundancy)")
print("="*70)


DIVERSITY METRICS COMPARISON
Metric                              Top-K             MMR        QUBO-RAG
avg_score                          0.4891          0.3311          0.2869
intra_list_similarity              0.5411          0.1347          0.0801
max_score                          0.6075          0.6075          0.6075
min_score                          0.2868          0.1336          0.0010
num_results                             3               3               3
std_score                          0.1438          0.2013          0.2488

Interpretation:
- Intra-list similarity: Lower = more diverse
- Subtopic recall: Higher = better coverage
- Alpha-nDCG: Higher = better relevance + diversity balance

Key Insight:
  • Top-K intra-list similarity:   0.5411
  • MMR intra-list similarity:     0.1347
  • QUBO-RAG intra-list similarity: 0.0801

  → Lower = more diverse chunks (less redundancy)
