# RAG System

**Phase A:** Load Documents → Chunk → Embed → Store in Vector DB

**Phase B:** Query → Retrieve Similar Chunks → Generate Answer

In [2]:
# CELL 1: Install dependencies (run once)
# !pip install -r requirements.txt

In [3]:
# CELL 2: Imports
import os, warnings
warnings.filterwarnings('ignore')

from dotenv import load_dotenv
load_dotenv()

from core import DocumentLoader, TextChunker, EmbeddingGenerator, VectorStore, Retriever, ResponseGenerator
from config import RAGConfig
from widgets import create_embedding_visualization, create_similarity_chart, create_chunk_statistics_dashboard

print("Imports OK!")
print(f"GEMINI_API_KEY: {'SET' if os.getenv('GEMINI_API_KEY') else 'NOT SET'}")

Imports OK!
GEMINI_API_KEY: SET


In [4]:
# CELL 3: Configuration - EDIT THESE VALUES
config = RAGConfig(
    chunk_size=500,
    chunk_overlap=50,
    chunking_strategy='sentence',
    embedding_model='all-MiniLM-L6-v2',
    top_k=5,
    llm_model='gemini-2.5-flash-lite',
    temperature=0.7,
    max_tokens=1024
)
print("Config set!")

Config set!


---
## Phase A: Indexing
---

In [5]:
# CELL 4: Load Documents - EDIT PATH HERE
documents = DocumentLoader.load_directory('./data/medical_diagnosis')

for doc in documents:
    print(f"{doc.source}: {len(doc):,} chars")
print(f"\nTotal: {len(documents)} docs")

addisons_disease.txt: 304 chars
anemia_variant_1.txt: 213 chars
anemia_variant_10.txt: 162 chars
anemia_variant_11.txt: 163 chars
anemia_variant_12.txt: 160 chars
anemia_variant_13.txt: 163 chars
anemia_variant_14.txt: 164 chars
anemia_variant_15.txt: 174 chars
anemia_variant_2.txt: 178 chars
anemia_variant_3.txt: 192 chars
anemia_variant_4.txt: 182 chars
anemia_variant_5.txt: 168 chars
anemia_variant_6.txt: 166 chars
anemia_variant_7.txt: 159 chars
anemia_variant_8.txt: 183 chars
anemia_variant_9.txt: 159 chars
celiac_disease.txt: 356 chars
chronic_fatigue_variant_1.txt: 292 chars
chronic_fatigue_variant_10.txt: 205 chars
chronic_fatigue_variant_11.txt: 195 chars
chronic_fatigue_variant_12.txt: 190 chars
chronic_fatigue_variant_13.txt: 166 chars
chronic_fatigue_variant_14.txt: 197 chars
chronic_fatigue_variant_15.txt: 186 chars
chronic_fatigue_variant_16.txt: 183 chars
chronic_fatigue_variant_17.txt: 195 chars
chronic_fatigue_variant_18.txt: 177 chars
chronic_fatigue_variant_19.txt: 1

In [6]:
# CELL 5: Chunk documents
chunker = TextChunker(chunk_size=config.chunk_size, overlap=config.chunk_overlap, strategy=config.chunking_strategy)
chunks = chunker.chunk_documents(documents)
print(f"Created {len(chunks)} chunks (avg {chunker.get_statistics(chunks)['avg_length']:.0f} chars)")

Created 210 chunks (avg 188 chars)


In [7]:
# CELL 6: Visualize chunks (optional - comment out to skip)
# create_chunk_statistics_dashboard(chunks).show()
print(f"Chunk statistics: {len(chunks)} chunks created")

Chunk statistics: 210 chunks created


In [8]:
# CELL 7: Generate embeddings
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

embedder = EmbeddingGenerator(model_name=config.embedding_model, device=device)
embedded_chunks = embedder.embed_chunks(chunks, batch_size=32, show_progress=True)
print(f"\nGenerated {len(embedded_chunks)} embeddings")

Using device: cuda


Embedding chunks:   0%|          | 0/7 [00:00<?, ?it/s]

Loading embedding model: all-MiniLM-L6-v2...
Model loaded (dim=384)

Generated 210 embeddings


In [9]:
# CELL 8: Store in vector database
vector_store = VectorStore(collection_name='rag_collection', persist_directory='./chroma_db', reset=True)
count = vector_store.add(embedded_chunks)
print(f"Stored {count} vectors")

Deleted existing collection: rag_collection
Stored 210 vectors


In [10]:
# CELL 9: Visualize embedding space (deprecated)
#all_embeddings, all_metadata = vector_store.get_all_embeddings()

# Automatic subsampling to 2000 points + faster UMAP params
# Should take ~5-10s instead of 40s+
#create_embedding_visualization(all_embeddings, all_metadata, method='UMAP').show()

#print(f"Visualized {len(all_embeddings)} embeddings (subsampled to 2000 for speed)")

---
## Phase B: Query
---

In [11]:
# CELL 10: Initialize retriever and generator
retriever = Retriever(embedder, vector_store)
generator = ResponseGenerator(model=config.llm_model, temperature=config.temperature, max_tokens=config.max_tokens)
print(f"Ready! Using {config.llm_model}")

Ready! Using gemini-2.5-flash-lite


In [12]:
# CELL 11: ASK A QUESTION - EDIT YOUR QUERY HERE
query = "Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered?"

# Retrieve
results = retriever.retrieve(query=query, k=config.top_k)
print(f"Query: {query}\n")
print("Retrieved chunks:")
for r in results:
    print(f"\n{'='*80}")
    print(f"[{r.rank}] Score: {r.score:.3f} | Source: {r.source}")
    print(f"{'='*80}")
    print(r.text)

# Generate
response = generator.generate(query=query, context_chunks=results)
print(f"\n{'='*50}\nANSWER:\n{'='*50}")
print(response.response)
print(f"\nSources: {', '.join(response.sources)}")

Query: Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered?

Retrieved chunks:

[1] Score: 0.598 | Source: mononucleosis_variant_2.txt
Mononucleosis presents with crushing fatigue, prolonged fever, and pharyngitis. The tiredness is profound and long-lasting. Lymphadenopathy and hepatosplenomegaly are typical.

[2] Score: 0.591 | Source: mononucleosis_variant_9.txt
Infectious mononucleosis causes debilitating tiredness, intermittent fever, and throat pain. Fatigue dominates symptoms for weeks. Lymphadenopathy and hepatosplenomegaly are typical.

[3] Score: 0.564 | Source: polymyalgia_variant_4.txt
Polymyalgia rheumatica involves severe bilateral muscle aching in shoulder and hip girdles. Morning stiffness is extreme and prolonged. Fever episodes and tiredness occur. Age over 50 typical.

[4] Score: 0.557 | Source: mononucleosis_variant_5.txt
Infectious mononucleosis involves severe tiredness, intermittent fever, and pha

In [13]:
# CELL 12: Visualize query in embedding space
all_embeddings, all_metadata = vector_store.get_all_embeddings()
query_embedding = retriever.last_query_embedding
retrieved_indices = retriever.get_retrieved_indices(results, all_metadata)
scores = [r.score for r in results]

fig = create_embedding_visualization(
    all_embeddings, all_metadata, method='UMAP',
    query_embedding=query_embedding,
    retrieved_indices=retrieved_indices,
    retrieval_scores=scores
)
fig.show()

In [14]:
# CELL 13: Quick ask function - use for more questions
def ask(question, top_k=5):
    results = retriever.retrieve(query=question, k=top_k)
    response = generator.generate(query=question, context_chunks=results)
    print(f"Q: {question}\n\nA: {response.response}\n\nSources: {', '.join(response.sources)}")
    return response

In [15]:
# CELL 14: Ask more questions
ask("What is deep learning?")

Q: What is deep learning?

A: I cannot find this information in the provided documents.

Sources: fibromyalgia_variant_2.txt, parkinsons_disease.txt, multiple_sclerosis.txt, cushings_syndrome.txt, chronic_fatigue_variant_28.txt


GenerationResult(query='What is deep learning?', response='I cannot find this information in the provided documents.', context_chunks=[RetrievalResult(chunk=Chunk(id='parkinsons_disease.txt_0', text="Parkinson's disease involves progressive degeneration of dopamine-producing neurons in substantia nigra. Classic triad includes resting tremor, rigidity, and bradykinesia. Postural instability develops later. Symptoms begin asymmetrically. Levodopa therapy improves motor symptoms but doesn't stop progression.", source='parkinsons_disease.txt', chunk_index=0, start_char=0, end_char=310, metadata={'strategy': 'sentence', 'document_type': 'txt', 'sentence_count': '5'}), score=0.12843894958496094, rank=1), RetrievalResult(chunk=Chunk(id='multiple_sclerosis.txt_0', text='Multiple sclerosis is an autoimmune disease affecting the central nervous system. Immune cells attack myelin sheath protecting nerve fibers. Symptoms vary based on lesion location: vision problems, weakness, numbness, balance i

In [16]:
ask("How do I create a virtual environment in Python?")

Q: How do I create a virtual environment in Python?

A: I cannot find this information in the provided documents.

Sources: mononucleosis_variant_6.txt, mononucleosis_variant_13.txt, polymyalgia_variant_4.txt, polymyalgia_variant_8.txt, mononucleosis_variant_14.txt


GenerationResult(query='How do I create a virtual environment in Python?', response='I cannot find this information in the provided documents.', context_chunks=[RetrievalResult(chunk=Chunk(id='mononucleosis_variant_6.txt_0', text='Mononucleosis presents with profound exhaustion, persistent fever, and sore throat. The fatigue is debilitating and long-term. EBV infection causes lymphadenopathy.', source='mononucleosis_variant_6.txt', chunk_index=0, start_char=0, end_char=164, metadata={'strategy': 'sentence', 'sentence_count': '3', 'document_type': 'txt'}), score=0.08225637674331665, rank=1), RetrievalResult(chunk=Chunk(id='polymyalgia_variant_4.txt_0', text='Polymyalgia rheumatica involves severe bilateral muscle aching in shoulder and hip girdles. Morning stiffness is extreme and prolonged. Fever episodes and tiredness occur. Age over 50 typical.', source='polymyalgia_variant_4.txt', chunk_index=0, start_char=0, end_char=192, metadata={'sentence_count': '4', 'strategy': 'sentence', 'do

---
## Method Comparison: Top-K vs MMR vs QUBO-RAG
**All three methods answer the same query, but with different retrieval strategies**
---

In [17]:
# Setup: Import and define test query
from core.retrieval_strategies import create_retrieval_strategy
import numpy as np

# Medical diagnosis query with overlapping symptoms  
test_query = "Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered in the differential diagnosis?"
k = 5

print(f"Test Query: '{test_query}'")
print(f"Retrieving top {k} chunks with each method...\n")

Test Query: 'Patient presents with chronic fatigue, joint pain, and occasional low-grade fever. What conditions should be considered in the differential diagnosis?'
Retrieving top 5 chunks with each method...



### Method 1: Top-K (Naive)
Simple relevance-based retrieval - picks chunks with highest similarity scores

In [18]:
# METHOD 1: TOP-K (Naive)
print("="*70)
print("METHOD 1: TOP-K (NAIVE)")
print("="*70)

# Retrieve with Naive strategy
query_emb = embedder.embed_query(test_query)
candidates = vector_store.search(query_emb, k=k*3)
strategy = create_retrieval_strategy('naive')
results_naive, meta = strategy.retrieve(query_emb, candidates, k=k)

# Show retrieved chunks
print(f"\nRetrieved {len(results_naive)} chunks:")
for r in results_naive:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_naive = generator.generate(query=test_query, context_chunks=results_naive)

print("\n" + "-"*70)
print("TOP-K ANSWER:")
print("-"*70)
print(response_naive.response)
print(f"\nSources: {', '.join(response_naive.sources)}")
print("="*70)

METHOD 1: TOP-K (NAIVE)

Retrieved 5 chunks:
  [1] Score: 0.600 | Mononucleosis presents with crushing fatigue, prolonged fever, and pharyngitis. ...
  [2] Score: 0.589 | Infectious mononucleosis causes debilitating tiredness, intermittent fever, and ...
  [3] Score: 0.562 | Infectious mononucleosis involves severe tiredness, intermittent fever, and phar...
  [4] Score: 0.562 | Lyme disease presents with low-grade fever, overwhelming fatigue, and painful jo...
  [5] Score: 0.557 | Mononucleosis presents with debilitating fatigue, persistent low-grade fever, an...

----------------------------------------------------------------------
TOP-K ANSWER:
----------------------------------------------------------------------
Lyme disease and mononucleosis should be considered in the differential diagnosis.

Sources: lyme_variant_6.txt, mononucleosis_variant_2.txt, mononucleosis_variant_9.txt, mononucleosis_variant_5.txt, mononucleosis_variant_14.txt


### Method 2: MMR (Maximal Marginal Relevance)
Balances relevance and diversity - avoids redundant chunks

In [19]:
# METHOD 2: MMR (Maximal Marginal Relevance)
print("="*70)
print("METHOD 2: MMR (MAXIMAL MARGINAL RELEVANCE)")
print("="*70)

# Retrieve with MMR strategy (lambda=0.5 balances relevance/diversity)
strategy = create_retrieval_strategy('mmr', lambda_param=0.5)
results_mmr, meta = strategy.retrieve(query_emb, candidates, k=k)

# Show retrieved chunks
print(f"\nRetrieved {len(results_mmr)} chunks:")
for r in results_mmr:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_mmr = generator.generate(query=test_query, context_chunks=results_mmr)

print("\n" + "-"*70)
print("MMR ANSWER:")
print("-"*70)
print(response_mmr.response)
print(f"\nSources: {', '.join(response_mmr.sources)}")
print("="*70)

METHOD 2: MMR (MAXIMAL MARGINAL RELEVANCE)

Retrieved 5 chunks:
  [1] Score: 0.600 | Mononucleosis presents with crushing fatigue, prolonged fever, and pharyngitis. ...
  [2] Score: 0.546 | Low vitamin D causes fatigue, generalized muscle aches, and joint pain. Patients...
  [3] Score: 0.562 | Lyme disease presents with low-grade fever, overwhelming fatigue, and painful jo...
  [4] Score: 0.542 | Polymyalgia rheumatica presents with bilateral shoulder and hip pain with severe...
  [5] Score: 0.541 | Hypothyroidism presents with debilitating fatigue, unexplained weight increase, ...

----------------------------------------------------------------------
MMR ANSWER:
----------------------------------------------------------------------
Based on the provided documents, the following conditions should be considered in the differential diagnosis for a patient presenting with chronic fatigue, joint pain, and occasional low-grade fever:

*   **Mononucleosis:** Presents with crushing fatigue, 

### Method 3: QUBO-RAG (Quantum-Inspired Optimization)
Uses ORBIT simulator to optimize relevance-diversity tradeoff via p-bit computing

In [20]:
# METHOD 3: QUBO-RAG (Quantum-Inspired with ORBIT)
print("="*70)
print("METHOD 3: QUBO-RAG (QUANTUM-INSPIRED OPTIMIZATION)")
print("="*70)

# Retrieve with QUBO strategy - TUNED FOR DIVERSITY
# alpha=0.35 emphasizes diversity (65% diversity weight!)
# n_replicas=4 for better optimization
# full_sweeps=10000 for convergence
strategy = create_retrieval_strategy('qubo', alpha=0.35, 
                                    solver_params={'n_replicas': 4, 'full_sweeps': 10000})
results_qubo, meta = strategy.retrieve(query_emb, candidates, k=k)

print(f"ORBIT simulation time: {meta['execution_time']:.2f}s")
print(f"Alpha: {meta['alpha']} (lower = more diversity emphasis)")
print(f"Constraint satisfied: {meta.get('constraint_satisfied', 'N/A')}")

# Show retrieved chunks
print(f"\nRetrieved {len(results_qubo)} chunks:")
for r in results_qubo:
    print(f"  [{r.rank}] Score: {r.score:.3f} | {r.text[:80]}...")

# Generate answer
response_qubo = generator.generate(query=test_query, context_chunks=results_qubo)

print("\n" + "-"*70)
print("QUBO-RAG ANSWER:")
print("-"*70)
print(response_qubo.response)
print(f"\nSources: {', '.join(response_qubo.sources)}")
print("="*70)

METHOD 3: QUBO-RAG (QUANTUM-INSPIRED OPTIMIZATION)
[2025-12-06 17:07:31] INFO - orbit.simulator: Simulation starting...
[2025-12-06 17:07:36] INFO - orbit.simulator: Simulation completed in 4.53 seconds
ORBIT simulation time: 4.53s
Alpha: 0.35 (lower = more diversity emphasis)
Constraint satisfied: True

Retrieved 5 chunks:
  [1] Score: 0.562 | Lyme disease presents with low-grade fever, overwhelming fatigue, and painful jo...
  [2] Score: 0.547 | Polymyalgia rheumatica involves severe bilateral muscle aching in shoulder and h...
  [3] Score: 0.544 | Infectious mononucleosis involves profound exhaustion, intermittent fever, and t...
  [4] Score: 0.541 | Polymyalgia rheumatica involves bilateral pain and profound stiffness in shoulde...
  [5] Score: 0.540 | Mononucleosis manifests as extreme fatigue, recurrent fever, and throat pain. Th...

----------------------------------------------------------------------
QUBO-RAG ANSWER:
------------------------------------------------------------

### Diversity Comparison
Compare how diverse the retrieved chunks are for each method

In [21]:
# Compare diversity metrics across all three methods
from core.diversity_metrics import compare_retrieval_methods, print_comparison_table

# Convert results to dict format for metrics
results_dict = {}
for name, results in [('Top-K', results_naive), ('MMR', results_mmr), ('QUBO-RAG', results_qubo)]:
    results_dict[name] = [{
        'id': r.id,
        'score': r.score,
        'embedding': next((c['embedding'] for c in candidates if c['id'] == r.id), None)
    } for r in results]

print("\n" + "="*70)
print("DIVERSITY METRICS COMPARISON")
print("="*70)
comparison = compare_retrieval_methods(results_dict)
print_comparison_table(comparison)

print("\nKey Insight:")
print(f"  • Top-K intra-list similarity:   {comparison['Top-K']['intra_list_similarity']:.4f}")
print(f"  • MMR intra-list similarity:     {comparison['MMR']['intra_list_similarity']:.4f}")  
print(f"  • QUBO-RAG intra-list similarity: {comparison['QUBO-RAG']['intra_list_similarity']:.4f}")
print("\n  → Lower = more diverse chunks (less redundancy)")
print("="*70)


DIVERSITY METRICS COMPARISON
Metric                              Top-K             MMR        QUBO-RAG
avg_score                          0.5743          0.5583          0.5468
intra_list_similarity              0.6704          0.4013          0.4988
max_score                          0.6005          0.6005          0.5622
min_score                          0.5568          0.5410          0.5400
num_results                             5               5               5
std_score                          0.0174          0.0225          0.0081

Interpretation:
- Intra-list similarity: Lower = more diverse
- Subtopic recall: Higher = better coverage
- Alpha-nDCG: Higher = better relevance + diversity balance

Key Insight:
  • Top-K intra-list similarity:   0.6704
  • MMR intra-list similarity:     0.4013
  • QUBO-RAG intra-list similarity: 0.4988

  → Lower = more diverse chunks (less redundancy)
