# üîç Homework 7: Build a RAG System
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** Sunday, March 29, 2026 @ 11pm Pacific

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier) ‚Äî GPU recommended

---

## What You'll Learn

1. What RAG is and why it matters
2. Build a document retrieval system using embeddings
3. Integrate retrieval with a language model
4. Evaluate RAG output quality

---

## Part 1: Document Chunking (3 points)

In [None]:
!pip install sentence-transformers datasets faiss-cpu transformers -q

from datasets import load_dataset
import pandas as pd

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train[:1000]")
df = dataset.to_pandas()

print(f"‚úÖ Loaded {len(df):,} documents")

In [None]:
def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

# Chunk all documents
all_chunks = []
chunk_to_doc = []  # Track which document each chunk came from

for idx, row in df.iterrows():
    chunks = chunk_text(row['text'])
    all_chunks.extend(chunks)
    chunk_to_doc.extend([idx] * len(chunks))

print("üìÑ DOCUMENT CHUNKING")
print("=" * 60)
print(f"Original documents: {len(df):,}")
print(f"Total chunks created: {len(all_chunks):,}")
print(f"Average chunks per document: {len(all_chunks)/len(df):.1f}")
print(f"Chunk size: 500 chars, Overlap: 50 chars")

## Part 2: Embedding Index (4 points)

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Encoding chunks (this may take 1-2 minutes)...")
chunk_embeddings = model.encode(all_chunks, show_progress_bar=True)

print(f"\n‚úÖ Encoded {len(chunk_embeddings):,} chunks")
print(f"Embedding dimension: {chunk_embeddings.shape[1]}")

In [None]:
# Build FAISS index
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(chunk_embeddings).astype('float32'))

print("üìä VECTOR INDEX BUILT")
print("=" * 60)
print(f"Embeddings shape: {chunk_embeddings.shape}")
print(f"Index type: FAISS (Flat L2)")
print(f"Index size: {index.ntotal} vectors")

## Part 3: Retrieval (5 points)

In [None]:
def retrieve(query, k=5):
    """Retrieve top-k relevant chunks for a query."""
    query_embedding = model.encode([query])
    distances, indices = index.search(
        np.array(query_embedding).astype('float32'), k
    )
    
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'chunk': all_chunks[idx],
            'distance': distances[0][i],
            'doc_id': chunk_to_doc[idx]
        })
    return results

# Test retrieval
query = "What do people say about the acting?"

print("üîé RETRIEVAL TEST")
print("=" * 60)
print(f"Query: \"{query}\"\n")

results = retrieve(query)

print("Retrieved Chunks:")
for i, r in enumerate(results, 1):
    score = 1 / (1 + r['distance'])  # Convert distance to similarity
    print(f"\n{i}. [Score: {score:.2f}] (Doc #{r['doc_id']})")
    print(f"   {r['chunk'][:150]}...")

## Part 4: Generation with RAG (5 points)

In [None]:
from transformers import pipeline

# Load a small language model (works on CPU)
generator = pipeline(
    "text2text-generation", 
    model="google/flan-t5-small",
    max_new_tokens=150
)

print("‚úÖ Language model loaded (flan-t5-small)")

In [None]:
def rag_answer(query, k=3):
    """Answer a question using RAG."""
    # Retrieve relevant chunks
    chunks = retrieve(query, k=k)
    
    # Build context
    context = "\n\n".join([c['chunk'][:300] for c in chunks])
    
    # Build prompt
    prompt = f"""Based on the following movie reviews, answer the question.

Reviews:
{context}

Question: {query}

Answer:"""
    
    # Generate answer
    response = generator(prompt)[0]['generated_text']
    
    return {
        'answer': response,
        'context': chunks
    }

# Test RAG
query = "What do reviewers think about the movie's plot?"

print("üí¨ RAG RESPONSE")
print("=" * 60)
print(f"Query: \"{query}\"\n")

result = rag_answer(query)

print("Retrieved Context (3 chunks):")
for i, c in enumerate(result['context'], 1):
    print(f"  {i}. {c['chunk'][:80]}...")

print(f"\nGenerated Answer:")
print(f"  {result['answer']}")

In [None]:
# Compare RAG vs No-RAG
query = "What are common complaints in movie reviews?"

print("üìä RAG vs NO-RAG COMPARISON")
print("=" * 60)
print(f"Query: \"{query}\"\n")

# Without RAG
print("WITHOUT RAG (model's own knowledge):")
no_rag = generator(f"What are common complaints in movie reviews? Answer:")
print(f"  {no_rag[0]['generated_text']}")

# With RAG
print("\nWITH RAG (grounded in actual reviews):")
rag_result = rag_answer(query)
print(f"  {rag_result['answer']}")

print("\nüí° RAG answers are grounded in your actual data!")

## Part 5: Evaluation (3 points)

In [None]:
# Test multiple queries
test_queries = [
    "What makes a good movie according to reviewers?",
    "What do people say about special effects?",
    "Are there any mentions of famous actors?"
]

print("üìä RAG EVALUATION")
print("=" * 60)

for query in test_queries:
    print(f"\nQ: {query}")
    result = rag_answer(query)
    print(f"A: {result['answer']}")
    print("-" * 40)

---

## Questions to Answer

**Q1:** How did you decide on chunk size and overlap?

*Your answer:*

**Q2:** Show an example where RAG improved the LLM's answer.

*Your answer:*

**Q3:** Show an example where retrieval failed. Why?

*Your answer:*

**Q4:** How would you deploy this for a real business application?

*Your answer:*

---

## Submission Checklist

| Item | Points | Done? |
|------|--------|-------|
| Part 1: Document chunking | 3 | ‚òê |
| Part 2: Embedding index | 4 | ‚òê |
| Part 3: Retrieval | 5 | ‚òê |
| Part 4: RAG generation | 5 | ‚òê |
| Part 5: Evaluation | 3 | ‚òê |
| **Total** | **20** | |