# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import time
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [3]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [4]:
# TODO: Load environment variables
load_dotenv()

True

### VectorDB Instance

In [5]:
# TODO: Instantiate your ChromaDB Client
# Choose any path you want
chroma_client = chromadb.PersistentClient(path="chromadb")

### Collection

In [6]:
# TODO: Pick one embedding function
# If picking something different than openai, 
# make sure you use the same when loading it
embedding_fn = embedding_functions.OpenAIEmbeddingFunction()

In [7]:
# TODO: Create a collection
# Choose any name you want
collection = chroma_client.get_or_create_collection(
    name="udaplay",
    embedding_function=embedding_fn
)

### Add documents

In [8]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

start_time = time.time()
num_docs = 0

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )

    num_docs += 1

elapsed = time.time() - start_time
print(f"Indexed {num_docs} games from '{data_dir}' in {elapsed:.2f} seconds.")


Indexed 15 games from 'games' in 4.85 seconds.


### Retrieval Quality Evaluation Utilities

In [9]:
def debug_retrieval(query: str, n_results: int = 5):
    """
    Helper function to inspect retrieval quality for a given query.

    Args:
        query: Natural language search query.
        n_results: How many nearest neighbors to return.

    Prints:
        - Query
        - Retrieved game titles
        - Platforms and years
        - Distances / similarity scores (if available)
        - Content snippet for each result
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"],
    )

    print(f"\n=== Query: {query!r} ===")
    for i in range(len(results["ids"][0])):
        meta = results["metadatas"][0][i]
        doc = results["documents"][0][i]
        distance = results.get("distances", [[None]])[0][i]

        name = meta.get("Name", "Unknown title")
        platform = meta.get("Platform", "Unknown platform")
        year = meta.get("YearOfRelease", "Unknown year")

        print(f"\nResult #{i+1}")
        print(f"  Title:    {name}")
        print(f"  Platform: {platform}")
        print(f"  Year:     {year}")
        if distance is not None:
            print(f"  Distance: {distance:.4f}")
        print(f"  Snippet:  {doc[:200]}{'...' if len(doc) > 200 else ''}")

In [10]:
# Try a few example queries to inspect retrieval quality
test_queries = [
    "3D Mario platformer",
    "fighting game with fatalities",
    "open world fantasy RPG",
    "classic Nintendo kart racing game"
]

for q in test_queries:
    debug_retrieval(q, n_results=3)


=== Query: '3D Mario platformer' ===

Result #1
  Title:    Super Mario 64
  Platform: Nintendo 64
  Year:     1996
  Distance: 0.1035
  Snippet:  [Nintendo 64] Super Mario 64 (1996) - A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach.

Result #2
  Title:    Super Mario World
  Platform: Super Nintendo Entertainment System (SNES)
  Year:     1990
  Distance: 0.1300
  Snippet:  [Super Nintendo Entertainment System (SNES)] Super Mario World (1990) - A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.

Result #3
  Title:    Mario Kart 8 Deluxe
  Platform: Nintendo Switch
  Year:     2017
  Distance: 0.1609
  Snippet:  [Nintendo Switch] Mario Kart 8 Deluxe (2017) - An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechanics.

=== Query: 'fighting game with fatalities' ===

Result #1
  Title:    Super Smash Bros. Melee
  Pl

## Design Rationale for the Retrieval System

This notebook implements the retrieval layer for the UdaPlay Agent.  
Several intentional design decisions were made to ensure that the vector search system behaves predictably and supports agentic reasoning:

### 1. **ChromaDB as a persistent vector store**
Chroma provides a simple API, persistent local storage, and fast similarity search, making it ideal for lightweight agent prototypes and educational RAG systems.

### 2. **Embedding strategy**
Game metadata (name, platform, year, description) is concatenated into a single indexed document.  
This balances:
- semantic richness (via descriptions)
- factual precision (via metadata fields)
- minimal engineering overhead

### 3. **Similarity search**
The agent relies on vector similarity to retrieve top-K candidates.  
This approach supports:
- fuzzy semantic matches  
- genre/category inference  
- cross-platform comparisons  

### 4. **Retrieval evaluation utilities**
The `debug_retrieval()` helper was added to:
- inspect retrieved neighbors  
- validate that relevant metadata is surfaced  
- expose distances to detect weak retrieval cases  

This mirrors how RAG systems are debugged in production environments.

### 5. **Performance instrumentation**
Indexing is timed and counted so that scaling behavior is visible and performance regressions are easier to detect.

---

This design establishes a clean retrieval pipeline that is simple enough for a notebook environment yet representative of the RAG architectures used in real-world agent systems.
