# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [None]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [None]:
import sys
import os

sys.path.append(os.path.abspath("./starter/lib"))


In [1]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv
from typing import List, Dict, Any
from pathlib import Path
from lib.documents import Document, Corpus
from lib.vector_db import VectorStore, VectorStoreManager

In [None]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [2]:
# TODO: Load environment variables
load_dotenv()

 # Verify API keys are loaded
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
CHROMA_OPENAI_API_KEY = os.getenv('CHROMA_OPENAI_API_KEY')

# ✅ Validate keys with more readable error messaging
def require_env(var, name):
    assert var, f"[ConfigError] 🔒 {name} is missing! Please set it in your .env file."

require_env(OPENAI_API_KEY, "OPENAI_API_KEY")
require_env(CHROMA_OPENAI_API_KEY, "CHROMA_OPENAI_API_KEY")

# ✅ Detect and configure Vocareum keys
if OPENAI_API_KEY.startswith('voc-'):
    print("📡 Vocareum OpenAI key detected — routing requests through Vocareum endpoint.")
    os.environ['OPENAI_API_BASE'] = 'https://openai.vocareum.com/v1'

if CHROMA_OPENAI_API_KEY.startswith('voc-'):
    print("📡 Vocareum ChromaDB key detected — applying Vocareum configuration.")

📡 Vocareum OpenAI key detected — routing requests through Vocareum endpoint.
📡 Vocareum ChromaDB key detected — applying Vocareum configuration.


In [3]:
def load_games() -> List[Dict]:
    """
    Load all JSON game files from the 'games' directory into memory.

    Returns:
        A list of dictionaries, each representing a game.
    """
    games_dir = Path("games")
    game_files = sorted(games_dir.glob("*.json"))

    if not game_files:
        print(f"[Warning] No game files found in '{games_dir}/'.")
        return []

    games: List[Dict] = []
    for file_path in game_files:
        with open(file_path, "r", encoding="utf-8") as fp:
            games.append(json.load(fp))

    print(f"[Info] Loaded {len(games)} game files from '{games_dir}/'")
    print("Example keys from first game file:", list(games[0].keys()))
    return games

# Load game data
games_data = load_games()

# Preview structure of the first game
if games_data:
    print("\nSample game structure:")
    print(json.dumps(games_data[0], indent=2))
else:
    print("No game data available to display.")

[Info] Loaded 15 game files from 'games/'
Example keys from first game file: ['Name', 'Platform', 'Genre', 'Publisher', 'Description', 'YearOfRelease']

Sample game structure:
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}


In [4]:
def create_game_document(game_data: Dict, index: int) -> Document:
    """
    Convert a single game's metadata dictionary into a structured Document.

    Args:
        game_data: A dictionary containing information about the game.
        index: An integer used to generate a unique document ID.

    Returns:
        A Document object containing formatted game information and metadata.
    """
    name = game_data.get("Name", "Unknown")
    platform = game_data.get("Platform", "Unknown")
    genre = game_data.get("Genre", "Unknown")
    publisher = game_data.get("Publisher", "Unknown")
    release_year = game_data.get("YearOfRelease", "Unknown")
    description = game_data.get("Description", "No description available")

    content = "\n".join([
        f"Game: {name}",
        f"Platform: {platform}",
        f"Genre: {genre}",
        f"Publisher: {publisher}",
        f"Release Year: {release_year}",
        f"Description: {description}",
    ])

    metadata = {
        "name": name,
        "platform": platform,
        "genre": genre,
        "publisher": publisher,
        "release_year": str(release_year),
        "description": description,
    }

    clean_name = (
        name.lower()
            .replace(" ", "_")
            .replace(":", "")
            .replace("-", "_")
            .replace("'", "")
    )
    doc_id = f"game_{index:03d}_{clean_name}"

    return Document(id=doc_id, content=content, metadata=metadata)


def build_corpus(games: List[Dict]) -> Corpus:
    """
    Transform a list of game dictionaries into a Corpus of Document objects.

    Args:
        games: A list of game metadata dictionaries.

    Returns:
        A Corpus object containing structured documents.
    """
    documents = [create_game_document(game, i) for i, game in enumerate(games)]
    print(f"Corpus built successfully with {len(documents)} document(s).")
    return Corpus(documents)

# Generate document corpus from game data
game_corpus = build_corpus(games_data)

# Preview the first document
print("\nSample Document:")
print(f"ID: {game_corpus[0].id}")
print(f"Content Preview:\n{game_corpus[0].content[:200]}...")
print(f"Metadata:\n{game_corpus[0].metadata}")


Corpus built successfully with 15 document(s).

Sample Document:
ID: game_000_gran_turismo
Content Preview:
Game: Gran Turismo
Platform: PlayStation 1
Genre: Racing
Publisher: Sony Computer Entertainment
Release Year: 1997
Description: A realistic racing simulator featuring a wide array of cars and tracks, ...
Metadata:
{'name': 'Gran Turismo', 'platform': 'PlayStation 1', 'genre': 'Racing', 'publisher': 'Sony Computer Entertainment', 'release_year': '1997', 'description': 'A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.'}


### VectorDB Instance, Collection, Add documents

In [5]:
vector_manager = VectorStoreManager(CHROMA_OPENAI_API_KEY)

def index_documents(
    corpus: Corpus, 
    manager: VectorStoreManager, 
    store_name: str = "udaplay_games"
) -> VectorStore:
    """
    Index a collection of documents into a named vector store using the provided manager.

    Args:
        corpus: The Corpus object containing documents to index.
        manager: A VectorStoreManager instance to manage store lifecycle.
        store_name: Name of the target vector store.

    Returns:
        A VectorStore instance with the indexed documents.
    """
    print(f"[Init] Creating or replacing vector store '{store_name}'...")
    store = manager.create_store(store_name, force=True)

    print("[Indexing] Storing documents with embeddings...")
    store.add(corpus)

    print(f"[Success] Indexed {len(corpus)} documents into store: '{store_name}'")
    return store

# Build and index the game corpus
vector_store = index_documents(game_corpus, vector_manager)

# Basic verification of indexed content
retrieved = vector_store.get(limit=3)
print(f"\n[Verification] Retrieved {len(retrieved['ids'])} documents:")
for i, doc_id in enumerate(retrieved["ids"], start=1):
    print(f"  {i}. {doc_id}")

[Init] Creating or replacing vector store 'udaplay_games'...
[Indexing] Storing documents with embeddings...
[Success] Indexed 15 documents into store: 'udaplay_games'

[Verification] Retrieved 3 documents:
  1. game_000_gran_turismo
  2. game_001_grand_theft_auto_san_andreas
  3. game_002_gran_turismo_5


In [None]:
# Format and display the results of a semantic search query
def display_search_results(query: str, results: Dict):
    print("=" * 70)
    print(f"Search Query: {query}")
    print("=" * 70)

    documents = results.get("documents", [[]])[0]
    distances = results.get("distances", [[]])[0]
    metadatas = results.get("metadatas", [[]])[0]

    if documents:
        for i, (doc, distance, metadata) in enumerate(zip(documents, distances, metadatas), start=1):
            score = 1 - distance
            print(f"[{i}] {metadata['name']} ({metadata['release_year']} • {metadata['platform']})")
            print(f"     Genre: {metadata['genre']} | Publisher: {metadata['publisher']}")
            print(f"     Similarity Score: {score:.3f}")
            print(f"     Description: {metadata['description'][:120]}...\n")
    else:
        print("No matches found for this query.\n")


# Execute several search queries to validate the vector store
def run_demo_searches(store: VectorStore):
    sample_queries = [
        "Pokemon games from the 90s",
        "First 3D Mario platformer", 
        "Mortal Kombat fighting game",
        "RPG games by Nintendo",
        "Games released in 1999",
    ]

    for query in sample_queries:
        results = store.query(query_texts=[query], n_results=3)
        display_search_results(query, results)

    # Retrieve records based metadata filters
    metadata_results = store.get(where={"publisher": "Nintendo"}, limit=5)
    print("\nGames published by Nintendo:")
    for i, metadata in enumerate(metadata_results["metadatas"], start=1):
        print(f"  {i}. {metadata['name']} ({metadata['release_year']} – {metadata['platform']})")

    # Combine semantic search with metadata constraints
    platform_filtered = store.query(
        query_texts=["adventure game"],
        n_results=3,
        where={"platform": "Nintendo 64"}
    )
    print("\nAdventure titles on Nintendo 64:")
    for metadata in platform_filtered["metadatas"][0]:
        print(f"  - {metadata['name']} ({metadata['genre']})")


# Run full demonstration of the vector search system
run_demo_searches(vector_store)

print("\nDemo complete. Vector store is populated and functioning as expected.")


Search Query: Pokemon games from the 90s
[1] Pokémon Gold and Silver (1999 • Game Boy Color)
     Genre: Role-playing | Publisher: Nintendo
     Similarity Score: 0.732
     Description: Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics....

[2] Pokémon Ruby and Sapphire (2002 • Game Boy Advance)
     Genre: Role-playing | Publisher: Nintendo
     Similarity Score: 0.726
     Description: Third-generation Pokémon games set in the Hoenn region, featuring new Pokémon and double battles....

[3] Super Mario 64 (1996 • Nintendo 64)
     Genre: Platformer | Publisher: Nintendo
     Similarity Score: 0.612
     Description: A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach....

Search Query: First 3D Mario platformer
[1] Super Mario 64 (1996 • Nintendo 64)
     Genre: Platformer | Publisher: Nintendo
     Similarity Score: 0.780
     Description: A groundbreaking 3D platformer that set 