# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [3]:
# Environment variables are stored in ../../.env file with:
# OPENAI_API_KEY - For LLM API calls
# CHROMA_OPENAI_API_KEY - For embedding generation (same as OPENAI_API_KEY)
# TAVILY_API_KEY - For web search fallback (used in Part 2)

In [4]:
# Load environment variables from .env file
load_dotenv("../../.env")  # Path relative to this notebook (goes up to MASTERS directory)

# Verify keys are loaded
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
CHROMA_OPENAI_API_KEY = os.getenv("CHROMA_OPENAI_API_KEY", OPENAI_API_KEY)

print(f"OpenAI API Key loaded: {'Yes' if OPENAI_API_KEY else 'No'}")
print(f"Chroma OpenAI API Key loaded: {'Yes' if CHROMA_OPENAI_API_KEY else 'No'}")

OpenAI API Key loaded: Yes
Chroma OpenAI API Key loaded: Yes


### VectorDB Instance

In [5]:
# Instantiate ChromaDB Client with persistent storage
# This ensures the database persists across notebook restarts
# Data will be stored in ./chromadb folder
chroma_client = chromadb.PersistentClient(path="chromadb")

print(f"ChromaDB client initialized at: ./chromadb")
print(f"Existing collections: {chroma_client.list_collections()}")

ChromaDB client initialized at: ./chromadb
Existing collections: [Collection(name=udaplay)]


### Collection

In [6]:
# Create OpenAI embedding function
# This converts text into vector embeddings (arrays of 1536 numbers)
# The same embedding function must be used for both indexing and querying
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=CHROMA_OPENAI_API_KEY,
    model_name="text-embedding-3-small"  # Efficient and cost-effective model
)

print("OpenAI embedding function created (model: text-embedding-3-small)")

OpenAI embedding function created (model: text-embedding-3-small)


In [7]:
# Create or get the collection
# Using get_or_create_collection to avoid errors if the collection already exists
# This is safer than create_collection which throws an error if collection exists
collection = chroma_client.get_or_create_collection(
    name="udaplay",
    embedding_function=embedding_fn
)

print(f"Collection 'udaplay' ready")
print(f"Current document count: {collection.count()}")

Collection 'udaplay' ready
Current document count: 15


### Add documents

In [8]:
# Process and add game documents to the collection
data_dir = "games"

# Track how many documents we add
added_count = 0
skipped_count = 0

# Get existing IDs to avoid duplicates (important when re-running the notebook)
existing_ids = set(collection.get()["ids"])

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]
    
    # Skip if document already exists
    if doc_id in existing_ids:
        skipped_count += 1
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # Create rich content for semantic search
    # Include all relevant info so queries like "racing game" or "Nintendo publisher" work well
    content = (
        f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - "
        f"{game['Description']} "
        f"Genre: {game['Genre']}. Publisher: {game['Publisher']}."
    )

    # Add document with full game data as metadata
    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )
    
    added_count += 1
    print(f"Added: {game['Name']} ({game['Platform']})")

print(f"\n{'='*50}")
print(f"Documents added: {added_count}")
print(f"Documents skipped (already exist): {skipped_count}")
print(f"Total documents in collection: {collection.count()}")


Documents added: 0
Documents skipped (already exist): 15
Total documents in collection: 15


### Semantic Search Demonstration

Now let's verify our vector database works by performing semantic searches. Notice how the search finds relevant games even when the query doesn't exactly match the document text.

In [9]:
def search_games(query: str, n_results: int = 3):
    """
    Search the vector database for games matching the query.
    
    Args:
        query: Natural language search query
        n_results: Number of results to return
    
    Returns:
        The raw results from ChromaDB
    """
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    
    print(f"\nQuery: '{query}'")
    print("=" * 60)
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    )):
        print(f"\nResult {i+1} (Distance: {distance:.4f}):")
        print(f"  Name: {metadata['Name']}")
        print(f"  Platform: {metadata['Platform']}")
        print(f"  Year: {metadata['YearOfRelease']}")
        print(f"  Genre: {metadata['Genre']}")
        print(f"  Publisher: {metadata['Publisher']}")
    
    return results

In [10]:
# Example 1: Search by game type/genre
# Notice: we say "racing simulation" - it finds Gran Turismo even though
# the exact phrase might not appear in the document
search_games("realistic racing simulation games")


Query: 'realistic racing simulation games'

Result 1 (Distance: 0.3795):
  Name: Gran Turismo
  Platform: PlayStation 1
  Year: 1997
  Genre: Racing
  Publisher: Sony Computer Entertainment

Result 2 (Distance: 0.4133):
  Name: Gran Turismo 5
  Platform: PlayStation 3
  Year: 2010
  Genre: Racing
  Publisher: Sony Computer Entertainment

Result 3 (Distance: 0.6403):
  Name: Grand Theft Auto: San Andreas
  Platform: PlayStation 2
  Year: 2004
  Genre: Action-adventure
  Publisher: Rockstar Games


{'ids': [['001', '003', '002']],
 'embeddings': None,
 'documents': [['[PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.',
   '[PlayStation 3] Gran Turismo 5 (2010) - A comprehensive racing simulator featuring a vast selection of vehicles and tracks, with realistic driving physics.',
   "[PlayStation 2] Grand Theft Auto: San Andreas (2004) - An expansive open-world game set in the fictional state of San Andreas, following the story of Carl 'CJ' Johnson."]],
 'uris': None,
 'included': ['documents', 'metadatas', 'distances'],
 'data': None,
 'metadatas': [[{'Platform': 'PlayStation 1',
    'Publisher': 'Sony Computer Entertainment',
    'YearOfRelease': 1997,
    'Genre': 'Racing',
    'Description': 'A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.',
    'Name': 'Gran Turismo'},
   {'Description': 'A comprehensive racing simulato

In [11]:
# Example 2: Search by character/franchise
# Semantic search understands "Mario" relates to platformer games
search_games("Mario platformer games")


Query: 'Mario platformer games'

Result 1 (Distance: 0.5055):
  Name: Super Mario World
  Platform: Super Nintendo Entertainment System (SNES)
  Year: 1990
  Genre: Platformer
  Publisher: Nintendo

Result 2 (Distance: 0.5339):
  Name: Super Mario 64
  Platform: Nintendo 64
  Year: 1996
  Genre: Platformer
  Publisher: Nintendo

Result 3 (Distance: 0.6228):
  Name: Super Smash Bros. Melee
  Platform: GameCube
  Year: 2001
  Genre: Fighting
  Publisher: Nintendo


{'ids': [['008', '009', '010']],
 'embeddings': None,
 'documents': [['[Super Nintendo Entertainment System (SNES)] Super Mario World (1990) - A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.',
   "[Nintendo 64] Super Mario 64 (1996) - A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach.",
   '[GameCube] Super Smash Bros. Melee (2001) - A crossover fighting game featuring characters from various Nintendo franchises battling it out in dynamic arenas.']],
 'uris': None,
 'included': ['documents', 'metadatas', 'distances'],
 'data': None,
 'metadatas': [[{'Platform': 'Super Nintendo Entertainment System (SNES)',
    'Genre': 'Platformer',
    'YearOfRelease': 1990,
    'Publisher': 'Nintendo',
    'Name': 'Super Mario World',
    'Description': 'A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.'},
   {'De

In [12]:
# Example 3: Search with a question (like a user would ask)
# This is how the agent will use the database in Part 2
search_games("When was Pokemon Gold and Silver released?")


Query: 'When was Pokemon Gold and Silver released?'

Result 1 (Distance: 0.3638):
  Name: Pokémon Gold and Silver
  Platform: Game Boy Color
  Year: 1999
  Genre: Role-playing
  Publisher: Nintendo

Result 2 (Distance: 0.5084):
  Name: Pokémon Ruby and Sapphire
  Platform: Game Boy Advance
  Year: 2002
  Genre: Role-playing
  Publisher: Nintendo

Result 3 (Distance: 0.7555):
  Name: Super Mario 64
  Platform: Nintendo 64
  Year: 1996
  Genre: Platformer
  Publisher: Nintendo


{'ids': [['006', '007', '009']],
 'embeddings': None,
 'documents': [['[Game Boy Color] Pokémon Gold and Silver (1999) - Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics.',
   '[Game Boy Advance] Pokémon Ruby and Sapphire (2002) - Third-generation Pokémon games set in the Hoenn region, featuring new Pokémon and double battles.',
   "[Nintendo 64] Super Mario 64 (1996) - A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach."]],
 'uris': None,
 'included': ['documents', 'metadatas', 'distances'],
 'data': None,
 'metadatas': [[{'YearOfRelease': 1999,
    'Platform': 'Game Boy Color',
    'Name': 'Pokémon Gold and Silver',
    'Genre': 'Role-playing',
    'Publisher': 'Nintendo',
    'Description': 'Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics.'},
   {'Description': 'Third-generation Pokémon games set in the Hoenn region, featuring new Pokémo

In [13]:
# Example 4: Search by era/time period
# Semantic search can understand concepts like "90s games" or "retro"
search_games("classic video games from the 1990s")


Query: 'classic video games from the 1990s'

Result 1 (Distance: 0.5736):
  Name: Super Mario World
  Platform: Super Nintendo Entertainment System (SNES)
  Year: 1990
  Genre: Platformer
  Publisher: Nintendo

Result 2 (Distance: 0.6005):
  Name: Gran Turismo
  Platform: PlayStation 1
  Year: 1997
  Genre: Racing
  Publisher: Sony Computer Entertainment

Result 3 (Distance: 0.6121):
  Name: Pokémon Gold and Silver
  Platform: Game Boy Color
  Year: 1999
  Genre: Role-playing
  Publisher: Nintendo


{'ids': [['008', '001', '006']],
 'embeddings': None,
 'documents': [['[Super Nintendo Entertainment System (SNES)] Super Mario World (1990) - A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.',
   '[PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.',
   '[Game Boy Color] Pokémon Gold and Silver (1999) - Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics.']],
 'uris': None,
 'included': ['documents', 'metadatas', 'distances'],
 'data': None,
 'metadatas': [[{'Description': 'A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.',
    'Name': 'Super Mario World',
    'Platform': 'Super Nintendo Entertainment System (SNES)',
    'Publisher': 'Nintendo',
    'Genre': 'Platformer',
    'YearOfRelease': 1990},
   {'Name': 'Gran Turismo',
    'Ye

### Summary

In this notebook, we successfully:

1. **Set up ChromaDB** with persistent storage (data survives restarts)
2. **Created a collection** with OpenAI embeddings for semantic understanding
3. **Processed 15 game documents** from JSON files with rich metadata
4. **Demonstrated semantic search** - finding relevant games by meaning, not just keywords

The vector database is now ready to be used by our AI agent in Part 02!

In [14]:
# Final collection statistics
print("Final Collection Statistics")
print("=" * 40)
print(f"Collection name: udaplay")
print(f"Total documents: {collection.count()}")
print(f"Embedding model: text-embedding-3-small")
print(f"Storage path: ./chromadb")

Final Collection Statistics
Collection name: udaplay
Total documents: 15
Embedding model: text-embedding-3-small
Storage path: ./chromadb
