# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [10]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [11]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
from dotenv import load_dotenv

In [12]:
# Load environment variables
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY is missing in .env")
os.environ["CHROMA_OPENAI_API_KEY"] = OPENAI_API_KEY

print("OPENAI_API_KEY loaded:", OPENAI_API_KEY is not None)

OPENAI_API_KEY loaded: True


### Vector DB instance and collection with OpenAI 
Unfortunately does not work due to invalid key, but I added the code to fullfill the requirements of the project; In addition I created another DB instance without OpenAI key

In [13]:
# Initalize ChromaDB and create collection

CHROMA_PATH = "udaplay_db"   # new path (same as chroma_db)
DATA_DIR = "games"

# PersistentClient auf dem neuen Pfad
chroma_client = chromadb.PersistentClient(path=CHROMA_PATH)

# OpenAI-Embedding-Funktion (wie in der Aufgabenstellung gefordert)
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=OPENAI_API_KEY,
    model_name="text-embedding-3-small",
)

# Error handlin in case a 'udaplay'-Collection exisits, this deletes it for a clean run
try:
    chroma_client.delete_collection("udaplay")
    print("Deleted existing 'udaplay' collection.")
except Exception:
    print("No existing 'udaplay' collection to delete.")

# Creation of new Collection 
collection = chroma_client.create_collection(
    name="udaplay",
    embedding_function=embedding_fn,
)

print("Collections now:", [c.name for c in chroma_client.list_collections()])


Deleted existing 'udaplay' collection.
Collections now: ['udaplay']


### Vector DB instance and collection without OpenAI 

In [14]:
# Chroma DB config (local, no OpenAI)
CHROMA_PATH = "udaplay_db"   # writable path
DATA_DIR = "games"

# Create / reuse persistent client
chroma_client = chromadb.PersistentClient(path=CHROMA_PATH)

# Local / built-in embedding function (no external services)
embedding_fn = embedding_functions.DefaultEmbeddingFunction()

# Start fresh: delete any existing 'udaplay' collection
try:
    chroma_client.delete_collection("udaplay")
    print("Deleted existing 'udaplay' collection.")
except Exception:
    print("No existing 'udaplay' collection to delete.")

# Create collection with local embeddings
collection = chroma_client.create_collection(
    name="udaplay",
    embedding_function=embedding_fn,
)

print("Collections now:", [c.name for c in chroma_client.list_collections()])

Deleted existing 'udaplay' collection.
Collections now: ['udaplay']


In [15]:
# Ingest games data from JSON 

num_added = 0

for file_name in sorted(os.listdir(DATA_DIR)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(DATA_DIR, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # Store default values
    name = game.get("Name", "Unknown")
    platform = game.get("Platform", "Unknown")
    year = game.get("YearOfRelease", "Unknown")
    desc = game.get("Description", "")

    # Text, which should be embedded and analyzed e
    content = f"[{platform}] {name} ({year}) - {desc}"

    # File name without .json as ID (z.B. "001")
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game],
    )

    num_added += 1

print(f"✔ Ingestion completed. Number of items added: {num_added}")
print("Collection count():", collection.count())


✔ Ingestion completed. Number of items added: 15
Collection count(): 15


In [16]:
#Show collection 
peek = collection.peek()
print("Peek result:")
print(peek)

Peek result:
{'ids': ['001', '002', '003', '004', '005', '006', '007', '008', '009', '010'], 'embeddings': array([[-0.04768373, -0.01363629, -0.02100486, ..., -0.00430921,
        -0.00090423,  0.0972009 ],
       [-0.01098956,  0.01837821, -0.06527878, ...,  0.07810403,
        -0.07336764,  0.03125126],
       [-0.05645394, -0.04063236,  0.02586169, ..., -0.0154053 ,
         0.00325583,  0.09907703],
       ...,
       [-0.06329967,  0.00596636, -0.00042587, ...,  0.01854568,
        -0.00551006,  0.0540259 ],
       [-0.02731697, -0.00917329, -0.01271181, ..., -0.01593885,
         0.05847818,  0.0173116 ],
       [-0.04662724, -0.04988441, -0.06761332, ...,  0.03523087,
        -0.04189238,  0.04914557]], shape=(10, 384)), 'documents': ['[PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.', "[PlayStation 2] Grand Theft Auto: San Andreas (2004) - An expansive open-world game set in the f

In [17]:
# Function for prettification of query output
def pretty_print_results(title: str, results):
    print(f"\n### {title}")
    
    metas_list = results.get("metadatas", [[]])
    if not metas_list or not metas_list[0]:
        print("No matching games found.")
        return

    docs_list = results.get("documents", [[]])
    dists_list = results.get("distances", [[]])

    metas = metas_list[0]
    docs = docs_list[0]
    dists = dists_list[0]

    for meta, doc, dist in zip(metas, docs, dists):
        print(
            f"- {meta.get('Name')} ({meta.get('YearOfRelease')}) "
            f"on {meta.get('Platform')} | Genre: {meta.get('Genre')} "
            f"| distance={dist:.4f}"
        )

In [18]:
# Semantic search through udaplay-collection 

# Query 1 – platform-specific
q1 = "realistic racing games on PlayStation 1"
res1 = collection.query(
    query_texts=[q1],
    n_results=5,
    include=["metadatas", "documents", "distances"],
)
pretty_print_results(f"Query: {q1}", res1)

# Query 2 – genre / device-based
q2 = "turn-based RPGs on handheld consoles"
res2 = collection.query(
    query_texts=[q2],
    n_results=5,
    include=["metadatas", "documents", "distances"],
)
pretty_print_results(f"Query: {q2}", res2)



### Query: realistic racing games on PlayStation 1
- Gran Turismo (1997) on PlayStation 1 | Genre: Racing | distance=0.7439
- Gran Turismo 5 (2010) on PlayStation 3 | Genre: Racing | distance=0.8050
- Grand Theft Auto: San Andreas (2004) on PlayStation 2 | Genre: Action-adventure | distance=1.0872
- Wii Sports (2006) on Wii | Genre: Sports | distance=1.1851
- Kinect Adventures! (2010) on Xbox 360 | Genre: Party | distance=1.2881

### Query: turn-based RPGs on handheld consoles
- Wii Sports (2006) on Wii | Genre: Sports | distance=1.1072
- Super Mario World (1990) on Super Nintendo Entertainment System (SNES) | Genre: Platformer | distance=1.2821
- Kinect Adventures! (2010) on Xbox 360 | Genre: Party | distance=1.3057
- Grand Theft Auto: San Andreas (2004) on PlayStation 2 | Genre: Action-adventure | distance=1.3632
- Super Mario 64 (1996) on Nintendo 64 | Genre: Platformer | distance=1.3662
