# **Exploring Data**
This notebook is an exploratory area for understanding my data. 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by configuring the kernel that's running this notebook:

In [1]:
# Change the cwd
%cd ..

# Enable the autoreload module
%load_ext autoreload
%autoreload 2

# Load the environment variables
from dotenv import load_dotenv
load_dotenv(override=True)

/Users/thubbard/Documents/personal/programming/pax-pal-2025/experiments


True

Next, I'm going to import the necessary modules:

In [2]:
# General imports
import os

# Third-party imports
import pandas as pd
from IPython.display import display, Markdown

# Project-specific imports
from utils.openai import generate_completions_in_parallel

# Loading Data
Below, I'll load in the data:

In [3]:
games_data_df = pd.read_json("data/final_enriched_games_data.json")

  games_data_df = pd.read_json("data/final_enriched_games_data.json")
  games_data_df = pd.read_json("data/final_enriched_games_data.json")
  games_data_df = pd.read_json("data/final_enriched_games_data.json")


# Exploring Data


### Genres
What are the most common genres / tags?

In [4]:
# Print all genre/tag counts, not just the head (default Jupyter truncates output)
pd.set_option('display.max_rows', None)
print(games_data_df["genres_and_tags"].explode().value_counts())
pd.reset_option('display.max_rows')

genres_and_tags
Indie                         83
Adventure                     71
Action                        59
RPG                           42
Simulation                    29
Platformer                    23
Puzzle                        22
Casual                        21
Strategy / Tactics            20
Roguelite                     18
Multiplayer                   15
Horror                        14
Arcade                        12
Shooter                       11
Hack & Slash / Beat 'em Up    10
Survival                       9
Deck-Builder                   8
Open World                     6
Sports                         5
Rhythm / Music                 5
Isometric                      4
First-Person                   4
Point-and-Click                4
Stealth                        4
Bullet Hell                    3
Comedy                         3
Metroidvania                   2
Music / Rhythm                 1
Name: count, dtype: int64


### Platforms
What platforms are available?

In [5]:
games_data_df["platforms"].explode().value_counts()

platforms
PC                 115
Xbox                34
PlayStation         29
Nintendo Switch     29
Mobile               6
VR                   3
Name: count, dtype: int64

# Cleaning Up Platforms & Genres
Below, I've created a list of the "standardized" platforms and genres. I'm going to run each game's details through ChatGPT, and ask to tag them based on what's known about the game.

In [6]:
available_genres = [
    "Indie",
    "Adventure",
    "Action",
    "RPG",
    "Simulation",
    "Platformer",
    "Shooter",
    "Horror",
    "Puzzle",
    "Casual",
    "Roguelite",
    "Strategy",
    "Arcade",
    "Survival",
    "Beat 'em Up",
    "Sports",
    "Deck-Builder",
    "Rhythm",
    "Stealth",
    "Open World",
    "First-Person",
    "Comedy",
    "Isometric",
    "Metroidvania",
    "Bullet Hell",
    "Multiplayer",
]

available_platforms = ["PC", "Nintendo Switch", "PlayStation", "Xbox", "VR", "Mobile"]

Now, I'll prepare the prompts. 

In [7]:
developer_prompt = f"""# Role
You're a digital assistant helping to categorize video game metadata under a standardized taxonomy. 

# Task
Users will paste in some free text containing scraped information about a game. 

You will determine a list of genres & platforms from the following options:

### Genres
{"\n".join([f"- {genre}" for genre in available_genres])}

### Platforms
{"\n".join([f"- {platform}" for platform in available_platforms])}

# Guidelines
- Only use the information available in the text. Do not assume anything. 
- If no information is available for either `genres` or `platforms`, return an empty list for that field.
"""

from typing import List
from pydantic import BaseModel


class GameCategorizationOutput(BaseModel):
    genres: List[str]
    platforms: List[str]

Next, I'll prepare the prompts:

In [8]:
game_id_to_markdown_prompt = {}
for row in games_data_df.itertuples():
    prompt = f"""# **{row.name}**

***Summary:** {row.snappy_summary}*

***Description:** {row.description_texts}*

***Genres:** {row.genres_and_tags}*

***Platforms:** {row.platforms}*
"""
    game_id_to_markdown_prompt[row.id] = prompt

Now, I'll categorize the data:

In [9]:
completions = generate_completions_in_parallel(
    message_format_pairs=[
        (
            [
                {"role": "developer", "content": developer_prompt},
                {"role": "user", "content": prompt},
            ],
            GameCategorizationOutput,
        )
        for game_id, prompt in game_id_to_markdown_prompt.items()
    ],
    gpt_model="gpt-4.1-mini",
    show_progress=True
)

Generating Completions: 100%|██████████| 161/161 [00:22<00:00,  7.01it/s]


Next, I'll parse the completions:

In [10]:
game_ids = [game_id for game_id, prompt in game_id_to_markdown_prompt.items()]
cleaned_games_df_records = []
for idx, completion in enumerate(completions):
    parsed_completion = completion.choices[0].message.parsed
    cleaned_games_df_records.append(
        {
            "id": game_ids[idx],
            "genres_and_tags": parsed_completion.genres,
            "platforms": parsed_completion.platforms,
        }
    )
cleaned_games_df = pd.DataFrame(cleaned_games_df_records).merge(
    games_data_df.drop(columns=["genres_and_tags", "platforms"]),
    on="id",
    how="inner",
)

Finally, I'll save the data below:

In [12]:
cleaned_games_df.to_json(
    "data/final_enriched_games_data.json",
    orient="records",
)