# **Exploring Data**
This notebook is an exploratory area for understanding my data. 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by configuring the kernel that's running this notebook:

In [1]:
# Change the cwd
%cd ..

# Enable the autoreload module
%load_ext autoreload
%autoreload 2

# Load the environment variables
from dotenv import load_dotenv
load_dotenv(override=True)

/Users/thubbard/Documents/personal/programming/pax-pal-2025/experiments


True

Next, I'm going to import the necessary modules:

In [12]:
# General imports
import os

# Third-party imports
import pandas as pd
from IPython.display import display, Markdown

# Project-specific imports
from utils.openai import generate_completions_in_parallel

# Loading Data
Below, I'll load in the data:

In [3]:
games_data_df = pd.read_json("data/final_enriched_games_data.json")

  games_data_df = pd.read_json("data/final_enriched_games_data.json")
  games_data_df = pd.read_json("data/final_enriched_games_data.json")
  games_data_df = pd.read_json("data/final_enriched_games_data.json")


# Exploring Data


### Genres
What are the most common genres / tags?

In [4]:
# Print all genre/tag counts, not just the head (default Jupyter truncates output)
pd.set_option('display.max_rows', None)
print(games_data_df["genres_and_tags"].explode().value_counts())
pd.reset_option('display.max_rows')

genres_and_tags
Indie                                 85
Adventure                             68
Action                                55
RPG                                   31
Simulation                            27
Puzzle                                22
Casual                                21
Strategy                              14
Platformer                            13
Horror                                13
Platform                              13
Arcade                                12
Roguelite                             10
Shooter                                9
Early Access                           8
Hack and Slash                         8
Side Scroller                          8
Multiplayer                            7
Co-op                                  6
Singleplayer                           6
Roguelike                              6
Cooperative                            5
Survival Horror                        5
Exploration                            5


### Platforms
What platforms are available?

In [6]:
games_data_df["platforms"].explode().value_counts()

platforms
PC                                                          104
Nintendo Switch                                              29
PlayStation 5                                                20
Xbox One                                                     20
PlayStation 4                                                15
Tabletop                                                     15
Xbox Series X|S                                              13
Xbox Series X/S                                              10
Windows                                                       8
Linux                                                         7
Mac                                                           7
macOS                                                         5
Xbox                                                          5
Android                                                       5
PlayStation                                                   5
iOS                           

# Cleaning Up Platforms & Genres
Below, I've created a list of the "standardized" platforms and genres. I'm going to run each game's details through ChatGPT, and ask to tag them based on what's known about the game.

In [22]:
available_genres = [
    "Indie",
    "Adventure",
    "Action",
    "RPG",
    "Simulation",
    "Platformer",
    "Shooter",
    "Horror",
    "Puzzle",
    "Casual",
    "Roguelite",
    "Strategy / Tactics",
    "Arcade",
    "Survival",
    "Hack & Slash / Beat 'em Up",
    "Sports",
    "Deck-Builder",
    "Rhythm / Music",
    "Stealth",
    "Open World",
    "First-Person",
    "Comedy",
    "Point-and-Click",
    "Isometric",
    "Metroidvania",
    "Bullet Hell",
    "Multiplayer",
]

available_platforms = ["PC", "Nintendo Switch", "PlayStation", "Xbox", "VR", "Mobile"]

Now, I'll prepare the prompts. 

In [23]:
developer_prompt = f"""# Role
You're a digital assistant helping to categorize video game metadata under a standardized taxonomy. 

# Task
Users will paste in some free text containing scraped information about a game. 

You will determine a list of genres & platforms from the following options:

### Genres
{"\n".join([f"- {genre}" for genre in available_genres])}

### Platforms
{"\n".join([f"- {platform}" for platform in available_platforms])}

# Guidelines
- Only use the information available in the text. Do not assume anything. 
- If no information is available for either `genres` or `platforms`, return an empty list for that field.
"""

from typing import List
from pydantic import BaseModel


class GameCategorizationOutput(BaseModel):
    genres: List[str]
    platforms: List[str]

Next, I'll prepare the prompts:

In [24]:
game_id_to_markdown_prompt = {}
for row in games_data_df.itertuples():
    prompt = f"""# **{row.name}**

***Summary:** {row.snappy_summary}*

***Description:** {row.description_texts}*

***Genres:** {row.genres_and_tags}*

***Platforms:** {row.platforms}*
"""
    game_id_to_markdown_prompt[row.id] = prompt

Now, I'll categorize the data:

In [25]:
completions = generate_completions_in_parallel(
    message_format_pairs=[
        (
            [
                {"role": "developer", "content": developer_prompt},
                {"role": "user", "content": prompt},
            ],
            GameCategorizationOutput,
        )
        for game_id, prompt in game_id_to_markdown_prompt.items()
    ],
    gpt_model="gpt-4.1-mini",
    show_progress=True
)

Generating Completions: 100%|██████████| 161/161 [00:24<00:00,  6.44it/s]


Next, I'll parse the completions:

In [26]:
game_ids = [game_id for game_id, prompt in game_id_to_markdown_prompt.items()]
cleaned_games_df_records = []
for idx, completion in enumerate(completions):
    parsed_completion = completion.choices[0].message.parsed
    cleaned_games_df_records.append(
        {
            "id": game_ids[idx],
            "genres_and_tags": parsed_completion.genres,
            "platforms": parsed_completion.platforms,
        }
    )
cleaned_games_df = pd.DataFrame(cleaned_games_df_records).merge(
    games_data_df.drop(columns=["genres_and_tags", "platforms"]),
    on="id",
    how="inner",
)

Finally, I'll save the data below:

In [27]:
cleaned_games_df.to_json(
    "data/final_enriched_games_data.json",
    orient="records",
)