# Simulator Development 
This notebook prepares the final dataset for running the NBA Playoff Game Simulator by combining modeled playoff games, matchup data, and SVM model predictions.

#### Build Simulator Dataset

In this section, we integrate multiple sources (modeled games, playoff games, and SVM predictions) into a single simulator-ready dataset.  
The resulting file `simulator_data.csv` will be used to power the playoff simulator application.

##### Overview of Steps:

1. **Load Data**: Read `all_modeled_playoff_games.csv`, `playoffs_games.csv`, `svm_predictions.csv` 
2. **Filter Season**: Keep only **2025 NBA playoff games** from the modeled dataset.
3. **Remove Overlap Columns**: Drop metadata columns that already exist in the modeled dataset and keep only unique identifiers and matchup information (`gameId`, `season`, `homeTeam`, `awayTeam`).
4. **Merge Data**: Combine matchup data with modeled playoff games and ensure no duplicate suffixes (`_x`, `_y`) are created.
5. **Add Predictions**: Attach `true_label` and `predicted_label` from the SVM predictions file.  
6. **Compute Winners**: Derive **predicted winner** based on model output and derive **actual winner** from ground truth labels.

In [77]:
import pandas as pd

# Load all modeled playoff games, playoff games, and svm predictions files
all_modeled_df = pd.read_csv("../data/processed/all_modeled_playoff_games.csv")
playoffs_games_df = pd.read_csv("../data/processed/playoffs_games.csv")
svm_predictions_df = pd.read_csv("../data/final/svm_predictions.csv")

# Filter only 2025 playoff games from all modeled playoff games
df_2025 = all_modeled_df[all_modeled_df['season'] == 2025].copy()

# Drop any columns from playoffs_games_df that already exist in df_2025
overlap_cols = set(df_2025.columns) & set(playoffs_games_df.columns)
meta_df = playoffs_games_df.drop(columns=overlap_cols - {"gameId", "season", "homeTeam", "awayTeam"})

# Merge without creating _x/_y suffixes
df_2025 = df_2025.merge(
    meta_df,
    on=["gameId", "season", "homeTeam", "awayTeam"],
    how="left",
    suffixes=("", ""),
    validate="one_to_one"
)

# Add SVM predictions
df_2025 = df_2025.reset_index(drop=True)
svm_predictions_df = svm_predictions_df.reset_index(drop=True)

df_2025["true_label"] = svm_predictions_df["true_label"]
df_2025["predicted_label"] = svm_predictions_df["predicted_label"]

# Add winners
df_2025["predicted_winner"] = df_2025.apply(
    lambda row: row["homeTeam"] if row["predicted_label"] == 1 else row["awayTeam"], axis=1
)
df_2025["actual_winner"] = df_2025.apply(
    lambda row: row["homeTeam"] if row["true_label"] == 1 else row["awayTeam"], axis=1
)

# Export the final dataset
df_2025.to_csv("../data/final/simulator_data.csv", index=False)