# NBA BetIQ – Data Cleaning & Exploration

This notebook loads raw NBA betting and game data, performs cleaning, feature engineering,
and basic exploratory data analysis (EDA) to prepare inputs for model training.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Display options
pd.set_option("display.max_columns", 100)


## 1. Load Raw Data

Load CSVs or database tables from the `data/raw/` directory.
Adjust paths and filenames based on your project.


In [None]:
# Example paths – adjust as needed
bets_path = "../data/raw/bets.csv"
games_path = "../data/raw/games.csv"
public_path = "../data/raw/public_bets.csv"

bets_df = pd.read_csv(bets_path)
games_df = pd.read_csv(games_path)
public_df = pd.read_csv(public_path)

bets_df.head()


## 2. Inspect Schema and Basic Summary

Check column names, dtypes, missing values, and general ranges.


In [None]:
print("BETS INFO")
print(bets_df.info())
print(bets_df.describe(include="all"))

print("\nGAMES INFO")
print(games_df.info())
print(games_df.describe(include="all"))

print("\nPUBLIC INFO")
print(public_df.info())
print(public_df.describe(include="all"))


## 3. Handle Missing Values

Decide on strategies: drop rows, impute values, or flag missingness.
The goal is to make the dataset consistent and model-ready.


In [None]:
# Example: drop rows with missing outcomes, keep others
bets_df = bets_df.dropna(subset=["outcome"])

# Example: fill missing public bet percentages with median
if "public_pct" in public_df.columns:
    public_df["public_pct"] = public_df["public_pct"].fillna(public_df["public_pct"].median())

# Example: fill missing odds with previous values within a game_id
for col in ["open_odds", "close_odds"]:
    if col in bets_df.columns:
        bets_df[col] = bets_df.groupby("game_id")[col].transform(lambda s: s.fillna(method="ffill").fillna(method="bfill"))


## 4. Merge Datasets

Merge bets, games, and public betting splits into one modeling table.
We use `game_id` and potentially team / side identifiers as keys.


In [None]:
# Example merges – adjust keys to match your schema
df = bets_df.merge(games_df, on="game_id", how="left", suffixes=("", "_game"))
df = df.merge(public_df, on=["game_id", "team"], how="left", suffixes=("", "_public"))

print(df.shape)
df.head()


## 5. Feature Engineering

Create features such as implied probabilities, odds movement, and public fade indicators.


In [None]:
def american_to_implied_prob(odds):
    """Convert American odds to implied probability."""
    odds = np.array(odds, dtype=float)
    prob = np.where(
        odds > 0,
        100 / (odds + 100),
        -odds / (-odds + 100),
    )
    return prob

if "moneyline" in df.columns:
    df["implied_prob"] = american_to_implied_prob(df["moneyline"])

if set(["open_odds", "close_odds"]).issubset(df.columns):
    df["odds_movement"] = df["close_odds"] - df["open_odds"]

if "public_pct" in df.columns:
    df["is_fading_public"] = (df["public_pct"] > 0.6).astype(int)

df.head()


## 6. Convert Target Variable

Define the modeling target, e.g., whether the bet won (1) or lost (0).


In [None]:
# Example: assume 'outcome' is "win"/"loss"
df["target"] = (df["outcome"] == "win").astype(int)
df["target"].value_counts(normalize=True)


## 7. Exploratory Data Analysis (EDA)

Look at distributions and relationships for key features.


In [None]:
# Target distribution
sns.countplot(x="target", data=df)
plt.title("Target Distribution (Win vs Loss)")
plt.show()


In [None]:
# Public betting percentage
if "public_pct" in df.columns:
    sns.histplot(df["public_pct"], bins=20, kde=True)
    plt.title("Distribution of Public Betting Percentage")
    plt.show()


In [None]:
# Implied probability vs outcome
if "implied_prob" in df.columns:
    sns.boxplot(x="target", y="implied_prob", data=df)
    plt.title("Implied Probability vs Outcome")
    plt.show()


## 8. Train/Test Split and Save Cleaned Data

Persist the cleaned dataset into `data/processed/` for downstream modeling.


In [None]:
from sklearn.model_selection import train_test_split
import os

feature_cols = [
    col for col in df.columns
    if col not in ["target", "outcome", "game_id", "team"]
]

X = df[feature_cols]
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

os.makedirs("../data/processed", exist_ok=True)
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)


## 9. Summary

We have cleaned and merged the raw datasets, engineered key features,
and saved train/test splits for model training.
