# Exploration Notebook

## Learning the Dataset

Welcome to my exploration notebook for my basketball predictions project! This is where I will take time to learn more about the master datasets I hae created; moreover, my goal is to understand more about the data than what is on the surface through summary statistic and exploratory visualization methods.

## Methods

In [22]:
# imports
import pandas as pd

### Data

In [23]:
# load data
master_df = pd.read_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/master_df.csv")
print(master_df.head())

   Season                   Team  GP   W   L   WIN%   Min    PTS   FGM   FGA  \
0    2025  Oklahoma City Thunder  82  68  14  0.829  48.1  120.5  44.6  92.7   
1    2025  Oklahoma City Thunder  82  68  14  0.829  48.1  120.5  44.6  92.7   
2    2025  Oklahoma City Thunder  82  68  14  0.829  48.1  120.5  44.6  92.7   
3    2025    Cleveland Cavaliers  82  64  18  0.780  48.2  121.9  44.5  90.8   
4    2025    Cleveland Cavaliers  82  64  18  0.780  48.2  121.9  44.5  90.8   

   ...       SOS            Coach  Yw/Franch  YOverall  CareerW  CareerL  \
0  ...  0.487557  Mark Daigneault          5         5      211      189   
1  ...  0.487557  Mark Daigneault          5         5      211      189   
2  ...  0.487557  Mark Daigneault          5         5      211      189   
3  ...  0.479675   Kenny Atkinson          1         5      182      208   
4  ...  0.479675   Kenny Atkinson          1         5      182      208   

   CareerW%  Pk  Coach_Count      Payroll  
0     0.528  15   

I will now create the training and testing data, making the training data 2016-2023 seasons (8 seasons) and the testing data 2024/2025 seasons (2 seasons). I am doing it like this because I am predicting for the seasons to come; therefore, I am trying to replicate the traditional 80/20 split while making it a time-based split.

In [24]:
# train/test split
master_test = master_df[master_df["Season"].isin([2024, 2025])]
master_train = master_df[~master_df["Season"].isin([2024, 2025])]

print("Train shape:", master_train.shape)
print("Test shape:", master_test.shape)

Train shape: (515, 51)
Test shape: (132, 51)


In [25]:
# Numeric features (continuous or counts)
numeric_features = [
    "GP", "W", "L", "WIN%", "Min", "PTS", "FGM", "FGA", "FG%",
    "3PM", "3PA", "3P%", "FTM", "FTA", "FT%", "OREB", "DREB",
    "REB", "AST", "TOV", "STL", "BLK", "BLKA", "PF", "PFD",
    "PLUS_MINUS", "Home_W", "Home_L", "Road_W", "Road_L",
    "E_W", "E_L", "W_W", "W_L", "Pre-ASG_W", "Pre-ASG_L",
    "Post-ASG_W", "Post-ASG_L", "SOS", "Yw/Franch", "YOverall",
    "CareerW", "CareerL", "CareerW%", "Pk", "Coach_Count", "Payroll"
]

# Categorical features (labels, identifiers, strings)
categorical_features = [
    "Season", "Team", "Coach"
]

# Define target column (example: predict wins, change as needed)
target_column = "W"

### Dataset Description  

The master dataset consists of NBA team performance and front-office data from the 2016–2025 seasons. Each row represents a single team’s season, including statistical performance, coaching information, draft data, payroll, and strength of schedule. The goal is to analyze team success and build predictive models for future performance.  

---

#### Response  
**W**  
[int64] Number of regular season wins for the team in a given season.  

---

## Features  

**Season**  
[int64] The NBA season year (e.g., 2016, 2017, …, 2025).  

**Team**  
[string] Full name of the NBA team (e.g., "Boston Celtics").  

**GP**  
[int64] Number of games played in the season.  

**L**  
[int64] Number of regular season losses.  

**WIN%**  
[float64] Win percentage for the season.  

**Min**  
[float64] Average minutes per game.  

**PTS**  
[float64] Points scored per game.  

**FGM / FGA / FG%**  
[float64] Field goals made, attempted, and field goal percentage.  

**3PM / 3PA / 3P%**  
[float64] Three-pointers made, attempted, and percentage.  

**FTM / FTA / FT%**  
[float64] Free throws made, attempted, and percentage.  

**OREB / DREB / REB**  
[int64] Offensive, defensive, and total rebounds per game.  

**AST / TOV / STL / BLK / BLKA**  
[int64] Assists, turnovers, steals, blocks, and blocks against.  

**PF / PFD**  
[int64] Personal fouls committed and fouls drawn.  

**PLUS_MINUS**  
[float64] Average point differential per game.  

**Home_W / Home_L**  
[int64] Wins and losses at home.  

**Road_W / Road_L**  
[int64] Wins and losses on the road.  

**E_W / E_L / W_W / W_L**  
[int64] Wins and losses vs Eastern and Western Conference opponents.  

**Pre-ASG_W / Pre-ASG_L / Post-ASG_W / Post-ASG_L**  
[int64] Wins and losses before and after the All-Star Game.  

**SOS**  
[float64] Strength of schedule, computed as the average win% of opponents.  

**Coach**  
[string] Head coach for the team in the given season.  

**Yw/Franch**  
[int64] Years the coach has been with the franchise.  

**YOverall**  
[int64] Total years of head coaching experience.  

**CareerW / CareerL / CareerW%**  
[int64 / float64] Career wins, losses, and win percentage of the coach.  

**Pk**  
[int64] Draft pick number for the team’s highest selection that year.  

**Coach_Count**  
[int64] Number of different head coaches the team had during the season.  

**Payroll**  
[float64] Total team payroll for the season in USD.  


In [26]:
master_train.head()

Unnamed: 0,Season,Team,GP,W,L,WIN%,Min,PTS,FGM,FGA,...,SOS,Coach,Yw/Franch,YOverall,CareerW,CareerL,CareerW%,Pk,Coach_Count,Payroll
132,2023,Milwaukee Bucks,82,58,24,0.707,48.4,116.9,42.7,90.4,...,0.493024,Mike Budenholzer,5,10,484,317,0.604,58,1,182930771.0
133,2023,Boston Celtics,82,57,25,0.695,48.7,117.9,42.2,88.8,...,0.493012,Joe Mazzulla,1,1,57,25,0.695,35,1,178633307.0
134,2023,Philadelphia 76ers,82,54,28,0.659,48.5,115.2,40.8,83.8,...,0.499232,Doc Rivers,3,24,1097,763,0.59,60,1,
135,2023,Denver Nuggets,82,53,29,0.646,48.2,115.8,43.6,86.4,...,0.48939,Michael Malone,8,10,406,337,0.546,40,1,162338665.0
136,2023,Cleveland Cavaliers,82,51,31,0.622,48.5,112.3,41.6,85.2,...,0.495671,J.B. Bickerstaff,4,7,207,256,0.447,49,1,151966241.0


In [27]:
master_test.head()

Unnamed: 0,Season,Team,GP,W,L,WIN%,Min,PTS,FGM,FGA,...,SOS,Coach,Yw/Franch,YOverall,CareerW,CareerL,CareerW%,Pk,Coach_Count,Payroll
0,2025,Oklahoma City Thunder,82,68,14,0.829,48.1,120.5,44.6,92.7,...,0.487557,Mark Daigneault,5,5,211,189,0.528,15,1,166418720.0
1,2025,Oklahoma City Thunder,82,68,14,0.829,48.1,120.5,44.6,92.7,...,0.487557,Mark Daigneault,5,5,211,189,0.528,24,1,166418720.0
2,2025,Oklahoma City Thunder,82,68,14,0.829,48.1,120.5,44.6,92.7,...,0.487557,Mark Daigneault,5,5,211,189,0.528,44,1,166418720.0
3,2025,Cleveland Cavaliers,82,64,18,0.78,48.2,121.9,44.5,90.8,...,0.479675,Kenny Atkinson,1,5,182,208,0.467,49,1,165110486.0
4,2025,Cleveland Cavaliers,82,64,18,0.78,48.2,121.9,44.5,90.8,...,0.479675,Kenny Atkinson,1,5,182,208,0.467,58,1,165110486.0


### Summary Statistics

In [14]:
# summary statistics

### Exploratory Visualization

In [15]:
# exploratory visualization