## Data Preparation and Feature Engineering
This notebook prepares the datasets needed to model NBA playoff game outcomes. Instead of only cleaning raw stats, I focus on building basketball-specific features that are proven to impact winning.

---

#### Why These Seasons (2015-2025)?
I focused on the 2015-2025 seasons to capture the modern NBA era, when analytics-drive play and the 3-point revoluation reshaped the game.

1. **Start of the 3-Point Era** -> The Golden State Warriors, led by Stephen Curry, revolutionized basketball by showing how the 3-point shot could be the central weapon of an offense. Curry's unprecedented shooting range and efficiency reshaped defenses and forced every team to adjust. League-wide, average 3PA per game jumped from ~22 (2014) to ~27 (2015) and has continued to rise, with some teams now attempting 40+ threes per game. This dramatically affected the Effective FG% (eFG%).
2. **Pace and Spacing Revolution** -> Teams shifted to small-ball, faster pace, and ball movement, elevating the importance of TOV% and ORB%.
3. **Consistency with Advanced Metrics** -> From 2015 onward, the Four Factors (eFG%, TOV%, ORB%, DRB%, FT/FGA) became more stable predictors of success.
4. **Reliable Data** -> Advanced play-by-play and team stats are cleaner and standardized post-2015.
5. **Future-Proofing** -> Including 2025 ensures the model is tested against the most recent playoff trends.

---

#### Why Dean Oliver's Four Factors?

In basketball analytics, **Dean Oliver’s Four Factors** are widely regarded as the core drivers of winning. They account for the majority of variance in game outcomes and are often cited in professional scouting and NBA front office analysis.  

- **Effective Field Goal Percentage (eFG%)** → Adjusts for the higher value of three-pointers  
- **Turnover Rate (TOV%)** → Captures lost scoring opportunities  
- **Offensive Rebound % (ORB%)** and **Defensive Rebound % (DRB%)** → Measure possession control  
- **Free Throws per Field Goal Attempt (FT/FGA)** → Rewards teams that get to the line efficiently  

By engineering these metrics, I align the model with domain knowledge, ensuring it emphasizes what truly matters in basketball success, rather than relying only on raw box score totals.

---

#### Regular Season Team Statistics 
This step produces **season-level team features** that summarize each team’s efficiency during the regular season. These values serve as baseline performance indicators heading into the playoffs.

##### Key Steps:
1. Import raw `Game.csv` and `TeamStatistics.csv`.
2. Assign each game to its correct NBA season.
3. Filter only regular season games.
4. Compute the **Four Factors**.
5. Merge opponent rebounds for ORB% and DRB%.
6. Standardize team names.
7. Aggregate per-season averages by team.
8. Keep seasons 2015-2025

In [19]:
# Import required libraries
import pandas as pd

# Load games and team statistics data
games_df = pd.read_csv("../data/raw/Games.csv", low_memory=False)
team_stats_df = pd.read_csv("../data/raw/TeamStatistics.csv", low_memory=False)

# Assign a season to each game
def get_season_year(date_str):
    date = pd.to_datetime(date_str)
    return date.year + 1 if date.month >= 10 else date.year

team_stats_df["season"] = team_stats_df["gameDate"].apply(get_season_year)

# Filter for regular season games
# Merge gameType info from games_df
merged = pd.merge(
    team_stats_df,
    games_df[["gameId", "gameType"]],
    on="gameId",
    how="left"
)

# Keep only regular season games
df = merged[merged["gameType"].str.strip() == "Regular Season"].copy()

# Compute the Four Factors team statistics
df["eFG%"] = (df["fieldGoalsMade"] + 0.5 * df["threePointersMade"]) / df["fieldGoalsAttempted"]
df["TOV%"] = df["turnovers"] / (df["fieldGoalsAttempted"] + 0.44 * df["freeThrowsAttempted"] + df["turnovers"])
df["FT/FGA"] = df["freeThrowsMade"] / df["fieldGoalsAttempted"]

# Merge opponent rebounds for ORB% and DRB%
opp_stats = df[["gameId", "teamId", "reboundsOffensive", "reboundsDefensive"]].copy()
opp_stats.columns = ["gameId", "opponentTeamId", "oppORB", "oppDRB"]

df = pd.merge(df, opp_stats, on=["gameId", "opponentTeamId"], how="left")

df["ORB%"] = df["reboundsOffensive"] / (df["reboundsOffensive"] + df["oppDRB"]).clip(lower=1)
df["DRB%"] = df["reboundsDefensive"] / (df["reboundsDefensive"] + df["oppORB"]).clip(lower=1)

# Combine team city and team name
df["teamName"] = df["teamCity"].str.strip() + " " + df["teamName"].str.strip()

# Group by team name and season to compute season averages
regular_season_stats = df.groupby(["teamName", "season"])[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()

# Filter to include only NBA seasons from 2015 to 2025
regular_season_stats = regular_season_stats[(regular_season_stats["season"] >= 2015) & (regular_season_stats["season"] <= 2025)].dropna()

# Export the final dataset
regular_season_stats.to_csv("../data/processed/team_statistics_regular_season.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/team_statistics_regular_season.csv")
print(df.head())

        teamName  season      eFG%      TOV%    FT/FGA      ORB%      DRB%
0  Atlanta Hawks    2015  0.528550  0.134987  0.203871  0.212378  0.735202
1  Atlanta Hawks    2016  0.517698  0.138237  0.187612  0.187279  0.748457
2  Atlanta Hawks    2017  0.505295  0.142289  0.217033  0.235062  0.763283
3  Atlanta Hawks    2018  0.513520  0.141479  0.187188  0.208711  0.763587
4  Atlanta Hawks    2019  0.522684  0.143139  0.193936  0.247981  0.765988


#### Playoff Game Features
This step structures **game-level data for the playoffs** (e.g., seeding, round, matchup context). These features add the playoff-specific elements that regular season stats cannot capture.

##### Key Steps: 
1. Load playoff games only (`gameType = Playoffs`).
2. Assign season and round (e.g., First Round, Conference Finals, NBA Finals).
3. Format team names for consistency.
4. Create a **series ID** to link all games in the same playoff series.
5. Normalize rounds into numeric values (1 = First Round, 4 = Finals).
6. Add seeding information for both teams.
7. Set **home-court advantage** flags.
8. Store metadata (scores, winners, matchup labels).

Playoff basketball is different from the regular season. Matchups repeat, adjustments are made, and seeding/home-court often play decisive roles. Including these contextual features allows the model to account for playoff dynamics beyond pure stats.

In [20]:
# Import required libraries
import pandas as pd

# Load games data
games_df = pd.read_csv("../data/raw/Games.csv", low_memory=False)

# Filter for playoff games only
playoff_games = games_df[games_df["gameType"].str.strip() == "Playoffs"].copy()

# Assign a season to each playoff game
def get_season_year(date_str):
    date = pd.to_datetime(date_str)
    return date.year + 1 if date.month >= 10 else date.year

playoff_games["season"] = playoff_games["gameDate"].apply(get_season_year)

# Format home and away team names
playoff_games["homeTeam"] = playoff_games["hometeamCity"].str.strip() + " " + playoff_games["hometeamName"].str.strip()
playoff_games["awayTeam"] = playoff_games["awayteamCity"].str.strip() + " " + playoff_games["awayteamName"].str.strip()

# Create a unique series ID based on matchup and season
playoff_games["seriesId"] = (
    playoff_games[["homeTeam", "awayTeam", "season"]]
    .astype(str)
    .agg("_vs_".join, axis=1)
)

# Normalize game round into a number (1 = First Round, ..., 4 = NBA Finals)
def get_round(label):
    if not isinstance(label, str):
        return None
    label = label.lower()
    if "first round" in label:
        return 1
    elif "semifinal" in label:
        return 2
    elif "conf. finals" in label or "conference finals" in label:
        return 3
    elif "nba finals" in label and "conf" not in label:
        return 4
    return None

playoff_games["round"] = playoff_games["gameLabel"].apply(get_round)

# Extract game number in series and whether the home team won
playoff_games["gameNumber"] = playoff_games["seriesGameNumber"].fillna(1).astype(int)
playoff_games["homeWin"] = playoff_games["winner"] == playoff_games["hometeamId"]

# Select relevant metadata columns
playoff_metadata = playoff_games[[
    "gameId", "gameDate", "season", "homeTeam", "awayTeam",
    "seriesId", "round", "gameNumber", "homeWin", "gameLabel",
    "gameSubLabel", "homeScore", "awayScore"
]]

# Merge home and away team seeds
games = (
    playoff_metadata
    .merge(
        pd.read_csv("../data/raw/playoff_seeds_2015_2025.csv")
          .rename(columns={"team": "homeTeam", "seed": "homeSeed"}),
        on=["season", "homeTeam"],
        how="left"
    )
    .merge(
        pd.read_csv("../data/raw/playoff_seeds_2015_2025.csv")
          .rename(columns={"team": "awayTeam", "seed": "awaySeed"}),
        on=["season", "awayTeam"],
        how="left"
    )
)

# Create readable matchup name
round_name_map = {
    1: "First Round",
    2: "Conference Semifinals",
    3: "Conference Finals",
    4: "NBA Finals"
}
games["matchupType"] = games["round"].map(round_name_map)

# Rename conference columns (if present)
games["conference_home"] = games.get("conference_x")
games["conference_away"] = games.get("conference_y")
games = games.drop(columns=["conference_x", "conference_y"], errors="ignore")

# Calculate home-court advantage (homeSeed < awaySeed)
games["homeCourt"] = games["homeSeed"] < games["awaySeed"]

# Export final dataset
games.to_csv("../data/processed/playoffs_games.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/playoffs_games.csv")
print(df.head())

     gameId             gameDate  season               homeTeam  \
0  42400407  2025-06-22 20:00:00    2025  Oklahoma City Thunder   
1  42400406  2025-06-19 20:30:00    2025         Indiana Pacers   
2  42400405  2025-06-16 20:30:00    2025  Oklahoma City Thunder   
3  42400404  2025-06-13 20:30:00    2025         Indiana Pacers   
4  42400403  2025-06-11 20:30:00    2025         Indiana Pacers   

                awayTeam                                         seriesId  \
0         Indiana Pacers  Oklahoma City Thunder_vs_Indiana Pacers_vs_2025   
1  Oklahoma City Thunder  Indiana Pacers_vs_Oklahoma City Thunder_vs_2025   
2         Indiana Pacers  Oklahoma City Thunder_vs_Indiana Pacers_vs_2025   
3  Oklahoma City Thunder  Indiana Pacers_vs_Oklahoma City Thunder_vs_2025   
4  Oklahoma City Thunder  Indiana Pacers_vs_Oklahoma City Thunder_vs_2025   

   round  gameNumber  homeWin   gameLabel gameSubLabel  homeScore  awayScore  \
0    4.0           7     True  NBA Finals       Game 7

#### Head-to-Head Regular Season Statistics
To capture matchup familiarity and momentum, I computed **head-to-head statistics** for each playoff pairing based on their regular season meetings.

##### Key Steps:
1. Filter regular season games (2015-2025)
2. Track how many times each playoff opponent faced one another.
3. Record how many games were won vs. lost.
4. Calculate **head-to-head win rate**.


Head-to-head records provide insight into how teams match up stylistically. For example, a lower seed may still have an edge if it consistently outperformed its playoff opponent in the regular season.

In [24]:
# Import libraries
import pandas as pd

# Load games dataset
games_df = pd.read_csv("../data/raw/Games.csv", low_memory = False)
stats_df = pd.read_csv("../data/raw/TeamStatistics.csv", low_memory = False)

# Assign a season to each game
def get_season_year(date_str):
    date = pd.to_datetime(date_str)
    return date.year + 1 if date.month >= 10 else date.year

stats_df["season"] = stats_df["gameDate"].apply(get_season_year)

# Merge to get gameType info
games_df_small = games_df[["gameId", "gameType"]]
merged_df = stats_df.merge(games_df_small, on="gameId", how="left")

# Filter for Regular Season Games Only (2015-2025)
regular_season = merged_df[merged_df["gameType"] == "Regular Season"].copy()
regular_season = regular_season[(regular_season["season"] >= 2015) & (regular_season["season"] <= 2025)]

# Standardize Team and Opponent Names
regular_season["teamName"] = regular_season["teamCity"].str.strip() + " " + regular_season["teamName"].str.strip()
regular_season["opponentName"] = regular_season["opponentTeamCity"].str.strip() + " " + regular_season["opponentTeamName"].str.strip()

# Create Win Flag
regular_season["win_flag"] = regular_season["win"] == 1

# Group and aggregate head-to-head statistics
head_to_head = regular_season.groupby(["season", "teamName", "opponentName"]).agg(
    games_played=("gameId", "count"),
    wins=("win_flag", "sum")
).reset_index()
head_to_head["losses"] = head_to_head["games_played"] - head_to_head["wins"]
head_to_head["win_rate"] = head_to_head["wins"] / head_to_head["games_played"]

# Export final dataset
head_to_head.to_csv("../data/processed/regular_season_head_to_head_stats.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/regular_season_head_to_head_stats.csv")
print(df.head())

   season       teamName         opponentName  games_played  wins  losses  \
0    2015  Atlanta Hawks       Boston Celtics             3     2       1   
1    2015  Atlanta Hawks        Brooklyn Nets             4     4       0   
2    2015  Atlanta Hawks    Charlotte Hornets             4     2       2   
3    2015  Atlanta Hawks        Chicago Bulls             3     2       1   
4    2015  Atlanta Hawks  Cleveland Cavaliers             4     3       1   

   win_rate  
0  0.666667  
1  1.000000  
2  0.500000  
3  0.666667  
4  0.750000  


#### Playoff Team Statistics
Finally, I aggregated **Four Factors for each playoff game** to serve as inputs for the predictive models. Unlike the regular season stats (which are seasonal averages), these are **game-by-game playoff stats**.

##### Key Steps:
1. Load `Games.csv` and `TeamStatistics.csv` (playoff only).
2. Assign playoff season.
3. Compoute the **Four Factors** per game.
4. Merge in opponent rebounds to calculate ORB% and DRB%.
5. Attach home/away scores.
6. Standardize team names.
7. Keep only 2015-2025 seasons.

These game-level playoff stats give the model real-time context. How teams actually performed in the pressure of the postseason, which is more predictive than relying only on regular season form. 

With these datasets: Regular season averages, playoff game stats, and head-to-head records, the model is equipped with both historical trends and series-specific context, allowing for more realistic playoff outcome predictions.

In [27]:
# Import libraries
import pandas as pd

# Load Team Statistics and Games dataset
df = pd.read_csv("../data/raw/TeamStatistics.csv", low_memory=False)

# Add gameType by merging with games.csv (if needed)
games_df = pd.read_csv("../data/raw/Games.csv", low_memory=False)
df = pd.merge(df, games_df[["gameId", "gameType"]], on="gameId", how="left")

# Filter only playoff games
df = df[df["gameType"].str.strip() == "Playoffs"].copy()

# Assign each playoffs game to a season
def get_season_year(date_str):
    date = pd.to_datetime(date_str)
    return date.year + 1 if date.month >= 10 else date.year

df["season"] = df["gameDate"].apply(get_season_year)

# Compute the Four Factors team statistics
df["eFG%"] = (df["fieldGoalsMade"] + 0.5 * df["threePointersMade"]) / df["fieldGoalsAttempted"]
df["TOV%"] = df["turnovers"] / (df["fieldGoalsAttempted"] + 0.44 * df["freeThrowsAttempted"] + df["turnovers"])
df["FT/FGA"] = df["freeThrowsMade"] / df["fieldGoalsAttempted"]

# Merge opponent rebounds for ORB% and DRB%
opp_stats = df[["gameId", "teamId", "reboundsOffensive", "reboundsDefensive"]].copy()
opp_stats.columns = ["gameId", "opponentTeamId", "oppORB", "oppDRB"]

df = pd.merge(df, opp_stats, on=["gameId", "opponentTeamId"], how="left")

df["ORB%"] = df["reboundsOffensive"] / (df["reboundsOffensive"] + df["oppDRB"])
df["DRB%"] = df["reboundsDefensive"] / (df["reboundsDefensive"] + df["oppORB"])

# Combine team city and team name
df["teamName"] = df["teamCity"].str.strip() + " " + df["teamName"].str.strip()

# Identify home and away teams with their scores
score_df = df[["gameId", "teamName", "teamScore", "home"]].copy()

# Separate into home and away team scores
home_scores = score_df[score_df["home"] == True][["gameId", "teamName", "teamScore"]].rename(columns={"teamScore": "homeScore", "teamName": "homeTeam"})
away_scores = score_df[score_df["home"] == False][["gameId", "teamName", "teamScore"]].rename(columns={"teamScore": "awayScore", "teamName": "awayTeam"})

# Merge them back together
game_scores = pd.merge(home_scores, away_scores, on="gameId", how="inner")

# Merge 'game_scores' into playoff team statistics Dataframe
df = df.merge(game_scores[["gameId", "homeScore", "awayScore"]], on="gameId", how="left")

# Filter to include only NBA seasons from 2015 to 2025
df = df[(df["season"] >= 2015) & (df["season"] <= 2025)].copy()

# Keep final playoff statistics per team per 
playoff_ff = df[[
    "gameId", "season", "teamName", "eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%","homeScore","awayScore"
]].copy()

# Export the final dataset
playoff_ff.to_csv("../data/processed/team_statistics_playoff_games.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")
print(df.head())

     gameId  season               teamName      eFG%      TOV%    FT/FGA  \
0  42400407    2025         Indiana Pacers  0.492857  0.202390  0.314286   
1  42400407    2025  Oklahoma City Thunder  0.465517  0.065032  0.252874   
2  42400406    2025         Indiana Pacers  0.494565  0.088496  0.184783   
3  42400406    2025  Oklahoma City Thunder  0.472973  0.197294  0.283784   
4  42400405    2025         Indiana Pacers  0.518293  0.187713  0.292683   

       ORB%      DRB%  homeScore  awayScore  
0  0.307692  0.717391        103         91  
1  0.282609  0.692308        103         91  
2  0.229167  0.897436        108         91  
3  0.102564  0.770833        108         91  
4  0.409091  0.627451        120        109  
