## Predictive Modeling by Round and Game
NBA playoff series unfold differently depending on **round, game number, and series context**. To realistically capture these dynamics, I built **custom datasets for each round and game** (2015–2025). 

---

This approach ensures that models:  
- Avoid **data leakage** by only using information available before each game.  
- Reflect **real-world playoff strategy** (e.g., rest days, momentum, series score).  
- Improve **interpretability** by aligning inputs with how teams and analysts think about playoffs.  

---

### Why Separate Models for Rounds and Games?

Playoff basketball is highly contextual.  
A Game 1 in Round 1 looks very different from a Game 6 in the Finals:  

- **Round 1 Game 1** → Teams rely mostly on regular season form, seeding, and head-to-head history.  
- **Round 1 Game 2** → Momentum from Game 1 matters; team adjustments begin.  
- **Games 3–7** → Series score and rolling performance stats become more predictive.  
- **Later Rounds (2–4)** → Fatigue, rest advantage, and prior series length play a larger role.  

By tailoring features for each context, the model better mirrors how **real teams adjust strategies across the playoffs.**

---

#### Round 1 – Game 1
Game 1 is the baseline of the playoffs.
Teams rely primarily on **regular season performance, seeding, and head-to-head history** because no playoff games have been played yet.

**Features used:**
- Regular season Four Factors (eFG%, TOV%, ORB%, DRB%, FT/FGA).
- Team seeding (higher seeds generally favored).
- Head-to-head regular season results.
- Home-court advantage.

In [3]:
# Import libraries
import pandas as pd

# Load Regular Season Game Team Statstics and Playoff Games Statistics Dataset
playoffs_path = "../data/processed/playoffs_games.csv"
playoff_games_df = pd.read_csv(playoffs_path)

regular_season_path = "../data/processed/team_statistics_regular_season.csv"
regular_season_stats_df = pd.read_csv(regular_season_path)

# Filter for Round 1, Game 1
round1_game1_df = playoff_games_df[
    (playoff_games_df["round"] == 1) &
    (playoff_games_df["gameNumber"] == 1) &
    (playoff_games_df["season"].between(2015, 2025))
].copy()

# Merge Home Team Regular Season Statistics
round1_game1_df = round1_game1_df.merge(
    regular_season_stats_df,
    left_on=["season", "homeTeam"],
    right_on=["season", "teamName"],
    how="left",
    suffixes=("", "_home")
)

# Rename Home Team Columns
round1_game1_df = round1_game1_df.rename(columns={
    "eFG%": "eFG%_home_season",
    "TOV%": "TOV%_home_season",
    "ORB%": "ORB%_home_season",
    "DRB%": "DRB%_home_season",
    "FT/FGA": "FT/FGA_home_season"
})

# Merge Away Team Regular Season Statistics
round1_game1_df = round1_game1_df.merge(
    regular_season_stats_df,
    left_on=["season", "awayTeam"],
    right_on=["season", "teamName"],
    how="left",
    suffixes=("", "_away")
)

# Rename Away Team Columns
round1_game1_df = round1_game1_df.rename(columns={
    "eFG%": "eFG%_away_season",
    "TOV%": "TOV%_away_season",
    "ORB%": "ORB%_away_season",
    "DRB%": "DRB%_away_season",
    "FT/FGA": "FT/FGA_away_season"
})

# Drop unnecessary columns
round1_game1_df = round1_game1_df.drop(columns=["teamName", "teamName_away"])

# Select features for modeling
model_df = round1_game1_df[[
    "gameId", "season", "homeTeam", "awayTeam",
    "homeSeed", "awaySeed", "homeCourt", "homeWin", "gameNumber", "round",
    "eFG%_home_season", "TOV%_home_season", "ORB%_home_season", "DRB%_home_season", "FT/FGA_home_season",
    "eFG%_away_season", "TOV%_away_season", "ORB%_away_season", "DRB%_away_season", "FT/FGA_away_season"
]]

# Load Head-to-Head Regular Season Dataset
h2h_path = "../data/processed/regular_season_head_to_head_stats.csv"
h2h_df = pd.read_csv(h2h_path)

# Rename columns for clarity
h2h_df = h2h_df.rename(columns={
    "teamName": "homeTeam",
    "opponentName": "awayTeam",
    "wins": "h2h_home_wins",
    "games_played": "h2h_games_played",
    "win_rate": "h2h_home_winrate"
})

# Merge Head-to-Head Statistics into Round 1 Game 1 Dataset
model_df = model_df.merge(
    h2h_df[["season", "homeTeam", "awayTeam", "h2h_games_played", "h2h_home_wins", "h2h_home_winrate"]],
    on=["season", "homeTeam", "awayTeam"],
    how="left"
)

# Export the final dataset
model_df.to_csv("../data/processed/modeling_round1_game1.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round1_game1.csv")
print(df.head())

     gameId  season               homeTeam                awayTeam  homeSeed  \
0  42400151    2025        Houston Rockets   Golden State Warriors       2.0   
1  42400101    2025    Cleveland Cavaliers              Miami Heat       1.0   
2  42400111    2025         Boston Celtics           Orlando Magic       2.0   
3  42400141    2025  Oklahoma City Thunder       Memphis Grizzlies       1.0   
4  42400161    2025     Los Angeles Lakers  Minnesota Timberwolves       3.0   

   awaySeed  homeCourt  homeWin  gameNumber  round  ...  DRB%_home_season  \
0       7.0       True    False           1    1.0  ...          0.763132   
1       8.0       True     True           1    1.0  ...          0.749635   
2       7.0       True     True           1    1.0  ...          0.760160   
3       8.0       True     True           1    1.0  ...          0.747759   
4       6.0       True    False           1    1.0  ...          0.745059   

   FT/FGA_home_season  eFG%_away_season  TOV%_away_seaso

####  Round 1 – Game 2
By Game 2, teams have already played once.
Momentum and adjustments start to matter, so we include **Game 1 outcomes** alongside regular season baselines.

**Features used:**
- Regular season Four Factors.
- Game 1 Four Factors.
- Game 1 box score stats (both teams).
- Game 1 winner/loser (momentum shift).
- Seeding + home court advantage.

In [7]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")

# Filter for seasons 2015-2025
playoff_games = playoff_games[
    (playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)
]
playoff_stats = playoff_stats[
    (playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)
]
regular_season_stats = regular_season_stats[
    (regular_season_stats["season"] >= 2015) & (regular_season_stats["season"] <= 2025)
]

# Filter for Round 1, Game 2
round1_game2_df = playoff_games[
    (playoff_games["round"] == 1) & (playoff_games["gameNumber"] == 2)
].copy()

# Pull Game 1 IDs and Teams
game1_df = playoff_games[
    (playoff_games["round"] == 1) & (playoff_games["gameNumber"] == 1)
][["gameId", "season", "homeTeam", "awayTeam"]].copy()

# Merge Game 1 Home Team Statistics 
home_g1 = game1_df.merge(
    playoff_stats,
    left_on=["gameId", "homeTeam"],
    right_on=["gameId", "teamName"],
    how="left"
).rename(columns={
    "eFG%": "eFG%_home_r1g1",
    "TOV%": "TOV%_home_r1g1",
    "FT/FGA": "FT/FGA_home_r1g1",
    "ORB%": "ORB%_home_r1g1",
    "DRB%": "DRB%_home_r1g1",
    "homeScore": "homeScore_r1g1",
    "awayScore": "awayScore_r1g1"
}).drop(columns=["teamName"])

# Merge Game 1 Away Team Statistics
away_g1 = home_g1.merge(
    playoff_stats,
    left_on=["gameId", "awayTeam"],
    right_on=["gameId", "teamName"],
    how="left"
).rename(columns={
    "eFG%": "eFG%_away_r1g1",
    "TOV%": "TOV%_away_r1g1",
    "FT/FGA": "FT/FGA_away_r1g1",
    "ORB%": "ORB%_away_r1g1",
    "DRB%": "DRB%_away_r1g1"
}).drop(columns=["teamName"])

# Merge Game 1 Statistics into Game 2 Matchups
round1_game2_df = round1_game2_df.merge(
    away_g1[
        ["season", "homeTeam", "awayTeam",
         "eFG%_home_r1g1", "TOV%_home_r1g1", "FT/FGA_home_r1g1", "ORB%_home_r1g1", "DRB%_home_r1g1",
         "eFG%_away_r1g1", "TOV%_away_r1g1", "FT/FGA_away_r1g1", "ORB%_away_r1g1", "DRB%_away_r1g1",
         "homeScore_r1g1", "awayScore_r1g1"
        ]
    ],
    on=["season", "homeTeam", "awayTeam"],
    how="left"
)

# Add Momentum Feature - homeLostRound1Game1
round1_game2_df["homeLostRound1Game1"] = (
    round1_game2_df["homeScore_r1g1"] < round1_game2_df["awayScore_r1g1"]
).astype(int)

# Merge Regular Season Statistics (Home Team and Away Team)
round1_game2_df = round1_game2_df.merge(
    regular_season_stats,
    left_on=["season", "homeTeam"],
    right_on=["season", "teamName"],
    how="left"
).rename(columns={
    "eFG%": "eFG%_home_season",
    "TOV%": "TOV%_home_season",
    "FT/FGA": "FT/FGA_home_season",
    "ORB%": "ORB%_home_season",
    "DRB%": "DRB%_home_season"
}).drop(columns=["teamName"])

round1_game2_df = round1_game2_df.merge(
    regular_season_stats,
    left_on=["season", "awayTeam"],
    right_on=["season", "teamName"],
    how="left"
).rename(columns={
    "eFG%": "eFG%_away_season",
    "TOV%": "TOV%_away_season",
    "FT/FGA": "FT/FGA_away_season",
    "ORB%": "ORB%_away_season",
    "DRB%": "DRB%_away_season"
}).drop(columns=["teamName"])

# Select Final Columns
round1_game2_df["round"] = 1
round1_game2_df["gameNumber"] = 2

final_cols = [
    "gameId", "season", "homeTeam", "awayTeam", "homeSeed", "awaySeed", "homeCourt", "homeWin", "round", "gameNumber",

    # Regular Season Stats
    "eFG%_home_season", "TOV%_home_season", "FT/FGA_home_season", "ORB%_home_season", "DRB%_home_season",
    "eFG%_away_season", "TOV%_away_season", "FT/FGA_away_season", "ORB%_away_season", "DRB%_away_season",

    # Game 1 Stats
    "eFG%_home_r1g1", "TOV%_home_r1g1", "FT/FGA_home_r1g1", "ORB%_home_r1g1", "DRB%_home_r1g1",
    "eFG%_away_r1g1", "TOV%_away_r1g1", "FT/FGA_away_r1g1", "ORB%_away_r1g1", "DRB%_away_r1g1",

    # Momentum Features
    "homeScore_r1g1", "awayScore_r1g1", "homeLostRound1Game1"
]

round1_game2_df = round1_game2_df[final_cols].copy()

# Export the final dataset
round1_game2_df.to_csv("../data/processed/modeling_round1_game2.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round1_game2.csv")
print(df.head())

     gameId  season               homeTeam                awayTeam  homeSeed  \
0  42400152    2025        Houston Rockets   Golden State Warriors       2.0   
1  42400102    2025    Cleveland Cavaliers              Miami Heat       1.0   
2  42400112    2025         Boston Celtics           Orlando Magic       2.0   
3  42400162    2025     Los Angeles Lakers  Minnesota Timberwolves       3.0   
4  42400142    2025  Oklahoma City Thunder       Memphis Grizzlies       1.0   

   awaySeed  homeCourt  homeWin  round  gameNumber  ...  ORB%_home_r1g1  \
0       7.0       True     True      1           2  ...        0.423077   
1       8.0       True     True      1           2  ...        0.350000   
2       7.0       True     True      1           2  ...        0.243243   
3       6.0       True     True      1           2  ...        0.282609   
4       8.0       True     True      1           2  ...        0.285714   

   DRB%_home_r1g1  eFG%_away_r1g1  TOV%_away_r1g1  FT/FGA_away_r1g1 

#### Round 1 – Games 3 to 7
As the series progresses, the **series score** and **rolling averages** of playoff performance become more predictive than regular season stats.
Each game builds on prior ones, capturing form and momentum.

**Features used:**
- Rolling averages of Four Factors from earlier games in the series.
- Series score (e.g., Team A leads 2–1).
- Regular season stats (still included, but weighted less).
- Seeding + Home court advantage (switches after Game 2, then alternates).

In [14]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")

# Merge Playoffs Games Statistics with Playoff Games Team Statistics
playoff_stats = playoff_stats.merge(
    playoff_games[["gameId", "season", "round", "gameNumber", "homeTeam", "awayTeam"]],
    on="gameId", how="left"
)

# Add Team Score and Opponent Score Columns
playoff_stats["teamScore"] = playoff_stats.apply(
    lambda row: row["homeScore"] if row["teamName"] == row["homeTeam"] else row["awayScore"],
    axis=1
)
playoff_stats["opponentScore"] = playoff_stats.apply(
    lambda row: row["awayScore"] if row["teamName"] == row["homeTeam"] else row["homeScore"],
    axis=1
)

# Filter for Round 1, Games 3-7 
games3to7_df = playoff_games[
    (playoff_games["round"] == 1) & (playoff_games["gameNumber"] >= 3)
].copy()

# Initialize Dataset
final_rows = []

# Loop Through Games and Build Feature Rows
for idx, row in games3to7_df.iterrows():
    game_id = row["gameId"]
    season = row["season"]
    home_team = row["homeTeam"]
    away_team = row["awayTeam"]
    game_num = row["gameNumber"]
    round_num = row["round"]

    # Previous games in same series
    prev_games = playoff_games[
        (playoff_games["round"] == 1) &
        (playoff_games["season"] == season) &
        (playoff_games["gameNumber"] < game_num) &
        (
            ((playoff_games["homeTeam"] == home_team) & (playoff_games["awayTeam"] == away_team)) |
            ((playoff_games["homeTeam"] == away_team) & (playoff_games["awayTeam"] == home_team))
        )
    ]["gameId"].tolist()

    prior_stats = playoff_stats[
        (playoff_stats["gameId"].isin(prev_games)) &
        (playoff_stats["teamName"].isin([home_team, away_team]))
    ].copy()

    team_rolling = prior_stats.groupby("teamName")[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()
    team_rolling = team_rolling.rename(columns={
        "eFG%": "eFG%_roll", "TOV%": "TOV%_roll", "FT/FGA": "FT/FGA_roll",
        "ORB%": "ORB%_roll", "DRB%": "DRB%_roll"
    })

    home_season = regular_season_stats[
        (regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == home_team)
    ]
    away_season = regular_season_stats[
        (regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == away_team)
    ]

    if home_season.empty or away_season.empty:
        continue

    home_row = team_rolling[team_rolling["teamName"] == home_team].merge(home_season, on="teamName")
    away_row = team_rolling[team_rolling["teamName"] == away_team].merge(away_season, on="teamName")

    if home_row.empty or away_row.empty:
        continue

    home_wins = prior_stats[
        (prior_stats["teamName"] == home_team) & (prior_stats["teamScore"] > prior_stats["opponentScore"])
    ].shape[0]

    away_wins = prior_stats[
        (prior_stats["teamName"] == away_team) & (prior_stats["teamScore"] > prior_stats["opponentScore"])
    ].shape[0]

# Append All Features for Each Game
    final_rows.append({
        "gameId": game_id,
        "season": season,
        "homeTeam": home_team,
        "awayTeam": away_team,
        "homeSeed": row["homeSeed"],
        "awaySeed": row["awaySeed"],
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "gameNumber": game_num,
        "round": round_num,

        # Rolling Stats
        "eFG%_home_r1_roll": home_row["eFG%_roll"].values[0],
        "TOV%_home_r1_roll": home_row["TOV%_roll"].values[0],
        "FT/FGA_home_r1_roll": home_row["FT/FGA_roll"].values[0],
        "ORB%_home_r1_roll": home_row["ORB%_roll"].values[0],
        "DRB%_home_r1_roll": home_row["DRB%_roll"].values[0],

        "eFG%_away_r1_roll": away_row["eFG%_roll"].values[0],
        "TOV%_away_r1_roll": away_row["TOV%_roll"].values[0],
        "FT/FGA_away_r1_roll": away_row["FT/FGA_roll"].values[0],
        "ORB%_away_r1_roll": away_row["ORB%_roll"].values[0],
        "DRB%_away_r1_roll": away_row["DRB%_roll"].values[0],

        # Regular Season Stats
        "eFG%_home_season": home_row["eFG%"].values[0],
        "TOV%_home_season": home_row["TOV%"].values[0],
        "FT/FGA_home_season": home_row["FT/FGA"].values[0],
        "ORB%_home_season": home_row["ORB%"].values[0],
        "DRB%_home_season": home_row["DRB%"].values[0],

        "eFG%_away_season": away_row["eFG%"].values[0],
        "TOV%_away_season": away_row["TOV%"].values[0],
        "FT/FGA_away_season": away_row["FT/FGA"].values[0],
        "ORB%_away_season": away_row["ORB%"].values[0],
        "DRB%_away_season": away_row["DRB%"].values[0],

        # Series Score
        "homeWins": home_wins,
        "awayWins": away_wins
    })

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round1_games3to7.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round1_games3to7.csv")
print(df.head())

     gameId  season               homeTeam               awayTeam  homeSeed  \
0  42400157    2025        Houston Rockets  Golden State Warriors       2.0   
1  42400177    2025         Denver Nuggets   Los Angeles Clippers       4.0   
2  42400156    2025  Golden State Warriors        Houston Rockets       7.0   
3  42400176    2025   Los Angeles Clippers         Denver Nuggets       5.0   
4  42400126    2025        Detroit Pistons        New York Knicks       6.0   

   awaySeed  homeCourt  homeWin  gameNumber  round  ...  FT/FGA_home_season  \
0       7.0       True    False           7    1.0  ...            0.179743   
1       5.0       True     True           7    1.0  ...            0.202515   
2       2.0      False    False           6    1.0  ...            0.189377   
3       4.0      False     True           6    1.0  ...            0.201672   
4       3.0      False    False           6    1.0  ...            0.194834   

   ORB%_home_season  DRB%_home_season  eFG%_away_s

#### Round 2 – Game 1
Entering the second round, teams bring forward their full Round 1 performance.
We combine regular season stats with **Round 1 playoff averages** to capture current form.

**Features used:**
- Regular season Four Factors.
- Average Four Factors from Round 1 playoff games.
- Rest advantage (days off since Round 1 ended).
- Home court advantage.
- Series length (how many games played in Round 1).

In [17]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")

# Filter Round 2 Game 1 Matchups
round2_game1 = playoff_games[(playoff_games["round"] == 2) & (playoff_games["gameNumber"] == 1)].copy()

# Get all Round 1 games for each team
round1_meta = playoff_games[playoff_games["round"] == 1][["gameId", "season", "round", "gameNumber", "homeTeam", "awayTeam"]].copy()

round1_home = round1_meta[["gameId", "homeTeam"]].rename(columns={"homeTeam": "teamName"})
round1_away = round1_meta[["gameId", "awayTeam"]].rename(columns={"awayTeam": "teamName"})
round1_long = pd.concat([round1_home, round1_away], axis=0).drop_duplicates()

round1_long = round1_long.merge(playoff_games[["gameId", "season"]], on="gameId", how="left")

# Merge team statistics with Round 1 games
round1_stats = playoff_stats.merge(round1_long, on=["gameId", "teamName"], how="inner")

# Ensure 'season' exists in round1_stats
if "season" not in round1_stats.columns:
    round1_stats = round1_stats.merge(round1_long, on=["gameId", "teamName"], how="left")

# Compute average Round 1 stats per team
round1_avg_stats = round1_stats.groupby(["season", "teamName"])[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()

# Merge with Round 2 Game 1 Teams
final_rows = []

for _, row in round2_game1.iterrows():
    season = row["season"]
    game_id = row["gameId"]
    home_team = row["homeTeam"]
    away_team = row["awayTeam"]
    round_num = row["round"]
    game_num = row["gameNumber"]

    # Get Round 1 average stats
    home_r1 = round1_avg_stats[(round1_avg_stats["season"] == season) & (round1_avg_stats["teamName"] == home_team)]
    away_r1 = round1_avg_stats[(round1_avg_stats["season"] == season) & (round1_avg_stats["teamName"] == away_team)]

    # Get regular season stats
    home_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == home_team)]
    away_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == away_team)]

    if home_r1.empty or away_r1.empty or home_reg.empty or away_reg.empty:
        continue

    # Compute series lengths
    home_games = round1_long[(round1_long["season"] == season) & (round1_long["teamName"] == home_team)]
    away_games = round1_long[(round1_long["season"] == season) & (round1_long["teamName"] == away_team)]
    home_series_len = home_games["gameId"].nunique()
    away_series_len = away_games["gameId"].nunique()
    rest_diff = away_series_len - home_series_len

    final_rows.append({
        "gameId": game_id,
        "season": season,
        "homeTeam": home_team,
        "awayTeam": away_team,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "round": round_num,
        "gameNumber": game_num,

        # Round 1 averages
        "eFG%_home_r1_avg": home_r1["eFG%"].values[0],
        "TOV%_home_r1_avg": home_r1["TOV%"].values[0],
        "FT/FGA_home_r1_avg": home_r1["FT/FGA"].values[0],
        "ORB%_home_r1_avg": home_r1["ORB%"].values[0],
        "DRB%_home_r1_avg": home_r1["DRB%"].values[0],

        "eFG%_away_r1_avg": away_r1["eFG%"].values[0],
        "TOV%_away_r1_avg": away_r1["TOV%"].values[0],
        "FT/FGA_away_r1_avg": away_r1["FT/FGA"].values[0],
        "ORB%_away_r1_avg": away_r1["ORB%"].values[0],
        "DRB%_away_r1_avg": away_r1["DRB%"].values[0],

        # Regular season
        "eFG%_home_season": home_reg["eFG%"].values[0],
        "TOV%_home_season": home_reg["TOV%"].values[0],
        "FT/FGA_home_season": home_reg["FT/FGA"].values[0],
        "ORB%_home_season": home_reg["ORB%"].values[0],
        "DRB%_home_season": home_reg["DRB%"].values[0],

        "eFG%_away_season": away_reg["eFG%"].values[0],
        "TOV%_away_season": away_reg["TOV%"].values[0],
        "FT/FGA_away_season": away_reg["FT/FGA"].values[0],
        "ORB%_away_season": away_reg["ORB%"].values[0],
        "DRB%_away_season": away_reg["DRB%"].values[0],

        # Rest/fatigue
        "homeSeriesLength": home_series_len,
        "awaySeriesLength": away_series_len,
        "restAdvantage": rest_diff
    })

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round2_game1.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round2_game1.csv")
print(df.head())

     gameId  season                homeTeam               awayTeam  homeCourt  \
0  42400231    2025  Minnesota Timberwolves  Golden State Warriors       True   
1  42400221    2025   Oklahoma City Thunder         Denver Nuggets       True   
2  42400211    2025          Boston Celtics        New York Knicks       True   
3  42400201    2025     Cleveland Cavaliers         Indiana Pacers       True   
4  42300221    2024   Oklahoma City Thunder       Dallas Mavericks       True   

   homeWin  round  gameNumber  eFG%_home_r1_avg  TOV%_home_r1_avg  ...  \
0    False    2.0           1          0.508776          0.091551  ...   
1    False    2.0           1          0.524548          0.087615  ...   
2    False    2.0           1          0.545089          0.116482  ...   
3    False    2.0           1          0.629976          0.084265  ...   
4     True    2.0           1          0.564486          0.125863  ...   

   ORB%_home_season  DRB%_home_season  eFG%_away_season  TOV%_away_s

#### Round 2 – Games 2 to 7
By now, **rolling playoff performance within the round** becomes the key predictor.

**Features used:**
- Rolling Four Factors within Round 2.
- Series score context.
- Regular Season Four Factors.
- Carryover playoff averages from Round 1.
- Home court advantage.

In [22]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
# Regular season statistics can still be included but weighted less because playoff performance is more predictive.
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")

# Filter for 2015–2025 seasons
playoff_games = playoff_games[(playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)]
playoff_stats = playoff_stats[(playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)]

# Filter Round 2 Games 2–7
round2_g2to7 = playoff_games[(playoff_games["round"] == 2) & (playoff_games["gameNumber"] > 1)].copy()

# Compute Round 1 Average Stats
r1_games = playoff_games[playoff_games["round"] == 1][["gameId", "homeTeam", "awayTeam", "season"]].copy()
r1_home = r1_games[["gameId", "homeTeam", "season"]].rename(columns={"homeTeam": "teamName"})
r1_away = r1_games[["gameId", "awayTeam", "season"]].rename(columns={"awayTeam": "teamName"})
r1_long = pd.concat([r1_home, r1_away], ignore_index=True)

r1_stats = r1_long.merge(playoff_stats, on=["gameId", "teamName"], how="inner")
if "season" not in r1_stats.columns:
    r1_stats = r1_stats.merge(r1_long, on=["gameId", "teamName"], how="left")
r1_stats["season"] = r1_stats["season"].astype(int)
r1_avg = r1_stats.groupby(["season", "teamName"])[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()

# Build rolling stats for Round 2 Games 2–7
final_rows = []

for _, row in round2_g2to7.iterrows():
    game_id = row["gameId"]
    season = row["season"]
    game_num = row["gameNumber"]
    home_team = row["homeTeam"]
    away_team = row["awayTeam"]
    round_num = row["round"]

    # Prior games between the same teams in Round 2
    prior_games = playoff_games[
        (playoff_games["season"] == season) &
        (playoff_games["round"] == 2) &
        (playoff_games["gameNumber"] < game_num) &
        (
            ((playoff_games["homeTeam"] == home_team) & (playoff_games["awayTeam"] == away_team)) |
            ((playoff_games["homeTeam"] == away_team) & (playoff_games["awayTeam"] == home_team))
        )
    ]["gameId"].tolist()

    # Stats from those games
    prior_stats = playoff_stats[
        (playoff_stats["season"] == season) &
        (playoff_stats["gameId"].isin(prior_games)) &
        (playoff_stats["teamName"].isin([home_team, away_team]))
    ]

    if prior_stats.empty:
        continue

    # Rolling stats
    rolling = prior_stats.groupby("teamName")[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()

    # Series score
    prior_stats = prior_stats.merge(playoff_games[["gameId", "homeTeam", "awayTeam", "homeWin"]], on="gameId", how="left")
    prior_stats["teamWin"] = prior_stats.apply(
        lambda x: (x["teamName"] == x["homeTeam"] and x["homeWin"] == 1) or
                  (x["teamName"] == x["awayTeam"] and x["homeWin"] == 0),
        axis=1
    )
    home_wins = prior_stats[(prior_stats["teamName"] == home_team) & (prior_stats["teamWin"])].shape[0]
    away_wins = prior_stats[(prior_stats["teamName"] == away_team) & (prior_stats["teamWin"])].shape[0]

    # R1 average
    home_r1 = r1_avg[(r1_avg["season"] == season) & (r1_avg["teamName"] == home_team)]
    away_r1 = r1_avg[(r1_avg["season"] == season) & (r1_avg["teamName"] == away_team)]

    # Regular season
    home_reg = regular_season_stats[
        (regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == home_team)
    ]
    away_reg = regular_season_stats[
        (regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == away_team)
    ]

    # Build final row
    row_data = {
        "gameId": game_id,
        "season": season,
        "gameNumber": game_num,
        "homeTeam": home_team,
        "awayTeam": away_team,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "homeWins": home_wins,
        "awayWins": away_wins,
        "round": round_num,
    }

    for team, prefix in zip([home_team, away_team], ["home", "away"]):
        team_stats = rolling[rolling["teamName"] == team]
        if not team_stats.empty:
            for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
                row_data[f"{col}_{prefix}_r2_roll"] = team_stats[col].values[0]

        r1_stats_team = home_r1 if prefix == "home" else away_r1
        if not r1_stats_team.empty:
            for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
                row_data[f"{col}_{prefix}_r1_avg"] = r1_stats_team[col].values[0]

        reg_stats_team = home_reg if prefix == "home" else away_reg
        if not reg_stats_team.empty:
            for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
                row_data[f"{col}_{prefix}_season"] = reg_stats_team[col].values[0]

    final_rows.append(row_data)

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round2_game2to7.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round2_game2to7.csv")
print(df.head())

     gameId  season  gameNumber                homeTeam  \
0  42400227    2025           7   Oklahoma City Thunder   
1  42400216    2025           6         New York Knicks   
2  42400226    2025           6          Denver Nuggets   
3  42400235    2025           5  Minnesota Timberwolves   
4  42400215    2025           5          Boston Celtics   

                awayTeam  homeCourt  homeWin  homeWins  awayWins  round  ...  \
0         Denver Nuggets       True     True         3         3    2.0  ...   
1         Boston Celtics      False     True         3         2    2.0  ...   
2  Oklahoma City Thunder      False     True         2         3    2.0  ...   
3  Golden State Warriors       True     True         3         1    2.0  ...   
4        New York Knicks       True     True         1         3    2.0  ...   

   eFG%_away_r1_avg  TOV%_away_r1_avg  FT/FGA_away_r1_avg  ORB%_away_r1_avg  \
0          0.547985          0.123145            0.165182          0.277041   
1     

#### Round 3 – Game 1 
At this stage, only elite teams remain.
We combine **all prior playoff rounds** and regular season context to capture both consistency and fatigue.

**Features used:**
- Regular season Four Factors.
- Averages from Rounds 1 and 2 playoff games.
- Rest advantage between rounds.
- Series history (sweeps vs. long battles).
- Home court advantage.

In [26]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")

# Filter Round 3 Game 1
playoff_games = playoff_games[(playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)]
playoff_stats = playoff_stats[(playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)]
regular_season_stats = regular_season_stats[(regular_season_stats["season"] >= 2015) & (regular_season_stats["season"] <= 2025)]

round3_g1 = playoff_games[(playoff_games["round"] == 3) & (playoff_games["gameNumber"] == 1)].copy()
round2_long = playoff_games[(playoff_games["round"] == 2)].copy()

# Aggregate playoff stats Rounds 1 and 2
prior_playoffs = playoff_games[playoff_games["round"] < 3]
prior_game_ids = prior_playoffs["gameId"].unique()

team_playoff_stats = playoff_stats[playoff_stats["gameId"].isin(prior_game_ids)].copy()
team_playoff_avg = team_playoff_stats.groupby(["season", "teamName"])[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean().reset_index()

# Compute rest days and build final dataset
final_rows = []

for _, row in round3_g1.iterrows():
    game_id = row["gameId"]
    season = row["season"]
    home = row["homeTeam"]
    away = row["awayTeam"]
    date = row["gameDate"]
    round_num = row["round"]
    game_num = row["gameNumber"]

    def get_last_game(team):
        games = playoff_games[(playoff_games["season"] == season) & 
                              ((playoff_games["homeTeam"] == team) | (playoff_games["awayTeam"] == team)) &
                              (playoff_games["gameDate"] < date)]
        if games.empty:
            return None
        return pd.to_datetime(games["gameDate"].max())

    home_last = get_last_game(home)
    away_last = get_last_game(away)

    # Compute Round 2 series lengthes (how many games each team played in Round 2)
    home_games = round2_long[(round2_long["season"] == season) & 
                              ((round2_long["homeTeam"] == home) | (round2_long["awayTeam"] == home))]
    away_games = round2_long[(round2_long["season"] == season) & 
                              ((round2_long["homeTeam"] == away) | (round2_long["awayTeam"] == away))]
    home_series_len = home_games["gameId"].nunique()
    away_series_len = away_games["gameId"].nunique()

    # Compute Round 2 rest advantage
    home_rest = (pd.to_datetime(date) - home_last).days if home_last else None
    away_rest = (pd.to_datetime(date) - away_last).days if away_last else None
    rest_advantage = home_rest - away_rest if (home_rest is not None and away_rest is not None) else None

    # Pull team stats
    home_playoff = team_playoff_avg[(team_playoff_avg["season"] == season) & (team_playoff_avg["teamName"] == home)]
    away_playoff = team_playoff_avg[(team_playoff_avg["season"] == season) & (team_playoff_avg["teamName"] == away)]

    home_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == home)]
    away_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == away)]

    row_data = {
        "gameId": game_id,
        "season": season,
        "homeTeam": home,
        "awayTeam": away,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "restAdvantage": rest_advantage,
        "homeSeriesLength": home_series_len,
        "awaySeriesLength": away_series_len,
        "round": round_num,
        "gameNumber": game_num,
    }

    for team, prefix, playoff_df, reg_df in zip(
        [home, away],
        ["home", "away"],
        [home_playoff, away_playoff],
        [home_reg, away_reg]
    ):
        if not playoff_df.empty:
            for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
                row_data[f"{col}_{prefix}_r1r2_avg"] = playoff_df[col].values[0]

        if not reg_df.empty:
            for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
                row_data[f"{col}_{prefix}_season"] = reg_df[col].values[0]

    final_rows.append(row_data)

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round3_game1.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round3_game1.csv")
print(df.head())

     gameId  season                homeTeam                awayTeam  \
0  42400301    2025         New York Knicks          Indiana Pacers   
1  42400311    2025   Oklahoma City Thunder  Minnesota Timberwolves   
2  42300311    2024  Minnesota Timberwolves        Dallas Mavericks   
3  42300301    2024          Boston Celtics          Indiana Pacers   
4  42200301    2023          Boston Celtics              Miami Heat   

   homeCourt  homeWin  restAdvantage  homeSeriesLength  awaySeriesLength  \
0       True    False             -3                 6                 5   
1       True     True             -3                 7                 5   
2       True    False             -1                 7                 6   
3       True     True              4                 5                 7   
4       True    False             -2                 7                 6   

   round  ...  eFG%_away_r1r2_avg  TOV%_away_r1r2_avg  FT/FGA_away_r1r2_avg  \
0    3.0  ...            0.583950    

#### Round 3 – Games 2 to 7
Momentum and rolling playoff performance dominate prediction.
Series score and Game 1 results set the tone for adjustments.

**Features used:**
- Rolling Four Factors within Round 3.
- Current series score.
- Home court advantage.

In [29]:
# Import libraries
import pandas as pd

# Load Playoff Games Statistics and Playoff Games Team Statistics.
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")

# Filter Round 3 Games 2 to 7
playoff_games = playoff_games[(playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)]
playoff_stats = playoff_stats[(playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)]

round3 = playoff_games[(playoff_games["round"] == 3) & (playoff_games["gameNumber"] > 1)].copy()

# Build rolling feature dataset
final_rows = []

for _, row in round3.iterrows():
    gid = row["gameId"]
    season = row["season"]
    round_num = row["round"]
    gnum = row["gameNumber"]
    date = row["gameDate"]
    home = row["homeTeam"]
    away = row["awayTeam"]
    

    prior_games = playoff_games[(playoff_games["season"] == season) &
                       (playoff_games["round"] == round_num) &
                       (playoff_games["gameNumber"] < gnum)].copy()

    # Filter Playoffs Games Team Statistics for prior games
    prior_game_ids = prior_games["gameId"].unique()
    prior_stats = playoff_stats[playoff_stats["gameId"].isin(prior_game_ids)].copy()

    # Filter to each team's stats
    home_stats = prior_stats[prior_stats["teamName"] == home]
    away_stats = prior_stats[prior_stats["teamName"] == away]

    # Skip if missing prior stats
    if home_stats.empty or away_stats.empty:
        continue

    home_avg = home_stats[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()
    away_avg = away_stats[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()

    # Series score
    prior_home_wins = prior_games[prior_games["homeTeam"] == home]["homeWin"].sum()
    prior_away_wins = prior_games[prior_games["homeTeam"] == away]["homeWin"].apply(lambda x: not x).sum()

    row_data = {
        "gameId": gid,
        "season": season,
        "gameNumber": gnum,
        "homeTeam": home,
        "awayTeam": away,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "homeWins": prior_home_wins,
        "awayWins": prior_away_wins,
        "round": round_num
    }

    for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
        row_data[f"{col}_home_r3_roll"] = home_avg[col]
        row_data[f"{col}_away_r3_roll"] = away_avg[col]

    final_rows.append(row_data)

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round3_games2to7.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round3_games2to7.csv")
print(df.head())

     gameId  season  gameNumber                homeTeam  \
0  42400306    2025           6          Indiana Pacers   
1  42400305    2025           5         New York Knicks   
2  42400315    2025           5   Oklahoma City Thunder   
3  42400304    2025           4          Indiana Pacers   
4  42400314    2025           4  Minnesota Timberwolves   

                 awayTeam  homeCourt  homeWin  homeWins  awayWins  round  \
0         New York Knicks      False     True         1         2    3.0   
1          Indiana Pacers       True     True         0         1    3.0   
2  Minnesota Timberwolves       True     True         2         1    3.0   
3         New York Knicks      False     True         0         2    3.0   
4   Oklahoma City Thunder      False    False         1         0    3.0   

   eFG%_home_r3_roll  eFG%_away_r3_roll  TOV%_home_r3_roll  TOV%_away_r3_roll  \
0           0.542722           0.538627           0.111401           0.128778   
1           0.538453      

#### Round 4 – Game 1 
The Finals bring together the best from each conference (East and West).
Regular season and **cumulative playoff averages** drive Game 1 predictions, along with rest advantage from earlier rounds.

**Features used:**
- Regular season Four Factors.
- Cumulative averages across all prior playoff rounds.
- Rest advantage (longer layoff teams often start slow).
- Home court advantage.

In [32]:
# Import libraries
import pandas as pd

# Load Regular Season Games Team Statistics, Playoff Games Statistics, and Playoff Games Team Statistics.
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")
regular_season_stats = pd.read_csv("../data/processed/team_statistics_regular_season.csv")

# Filter Round 4, Game 1
playoff_games = playoff_games[(playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)]
playoff_stats = playoff_stats[(playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)]
regular_season_stats = regular_season_stats[(regular_season_stats["season"] >= 2015) & (regular_season_stats["season"] <= 2025)]

round4_g1 = playoff_games[(playoff_games["round"] == 4) & (playoff_games["gameNumber"] == 1)].copy()

# Build feature dataset
final_rows = []

for _, row in round4_g1.iterrows():
    gid = row["gameId"]
    season = row["season"]
    date = row["gameDate"]
    home = row["homeTeam"]
    away = row["awayTeam"]
    round_num = row["round"]
    game_num = row["gameNumber"]


    # All games prior to Round 4
    prior_games = playoff_games[(playoff_games["season"] == season) & (playoff_games["round"] < 4)].copy()
    prior_game_ids = prior_games["gameId"].unique()
    prior_stats = playoff_stats[playoff_stats["gameId"].isin(prior_game_ids)].copy()

    home_stats = prior_stats[prior_stats["teamName"] == home]
    away_stats = prior_stats[prior_stats["teamName"] == away]

    # Skip if missing prior playoff stats
    if home_stats.empty or away_stats.empty:
        continue

    home_avg = home_stats[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()
    away_avg = away_stats[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()

    # Regular season stats
    home_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == home)]
    away_reg = regular_season_stats[(regular_season_stats["season"] == season) & (regular_season_stats["teamName"] == away)]

    # Skip if missing regular season
    if home_reg.empty or away_reg.empty:
        continue

    home_reg_vals = home_reg[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].values[0]
    away_reg_vals = away_reg[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].values[0]

    # Rest advantage (home - away)
    home_rest = (pd.to_datetime(date) - pd.to_datetime(
        playoff_games[((playoff_games["homeTeam"] == home) | (playoff_games["awayTeam"] == home)) &
             (playoff_games["gameDate"] < date)]["gameDate"].max()
    )).days

    away_rest = (pd.to_datetime(date) - pd.to_datetime(
        playoff_games[((playoff_games["homeTeam"] == away) | (playoff_games["awayTeam"] == away)) &
             (playoff_games["gameDate"] < date)]["gameDate"].max()
    )).days

    rest_adv = home_rest - away_rest
    # Compute Round 3 series lengths (number of games each team played in Round 3)
    round3_games = playoff_games[
        (playoff_games["season"] == season) &
        (playoff_games["round"] == 3) &
        (
            (playoff_games["homeTeam"] == home) | (playoff_games["awayTeam"] == home) |
            (playoff_games["homeTeam"] == away) | (playoff_games["awayTeam"] == away)
        )
    ]

    home_series_len = round3_games[
        (round3_games["homeTeam"] == home) | (round3_games["awayTeam"] == home)
    ]["gameId"].nunique()

    away_series_len = round3_games[
        (round3_games["homeTeam"] == away) | (round3_games["awayTeam"] == away)
    ]["gameId"].nunique()


    row_data = {
        "gameId": gid,
        "season": season,
        "gameNumber": 1,
        "homeTeam": home,
        "awayTeam": away,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "restAdvantage": rest_adv,
        "homeSeriesLength": home_series_len,
        "awaySeriesLength": away_series_len,
        "round": round_num,
        "gameNumber": game_num

    }

    for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
        row_data[f"{col}_home_r1r2r3_avg"] = home_avg[col]
        row_data[f"{col}_away_r1r2r3_avg"] = away_avg[col]
        row_data[f"{col}_home_season"] = home_reg_vals[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"].index(col)]
        row_data[f"{col}_away_season"] = away_reg_vals[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"].index(col)]

    final_rows.append(row_data)

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round4_game1.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round4_game1.csv")
print(df.head())

     gameId  season  gameNumber               homeTeam          awayTeam  \
0  42400401    2025           1  Oklahoma City Thunder    Indiana Pacers   
1  42300401    2024           1         Boston Celtics  Dallas Mavericks   
2  42200401    2023           1         Denver Nuggets        Miami Heat   
3  42100401    2022           1  Golden State Warriors    Boston Celtics   
4  42000401    2021           1           Phoenix Suns   Milwaukee Bucks   

   homeCourt  homeWin  restAdvantage  homeSeriesLength  awaySeriesLength  ...  \
0       True    False              3                 5                 6  ...   
1       True     True              3                 4                 5  ...   
2       True     True              7                 4                 7  ...   
3      False    False              3                 5                 7  ...   
4       True     True              3                 6                 6  ...   

   FT/FGA_home_season  FT/FGA_away_season  ORB%_home_r1r

#### Round 4 – Games 2 to 7 
Every Finals game carries unique momentum and pressure.

**Features used:**
- Rolling Four Factors from earlier Finals games.
- Current series score (e.g., BOS leads 3–2).
- Home court advantage.

In [35]:
# Import libraries
import pandas as pd

# Load Playoff Games Statistics and Playoff Games Team Statistics.
playoff_games = pd.read_csv("../data/processed/playoffs_games.csv")
playoff_stats = pd.read_csv("../data/processed/team_statistics_playoff_games.csv")

# Filter for Round 4 Games 2 to 7
playoff_games = playoff_games[(playoff_games["season"] >= 2015) & (playoff_games["season"] <= 2025)]
playoff_stats = playoff_stats[(playoff_stats["season"] >= 2015) & (playoff_stats["season"] <= 2025)]
round4_games = playoff_games[(playoff_games["round"] == 4) & (playoff_games["gameNumber"] > 1)].copy()

# Build rolling feature dataset
final_rows = []

for _, row in round4_games.iterrows():
    gid = row["gameId"]
    season = row["season"]
    date = row["gameDate"]
    round_num = row["round"]
    game_num = row["gameNumber"]
    home = row["homeTeam"]
    away = row["awayTeam"]

    # Get all prior games in Round 4 before this game
    prior_games = playoff_games[(playoff_games["season"] == season) & (playoff_games["round"] == round_num) & (playoff_games["gameNumber"] < game_num)]
    prior_game_ids = prior_games["gameId"].tolist()

    prior_ff = playoff_stats[playoff_stats["gameId"].isin(prior_game_ids)]

    home_ff = prior_ff[prior_ff["teamName"] == home]
    away_ff = prior_ff[prior_ff["teamName"] == away]

    if home_ff.empty or away_ff.empty:
        continue

    home_avg = home_ff[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()
    away_avg = away_ff[["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]].mean()

    # Series score (number of wins)
    prior_home_wins = prior_games[prior_games["homeTeam"] == home]["homeWin"].sum()
    prior_away_wins = prior_games[prior_games["homeTeam"] == away]["homeWin"].apply(lambda x: not x).sum()

    row_data = {
        "gameId": gid,
        "season": season,
        "gameNumber": game_num,
        "homeTeam": home,
        "awayTeam": away,
        "homeCourt": row["homeCourt"],
        "homeWin": row["homeWin"],
        "homeWins": prior_home_wins,
        "awayWins": prior_away_wins,
        "round": round_num
    }

    for col in ["eFG%", "TOV%", "FT/FGA", "ORB%", "DRB%"]:
        row_data[f"{col}_home_r4_roll"] = home_avg[col]
        row_data[f"{col}_away_r4_roll"] = away_avg[col]

    final_rows.append(row_data)

# Export the final dataset
final_df = pd.DataFrame(final_rows)
final_df.to_csv("../data/processed/modeling_round4_games2to7.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/modeling_round4_games2to7.csv")
print(df.head())

     gameId  season  gameNumber               homeTeam               awayTeam  \
0  42400407    2025           7  Oklahoma City Thunder         Indiana Pacers   
1  42400406    2025           6         Indiana Pacers  Oklahoma City Thunder   
2  42400405    2025           5  Oklahoma City Thunder         Indiana Pacers   
3  42400404    2025           4         Indiana Pacers  Oklahoma City Thunder   
4  42400403    2025           3         Indiana Pacers  Oklahoma City Thunder   

   homeCourt  homeWin  homeWins  awayWins  round  eFG%_home_r4_roll  \
0       True     True         2         1    4.0           0.504243   
1      False     True         1         1    4.0           0.540916   
2       True     True         1         1    4.0           0.513122   
3      False    False         1         1    4.0           0.564180   
4      False     True         0         1    4.0           0.560976   

   eFG%_away_r4_roll  TOV%_home_r4_roll  TOV%_away_r4_roll  \
0           0.533191    