# Loading in data

## Data Section 1 - The Basics

**This section provides everything you need to build a simple prediction model and submit predictions.**

- Team ID's and Team Names
- Tournament seeds since 1984-85 season
- Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season
- Season-level details including dates and region names
- Example submission file for stage 1
  
By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in.

## Data Section 2 - Team Box Scores

**This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.) for all regular season, conference tournament, and NCAA® tournament games since the 2003 season (men) or since the 2010 season (women).**

Team Box Scores are provided in "Detailed Results" files rather than "Compact Results" files. However, the two files are strongly related.

In a Detailed Results file, the first eight columns (**Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT**) are exactly the same as a Compact Results file. However, in a Detailed Results file, there are many additional columns. The column names should be self-explanatory to basketball fans (as above, "W" or "L" refers to the winning or losing team):

- WFGM - field goals made (by the winning team)
- WFGA - field goals attempted (by the winning team)
- WFGM3 - three pointers made (by the winning team)
- WFGA3 - three pointers attempted (by the winning team)
- WFTM - free throws made (by the winning team)
- WFTA - free throws attempted (by the winning team)
- WOR - offensive rebounds (pulled by the winning team)
- WDR - defensive rebounds (pulled by the winning team)
- WAst - assists (by the winning team)
- WTO - turnovers committed (by the winning team)
- WStl - steals (accomplished by the winning team)
- WBlk - blocks (accomplished by the winning team)
- WPF - personal fouls committed (by the winning team)
- 
(and then the same set of stats from the perspective of the losing team: **LFGM** is the number of field goals made by the losing team, and so on up to **LPF**).

Note: by convention, "field goals made" (either WFGM or LFGM) refers to the total number of fields goals made by a team, a combination of both two-point field goals and three-point field goals. And "three point field goals made" (either WFGM3 or LFGM3) is just the three-point fields goals made, of course. So if you want to know specifically about two-point field goals, you have to subtract one from the other (e.g., WFGM - WFGM3). And the total number of points scored is most simply expressed as (2*FGM) + FGM3 + FTM.

## Data Section 3 - Geography

**This section provides city locations of all regular season, conference tournament, and NCAA® tournament games since the 2010 season**

## Data Section 4 - Public Rankings

**This section provides weekly team rankings (men's teams only) for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc., since the 2003 season.**

## Data Section 5 - Supplements

**This section contains additional supporting information, including coaches, conference affiliations, alternative team name spellings, bracket structure, and game results for NIT and other postseason tournaments.**



In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm

def reduce_mem_usage(df):
    """
    Reduce DataFrame memory usage by converting columns to more efficient dtypes.
    """
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtype
        if pd.api.types.is_numeric_dtype(col_type):
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.api.types.is_integer_dtype(col_type):
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                else:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        elif pd.api.types.is_object_dtype(col_type) or pd.api.types.is_categorical_dtype(col_type):
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    return df

def load_optimized_csv(file_path):
    """
    Load a CSV with memory optimization and error handling.
    """
    try:
        df = pd.read_csv(file_path, low_memory=False, encoding='utf-8')
    except UnicodeDecodeError:
        df = pd.read_csv(file_path, low_memory=False, encoding='ISO-8859-1')
    except FileNotFoundError:
        print(f"Error: File {file_path} not found.")
        return None
    df = reduce_mem_usage(df)
    return df

# Load all datasets with progress feedback
dataset_files = {
    "m_teams": "MTeams.csv",
    "w_teams": "WTeams.csv",
    "m_seasons": "MSeasons.csv",
    "w_seasons": "WSeasons.csv",
    "m_tourney_seeds": "MNCAATourneySeeds.csv",
    "w_tourney_seeds": "WNCAATourneySeeds.csv",
    "m_regular_season_compact": "MRegularSeasonCompactResults.csv",
    "w_regular_season_compact": "WRegularSeasonCompactResults.csv",
    "m_tourney_compact": "MNCAATourneyCompactResults.csv",
    "w_tourney_compact": "WNCAATourneyCompactResults.csv",
    "m_regular_season_detailed": "MRegularSeasonDetailedResults.csv",
    "w_regular_season_detailed": "WRegularSeasonDetailedResults.csv",
    "m_tourney_detailed": "MNCAATourneyDetailedResults.csv",
    "w_tourney_detailed": "WNCAATourneyDetailedResults.csv",
    "cities": "Cities.csv",
    "m_game_cities": "MGameCities.csv",
    "w_game_cities": "WGameCities.csv",
    "massey_ordinals": "MMasseyOrdinals.csv",
    "m_team_coaches": "MTeamCoaches.csv",
    "conferences": "Conferences.csv",
    "m_team_conferences": "MTeamConferences.csv",
    "w_team_conferences": "WTeamConferences.csv",
    "m_conf_tourney_games": "MConferenceTourneyGames.csv",
    "w_conf_tourney_games": "WConferenceTourneyGames.csv",
    "m_secondary_tourney_teams": "MSecondaryTourneyTeams.csv",
    "w_secondary_tourney_teams": "WSecondaryTourneyTeams.csv",
    "m_secondary_tourney_results": "MSecondaryTourneyCompactResults.csv",
    "w_secondary_tourney_results": "WSecondaryTourneyCompactResults.csv",
    "m_team_spellings": "MTeamSpellings.csv",
    "w_team_spellings": "WTeamSpellings.csv",
    "m_tourney_slots": "MNCAATourneySlots.csv",
    "w_tourney_slots": "WNCAATourneySlots.csv",
    "m_seed_round_slots": "MNCAATourneySeedRoundSlots.csv"
}

base_path = "/kaggle/input/march-machine-learning-mania-2025"
dataframes = {}

print("🔹 Starting to load all CSV files with memory optimization...")
for key, file_name in tqdm(dataset_files.items(), desc="Loading datasets"):
    full_path = f"{base_path}/{file_name}"
    dataframes[key] = load_optimized_csv(full_path)
print("✅ All files loaded into the `dataframes` dictionary.")

# Unpack into variables
(
    m_teams, w_teams, m_seasons, w_seasons, m_tourney_seeds, w_tourney_seeds,
    m_regular_season_compact, w_regular_season_compact, m_tourney_compact, w_tourney_compact,
    m_regular_season_detailed, w_regular_season_detailed, m_tourney_detailed, w_tourney_detailed,
    cities, m_game_cities, w_game_cities, massey_ordinals, m_team_coaches, conferences,
    m_team_conferences, w_team_conferences, m_conf_tourney_games, w_conf_tourney_games,
    m_secondary_tourney_teams, w_secondary_tourney_teams, m_secondary_tourney_results,
    w_secondary_tourney_results, m_team_spellings, w_team_spellings, m_tourney_slots,
    w_tourney_slots, m_seed_round_slots
) = dataframes.values()

print("✅ Data successfully unpacked into individual DataFrames!")

## Exploring the Data

In [None]:
def summarize_dataframes(df_dict, markdown_file="dataset_summary.txt"):
    """
    Summarize each DataFrame, including rows, columns, missing values, duplicates, column names, dtypes, and sample rows.
    """
    try:
        import tabulate  # Required for to_markdown
    except ImportError:
        print("Warning: 'tabulate' not installed. Falling back to plain text.")
    
    dataset_summary = {}
    for name, df in df_dict.items():
        dataset_summary[name] = {
            "Rows": df.shape[0],
            "Columns": df.shape[1],
            "Missing Values": int(df.isnull().sum().sum()),
            "Duplicate Rows": int(df.duplicated().sum()),
            "Column Names": df.columns.tolist(),
            "Column Dtypes": df.dtypes.astype(str).tolist(),
            "Sample Rows": df.head().to_dict(orient='records')
        }
    
    summary_df = pd.DataFrame(dataset_summary).T
    print("🔎 Dataset Summaries:")
    try:
        display(summary_df)
    except NameError:
        print(summary_df)
    
    try:
        summary_md = summary_df.to_markdown()
    except (ImportError, NameError):
        summary_md = summary_df.to_string()
    
    with open(markdown_file, "w", encoding="utf-8") as f:
        f.write(summary_md)
    print(f"✅ Summary saved to '{markdown_file}'")

summarize_dataframes(dataframes)

## Data Preparation

###  Update Data Preparation to Include ELO

In [None]:
def compute_team_season_stats(detailed_results_df):
    """
    Calculate advanced team statistics per season using detailed results.
    """
    df_win = detailed_results_df[['Season', 'WTeamID', 'WScore', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LScore']].rename(
        columns={'WTeamID': 'TeamID', 'WScore': 'Score', 'WFGM': 'FGM', 'WFGA': 'FGA', 'WFGM3': 'FGM3', 'WFGA3': 'FGA3', 'WFTM': 'FTM', 'WFTA': 'FTA', 'WOR': 'OR', 'WDR': 'DR', 'WAst': 'Ast', 'WTO': 'TO', 'WStl': 'Stl', 'WBlk': 'Blk', 'WPF': 'PF', 'LScore': 'OpponentScore'}
    )
    df_lose = detailed_results_df[['Season', 'LTeamID', 'LScore', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF', 'WScore']].rename(
        columns={'LTeamID': 'TeamID', 'LScore': 'Score', 'LFGM': 'FGM', 'LFGA': 'FGA', 'LFGM3': 'FGM3', 'LFGA3': 'FGA3', 'LFTM': 'FTM', 'LFTA': 'FTA', 'LOR': 'OR', 'LDR': 'DR', 'LAst': 'Ast', 'LTO': 'TO', 'LStl': 'Stl', 'LBlk': 'Blk', 'LPF': 'PF', 'WScore': 'OpponentScore'}
    )
    all_games = pd.concat([df_win, df_lose], ignore_index=True)
    team_stats = all_games.groupby(['Season', 'TeamID']).agg(
        games=('TeamID', 'count'), total_points=('Score', 'sum'), total_fgm=('FGM', 'sum'), total_fga=('FGA', 'sum'),
        total_fgm3=('FGM3', 'sum'), total_fga3=('FGA3', 'sum'), total_ftm=('FTM', 'sum'), total_fta=('FTA', 'sum'),
        total_or=('OR', 'sum'), total_dr=('DR', 'sum'), total_ast=('Ast', 'sum'), total_to=('TO', 'sum'),
        total_stl=('Stl', 'sum'), total_blk=('Blk', 'sum'), total_pf=('PF', 'sum'), total_points_allowed=('OpponentScore', 'sum')
    ).reset_index()
    team_stats['avg_points'] = team_stats['total_points'] / team_stats['games']
    team_stats['avg_points_allowed'] = team_stats['total_points_allowed'] / team_stats['games']
    team_stats['fg_percent'] = team_stats['total_fgm'] / team_stats['total_fga']
    team_stats['three_p_percent'] = team_stats['total_fgm3'] / team_stats['total_fga3']
    team_stats['ft_percent'] = team_stats['total_ftm'] / team_stats['total_fta']
    team_stats['avg_rebounds'] = (team_stats['total_or'] + team_stats['total_dr']) / team_stats['games']
    team_stats['avg_assists'] = team_stats['total_ast'] / team_stats['games']
    team_stats['avg_turnovers'] = team_stats['total_to'] / team_stats['games']
    team_stats['avg_steals'] = team_stats['total_stl'] / team_stats['games']
    team_stats['avg_blocks'] = team_stats['total_blk'] / team_stats['games']
    team_stats['avg_fouls'] = team_stats['total_pf'] / team_stats['games']
    return team_stats[['Season', 'TeamID', 'avg_points', 'avg_points_allowed', 'fg_percent', 'three_p_percent', 'ft_percent', 'avg_rebounds', 'avg_assists', 'avg_turnovers', 'avg_steals', 'avg_blocks', 'avg_fouls']]

print("🔹 Computing Men's Team Stats...")
m_team_stats = compute_team_season_stats(m_regular_season_detailed)
print("✅ Done!")
print("🔹 Computing Women's Team Stats...")
w_team_stats = compute_team_season_stats(w_regular_season_detailed)
print("✅ Done!")
m_team_stats.to_csv("m_team_stats.csv", index=False)
w_team_stats.to_csv("w_team_stats.csv", index=False)
print("✅ Team stats saved!")

### Create Matchup Features for Model Training

In [None]:
def prepare_training_data(tourney_results, team_stats):
    """
    Prepare tournament data with team stats for XGBoost training.
    """
    tourney_results = tourney_results[tourney_results['Season'] < 2025].copy()
    tourney_results['Team1'] = tourney_results[['WTeamID', 'LTeamID']].min(axis=1)
    tourney_results['Team2'] = tourney_results[['WTeamID', 'LTeamID']].max(axis=1)
    tourney_results['Outcome'] = (tourney_results['WTeamID'] == tourney_results['Team1']).astype(int)
    feature_cols = ['avg_points', 'avg_points_allowed', 'fg_percent', 'three_p_percent', 'ft_percent', 'avg_rebounds', 'avg_assists', 'avg_turnovers', 'avg_steals', 'avg_blocks', 'avg_fouls']
    team1_stats = team_stats[['Season', 'TeamID'] + feature_cols].rename(columns={col: f'Team1_{col}' for col in feature_cols})
    team2_stats = team_stats[['Season', 'TeamID'] + feature_cols].rename(columns={col: f'Team2_{col}' for col in feature_cols})
    train_data = tourney_results.merge(team1_stats, left_on=['Season', 'Team1'], right_on=['Season', 'TeamID'], how='left').drop('TeamID', axis=1)
    train_data = train_data.merge(team2_stats, left_on=['Season', 'Team2'], right_on=['Season', 'TeamID'], how='left').drop('TeamID', axis=1)
    return train_data, feature_cols

print("🔹 Preparing Men's Training Data...")
m_train_data, feature_cols = prepare_training_data(m_tourney_compact, m_team_stats)
print("✅ Done!")
print("🔹 Preparing Women's Training Data...")
w_train_data, _ = prepare_training_data(w_tourney_compact, w_team_stats)
print("✅ Done!")
m_train_data.to_csv("m_training_data.csv", index=False)
w_train_data.to_csv("w_training_data.csv", index=False)
print("✅ Training data saved!")

## Computing ELO Ratings and Submission

In [None]:
import itertools
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss

def compute_elo_ratings(regular_season_df, start_elo=1500, k_factor=20, season_scoped=True, home_advantage=100):
    """
    Compute ELO ratings with home-court advantage and margin-of-victory.
    """
    regular_season_df = regular_season_df.sort_values(by=['Season', 'DayNum']).reset_index(drop=True)
    elo_dict = {}
    for _, row in regular_season_df.iterrows():
        season = int(row['Season'])
        w_tid = int(row['WTeamID'])
        l_tid = int(row['LTeamID'])
        w_score = row['WScore']
        l_score = row['LScore']
        wloc = row['WLoc']
        w_key = (season, w_tid) if season_scoped else w_tid
        l_key = (season, l_tid) if season_scoped else l_tid
        if w_key not in elo_dict:
            elo_dict[w_key] = start_elo
        if l_key not in elo_dict:
            elo_dict[l_key] = start_elo
        w_elo = elo_dict[w_key]
        l_elo = elo_dict[l_key]
        if wloc == 'H':
            exp_w = 1.0 / (1.0 + 10.0 ** ((l_elo - (w_elo + home_advantage)) / 400.0))
        elif wloc == 'A':
            exp_w = 1.0 / (1.0 + 10.0 ** (((l_elo + home_advantage) - w_elo) / 400.0))
        else:
            exp_w = 1.0 / (1.0 + 10.0 ** ((l_elo - w_elo) / 400.0))
        point_diff = w_score - l_score
        mov_factor = np.log2(point_diff + 1) * (2.2 / ((w_elo - l_elo) * 0.001 + 2.2))
        elo_dict[w_key] = w_elo + k_factor * mov_factor * (1.0 - exp_w)
        elo_dict[l_key] = l_elo + k_factor * mov_factor * (0.0 - (1.0 - exp_w))
    return elo_dict

print("🔹 Computing Men's ELO...")
m_elo_dict = compute_elo_ratings(m_regular_season_compact)
print("✅ Done!")
print("🔹 Computing Women's ELO...")
w_elo_dict = compute_elo_ratings(w_regular_season_compact)
print("✅ Done!")

team1_feats = [f'Team1_{col}' for col in feature_cols]
team2_feats = [f'Team2_{col}' for col in feature_cols]
X_train_m = m_train_data[team1_feats + team2_feats].fillna(0)
y_train_m = m_train_data['Outcome']
xgb_model_m = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
xgb_model_m.fit(X_train_m, y_train_m)
print("✅ Men's XGBoost Trained!")

X_train_w = w_train_data[team1_feats + team2_feats].fillna(0)
y_train_w = w_train_data['Outcome']
xgb_model_w = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
xgb_model_w.fit(X_train_w, y_train_w)
print("✅ Women's XGBoost Trained!")

def add_elo_to_train_data(train_data, elo_dict):
    """
    Add ELO ratings to training data, ensuring no duplicate columns.
    """
    elo_df = pd.DataFrame([(k[0], k[1], v) for k, v in elo_dict.items()], columns=['Season', 'TeamID', 'ELO'])
    elo_df = elo_df.drop_duplicates(subset=['Season', 'TeamID'])
    train_data = train_data.copy()
    for col in ['Team1_ELO', 'Team2_ELO']:
        if col in train_data.columns:
            train_data = train_data.drop(columns=[col])
    train_data = train_data.merge(
        elo_df, left_on=['Season', 'Team1'], right_on=['Season', 'TeamID'], 
        how='left', suffixes=('', '_drop')
    ).rename(columns={'ELO': 'Team1_ELO'}).drop('TeamID', axis=1)
    train_data = train_data.merge(
        elo_df, left_on=['Season', 'Team2'], right_on=['Season', 'TeamID'], 
        how='left', suffixes=('', '_drop')
    ).rename(columns={'ELO': 'Team2_ELO'}).drop('TeamID', axis=1)
    train_data = train_data.loc[:, ~train_data.columns.str.endswith('_drop')]
    return train_data

# After XGBoost training
print("Shape of m_train_data before adding ELO:", m_train_data.shape)
print("Shape of w_train_data before adding ELO:", w_train_data.shape)
print("Length of X_train_m:", len(X_train_m))
print("Length of y_train_m:", len(y_train_m))
print("Length of X_train_w:", len(X_train_w))
print("Length of y_train_w:", len(y_train_w))

m_train_data = add_elo_to_train_data(m_train_data, m_elo_dict)
w_train_data = add_elo_to_train_data(w_train_data, w_elo_dict)

print("Shape of m_train_data after adding ELO:", m_train_data.shape)
print("Columns in m_train_data:", m_train_data.columns.tolist())
print("Missing Team1_ELO in m_train_data:", m_train_data['Team1_ELO'].isna().sum())
print("Missing Team2_ELO in m_train_data:", m_train_data['Team2_ELO'].isna().sum())
print("Shape of w_train_data after adding ELO:", w_train_data.shape)
print("Columns in w_train_data:", w_train_data.columns.tolist())
print("Missing Team1_ELO in w_train_data:", w_train_data['Team1_ELO'].isna().sum())
print("Missing Team2_ELO in w_train_data:", w_train_data['Team2_ELO'].isna().sum())

m_train_data['Team1_ELO'] = m_train_data['Team1_ELO'].fillna(1500)
m_train_data['Team2_ELO'] = m_train_data['Team2_ELO'].fillna(1500)
w_train_data['Team1_ELO'] = w_train_data['Team1_ELO'].fillna(1500)
w_train_data['Team2_ELO'] = w_train_data['Team2_ELO'].fillna(1500)

m_train_data['ELO_prob'] = 1 / (1 + 10 ** ((m_train_data['Team2_ELO'] - m_train_data['Team1_ELO']) / 400))
m_train_data['XGBoost_prob'] = xgb_model_m.predict_proba(X_train_m)[:, 1]
m_train_data['avg_prob'] = (m_train_data['ELO_prob'] + m_train_data['XGBoost_prob']) / 2

w_train_data['ELO_prob'] = 1 / (1 + 10 ** ((w_train_data['Team2_ELO'] - w_train_data['Team1_ELO']) / 400))
w_train_data['XGBoost_prob'] = xgb_model_w.predict_proba(X_train_w)[:, 1]
w_train_data['avg_prob'] = (w_train_data['ELO_prob'] + w_train_data['XGBoost_prob']) / 2

assert len(m_train_data) == len(X_train_m) == len(y_train_m), "Mismatch in m_train_data, X_train_m, y_train_m lengths"
assert len(w_train_data) == len(X_train_w) == len(y_train_w), "Mismatch in w_train_data, X_train_w, y_train_w lengths"

calibrator_m = LogisticRegression(random_state=42).fit(m_train_data['avg_prob'].values.reshape(-1, 1), y_train_m)
calibrator_w = LogisticRegression(random_state=42).fit(w_train_data['avg_prob'].values.reshape(-1, 1), y_train_w)
print("✅ Calibrators Fitted!")

# --- Brier Score Validation on 2024 Tournament ---
print("🔹 Calculating Brier Score on 2024 Validation Set...")

# Filter 2024 tournament games (assuming these are not in training data)
m_tourney_2024 = m_tourney_compact[m_tourney_compact['Season'] == 2024].copy()
w_tourney_2024 = w_tourney_compact[w_tourney_compact['Season'] == 2024].copy()

# Prepare validation data (reusing prepare_training_data from earlier)
m_val_data, _ = prepare_training_data(m_tourney_2024, m_team_stats)
w_val_data, _ = prepare_training_data(w_tourney_2024, w_team_stats)

# Add ELO ratings to validation data
m_val_data = add_elo_to_train_data(m_val_data, m_elo_dict)
w_val_data = add_elo_to_train_data(w_val_data, w_elo_dict)

# Handle NaNs in validation data
m_val_data['Team1_ELO'] = m_val_data['Team1_ELO'].fillna(1500)
m_val_data['Team2_ELO'] = m_val_data['Team2_ELO'].fillna(1500)
w_val_data['Team1_ELO'] = w_val_data['Team1_ELO'].fillna(1500)
w_val_data['Team2_ELO'] = w_val_data['Team2_ELO'].fillna(1500)

# Compute ELO probabilities
m_val_data['ELO_prob'] = 1 / (1 + 10 ** ((m_val_data['Team2_ELO'] - m_val_data['Team1_ELO']) / 400))
w_val_data['ELO_prob'] = 1 / (1 + 10 ** ((w_val_data['Team2_ELO'] - w_val_data['Team1_ELO']) / 400))

# Compute XGBoost probabilities
X_val_m = m_val_data[team1_feats + team2_feats].fillna(0)
m_val_data['XGBoost_prob'] = xgb_model_m.predict_proba(X_val_m)[:, 1]
X_val_w = w_val_data[team1_feats + team2_feats].fillna(0)
w_val_data['XGBoost_prob'] = xgb_model_w.predict_proba(X_val_w)[:, 1]

# Average probabilities
m_val_data['avg_prob'] = (m_val_data['ELO_prob'] + m_val_data['XGBoost_prob']) / 2
w_val_data['avg_prob'] = (w_val_data['ELO_prob'] + w_val_data['XGBoost_prob']) / 2

# Calibrate probabilities
m_val_data['calibrated_prob'] = calibrator_m.predict_proba(m_val_data['avg_prob'].values.reshape(-1, 1))[:, 1]
w_val_data['calibrated_prob'] = calibrator_w.predict_proba(w_val_data['avg_prob'].values.reshape(-1, 1))[:, 1]

# Calculate Brier Score
m_brier = brier_score_loss(m_val_data['Outcome'], m_val_data['calibrated_prob'])
w_brier = brier_score_loss(w_val_data['Outcome'], w_val_data['calibrated_prob'])
print(f"Men's 2024 Brier Score: {m_brier:.4f}")
print(f"Women's 2024 Brier Score: {w_brier:.4f}")
print(f"Average Brier Score: {(m_brier + w_brier) / 2:.4f}")
print("✅ Validation Complete!")
# --- End of Brier Score Validation ---

def extract_final_season_elo(elo_dict, season=2025):
    df = pd.DataFrame([((season, tid), elo) for (s, tid), elo in elo_dict.items() if s == season], columns=['key', 'Final_ELO'])
    df['TeamID'] = df['key'].apply(lambda k: k[1])
    return df.drop('key', axis=1)

m_teams_2025_elo = extract_final_season_elo(m_elo_dict, 2025)
w_teams_2025_elo = extract_final_season_elo(w_elo_dict, 2025)
m_team_stats_2025 = m_team_stats[m_team_stats['Season'] == 2025]
w_team_stats_2025 = w_team_stats[w_team_stats['Season'] == 2025]

m_teams_2025_ids = m_team_conferences[m_team_conferences['Season'] == 2025]['TeamID'].unique()
w_teams_2025_ids = w_team_conferences[w_team_conferences['Season'] == 2025]['TeamID'].unique()
m_matchups = list(itertools.combinations(sorted(m_teams_2025_ids), 2))
w_matchups = list(itertools.combinations(sorted(w_teams_2025_ids), 2))
print(f"Men's Matchups: {len(m_matchups)}, Women's Matchups: {len(w_matchups)}")

def compute_submission_rows(matchups, elo_df, team_stats_df, xgb_model, calibrator, season=2025):
    rows = []
    for t1, t2 in matchups:
        row_id = f"{season}_{t1}_{t2}"
        elo_t1 = elo_df.loc[elo_df['TeamID'] == t1, 'Final_ELO'].iloc[0] if t1 in elo_df['TeamID'].values else 1500
        elo_t2 = elo_df.loc[elo_df['TeamID'] == t2, 'Final_ELO'].iloc[0] if t2 in elo_df['TeamID'].values else 1500
        elo_prob = 1 / (1 + 10 ** ((elo_t2 - elo_t1) / 400))
        stats_t1 = team_stats_df.loc[team_stats_df['TeamID'] == t1, feature_cols].iloc[0] if t1 in team_stats_df['TeamID'].values else pd.Series(0, index=feature_cols)
        stats_t2 = team_stats_df.loc[team_stats_df['TeamID'] == t2, feature_cols].iloc[0] if t2 in team_stats_df['TeamID'].values else pd.Series(0, index=feature_cols)
        xgb_feats = np.concatenate([stats_t1.values, stats_t2.values])
        xgb_prob = xgb_model.predict_proba([xgb_feats])[0, 1]
        avg_prob = (elo_prob + xgb_prob) / 2
        final_prob = calibrator.predict_proba([[avg_prob]])[0, 1]
        rows.append((row_id, final_prob))
    return rows

print("🔹 Generating Submission...")
m_rows = compute_submission_rows(m_matchups, m_teams_2025_elo, m_team_stats_2025, xgb_model_m, calibrator_m)
w_rows = compute_submission_rows(w_matchups, w_teams_2025_elo, w_team_stats_2025, xgb_model_w, calibrator_w)
submission_data = pd.DataFrame(m_rows + w_rows, columns=["ID", "Pred"])
submission_data.to_csv("submission.csv", index=False)
print(f"✅ Submission Saved! Total Rows: {len(submission_data)}")