# Loading in data

## Data Section 1 - The Basics

**This section provides everything you need to build a simple prediction model and submit predictions.**

- Team ID's and Team Names
- Tournament seeds since 1984-85 season
- Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season
- Season-level details including dates and region names
- Example submission file for stage 1
  
By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in.

## Data Section 2 - Team Box Scores

**This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.) for all regular season, conference tournament, and NCAA® tournament games since the 2003 season (men) or since the 2010 season (women).**

Team Box Scores are provided in "Detailed Results" files rather than "Compact Results" files. However, the two files are strongly related.

In a Detailed Results file, the first eight columns (**Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT**) are exactly the same as a Compact Results file. However, in a Detailed Results file, there are many additional columns. The column names should be self-explanatory to basketball fans (as above, "W" or "L" refers to the winning or losing team):

- WFGM - field goals made (by the winning team)
- WFGA - field goals attempted (by the winning team)
- WFGM3 - three pointers made (by the winning team)
- WFGA3 - three pointers attempted (by the winning team)
- WFTM - free throws made (by the winning team)
- WFTA - free throws attempted (by the winning team)
- WOR - offensive rebounds (pulled by the winning team)
- WDR - defensive rebounds (pulled by the winning team)
- WAst - assists (by the winning team)
- WTO - turnovers committed (by the winning team)
- WStl - steals (accomplished by the winning team)
- WBlk - blocks (accomplished by the winning team)
- WPF - personal fouls committed (by the winning team)
- 
(and then the same set of stats from the perspective of the losing team: **LFGM** is the number of field goals made by the losing team, and so on up to **LPF**).

Note: by convention, "field goals made" (either WFGM or LFGM) refers to the total number of fields goals made by a team, a combination of both two-point field goals and three-point field goals. And "three point field goals made" (either WFGM3 or LFGM3) is just the three-point fields goals made, of course. So if you want to know specifically about two-point field goals, you have to subtract one from the other (e.g., WFGM - WFGM3). And the total number of points scored is most simply expressed as (2*FGM) + FGM3 + FTM.

## Data Section 3 - Geography

**This section provides city locations of all regular season, conference tournament, and NCAA® tournament games since the 2010 season**

## Data Section 4 - Public Rankings

**This section provides weekly team rankings (men's teams only) for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc., since the 2003 season.**

## Data Section 5 - Supplements

**This section contains additional supporting information, including coaches, conference affiliations, alternative team name spellings, bracket structure, and game results for NIT and other postseason tournaments.**



In [None]:
import itertools
import pandas as pd
import numpy as np
from sklearn.metrics import brier_score_loss, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


# Function to reduce memory usage
def reduce_mem_usage(df):
    """ Reduces memory usage by optimizing data types. """
    start_mem = df.memory_usage().sum() / 1024**2

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min, c_max = df[col].min(), df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                else:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    return df

# Efficient Data Loading with Memory Optimization
def load_optimized_csv(file_path):
    """ Load CSV file with memory optimization & error handling for encoding issues. """
    df = pd.read_csv(file_path, encoding='ISO-8859-1') 
    df = reduce_mem_usage(df)
    return df

# Load all other necessary Kaggle datasets efficiently
dataset_files = {
    "m_teams": "MTeams.csv",
    "w_teams": "WTeams.csv",
    "m_seasons": "MSeasons.csv",
    "w_seasons": "WSeasons.csv",
    "m_tourney_seeds": "MNCAATourneySeeds.csv",
    "w_tourney_seeds": "WNCAATourneySeeds.csv",
    "m_regular_season_compact": "MRegularSeasonCompactResults.csv",
    "w_regular_season_compact": "WRegularSeasonCompactResults.csv",
    "m_tourney_compact": "MNCAATourneyCompactResults.csv",
    "w_tourney_compact": "WNCAATourneyCompactResults.csv",
    "m_regular_season_detailed": "MRegularSeasonDetailedResults.csv",
    "w_regular_season_detailed": "WRegularSeasonDetailedResults.csv",
    "m_tourney_detailed": "MNCAATourneyDetailedResults.csv",
    "w_tourney_detailed": "WNCAATourneyDetailedResults.csv",
    "cities": "Cities.csv",
    "m_game_cities": "MGameCities.csv",
    "w_game_cities": "WGameCities.csv",
    "massey_ordinals": "MMasseyOrdinals.csv",
    "m_team_coaches": "MTeamCoaches.csv",
    "conferences": "Conferences.csv",
    "m_team_conferences": "MTeamConferences.csv",
    "w_team_conferences": "WTeamConferences.csv",
    "m_conf_tourney_games": "MConferenceTourneyGames.csv",
    "w_conf_tourney_games": "WConferenceTourneyGames.csv",
    "m_secondary_tourney_teams": "MSecondaryTourneyTeams.csv",
    "w_secondary_tourney_teams": "WSecondaryTourneyTeams.csv",
    "m_secondary_tourney_results": "MSecondaryTourneyCompactResults.csv",
    "w_secondary_tourney_results": "WSecondaryTourneyCompactResults.csv",
    "m_team_spellings": "MTeamSpellings.csv",
    "w_team_spellings": "WTeamSpellings.csv",
    "m_tourney_slots": "MNCAATourneySlots.csv",
    "w_tourney_slots": "WNCAATourneySlots.csv",
    "m_seed_round_slots": "MNCAATourneySeedRoundSlots.csv"
}

# Load all datasets with memory optimization
dataframes = {}

for key, file_name in dataset_files.items():
    dataframes[key] = load_optimized_csv(f"/kaggle/input/march-machine-learning-mania-2025/{file_name}")

# Unpack dataframes into individual variables
(
    m_teams, w_teams, m_seasons, w_seasons, m_tourney_seeds, w_tourney_seeds,
    m_regular_season_compact, w_regular_season_compact, m_tourney_compact, w_tourney_compact,
    m_regular_season_detailed, w_regular_season_detailed, m_tourney_detailed, w_tourney_detailed,
    cities, m_game_cities, w_game_cities, massey_ordinals, m_team_coaches, conferences,
    m_team_conferences, w_team_conferences, m_conf_tourney_games, w_conf_tourney_games,
    m_secondary_tourney_teams, w_secondary_tourney_teams, m_secondary_tourney_results,
    w_secondary_tourney_results, m_team_spellings, w_team_spellings, m_tourney_slots,
    w_tourney_slots, m_seed_round_slots
) = dataframes.values()


## Exploring the Data

In [None]:
# Create summary of each dataset
dataset_summary = {}

for name, df in dataframes.items():  # Fix: Use correct variable name `dataframes`
    dataset_summary[name] = {
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "First 5 Rows": df.head().to_dict(),  # Convert to dict to avoid display issues
        "Column Names": df.columns.tolist()
    }

# Convert summary to DataFrame
summary_df = pd.DataFrame(dataset_summary).T

# Display dataset summaries
display(summary_df)  # Works in notebooks

# Try converting to Markdown format (fallback to .to_string() if .to_markdown() is unavailable)
try:
    summary_md = summary_df.to_markdown()
except ImportError:
    summary_md = summary_df.to_string()

# Save as a text file
with open("dataset_summary.txt", "w", encoding="utf-8") as f:
    f.write(summary_md)

print("✅ Dataset summary saved as 'dataset_summary.txt'")

## Data Preparation

### Compute Team-Level Stats from Regular Season

In [None]:
# Function to compute team-level season stats from regular season results
def compute_team_stats(regular_season_df):
    """Compute season stats for each team from regular season results."""

    # Compute stats for winning teams
    win_stats = regular_season_df.groupby(['Season', 'WTeamID']).agg(
        Wins=('WTeamID', 'count'),
        Points_Scored_Win=('WScore', 'sum'),
        Points_Allowed_Win=('LScore', 'sum')
    ).reset_index().rename(columns={'WTeamID': 'TeamID'})

    # Compute stats for losing teams
    loss_stats = regular_season_df.groupby(['Season', 'LTeamID']).agg(
        Losses=('LTeamID', 'count'),
        Points_Scored_Loss=('LScore', 'sum'),
        Points_Allowed_Loss=('WScore', 'sum')
    ).reset_index().rename(columns={'LTeamID': 'TeamID'})

    # Merge win/loss stats
    team_stats = pd.merge(win_stats, loss_stats, on=['Season', 'TeamID'], how='outer').fillna(0)

    # Compute additional features
    team_stats['TotalGames'] = team_stats['Wins'] + team_stats['Losses']

    # Avoid division by zero
    team_stats['WinRatio'] = team_stats['Wins'] / team_stats['TotalGames'].replace(0, 1)

    team_stats['AvgPointsScored'] = (team_stats['Points_Scored_Win'] + team_stats['Points_Scored_Loss']) / team_stats['TotalGames'].replace(0, 1)
    team_stats['AvgPointsAllowed'] = (team_stats['Points_Allowed_Win'] + team_stats['Points_Allowed_Loss']) / team_stats['TotalGames'].replace(0, 1)
    team_stats['PointMargin'] = team_stats['AvgPointsScored'] - team_stats['AvgPointsAllowed']

    # Drop redundant columns
    team_stats.drop(columns=['Points_Scored_Win', 'Points_Scored_Loss', 'Points_Allowed_Win', 'Points_Allowed_Loss'], inplace=True)

    # Optimize memory usage
    team_stats = team_stats.astype({
        "Season": "int16",
        "TeamID": "int16",
        "Wins": "int16",
        "Losses": "int16",
        "TotalGames": "int16",
        "WinRatio": "float16",
        "AvgPointsScored": "float16",
        "AvgPointsAllowed": "float16",
        "PointMargin": "float16"
    })

    return team_stats

# Compute stats for men's and women's teams
m_team_stats = compute_team_stats(m_regular_season_compact)
w_team_stats = compute_team_stats(w_regular_season_compact)

# Display computed stats
print("Men's Team Stats Sample:")
print(m_team_stats.head())

print("\nWomen's Team Stats Sample:")
print(w_team_stats.head())

# Save to CSV 
m_team_stats.to_csv("m_team_stats.csv", index=False)
w_team_stats.to_csv("w_team_stats.csv", index=False)

print("\n✅ Team stats saved successfully!")

### Create Matchup Features for Model Training

Each matchup will include:
- Team A and Team B stats (from the team stats dataset).
- Feature differences (Team A - Team B).
- Tournament seeding for both teams.
- Target variable (Win = 1, Loss = 0).

In [None]:
# Function to create matchup features
def create_matchup_features(tourney_results, team_stats, seeds):
    """Generate training features for tournament matchups."""
    
    # Merge winning and losing team stats
    df = tourney_results.merge(team_stats, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'], suffixes=('_W', '_L'))
    df = df.merge(team_stats, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'], suffixes=('_Winner', '_Loser'))

    # Drop redundant columns
    df.drop(columns=['TeamID_Winner', 'TeamID_Loser'], inplace=True)

    # Merge seeding data correctly (avoid overwriting)
    seeds_w = seeds.rename(columns={'TeamID': 'WTeamID', 'Seed': 'Seed_Winner'})
    df = df.merge(seeds_w, on=['Season', 'WTeamID'], how='left')

    seeds_l = seeds.rename(columns={'TeamID': 'LTeamID', 'Seed': 'Seed_Loser'})
    df = df.merge(seeds_l, on=['Season', 'LTeamID'], how='left')

    # Convert seeding to numeric (extract numbers from "Seed_WXX" format)
    df['Seed_Winner'] = df['Seed_Winner'].str.extract('(\d+)').astype(float).fillna(17)  # Default to 17 if missing
    df['Seed_Loser'] = df['Seed_Loser'].str.extract('(\d+)').astype(float).fillna(17)

    # Compute matchup feature differences (Team A - Team B)
    df['WinRatio_Diff'] = df['WinRatio_Winner'] - df['WinRatio_Loser']
    df['PointMargin_Diff'] = df['PointMargin_Winner'] - df['PointMargin_Loser']
    df['AvgPointsScored_Diff'] = df['AvgPointsScored_Winner'] - df['AvgPointsScored_Loser']
    df['AvgPointsAllowed_Diff'] = df['AvgPointsAllowed_Winner'] - df['AvgPointsAllowed_Loser']
    df['Seed_Diff'] = df['Seed_Loser'] - df['Seed_Winner']

    # Ensure Target reflects the competition format (1 if lower TeamID wins)
    df['Target'] = (df['WTeamID'] < df['LTeamID']).astype(int)

    # Optimize data types
    df = df.astype({
        "Season": "int16",
        "WTeamID": "int16",
        "LTeamID": "int16",
        "Seed_Winner": "int8",
        "Seed_Loser": "int8",
        "WinRatio_Diff": "float16",
        "PointMargin_Diff": "float16",
        "AvgPointsScored_Diff": "float16",
        "AvgPointsAllowed_Diff": "float16",
        "Seed_Diff": "int8",
        "Target": "int8"
    })

    return df

# Generate matchup data for training
m_training_data = create_matchup_features(m_tourney_compact, m_team_stats, m_tourney_seeds)
w_training_data = create_matchup_features(w_tourney_compact, w_team_stats, w_tourney_seeds)

# Display processed training data
print("Men's Training Data Sample:")
print(m_training_data.head())

print("\nWomen's Training Data Sample:")
print(w_training_data.head())

# Save to CSV 
m_training_data.to_csv("m_training_data.csv", index=False)
w_training_data.to_csv("w_training_data.csv", index=False)

print("\n✅ Training data saved successfully!")

## Make predictions

In [None]:
# Load the prepared training datasets
m_training_data = pd.read_csv("m_training_data.csv")
w_training_data = pd.read_csv("w_training_data.csv")

# Select features and target variable
FEATURES = ["WinRatio_Diff", "PointMargin_Diff", "AvgPointsScored_Diff", "AvgPointsAllowed_Diff", "Seed_Diff"]
TARGET = "Target"

# Function to split data into training and validation sets
def split_data(df, test_season=2024):
    """Splits the data into training and validation sets."""
    train_data = df[df["Season"] < test_season].copy()
    test_data = df[df["Season"] == test_season].copy()
    
    X_train, y_train = train_data[FEATURES], train_data[TARGET]
    X_test, y_test = test_data[FEATURES], test_data[TARGET]

    # Handle missing values (Fixed Warning)
    X_train = X_train.fillna(0)
    X_test = X_test.fillna(0)

    return X_train, X_test, y_train, y_test

# Split data for men’s and women’s tournaments
X_train_m, X_test_m, y_train_m, y_test_m = split_data(m_training_data)
X_train_w, X_test_w, y_train_w, y_test_w = split_data(w_training_data)

# Standardize features (important for Logistic Regression)
scaler_m = StandardScaler().fit(X_train_m)
scaler_w = StandardScaler().fit(X_train_w)

X_train_m = scaler_m.transform(X_train_m)
X_test_m = scaler_m.transform(X_test_m)

X_train_w = scaler_w.transform(X_train_w)
X_test_w = scaler_w.transform(X_test_w)

# Train Logistic Regression Model with regularization (better performance)
model_m = LogisticRegression(C=1.0, solver='liblinear', random_state=42)
model_w = LogisticRegression(C=1.0, solver='liblinear', random_state=42)

# Train models
model_m.fit(X_train_m, y_train_m)
model_w.fit(X_train_w, y_train_w)

# Make predictions (probability of lower TeamID winning)
y_pred_m = model_m.predict_proba(X_test_m)[:, 1]
y_pred_w = model_w.predict_proba(X_test_w)[:, 1]

# Evaluate the model using Brier Score
brier_m = brier_score_loss(y_test_m, y_pred_m)
brier_w = brier_score_loss(y_test_w, y_pred_w)

print(f"✅ Men's Brier Score: {brier_m:.4f}")
print(f"✅ Women's Brier Score: {brier_w:.4f}")

## Submission

In [None]:
import pandas as pd
import numpy as np
import itertools

# Load necessary data
def load_team_stats(file_path):
    """Load relevant columns from team stats dataset."""
    cols_needed = ["Season", "TeamID", "WinRatio", "PointMargin", "AvgPointsScored", "AvgPointsAllowed"]
    return pd.read_csv(file_path, usecols=cols_needed)

m_team_stats = load_team_stats("m_team_stats.csv")
w_team_stats = load_team_stats("w_team_stats.csv")

m_tourney_seeds = pd.read_csv("/kaggle/input/march-machine-learning-mania-2025/MNCAATourneySeeds.csv")
w_tourney_seeds = pd.read_csv("/kaggle/input/march-machine-learning-mania-2025/WNCAATourneySeeds.csv")

# Filter teams & seeds for the 2025 season
m_teams_2025 = m_team_stats[m_team_stats["Season"] == 2025].copy()
w_teams_2025 = w_team_stats[w_team_stats["Season"] == 2025].copy()

m_seeds_2025 = m_tourney_seeds[m_tourney_seeds["Season"] == 2025].copy()
w_seeds_2025 = w_tourney_seeds[w_tourney_seeds["Season"] == 2025].copy()

# Convert seeds to numeric
m_seeds_2025["Seed"] = m_seeds_2025["Seed"].str.extract("(\d+)").astype(float)
w_seeds_2025["Seed"] = w_seeds_2025["Seed"].str.extract("(\d+)").astype(float)

# Generate all possible matchups
def generate_matchups(team_stats):
    teams = team_stats["TeamID"].unique()
    return pd.DataFrame(itertools.combinations(teams, 2), columns=["TeamID1", "TeamID2"])

m_matchups = generate_matchups(m_teams_2025)
w_matchups = generate_matchups(w_teams_2025)

# Prepare submission dataset
def prepare_submission_data(matchups, team_stats, seeds):
    df = matchups.merge(team_stats, left_on="TeamID1", right_on="TeamID").copy()
    df = df.merge(team_stats, left_on="TeamID2", right_on="TeamID", suffixes=("_T1", "_T2")).copy()

    # Merge seed info
    df = df.merge(seeds.rename(columns={"TeamID": "TeamID1", "Seed": "Seed_T1"}), on="TeamID1", how="left")
    df = df.merge(seeds.rename(columns={"TeamID": "TeamID2", "Seed": "Seed_T2"}), on="TeamID2", how="left")

    # Compute feature differences
    df["WinRatio_Diff"] = df["WinRatio_T1"] - df["WinRatio_T2"]
    df["PointMargin_Diff"] = df["PointMargin_T1"] - df["PointMargin_T2"]
    df["AvgPointsScored_Diff"] = df["AvgPointsScored_T1"] - df["AvgPointsScored_T2"]
    df["AvgPointsAllowed_Diff"] = df["AvgPointsAllowed_T1"] - df["AvgPointsAllowed_T2"]
    df["Seed_Diff"] = df["Seed_T2"] - df["Seed_T1"]

    # FIX: Remove inplace warning by reassigning column
    df["Seed_Diff"] = df["Seed_Diff"].fillna(0)

    # Create ID column in format "2025_TeamID1_TeamID2"
    df["ID"] = "2025_" + df["TeamID1"].astype(str) + "_" + df["TeamID2"].astype(str)

    return df[["ID", "WinRatio_Diff", "PointMargin_Diff", "AvgPointsScored_Diff", "AvgPointsAllowed_Diff", "Seed_Diff"]]

m_submission_data = prepare_submission_data(m_matchups, m_teams_2025, m_seeds_2025)
w_submission_data = prepare_submission_data(w_matchups, w_teams_2025, w_seeds_2025)

# Define features
FEATURES = ["WinRatio_Diff", "PointMargin_Diff", "AvgPointsScored_Diff", "AvgPointsAllowed_Diff", "Seed_Diff"]
BATCH_SIZE = 5000
submission_chunks = []

def process_batches(submission_data, model, label):
    """Processes predictions in batches and avoids memory issues."""
    for i in range(0, len(submission_data), BATCH_SIZE):
        print(f"Processing {label} matchups {i} to {i + BATCH_SIZE}...")

        chunk = submission_data.iloc[i:i + BATCH_SIZE].copy()

        # FIX: Ensure model expects feature names
        chunk_features = chunk[FEATURES].to_numpy()
        chunk["Pred"] = model.predict_proba(chunk_features)[:, 1]

        submission_chunks.append(chunk[["ID", "Pred"]])

# Process predictions for both men's and women's tournaments
process_batches(m_submission_data, model_m, "men's")
process_batches(w_submission_data, model_w, "women's")

# Combine and save the submission
submission_data = pd.concat(submission_chunks, axis=0)
submission_data.to_csv("submission.csv", index=False)

print("✅ Submission file 'submission.csv' is ready!")