# NCAA March Machine Learning Mania 2025 – Competitive Submission Notebook

This notebook implements an ensemble pipeline that:
1. Computes Elo ratings from historical regular season detailed results.
2. Creates training data using an Elo difference (EloDiff) feature and observed game margin.
3. Trains separate LightGBM regression models for men's and women's games to predict margin.
4. Converts predicted margins to win probabilities using a logistic transform.
5. Loops over 100 iterations (each with a different random seed / slight hyperparameter variation) 
   to generate 100 candidate submission files.

*Note: This code is designed for competitive use and is more elaborate than a simple demonstration.*

In [1]:
# Import Libraries & Setup
import glob, os
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings("ignore")

sns.set(style="whitegrid", context="notebook", font_scale=1.1)

print("Libraries imported and default styles set.")

import time
start_time = time.time()

Libraries imported and default styles set.


In [2]:
input_folder = r"C:\Users\Hi\My Works\My Py Scripts\Git Repos\48_March-achine-learning-mania-2025\Input"
csv_files = glob.glob(os.path.join(input_folder, "*.csv"))

dataframes = {}
for file in csv_files:
    key = os.path.splitext(os.path.basename(file))[0]
    try:
        # Using 'latin-1' encoding to avoid Unicode issues
        dataframes[key] = pd.read_csv(file, low_memory=False, encoding="latin-1")
        print(f"Loaded {key} with shape {dataframes[key].shape}")
    except Exception as e:
        print(f"Error loading {file}: {e}")
print("\nAll CSV files loaded automatically.")

end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.2f} seconds")

Loaded Cities with shape (502, 3)
Loaded Conferences with shape (51, 2)
Loaded MConferenceTourneyGames with shape (6491, 5)
Loaded MGameCities with shape (84509, 6)
Loaded MMasseyOrdinals with shape (5435396, 5)
Loaded MNCAATourneyCompactResults with shape (2518, 8)
Loaded MNCAATourneyDetailedResults with shape (1382, 34)
Loaded MNCAATourneySeedRoundSlots with shape (776, 5)
Loaded MNCAATourneySeeds with shape (2558, 3)
Loaded MNCAATourneySlots with shape (2519, 4)
Loaded MRegularSeasonCompactResults with shape (190771, 8)
Loaded MRegularSeasonDetailedResults with shape (116723, 34)
Loaded MSeasons with shape (41, 6)
Loaded MSecondaryTourneyCompactResults with shape (1809, 9)
Loaded MSecondaryTourneyTeams with shape (1836, 3)
Loaded MTeamCoaches with shape (13533, 5)
Loaded MTeamConferences with shape (13388, 3)
Loaded MTeams with shape (380, 4)
Loaded MTeamSpellings with shape (1177, 2)
Loaded SampleSubmissionStage1 with shape (507108, 2)
Loaded SeedBenchmarkStage1 with shape (507108,

## Compute Elo Ratings from Historical Regular Season Detailed Results
We use the men's and women's regular season detailed results along with the team lists.

In [3]:
def initialize_elo(team_ids, start_elo=1500):
    return {tid: start_elo for tid in team_ids}

def update_elo(elo_dict, teamA, teamB, scoreA, scoreB, k=20):
    ra = elo_dict[teamA]
    rb = elo_dict[teamB]
    ea = 1.0 / (1 + 10 ** ((rb - ra) / 400))
    sa = 1 if scoreA > scoreB else 0
    sb = 1 - sa
    elo_dict[teamA] = ra + k * (sa - ea)
    elo_dict[teamB] = rb + k * (sb - (1 - ea))

def compute_elo(df_games, teams_df):
    df_sorted = df_games.sort_values(by=['Season','DayNum'])
    team_ids = teams_df['TeamID'].unique()
    elo_dict = initialize_elo(team_ids)
    for idx, row in df_sorted.iterrows():
        update_elo(elo_dict, row['WTeamID'], row['LTeamID'], row['WScore'], row['LScore'])
    return elo_dict

# Get men's and women's team lists
df_MTeams = dataframes['MTeams']
df_WTeams = dataframes['WTeams']

# Use regular season detailed results for Elo calculation.
df_MReg = dataframes['MRegularSeasonDetailedResults'].copy()
df_WReg = dataframes['WRegularSeasonDetailedResults'].copy()

elo_m = compute_elo(df_MReg, df_MTeams)
elo_w = compute_elo(df_WReg, df_WTeams)
print("Computed Elo ratings for men's and women's data.")

end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.2f} seconds")

Computed Elo ratings for men's and women's data.

Total Execution Time: 11.33 seconds


## Prepare Training Data for Margin Modeling
For each game, we compute:
- **EloDiff**: Difference in Elo ratings between the winning and losing teams.
- **Margin**: The observed score margin (WScore - LScore).
We do this for both men's and women's historical data.

In [4]:
def prepare_training_data(df, elo_dict):
    elo_diffs = []
    margins = []
    for idx, row in df.iterrows():
        # Use final Elo ratings as a proxy (could be improved with dynamic Elo)
        diff = elo_dict.get(row['WTeamID'], 1500) - elo_dict.get(row['LTeamID'], 1500)
        elo_diffs.append(diff)
        margins.append(row['WScore'] - row['LScore'])
    return pd.DataFrame({'EloDiff': elo_diffs, 'Margin': margins})

train_m = prepare_training_data(df_MReg, elo_m)
train_w = prepare_training_data(df_WReg, elo_w)
print("Prepared training data for men's and women's margin models.")
print("Men’s train data shape:", train_m.shape)
print("Women’s train data shape:", train_w.shape)

end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.2f} seconds")

Prepared training data for men's and women's margin models.
Men’s train data shape: (116723, 2)
Women’s train data shape: (79639, 2)

Total Execution Time: 25.83 seconds


## Train LightGBM Models for Margin Prediction
We use LightGBM regressors to predict the margin from the Elo difference.
We train separate models for men's and women's data.

In [5]:
# Cell 6: Train LightGBM (GPU) + Early Stopping

def train_lgb_model(df_train, seed=42):
    """
    Trains a LightGBM regressor on EloDiff -> Margin using GPU acceleration & early stopping.
    """
    # Prepare dataset
    X = df_train[['EloDiff']]
    y = df_train['Margin']
    
    # Train/val split
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    
    # GPU parameters
    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': 'mse',
        'learning_rate': 0.01,
        'num_leaves': 31,
        'seed': seed,
        'verbose': -1,
        'device_type': 'gpu',         # or 'device': 'gpu' in older versions
        'tree_learner': 'data_parallel',  # or 'serial'/'gpu'
        'gpu_platform_id': 0,
        'gpu_device_id': 0
    }
    
    dtrain = lgb.Dataset(X_train, label=y_train)
    dval   = lgb.Dataset(X_val, label=y_val, reference=dtrain)
    
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=2000,
        valid_sets=[dval],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50),
            lgb.log_evaluation(period=0)
        ]
    )
    return model

print("Defined train_lgb_model function for GPU-based LightGBM.")

end_time = time.time()
print(f"\nTotal Execution Time: {end_time - start_time:.2f} seconds")

Defined train_lgb_model function for GPU-based LightGBM.

Total Execution Time: 25.84 seconds


## Define Function to Convert Margin to Win Probability
We use a logistic transformation with a scaling factor.

In [6]:
def margin_to_probability(margin, scale=10.0):
    """
    Logistic transform to convert predicted margin -> win probability.
    P = 1 / [1 + 10^(-margin/scale)]
    """
    return 1.0 / (1 + 10 ** (-margin / scale))

## Prepare Test Data from the Sample Submission File
We parse the submission ID to extract Season, Team1, and Team2.

In [7]:
df_sub = dataframes['SampleSubmissionStage1'].copy()

def parse_id(match_id):
    season, t1, t2 = match_id.split('_')
    return int(season), int(t1), int(t2)

df_sub['Season'] = df_sub['ID'].apply(lambda x: parse_id(x)[0])
df_sub['Team1'] = df_sub['ID'].apply(lambda x: parse_id(x)[1])
df_sub['Team2'] = df_sub['ID'].apply(lambda x: parse_id(x)[2])
print("Test data prepared from the submission file.")
print(df_sub.head())

Test data prepared from the submission file.
               ID  Pred  Season  Team1  Team2
0  2021_1101_1102   0.5    2021   1101   1102
1  2021_1101_1103   0.5    2021   1101   1103
2  2021_1101_1104   0.5    2021   1101   1104
3  2021_1101_1105   0.5    2021   1101   1105
4  2021_1101_1106   0.5    2021   1101   1106


## Define a Function to Predict the Outcome for a Given Match
For each matchup, we compute the Elo difference using the final Elo ratings,
predict the margin using the corresponding LightGBM model (men's or women's),
and convert that margin to a win probability.

In [8]:
def predict_match(season, team1, team2,
                  elo_m, elo_w,
                  model_m, model_w):
    """
    If both teams < 2000 => men’s
    If both teams >= 3000 => women’s
    Otherwise default prob=0.5
    """
    if team1 < 2000 and team2 < 2000:
        e1 = elo_m.get(team1, 1500)
        e2 = elo_m.get(team2, 1500)
        elo_diff = e1 - e2
        margin_pred = model_m.predict(pd.DataFrame({'EloDiff': [elo_diff]}))[0]
        prob = margin_to_probability(margin_pred)
    elif team1 >= 3000 and team2 >= 3000:
        e1 = elo_w.get(team1, 1500)
        e2 = elo_w.get(team2, 1500)
        elo_diff = e1 - e2
        margin_pred = model_w.predict(pd.DataFrame({'EloDiff': [elo_diff]}))[0]
        prob = margin_to_probability(margin_pred)
    else:
        # If "mixed" or out-of-range, fallback to 0.5
        prob = 0.5
    return prob

## Generate Ensemble Predictions & Save 100 Submission Files
We loop over 100 iterations. In each iteration, we slightly vary the model seeds (and thus hyperparameters)
by retraining the LightGBM models using a different random seed. We then predict for every matchup in the test set,
and save the resulting predictions to a separate CSV file.

In [9]:
from datetime import datetime
import random

# We'll generate 100 submissions.
num_submissions = 100

# Create a folder to save submissions (if not exists)
submission_folder = "ensemble_submissions"
os.makedirs(submission_folder, exist_ok=True)

for i in range(1, num_submissions + 1):
    # Use a new seed for each iteration
    seed_val = 1000 + i
    print(f"\n=== Generating submission {i}/{num_submissions} with seed {seed_val} ===")
    
    # 1) Retrain men’s model
    model_m_i = train_lgb_model(train_m, seed=seed_val)
    # 2) Retrain women’s model
    model_w_i = train_lgb_model(train_w, seed=seed_val)
    
    # 3) Predict for each row in sample submission
    preds = []
    for idx, row in df_sub.iterrows():
        s  = row['Season']
        t1 = row['Team1']
        t2 = row['Team2']
        p  = predict_match(s, t1, t2, elo_m, elo_w, model_m_i, model_w_i)
        
        # Optional random perturbation for diversity
        p += np.random.normal(0, 0.005)   # Tiny noise
        p = np.clip(p, 0.001, 0.999)      # Keep in [0.001, 0.999]
        
        preds.append(p)
    
    # 4) Assign predictions & save
    df_sub['Pred'] = preds
    submission_filename = os.path.join(submission_folder, f"submission_gpu_{i}.csv")
    df_sub[['ID','Pred']].to_csv(submission_filename, index=False)
    print(f"Saved => {submission_filename}")
   
    end_time = time.time()
    print(f"\nTotal Execution Time: {end_time - start_time:.2f} seconds")

print("\nEnsemble submission generation complete. 100 submission files created.")


=== Generating submission 1/100 with seed 1001 ===
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[410]	valid_0's l2: 73.3057
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[518]	valid_0's l2: 101.719
Saved => ensemble_submissions\submission_gpu_1.csv

Total Execution Time: 369.80 seconds

=== Generating submission 2/100 with seed 1002 ===
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[425]	valid_0's l2: 73.4069
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[365]	valid_0's l2: 101.967
Saved => ensemble_submissions\submission_gpu_2.csv

Total Execution Time: 708.98 seconds

=== Generating submission 3/100 with seed 1003 ===
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[426]	valid_0's l2: 73.6591
Training until validation scores don't improve