# **Discussion & Key Takeaways**

### **Introduction**
In this submission notebook, I integrated extensive EDA insights with approaches from previous top solutions. I computed Elo ratings from historical detailed game results and built a two‐stage model where a KNN regressor predicts the game margin from the Elo difference. This margin is then transformed via a logistic function to yield a win probability. Separate models for men and women are used—since team ID ranges differ—to accurately reflect scoring patterns and dynamics.

### **Findings from the EDA**
- **Data Completeness:** Our exploratory analysis confirmed that the dataset is comprehensive with minimal missing values.
- **Score Distributions:** Histograms and boxplots indicated that men’s winning scores are generally higher than women’s, and margins (score differences) are consistent with typical game dynamics.
- **Correlations:** Strong correlations were found between scoring statistics (e.g., field goals made, shooting attempts) and game outcomes, supporting the use of these features for margin prediction.
- **Season Trends:** Time-series plots demonstrated stable trends over the years with some variation that could be linked to rule changes or evolving game pace.
- **Geographic Spread:** USA map visualizations showed games distributed widely across states, emphasizing the broad geographic appeal of NCAA basketball.

### **Approach & Model Strategy**
Based on previous winning solutions and our EDA:
- **Elo Ratings:** We used a simple Elo system to compute team strengths over time.
- **Margin Modeling:** A KNN regressor was trained to predict the margin (difference between winning and losing scores) using the Elo difference as a feature.
- **Probability Conversion:** A logistic function converts the predicted margin to a win probability.
- **Separate Models:** Men’s and women’s games are handled separately (based on team ID ranges), reflecting differences in scoring patterns.

In [1]:
# Import Libraries & Setup
import numpy as np
import pandas as pd
import glob, os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")
sns.set(style="whitegrid", context="notebook", font_scale=1.1)
print("Libraries imported and default styles set.")

Libraries imported and default styles set.


In [2]:
input_folder = r"/kaggle/input/march-machine-learning-mania-2025"
csv_files = glob.glob(os.path.join(input_folder, "*.csv"))

dataframes = {}
for file in csv_files:
    key = os.path.splitext(os.path.basename(file))[0]
    try:
        # Using 'latin-1' encoding to avoid Unicode decode errors.
        dataframes[key] = pd.read_csv(file, low_memory=False, encoding="latin-1")
        print(f"Loaded {key} with shape {dataframes[key].shape}")
    except Exception as e:
        print(f"Error loading {file}: {e}")

print("\nAll CSV files loaded automatically.")

Loaded Conferences with shape (51, 2)
Loaded SeedBenchmarkStage1 with shape (507108, 2)
Loaded WNCAATourneyDetailedResults with shape (894, 34)
Loaded WRegularSeasonCompactResults with shape (134961, 8)
Loaded MNCAATourneySeedRoundSlots with shape (776, 5)
Loaded MRegularSeasonDetailedResults with shape (116723, 34)
Loaded MNCAATourneyCompactResults with shape (2518, 8)
Loaded MGameCities with shape (84509, 6)
Loaded WSecondaryTourneyCompactResults with shape (828, 9)
Loaded WGameCities with shape (81342, 6)
Loaded MSeasons with shape (41, 6)
Loaded WNCAATourneySlots with shape (1713, 4)
Loaded MSecondaryTourneyTeams with shape (1836, 3)
Loaded Cities with shape (502, 3)
Loaded MTeamSpellings with shape (1177, 2)
Loaded MRegularSeasonCompactResults with shape (190771, 8)
Loaded MMasseyOrdinals with shape (5435396, 5)
Loaded MSecondaryTourneyCompactResults with shape (1809, 9)
Loaded WTeams with shape (378, 2)
Loaded WConferenceTourneyGames with shape (6113, 5)
Loaded MNCAATourneySlots 

In [3]:
df_sub = dataframes['SampleSubmissionStage1'].copy()

def parse_id(match_id):
    season, t1, t2 = match_id.split('_')
    return int(season), int(t1), int(t2)

df_sub['Season'] = df_sub['ID'].apply(lambda x: parse_id(x)[0])
df_sub['Team1'] = df_sub['ID'].apply(lambda x: parse_id(x)[1])
df_sub['Team2'] = df_sub['ID'].apply(lambda x: parse_id(x)[2])
print("Sample submission file prepared:")
print(df_sub.head())

Sample submission file prepared:
               ID  Pred  Season  Team1  Team2
0  2021_1101_1102   0.5    2021   1101   1102
1  2021_1101_1103   0.5    2021   1101   1103
2  2021_1101_1104   0.5    2021   1101   1104
3  2021_1101_1105   0.5    2021   1101   1105
4  2021_1101_1106   0.5    2021   1101   1106


In [4]:
def initialize_elo(team_ids, start_elo=1500):
    return {tid: start_elo for tid in team_ids}

def update_elo(elo_dict, teamA, teamB, scoreA, scoreB, k=20):
    ra = elo_dict[teamA]
    rb = elo_dict[teamB]
    ea = 1.0 / (1 + 10 ** ((rb - ra) / 400))
    # Actual result: 1 if teamA wins, 0 otherwise.
    sa = 1 if scoreA > scoreB else 0
    sb = 1 - sa
    elo_dict[teamA] = ra + k * (sa - ea)
    elo_dict[teamB] = rb + k * (sb - (1 - ea))

def compute_elo(df_games, teams_df):
    df_sorted = df_games.sort_values(by=['Season','DayNum'])
    team_ids = teams_df['TeamID'].unique()
    elo_dict = initialize_elo(team_ids)
    for idx, row in df_sorted.iterrows():
        update_elo(elo_dict, row['WTeamID'], row['LTeamID'], row['WScore'], row['LScore'])
    return elo_dict

In [5]:
df_MReg = dataframes['MRegularSeasonDetailedResults'].copy()
df_WReg = dataframes['WRegularSeasonDetailedResults'].copy()
df_MTeams = dataframes['MTeams']
df_WTeams = dataframes['WTeams']

elo_m = compute_elo(df_MReg, df_MTeams)
elo_w = compute_elo(df_WReg, df_WTeams)
print("Elo ratings computed for men's and women's data.")

Elo ratings computed for men's and women's data.


In [6]:
def prepare_training_data(df, elo_dict):
    margins = []
    elo_diffs = []
    for idx, row in df.iterrows():
        margin = row['WScore'] - row['LScore']
        elo_diff = elo_dict.get(row['WTeamID'], 1500) - elo_dict.get(row['LTeamID'], 1500)
        elo_diffs.append(elo_diff)
        margins.append(margin)
    return pd.DataFrame({'EloDiff': elo_diffs, 'Margin': margins})

train_m = prepare_training_data(df_MReg, elo_m)
train_w = prepare_training_data(df_WReg, elo_w)
print("Training data prepared for men's and women's margin models.")

Training data prepared for men's and women's margin models.


In [7]:
def train_margin_model(df_train):
    X = df_train[['EloDiff']].values
    y = df_train['Margin'].values
    knn = KNeighborsRegressor()
    param_grid = {'n_neighbors': [5, 10, 20, 40]}
    gscv = GridSearchCV(knn, param_grid, cv=3, scoring='neg_mean_squared_error')
    gscv.fit(X, y)
    print("Best n_neighbors:", gscv.best_params_)
    return gscv.best_estimator_

knn_m = train_margin_model(train_m)
knn_w = train_margin_model(train_w)
print("KNN margin models trained.")

Best n_neighbors: {'n_neighbors': 40}
Best n_neighbors: {'n_neighbors': 40}
KNN margin models trained.


In [8]:
def margin_to_probability(margin, scale=10.0):
    return 1.0 / (1 + 10 ** (-margin / scale))

def predict_match(season, team1, team2, elo_m, elo_w, knn_m, knn_w):
    # Determine if both teams belong to men's or women's brackets.
    if team1 < 2000 and team2 < 2000:
        e1 = elo_m.get(team1, 1500)
        e2 = elo_m.get(team2, 1500)
        elo_diff = e1 - e2
        margin_pred = knn_m.predict([[elo_diff]])[0]
        prob = margin_to_probability(margin_pred)
    elif team1 >= 3000 and team2 >= 3000:
        e1 = elo_w.get(team1, 1500)
        e2 = elo_w.get(team2, 1500)
        elo_diff = e1 - e2
        margin_pred = knn_w.predict([[elo_diff]])[0]
        prob = margin_to_probability(margin_pred)
    else:
        # In case of an unexpected matchup, default to 0.5
        prob = 0.5
    return prob

predictions = []
for idx, row in df_sub.iterrows():
    season = row['Season']
    team1 = row['Team1']
    team2 = row['Team2']
    prob = predict_match(season, team1, team2, elo_m, elo_w, knn_m, knn_w)
    predictions.append(prob)

df_sub['Pred'] = predictions
print("Predictions generated for all matchups.")
print(df_sub.head())

Predictions generated for all matchups.
               ID      Pred  Season  Team1  Team2
0  2021_1101_1102  0.938025    2021   1101   1102
1  2021_1101_1103  0.896477    2021   1101   1103
2  2021_1101_1104  0.803819    2021   1101   1104
3  2021_1101_1105  0.977223    2021   1101   1105
4  2021_1101_1106  0.957005    2021   1101   1106


In [9]:
df_sub[['ID','Pred']].to_csv("submission.csv", index=False)
print("Submission file 'submission.csv' created successfully.")

Submission file 'submission.csv' created successfully.
