# March Madness 2024 Code
### Team: Taylor Last, Ted Woodsides, Jake Hopkins

### Notes
- **Please add docstrings or commments to code so we all know what's going on**
- I can't find how to all work on the same notebook so we might have to copy code and all work on separate pieces
- **Check create_submission and simulate functions to see how to get a bracket in the form the competition asks for**
### Things to keep in mind
- **Days are standardized already**: Dayzero tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the men's national semifinals are always on DayNum=152, the "play-in" games are on days 134-135, men's Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. 
- **Special note about "Season" numbers**: the college basketball season lasts from early November until the national championship tournament that starts in the middle of March. For instance, this year the first regular season games were played in November 2023 and the national championship games will be played in April 2024. Because a basketball season spans two calendar years like this, it can be confusing to refer to the year of the season. By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in. So for instance, the current season will be identified in our data as the 2024 season, not the 2023 season or the 2023-24 season or the 2023-2024 season, though you may see any of these in everyday use outside of our data.

### Questions to consider
- Do we want to train the model by making prediction for all games (regular season and post), then only make predictions for tournament games, or do we want to only train on tournament games?

### Featues I'd like to add
- Quantify guard play
- Free throw percentage
- Pace
- Time of possession

In [19]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
from typing import List
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Refer to this if you need to look up documentation
print(f"Pandas Version: {pd.__version__}")

Pandas Version: 2.2.0


In [12]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
data = dict()
for dirname, _, filenames in os.walk('../kaggle/input'):
    for filename in filenames:
        table_name = filename.split('.')[0]
        table_path = os.path.join(dirname, filename)
        try:
            data[table_name] = pd.read_csv(table_path)
        except UnicodeDecodeError:
            data[table_name] = pd.read_csv(table_path, encoding='cp1252')
        except Exception as e:
            print(f"Error with {filename}: {e}")


In [13]:
# Check Table Names
data.keys()

dict_keys(['MNCAATourneyDetailedResults', 'WNCAATourneySlots', 'MNCAATourneyCompactResults', 'MSeasons', 'WTeams', 'MRegularSeasonDetailedResults', 'WNCAATourneyDetailedResults', 'MNCAATourneySlots', 'MGameCities', 'MConferenceTourneyGames', 'WNCAATourneyCompactResults', 'WSeasons', 'Cities', 'WRegularSeasonCompactResults', 'WTeamSpellings', '2024_tourney_seeds', 'WRegularSeasonDetailedResults', 'MRegularSeasonCompactResults', 'WNCAATourneySeeds', 'MNCAATourneySeedRoundSlots', 'WTeamConferences', 'MTeamConferences', 'MTeamCoaches', 'MMasseyOrdinals', 'Conferences', 'MTeams', 'WGameCities', 'MNCAATourneySeeds', 'MSecondaryTourneyTeams', 'MTeamSpellings', 'sample_submission', 'MSecondaryTourneyCompactResults'])

In [14]:
# Split dict of dataframes by gender and other (supplemental) data
mens_data = dict()
womens_data = dict()
supplemental_data = dict()

for k, v in data.items():
    if k.startswith("M"):
        mens_data[k] = v
    elif k.startswith("W"):
        womens_data[k] = v
    else:
        supplemental_data[k] = v
        

In [15]:
# Check men's keys
mens_data.keys()

dict_keys(['MNCAATourneyDetailedResults', 'MNCAATourneyCompactResults', 'MSeasons', 'MRegularSeasonDetailedResults', 'MNCAATourneySlots', 'MGameCities', 'MConferenceTourneyGames', 'MRegularSeasonCompactResults', 'MNCAATourneySeedRoundSlots', 'MTeamConferences', 'MTeamCoaches', 'MMasseyOrdinals', 'MTeams', 'MNCAATourneySeeds', 'MSecondaryTourneyTeams', 'MTeamSpellings', 'MSecondaryTourneyCompactResults'])

In [16]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(
        mens_data["MNCAATourneySlots"][
            (mens_data["MNCAATourneySlots"]["Season"] == 2023)
            & (mens_data["MNCAATourneySlots"]["Slot"].str.startswith("R1"))
        ]
    )
    
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#     display(mens_data["MNCAATourneySeeds"][
#         mens_data["MNCAATourneySeeds"]["Season"] == 2023
#     ])

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
2385,2023,R1W1,W01,W16
2386,2023,R1W2,W02,W15
2387,2023,R1W3,W03,W14
2388,2023,R1W4,W04,W13
2389,2023,R1W5,W05,W12
2390,2023,R1W6,W06,W11
2391,2023,R1W7,W07,W10
2392,2023,R1W8,W08,W09
2393,2023,R1X1,X01,X16
2394,2023,R1X2,X02,X15


# Baseline Model

Use a simple rating system (SRS) combined with KenPom to fit a model predicting the probability of a given matchup in the tournament. Then randomly sample from the distribution outputed from the model to create multiple submissions.

### Features:
- SRS system from the regular season
- KenPom ranking system
- Coach historical success in post season and regular season

### Model: Simple feedforward Neural Net

In [20]:
def get_season_stats(dataset, detailed=False, post_season=False, year=2024):
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if detailed:
        if post_season:
            df = dataset[f"{gender}NCAATourneyDetailedResults"]
        else:
            df = dataset[f"{gender}RegularSeasonDetailedResults"]
        
    else:
        if post_season:
            df = dataset[f"{gender}NCAATourneyCompactResults"]
        else:
            df = dataset[f"{gender}RegularSeasonCompactResults"]
        
    df = df[df["Season"] == year]
    return df, gender

def compute_margins_of_victory(df):
    df["margin"] = df["WScore"] - df["LScore"]
    
    win_df = df[["WTeamID", "margin"]].rename(columns={"WTeamID": "TeamID"})
    lose_df = df[["LTeamID", "margin"]].rename(columns={"LTeamID": "TeamID"})
    lose_df["margin"] = -lose_df["margin"]

    res = pd.concat([win_df, lose_df], axis=0)
    return res.groupby("TeamID")["margin"].mean()

def join_team_names(df, data, gender="M"):
    """
    df: pd.DataFrame
        dataframe appending teams to
    data: dict[str, pd.DataFrame]
        dictionary of all table names and data
    """
    res = pd.merge(df, data[f"{gender}Teams"][["TeamID", "TeamName"]], on="TeamID")
    return res
    

In [21]:
def create_srs(df,gender):

    df["margin"] = df["WScore"] - df["LScore"]
    win_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"WTeamID": "team_id", "LTeamID": "opp_id"}
    )
    lose_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"LTeamID": "team_id", "WTeamID": "opp_id"}
    )
    lose_df["margin"] = -lose_df["margin"]

    teams = pd.concat([win_df, lose_df], axis=0)
    spreads = compute_margins_of_victory(df)
    
    terms = []
    solutions = []

    for team_id in spreads.keys():
        row = []
        opps = list(teams[teams["team_id"] == team_id]["opp_id"])

        for opp_id in spreads.keys():
            if opp_id == team_id:
                # coef for the team itself should be 1
                row.append(1)
            elif opp_id in opps:
                # coef for opponents is 1 over num of opps
                row.append(-1.0/len(opps))
            else:
                # teams not faced get a 0 coef
                row.append(0)
        terms.append(row)

        solutions.append(spreads[team_id])

    solutions, _, _, _ = np.linalg.lstsq(np.array(terms), np.array(solutions), rcond=None)
    
    ratings = list(zip( spreads.keys(), solutions ))
    srs = pd.DataFrame(ratings, columns=['team', 'rating'])
    rankings = srs.sort_values('rating', ascending=False).reset_index()[['team', 'rating']]
    rankings = join_team_names(rankings.rename(columns={"team": "TeamID"}), data, gender=gender)
    return rankings

In [22]:
df, gender = get_season_stats(mens_data, detailed=False, post_season=False, year=2024)
create_srs(df, gender)[:25]

Unnamed: 0,TeamID,rating,TeamName
0,1222,23.772046,Houston
1,1112,22.36191,Arizona
2,1345,21.992962,Purdue
3,1397,20.972819,Tennessee
4,1104,20.383226,Alabama
5,1120,20.113872,Auburn
6,1163,19.217603,Connecticut
7,1235,19.140082,Iowa St
8,1140,17.846165,BYU
9,1124,17.736798,Baylor


In [23]:
def get_coach_win_perc(
    dataset: dict,
    regular_season: bool,
    year:int = 2024
) -> pd.DataFrame:
    """
    
    parameters
    ----------
    dataset: dict
        dictionary of datasets to use. it will be
        mens_data or womens_data.
        
    year: int
        year to filter data. it will get coaches stats for everything
        up until this year. (model can't have any look ahead bias). for post
        season games, use a year one less than the year of interest.
        
    returns
    -------
    coaches_stats: pd.DataFrame
        dataframe with count of wins, win percentage, and std dev
        of wins.
    """
    
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if regular_season:
        df = dataset[f"{gender}RegularSeasonCompactResults"]
    else:
        df = dataset[f"{gender}NCAATourneyCompactResults"]
        
    #Filter season up until season of interest
    df = df[df["Season"] <= year]
    
    winning_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "WTeamID"],
        right_on=["Season", "TeamID"]
    )

    winning_coaches_df = winning_coaches_df[
        (winning_coaches_df['DayNum'] >= winning_coaches_df['FirstDayNum']) 
        & (winning_coaches_df['DayNum'] <= winning_coaches_df['LastDayNum'])
    ]
    winning_coaches_df["win"] = 1

    #Make sure the join dind't create dupes
    assert len(winning_coaches_df) == len(df)

    losing_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "LTeamID"],
        right_on=["Season", "TeamID"]
    )

    losing_coaches_df = losing_coaches_df[
        (losing_coaches_df['DayNum'] >= losing_coaches_df['FirstDayNum']) 
        & (losing_coaches_df['DayNum'] <= losing_coaches_df['LastDayNum'])
    ]
    losing_coaches_df["win"] = 0

    #Make sure the join dind't create dupes
    assert len(losing_coaches_df) == len(df)

    coaches_df = pd.concat(
        [
            losing_coaches_df[["CoachName", "win"]],
            winning_coaches_df[["CoachName", "win"]]
        ],
        axis=0
    )

    coach_stats = (
        coaches_df
        .groupby("CoachName")["win"]
        .describe()
        .sort_values("count", ascending=False)
        [["count", "mean", "std"]]
        .fillna(0)
    )

    return coach_stats


In [24]:
def get_system_ratings(
    mens_dataset, #There are only ratings for men
    systems: List[str],
    year: int=2024,
):
    """
    gets system ratings for each team for specified systems for a specific year.
    
    parameters
    ---------
    mens_dataset: dict
        dictionary of datasets for men
    systems: List[str]
        list of dictionaries we are interested in seeing
    year: int
        year to look for ratings
    moving_average: str
        specifies how to calculate rolling ratings for given systems.
        if None, the system takes the most recent system rating
    
    returns
    -------
    df: pd.DataFrame
        data that reflects ratings for a team
    """
    
    # Filter by season - only take most recent
    df = mens_dataset["MMasseyOrdinals"]
    df = df[df["Season"] == year]
    
    # Filter by system
    df = df[df["SystemName"].isin(systems)]
    
    latest_rank = (
        df
        .sort_values("RankingDayNum")
        .groupby(["TeamID","SystemName"])
        ["OrdinalRank"]
        .last()
        .unstack("SystemName")
        .reset_index().
        rename(columns = {i: i+"_latest" for i in systems})
    )
    
    transformed_df = (
        df
        .sort_values(by="RankingDayNum")
        .groupby(["TeamID", "SystemName"], group_keys=False)
        ["OrdinalRank"]
        .rolling(2) # TODO: Parameterize this (window and moving average method)
        .mean()
        .unstack("SystemName")
        .reset_index()
        .drop("level_1", axis=1)
        .groupby("TeamID")
        [systems]
        .last()
        .reset_index()
        .rename(columns = {i: i+"_rolling" for i in systems})
    )
    
    res = pd.merge(latest_rank, transformed_df, on="TeamID")

    return res

In [25]:
def get_post_season(data, year):
    
    df, gender = get_season_stats(
            data, 
            detailed=False, 
            post_season=True, 
            year=year
    )
    
    # Shuffle teams for positional encoding (model shouldn't have winning teams features as the same)
    df["TeamID"] = np.where(
        np.random.uniform(0,1, size=len(df)) > .5, 
        df["WTeamID"], 
        df["LTeamID"]
    )
    df["team_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["WScore"], 
        df["LScore"]
    )
    df["OppID"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LTeamID"], 
        df["WTeamID"]
    )
    df["opp_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LScore"], 
        df["WScore"]
    )
    df = df.drop(
        ["WTeamID", "LTeamID", "WScore", "LScore", "WLoc", "NumOT"],
        axis=1
    )
    
    return df

In [26]:
def get_features(mens_data, year, systems):
    # Season Stats
    df, gender = get_season_stats(
        mens_data, 
        detailed=False, 
        post_season=False, 
        year=year
    )

    # Rating System
    srs = create_srs(df, gender)

    # System Ratings
    system_ratings = get_system_ratings(
        mens_data, 
        systems=systems
    ) #KenPom, Nolan ELO, EPSN BPI

    # Ratings df
    ratings_df = pd.merge(
                srs,
                system_ratings,
                on="TeamID"
    )

    # Coaches postseason win stats
    coaches_postseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=False, 
        year=year
    ).rename(columns={"count": "count_post", "mean": "mean_post", "std": "std_post"})

    # Coaches regular season win stats
    coaches_regseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=True, 
        year=year
    ).rename(columns={"count": "count_reg", "mean": "mean_reg", "std": "std_reg"})

    coaches_df = pd.merge(
        coaches_regseason_win_df,
        coaches_postseason_win_df,
        on="CoachName",
        how="left"
    ).fillna(0)

    # Get coaches for the year and only grab the most recent coach for a certain team
    curr_coaches = (
        mens_data["MTeamCoaches"][
            mens_data["MTeamCoaches"]["Season"] == year
        ]
        .sort_values("FirstDayNum")
        .groupby("TeamID")["CoachName"]
        .last()
        .reset_index()
    )

    # Get coach stats for current coaches
    coaches_df = pd.merge(
        curr_coaches,
        coaches_df,
        on="CoachName",
        how="left"
    )


    feature_df = (
        pd.merge(
            ratings_df,
            coaches_df
        )
        .drop(["TeamName", "CoachName"], axis=1)
    )
    
    return feature_df


def merge_features_to_games(feature_df, post_season_df, year, training=True):
    
    post_season_merged = pd.merge(
        pd.merge(
            feature_df,
            post_season_df,
            on="TeamID",
        ),
        feature_df,
        left_on="OppID",
        right_on="TeamID",
        suffixes=("_team", "_opp")
    )
    if training:
        post_season_merged["win"] = post_season_merged["team_score"] > post_season_merged["opp_score"]
        post_season_merged = (
            post_season_merged
            .drop(
                ["TeamID_team", "team_score", "OppID", "TeamID_opp", "opp_score", "DayNum"], 
                axis=1
            )
        )
    return post_season_merged

In [27]:
# This function will change a lot based on what we are trying to predict
# Simplest training method is to grab team ids from previous years and pull in reg season stats to make a prediction
# what we should try to get to is running simulations and making predictions based on matchups then have some sort of loss metric for how good or bad a bracket is.
# Also adding stats like if they're on a run or not would be cool (tough to do at inference time)
def create_mens_training_data():
    
    training_data = dict()
    
    for year in tqdm(range(2003, 2023)):
        
        feature_df = get_features(mens_data, year=year, systems=["POM", "NOL", "EBP"])
        post_season_df = get_post_season(mens_data, year)
        post_season_merged = merge_features_to_games(feature_df, post_season_df, year)
        
        training_data[year] = post_season_merged
    
    return training_data

In [28]:
# Need to join this on who is in the tournament every year and create predictions based on matchups 2003-2023
train_dict = create_mens_training_data()

  0%|          | 0/20 [00:00<?, ?it/s]

100%|██████████| 20/20 [00:46<00:00,  2.32s/it]


In [19]:
# TODO: combine training data with post season bracket matchups
# TODO: Create a simulation model that draws from a distribution on who's going to win a matchup then continue to simulate the bracket as if each team won the game.

In [30]:
def save_data(_dir="baseline"):
    if not os.path.isdir(_dir):
        os.mkdir(_dir)
    for year in tqdm(train_dict):
        train_dict[year].to_csv(f"{_dir}/{year}.csv", index=False)
        
def load_data(_dir="baseline"):
    if not os.path.isdir(_dir):
        raise NotADirectoryError(f"{_dir} is not a directory")
    else:
        train_data = dict()
        for dirname, _, filenames in tqdm(os.walk(_dir)):
            for filename in filenames:
                table_name = filename.split('.')[0]
                table_path = os.path.join(dirname, filename)
                try:
                    train_data[table_name] = pd.read_csv(table_path, index_col=False)
                except UnicodeDecodeError:
                    train_data[table_name] = pd.read_csv(table_path, encoding='cp1252')
                except Exception as e:
                    print(f"Error with {filename}: {e}")
    return train_data

In [31]:
save_data("../data/baseline")

100%|██████████| 20/20 [00:00<00:00, 239.92it/s]


In [34]:
train_data=load_data("../data/baseline")
# train_data = train_dict

1it [00:00, 12.28it/s]


In [36]:
all_data = pd.concat(train_data.values(), ignore_index=True)
train = all_data[all_data["Season"] != 2022]
test = all_data[all_data["Season"] == 2022]

X_train = train.drop("win", axis=1)
y_train = train["win"]
X_test = test.drop("win", axis=1)
y_test = test["win"]

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# For torch....
# X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
# y_train_tensor = torch.tensor(y_train.values.astype(np.float64), dtype=torch.float32)
# X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
# y_test_tensor = torch.tensor(y_test.values.astype(np.float64), dtype=torch.float32)

# train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
# test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

In [None]:
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)

In [None]:
class BaselineModel(torch.nn.Module):
    
    def __init__(self, input_size, hidden_size, dropout_rate=0.5):
        super(BaselineModel, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        ## Init layers
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.batch_norm = nn.BatchNorm1d(hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.layer2 = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        
        out = self.layer1(x)
        out = self.batch_norm(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.layer2(out)
        out = torch.sigmoid(out)
        
        return out

In [None]:
INPUT_SIZE = X_train.shape[1]
HIDDEN_LAYER_SIZE = 256

model = BaselineModel(INPUT_SIZE, HIDDEN_LAYER_SIZE)

# Simple Binary Cross-Entropy Loss and SGD
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels.unsqueeze(1))
        loss.backward()
        optimizer.step()
    
    if (epoch+1) % 50 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Epoch [50/1000], Loss: 0.3760
Epoch [100/1000], Loss: 0.3545
Epoch [150/1000], Loss: 0.2448
Epoch [200/1000], Loss: 0.4135
Epoch [250/1000], Loss: 0.3524
Epoch [300/1000], Loss: 0.2297
Epoch [350/1000], Loss: 0.3635
Epoch [400/1000], Loss: 0.4038
Epoch [450/1000], Loss: 0.2130
Epoch [500/1000], Loss: 0.2838
Epoch [550/1000], Loss: 0.2250
Epoch [600/1000], Loss: 0.3267
Epoch [650/1000], Loss: 0.4145
Epoch [700/1000], Loss: 0.2599
Epoch [750/1000], Loss: 0.2365
Epoch [800/1000], Loss: 0.3393
Epoch [850/1000], Loss: 0.3293
Epoch [900/1000], Loss: 0.1937
Epoch [950/1000], Loss: 0.2288
Epoch [1000/1000], Loss: 0.3330


In [None]:
torch.set_printoptions(sci_mode=False)

model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

train_inputs = X_train_tensor
train_actual_outputs = y_train_tensor

inputs = X_test_tensor
actual_outputs = y_test_tensor

with torch.no_grad():  # Inference mode, gradients not needed
    train_outputs = model(train_inputs)
    train_predicted = (model(X_train_tensor) >= .5).int()
    train_total = train_outputs.size(0)
    train_correct = (train_predicted.squeeze().int() == train_actual_outputs.int()).sum().item()
    
    outputs = model(inputs)
    predicted = (model(X_test_tensor) >= .5).int()
    total = outputs.size(0)
    correct = (predicted.squeeze().int() == actual_outputs.int()).sum().item()

train_accuracy = 100 * train_correct / train_total
print(f'Accuracy on the train set: {train_accuracy:.2f}%')

accuracy = 100 * correct / total
print(f'Accuracy on the test set: {accuracy:.2f}%')

Accuracy on the train set: 96.44%
Accuracy on the test set: 70.15%


### XGBoost

In [39]:
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)

In [55]:

predicted_output = pd.DataFrame(
    {
        # "Predicted": model.predict_proba(X_test)[:, 1], 
        "Predicted": model.predict(X_test),
        "Actual": y_test.astype(int)
    }
)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(predicted_output)

np.mean(np.where(predicted_output["Predicted"] == predicted_output["Actual"], 1, 0))

     Predicted  Actual
193          1       1
194          1       0
195          1       1
196          0       1
197          0       0
198          1       1
199          1       0
200          1       1
201          0       1
202          1       1
203          1       1
204          1       0
205          1       1
206          1       1
207          1       1
208          1       1
209          1       1
210          1       1
211          1       1
212          0       1
213          0       0
214          1       1
215          1       1
216          0       0
217          1       0
218          0       0
219          0       0
220          1       1
221          1       1
222          0       0
223          1       1
224          1       0
225          0       0
226          0       0
227          0       1
228          1       1
229          1       0
230          1       1
231          0       0
232          0       0
233          0       0
234          1       1
235        

0.746268656716418

In [None]:
predicted_output = pd.DataFrame(
    {
        "Predicted": torch.round(outputs, decimals = 2).flatten(), 
         "Actual": y_test_tensor.flatten()
    }
)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(predicted_output)

    Predicted  Actual
0        0.76     1.0
1        1.00     1.0
2        0.10     0.0
3        0.97     1.0
4        0.98     1.0
5        0.60     0.0
6        0.43     1.0
7        0.96     0.0
8        1.00     1.0
9        0.04     1.0
10       0.89     1.0
11       0.51     0.0
12       0.99     1.0
13       0.97     1.0
14       0.98     1.0
15       0.35     1.0
16       0.01     1.0
17       0.98     1.0
18       1.00     1.0
19       0.00     0.0
20       0.00     0.0
21       0.11     1.0
22       0.53     0.0
23       1.00     1.0
24       0.09     1.0
25       0.00     1.0
26       0.74     0.0
27       0.16     0.0
28       0.00     0.0
29       0.00     0.0
30       1.00     1.0
31       0.66     1.0
32       0.71     1.0
33       0.20     1.0
34       0.36     1.0
35       1.00     1.0
36       0.19     0.0
37       0.00     0.0
38       0.00     0.0
39       0.89     0.0
40       1.00     1.0
41       0.64     0.0
42       1.00     1.0
43       0.03     1.0
44       0

In [None]:
seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == 2022
    ]

seeding_df

Unnamed: 0,Season,Seed,TeamID
2354,2022,W01,1124
2355,2022,W02,1246
2356,2022,W03,1345
2357,2022,W04,1417
2358,2022,W05,1388
...,...,...,...
2417,2022,Z13,1151
2418,2022,Z14,1255
2419,2022,Z15,1174
2420,2022,Z16a,1136


In [16]:
def create_testing_data(df, seed_lookup, scaler, year):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    
    matchup_df = df.copy()
    
    matchup_df["TeamID"] = matchup_df["StrongSeed"].apply(
        lambda x: seed_lookup[x]
    )
    matchup_df["OppID"] = matchup_df["WeakSeed"].apply(
        lambda x: seed_lookup[x]
    )
    
    feature_df = get_features(mens_data, year=year, systems=["POM", "NOL", "EBP"])
    post_season_merged = merge_features_to_games(feature_df, matchup_df, year, training=False)
    
    post_season_merged = post_season_merged.drop(
        ['TeamID_team', 'OppID', 'TeamID_opp', 'Slot', 'StrongSeed','WeakSeed'], 
        axis=1
    )
    X_test = scaler.transform(post_season_merged)
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    return X_test_tensor

In [17]:
def probabilistic_choice(row):
    return np.random.choice([row['StrongSeed'], row['WeakSeed']], p=[row['prob'], 1-row['prob']])

def get_model_predictions(model, matchup_df, seed_lookup, scaler, year):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    inputs = create_testing_data(matchup_df, seed_lookup, scaler, year)
    model.eval()  # Set the model to evaluation mode

    with torch.no_grad():  # Inference mode, gradients not needed
        
        outputs = model(inputs)
    pred_df = matchup_df.copy()
    pred_df["prob"] = outputs
    pred_df["winner"] = pred_df.apply(probabilistic_choice, axis=1)

    return pred_df
    
    

# Run Simulation Based on Model
Recursively loop through the seeding df to pick teams at each round.

TODO: Swap out function to use baseline model predictions

In [38]:
# Change out function for model prediction
def simulate(_df, seed_lookup, round_num=0, results=None, year=2023):
    
    df = _df.copy()
    
    if results is None:
        results = {}
        
    if round_num > 6:
        return results
    
    if round_num == 0: # Play-IN games
        temp_df = df[~df["Slot"].str.startswith("R")]
    
    else:
        temp_df = df[df["Slot"].str.startswith(f"R{round_num}")]
        temp_df["StrongSeed"] = temp_df["StrongSeed"].apply(
            lambda x: results[x] if x in results else x
        )
        temp_df["WeakSeed"] = temp_df["WeakSeed"].apply(
            lambda x: results[x] if x in results else x
        )
        
    ###################### CHANGE ####################
    ##################################################
    temp_df["winner"] = np.where(
        np.random.uniform(0,1,size=len(temp_df)) <= .70,
        temp_df["StrongSeed"], 
        temp_df["WeakSeed"]
    )
    ###################################################
    # temp_df = get_model_predictions(model, temp_df, seed_lookup, scaler=scaler, year=year)

    for k, v in zip(temp_df["Slot"], temp_df["winner"]):
        results[k] = v
    
    results = simulate(df, seed_lookup, round_num + 1, results, year=year)
    return results

In [39]:
def create_submission(gender, bracket, year):
    df = mens_data["MNCAATourneySlots"][
        (mens_data["MNCAATourneySlots"]["Season"] == year)
    ]
    seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == year
    ]

    seed_lookup = {
        k: v for k, v in zip(
            seeding_df["Seed"], 
            seeding_df["TeamID"]
        )
    }
    round_winner = simulate(df, seed_lookup, year=year)
    
    df["Team"] = df["Slot"].apply(lambda x: round_winner[x])
    df["Tournament"] = gender
    df["Bracket"] = bracket
    
    return df[["Tournament", "Bracket", "Slot", "Team"]]

In [40]:
#TODO: Create feature df before recursion to speed this up
#TODO: Step through this to make sure it's working
sub = create_submission(gender="M", bracket=1, year=2023)

In [41]:
sub

Unnamed: 0,Tournament,Bracket,Slot,Team
2385,M,1,R1W1,W01
2386,M,1,R1W2,W02
2387,M,1,R1W3,W03
2388,M,1,R1W4,W04
2389,M,1,R1W5,W05
...,...,...,...,...
2447,M,1,R6CH,Y02
2448,M,1,W16,W16a
2449,M,1,X16,X16a
2450,M,1,Y11,Y11b


In [None]:
seeding_df = mens_data["MNCAATourneySeeds"][
    mens_data["MNCAATourneySeeds"]["Season"] == 2023
]
seed_lookup = {
    k: v for k, v in zip(
        seeding_df["Seed"], 
        seeding_df["TeamID"]
    )
}

In [None]:
sub["winner"] = sub["Team"].apply(lambda x: seed_lookup[x])
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(pd.merge(sub, mens_data["MTeams"].drop(["FirstD1Season","LastD1Season"],axis=1), left_on="winner", right_on="TeamID"))

Unnamed: 0,Tournament,Bracket,Slot,Team,winner,TeamID,TeamName
0,M,1,R1W1,W01,1345,1345,Purdue
1,M,1,R1W2,W02,1266,1266,Marquette
2,M,1,R1W3,W03,1243,1243,Kansas St
3,M,1,R1W4,W04,1397,1397,Tennessee
4,M,1,R1W5,W05,1181,1181,Duke
5,M,1,R1W6,W06,1246,1246,Kentucky
6,M,1,R1W7,W07,1277,1277,Michigan St
7,M,1,R1W8,W09,1194,1194,FL Atlantic
8,M,1,R1X1,X01,1104,1104,Alabama
9,M,1,R1X2,X02,1112,1112,Arizona
