# March Madness 2024 Code
### Team: Taylor Last, Ted Woodsides, Jake Hopkins

### Notes
- **Please add docstrings or commments to code so we all know what's going on**
- I can't find how to all work on the same notebook so we might have to copy code and all work on separate pieces
- **Check create_submission and simulate functions to see how to get a bracket in the form the competition asks for**
### Things to keep in mind
- **Days are standardized already**: Dayzero tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the men's national semifinals are always on DayNum=152, the "play-in" games are on days 134-135, men's Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. 
- **Special note about "Season" numbers**: the college basketball season lasts from early November until the national championship tournament that starts in the middle of March. For instance, this year the first regular season games were played in November 2023 and the national championship games will be played in April 2024. Because a basketball season spans two calendar years like this, it can be confusing to refer to the year of the season. By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in. So for instance, the current season will be identified in our data as the 2024 season, not the 2023 season or the 2023-24 season or the 2023-2024 season, though you may see any of these in everyday use outside of our data.

### Questions to consider
- Do we want to train the model by making prediction for all games (regular season and post), then only make predictions for tournament games, or do we want to only train on tournament games?

### Featues I'd like to add
- Quantify guard play
- Free throw percentage
- Pace
- Time of possession

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
from typing import List
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost

import networkx
from networkx.algorithms.traversal.depth_first_search import dfs_edges

import warnings
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [2]:
# Refer to this if you need to look up documentation
print(f"Pandas Version: {pd.__version__}")

Pandas Version: 2.2.0


In [3]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
data = dict()
for dirname, _, filenames in os.walk('../kaggle/input'):
    for filename in filenames:
        table_name = filename.split('.')[0]
        table_path = os.path.join(dirname, filename)
        try:
            data[table_name] = pd.read_csv(table_path)
        except UnicodeDecodeError:
            data[table_name] = pd.read_csv(table_path, encoding='cp1252')
        except Exception as e:
            print(f"Error with {filename}: {e}")


In [4]:
# Check Table Names
data.keys()

dict_keys(['MNCAATourneyDetailedResults', 'WNCAATourneySlots', 'MNCAATourneyCompactResults', 'MSeasons', 'WTeams', 'MRegularSeasonDetailedResults', 'WNCAATourneyDetailedResults', 'MNCAATourneySlots', 'MGameCities', 'MConferenceTourneyGames', 'WNCAATourneyCompactResults', 'WSeasons', 'Cities', 'WRegularSeasonCompactResults', 'WTeamSpellings', '2024_tourney_seeds', 'WRegularSeasonDetailedResults', 'MRegularSeasonCompactResults', 'WNCAATourneySeeds', 'MNCAATourneySeedRoundSlots', 'WTeamConferences', 'MTeamConferences', 'MTeamCoaches', 'MMasseyOrdinals', 'Conferences', 'MTeams', 'WGameCities', 'MNCAATourneySeeds', 'MSecondaryTourneyTeams', 'MTeamSpellings', 'sample_submission', 'MSecondaryTourneyCompactResults'])

In [12]:
data["MRegularSeasonDetailedResults"]

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112499,2024,114,1454,75,1237,70,A,0,25,57,...,13,16,23,7,19,15,13,7,2,21
112500,2024,114,1455,74,1412,66,A,0,27,54,...,16,15,22,9,21,15,12,8,3,12
112501,2024,114,1459,91,1359,69,H,0,32,59,...,24,20,28,10,18,10,14,6,0,19
112502,2024,114,1462,91,1177,58,H,0,35,67,...,19,11,14,5,25,11,18,6,4,12


In [5]:
# Split dict of dataframes by gender and other (supplemental) data
mens_data = dict()
womens_data = dict()
supplemental_data = dict()

for k, v in data.items():
    if k.startswith("M"):
        mens_data[k] = v
    elif k.startswith("W"):
        womens_data[k] = v
    else:
        supplemental_data[k] = v
        

In [6]:
# Check men's keys
mens_data.keys()

dict_keys(['MNCAATourneyDetailedResults', 'MNCAATourneyCompactResults', 'MSeasons', 'MRegularSeasonDetailedResults', 'MNCAATourneySlots', 'MGameCities', 'MConferenceTourneyGames', 'MRegularSeasonCompactResults', 'MNCAATourneySeedRoundSlots', 'MTeamConferences', 'MTeamCoaches', 'MMasseyOrdinals', 'MTeams', 'MNCAATourneySeeds', 'MSecondaryTourneyTeams', 'MTeamSpellings', 'MSecondaryTourneyCompactResults'])

In [8]:
kenpom_df = pd.merge(
    pd.read_csv("../data/kenpom/kenpom_all.csv"),
    data["MTeams"][["TeamID", "TeamName"]],
    on ="TeamName",
)
kenpom_df["Season"] = kenpom_df["Season"].astype(int)

In [10]:
# FTRate is ability to get to FT line
kenpom_cols = [
    "TeamID",
    "TeamName",
    "Season",
    "FG2Pct", 
    "FG3Pct", 
    "FTPct", 
    "OppFG2Pct",
    "OppFG3Pct", 
    "StlRate", 
    "OppStlRate", 
    "FTRate_offense",
    "FTRate_defense", 
    "TOPct_offense",
    "ORPct_offense",
    "HgtEff",
    "Bench",
    "AdjTempo",
    "AdjOE",
    "AdjDE",
    # "Size",
    # "Hgt1",
    # "Hgt2",
    # "Hgt3",
    # "Hgt4",
    # "Hgt5",
    # "Exp",
]

kenpom_df = kenpom_df[
    (kenpom_df["Season"] >= 2007)
    & (kenpom_df["Season"] != 2020)
][kenpom_cols]

In [11]:
kenpom_df.to_csv("../data/kenpom/kenpom_filtered.csv", index=False)

In [9]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(
        mens_data["MNCAATourneySlots"][
            (mens_data["MNCAATourneySlots"]["Season"] == 2023)
            & (mens_data["MNCAATourneySlots"]["Slot"].str.startswith("R1"))
        ]
    )
    
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#     display(mens_data["MNCAATourneySeeds"][
#         mens_data["MNCAATourneySeeds"]["Season"] == 2023
#     ])

Unnamed: 0,Season,Slot,StrongSeed,WeakSeed
2385,2023,R1W1,W01,W16
2386,2023,R1W2,W02,W15
2387,2023,R1W3,W03,W14
2388,2023,R1W4,W04,W13
2389,2023,R1W5,W05,W12
2390,2023,R1W6,W06,W11
2391,2023,R1W7,W07,W10
2392,2023,R1W8,W08,W09
2393,2023,R1X1,X01,X16
2394,2023,R1X2,X02,X15


# Baseline Model

Use a simple rating system (SRS) combined with KenPom to fit a model predicting the probability of a given matchup in the tournament. Then randomly sample from the distribution outputed from the model to create multiple submissions.

### Features:
- SRS system from the regular season
- KenPom ranking system
- Coach historical success in post season and regular season

### Model: Simple feedforward Neural Net

In [10]:
def get_season_stats(dataset, detailed=False, post_season=False, year=2024):
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if detailed:
        if post_season:
            df = dataset[f"{gender}NCAATourneyDetailedResults"]
        else:
            df = dataset[f"{gender}RegularSeasonDetailedResults"]
        
    else:
        if post_season:
            df = dataset[f"{gender}NCAATourneyCompactResults"]
        else:
            df = dataset[f"{gender}RegularSeasonCompactResults"]
        
    df = df[df["Season"] == year]
    return df, gender

def compute_margins_of_victory(df):
    df["margin"] = df["WScore"] - df["LScore"]
    
    win_df = df[["WTeamID", "margin"]].rename(columns={"WTeamID": "TeamID"})
    lose_df = df[["LTeamID", "margin"]].rename(columns={"LTeamID": "TeamID"})
    lose_df["margin"] = -lose_df["margin"]

    res = pd.concat([win_df, lose_df], axis=0)
    return res.groupby("TeamID")["margin"].mean()

def join_team_names(df, data, gender="M"):
    """
    df: pd.DataFrame
        dataframe appending teams to
    data: dict[str, pd.DataFrame]
        dictionary of all table names and data
    """
    res = pd.merge(df, data[f"{gender}Teams"][["TeamID", "TeamName"]], on="TeamID")
    return res
    

In [11]:
def create_srs(df,gender):

    df["margin"] = df["WScore"] - df["LScore"]
    win_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"WTeamID": "team_id", "LTeamID": "opp_id"}
    )
    lose_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"LTeamID": "team_id", "WTeamID": "opp_id"}
    )
    lose_df["margin"] = -lose_df["margin"]

    teams = pd.concat([win_df, lose_df], axis=0)
    spreads = compute_margins_of_victory(df)
    
    terms = []
    solutions = []

    for team_id in spreads.keys():
        row = []
        opps = list(teams[teams["team_id"] == team_id]["opp_id"])

        for opp_id in spreads.keys():
            if opp_id == team_id:
                # coef for the team itself should be 1
                row.append(1)
            elif opp_id in opps:
                # coef for opponents is 1 over num of opps
                row.append(-1.0/len(opps))
            else:
                # teams not faced get a 0 coef
                row.append(0)
        terms.append(row)

        solutions.append(spreads[team_id])

    solutions, _, _, _ = np.linalg.lstsq(np.array(terms), np.array(solutions), rcond=None)
    
    ratings = list(zip( spreads.keys(), solutions ))
    srs = pd.DataFrame(ratings, columns=['team', 'rating'])
    rankings = srs.sort_values('rating', ascending=False).reset_index()[['team', 'rating']]
    rankings = join_team_names(rankings.rename(columns={"team": "TeamID"}), data, gender=gender)
    return rankings

In [12]:
def get_coach_win_perc(
    dataset: dict,
    regular_season: bool,
    year:int = 2024
) -> pd.DataFrame:
    """
    
    parameters
    ----------
    dataset: dict
        dictionary of datasets to use. it will be
        mens_data or womens_data.
        
    year: int
        year to filter data. it will get coaches stats for everything
        up until this year. (model can't have any look ahead bias). for post
        season games, use a year one less than the year of interest.
        
    returns
    -------
    coaches_stats: pd.DataFrame
        dataframe with count of wins, win percentage, and std dev
        of wins.
    """
    
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if regular_season:
        df = dataset[f"{gender}RegularSeasonCompactResults"]
        #Filter season up until season of interest
        df = df[df["Season"] <= year]
    else:
        df = dataset[f"{gender}NCAATourneyCompactResults"]
        #Filter season up until season of interest
        df = df[df["Season"] < year]
        
    
    
    winning_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "WTeamID"],
        right_on=["Season", "TeamID"]
    )

    winning_coaches_df = winning_coaches_df[
        (winning_coaches_df['DayNum'] >= winning_coaches_df['FirstDayNum']) 
        & (winning_coaches_df['DayNum'] <= winning_coaches_df['LastDayNum'])
    ]
    winning_coaches_df["win"] = 1

    #Make sure the join dind't create dupes
    assert len(winning_coaches_df) == len(df)

    losing_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "LTeamID"],
        right_on=["Season", "TeamID"]
    )

    losing_coaches_df = losing_coaches_df[
        (losing_coaches_df['DayNum'] >= losing_coaches_df['FirstDayNum']) 
        & (losing_coaches_df['DayNum'] <= losing_coaches_df['LastDayNum'])
    ]
    losing_coaches_df["win"] = 0

    #Make sure the join dind't create dupes
    assert len(losing_coaches_df) == len(df)

    coaches_df = pd.concat(
        [
            losing_coaches_df[["CoachName", "win"]],
            winning_coaches_df[["CoachName", "win"]]
        ],
        axis=0
    )

    coach_stats = (
        coaches_df
        .groupby("CoachName")["win"]
        .describe()
        .sort_values("count", ascending=False)
        [["count", "mean", "std"]]
        .fillna(0)
    )

    return coach_stats


In [13]:
def get_system_ratings(
    mens_dataset, #There are only ratings for men
    systems: List[str],
    year: int=2024,
):
    """
    gets system ratings for each team for specified systems for a specific year.
    
    parameters
    ---------
    mens_dataset: dict
        dictionary of datasets for men
    systems: List[str]
        list of dictionaries we are interested in seeing
    year: int
        year to look for ratings
    moving_average: str
        specifies how to calculate rolling ratings for given systems.
        if None, the system takes the most recent system rating
    
    returns
    -------
    df: pd.DataFrame
        data that reflects ratings for a team
    """
    
    # Filter by season - only take most recent
    df = mens_dataset["MMasseyOrdinals"]
    df = df[df["Season"] == year]
    
    # Filter by system
    df = df[df["SystemName"].isin(systems)]
    
    latest_rank = (
        df
        .sort_values("RankingDayNum")
        .groupby(["TeamID","SystemName"])
        ["OrdinalRank"]
        .last()
        .unstack("SystemName")
        .reset_index().
        rename(columns = {i: i+"_latest" for i in systems})
    )
    
    transformed_df = (
        df
        .sort_values(by="RankingDayNum")
        .groupby(["TeamID", "SystemName"], group_keys=False)
        ["OrdinalRank"]
        .rolling(5) # TODO: Parameterize this (window and moving average method)
        .mean()
        .unstack("SystemName")
        .reset_index()
        .drop("level_1", axis=1)
        .groupby("TeamID")
        [systems]
        .last()
        .reset_index()
        .rename(columns = {i: i+"_rolling" for i in systems})
    )
    
    res = pd.merge(latest_rank, transformed_df, on="TeamID")

    return res

In [14]:
def get_post_season(data, year):
    
    df, gender = get_season_stats(
            data, 
            detailed=False, 
            post_season=True, 
            year=year
    )
    
    # Shuffle teams for positional encoding (model shouldn't have winning teams features as the same)
    df["TeamID"] = np.where(
        np.random.uniform(0,1, size=len(df)) > .5, 
        df["WTeamID"], 
        df["LTeamID"]
    )
    df["team_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["WScore"], 
        df["LScore"]
    )
    df["OppID"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LTeamID"], 
        df["WTeamID"]
    )
    df["opp_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LScore"], 
        df["WScore"]
    )
    df = df.drop(
        ["WTeamID", "LTeamID", "WScore", "LScore", "WLoc", "NumOT"],
        axis=1
    )
    
    return df

In [41]:
def get_features(mens_data, year, systems):
    # Season Stats
    df, gender = get_season_stats(
        mens_data, 
        detailed=False, 
        post_season=False, 
        year=year
    )

    # Rating System
    srs = create_srs(df, gender)

    # System Ratings
    system_ratings = get_system_ratings(
        mens_data, 
        systems=systems
    ) #KenPom, Nolan ELO, EPSN BPI

    # Ratings df
    ratings_df = pd.merge(
                srs,
                system_ratings,
                on="TeamID"
    )

    # Coaches postseason win stats
    coaches_postseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=False, 
        year=year
    ).rename(columns={"count": "count_post", "mean": "mean_post", "std": "std_post"})

    # Coaches regular season win stats
    coaches_regseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=True, 
        year=year
    ).rename(columns={"count": "count_reg", "mean": "mean_reg", "std": "std_reg"})

    coaches_df = pd.merge(
        coaches_regseason_win_df,
        coaches_postseason_win_df,
        on="CoachName",
        how="left"
    ).fillna(0)

    # Get coaches for the year and only grab the most recent coach for a certain team
    curr_coaches = (
        mens_data["MTeamCoaches"][
            mens_data["MTeamCoaches"]["Season"] == year
        ]
        .sort_values("FirstDayNum")
        .groupby("TeamID")["CoachName"]
        .last()
        .reset_index()
    )

    # Get coach stats for current coaches
    coaches_df = pd.merge(
        curr_coaches,
        coaches_df,
        on="CoachName",
        how="left"
    )


    feature_df = (
        pd.merge(
            ratings_df,
            coaches_df
        )
        .drop(["TeamName", "CoachName"], axis=1)
    )

    # feature_df = pd.merge(
    #     feature_df,
    #     kenpom_df[kenpom_df["Season"] == str(year)],
    #     on="TeamID"
    # )
    
    return feature_df


def merge_features_to_games(feature_df, post_season_df, year, training=True):
    
    post_season_merged = pd.merge(
        pd.merge(
            feature_df,
            post_season_df,
            on="TeamID",
        ),
        feature_df,
        left_on="OppID",
        right_on="TeamID",
        suffixes=("_team", "_opp")
    )
    if training:
        post_season_merged["win"] = post_season_merged["team_score"] > post_season_merged["opp_score"]
        post_season_merged = (
            post_season_merged
            .drop(
                ["TeamID_team", "team_score", "OppID", "TeamID_opp", "opp_score", "DayNum"], 
                axis=1
            )
        )
    return post_season_merged

In [42]:
# This function will change a lot based on what we are trying to predict
# Simplest training method is to grab team ids from previous years and pull in reg season stats to make a prediction
# what we should try to get to is running simulations and making predictions based on matchups then have some sort of loss metric for how good or bad a bracket is.
# Also adding stats like if they're on a run or not would be cool (tough to do at inference time)
def create_mens_training_data():
    
    training_data = dict()
    
    for year in tqdm(range(2003, 2023)):
        
        feature_df = get_features(mens_data, year=year, systems=["POM", "NOL", "EBP"])
        post_season_df = get_post_season(mens_data, year)
        post_season_merged = merge_features_to_games(feature_df, post_season_df, year)
        
        training_data[year] = post_season_merged
    
    return training_data

In [43]:
# Need to join this on who is in the tournament every year and create predictions based on matchups 2003-2023
train_dict = create_mens_training_data()

100%|██████████| 20/20 [00:42<00:00,  2.12s/it]


In [48]:
train_dict[2021]

Unnamed: 0,rating_team,EBP_latest_team,NOL_latest_team,POM_latest_team,POM_rolling_team,NOL_rolling_team,EBP_rolling_team,count_reg_team,mean_reg_team,std_reg_team,...,POM_rolling_opp,NOL_rolling_opp,EBP_rolling_opp,count_reg_opp,mean_reg_opp,std_reg_opp,count_post_opp,mean_post_opp,std_post_opp,win
0,23.335584,19,14,21,20.6,22.6,19.8,689.0,0.849057,0.358254,...,30.4,46.6,31.2,980.0,0.620408,0.485533,40.0,0.525000,0.505736,True
1,23.335584,19,14,21,20.6,22.6,19.8,689.0,0.849057,0.358254,...,98.8,144.6,95.6,319.0,0.583072,0.493825,7.0,0.571429,0.534522,True
2,23.335584,19,14,21,20.6,22.6,19.8,689.0,0.849057,0.358254,...,94.4,77.2,109.2,566.0,0.683746,0.465425,17.0,0.352941,0.492592,True
3,23.335584,19,14,21,20.6,22.6,19.8,689.0,0.849057,0.358254,...,14.6,22.4,15.0,564.0,0.620567,0.485677,19.0,0.578947,0.507257,False
4,18.684479,14,25,14,14.6,22.4,15.0,564.0,0.620567,0.485677,...,117.0,83.4,121.2,188.0,0.670213,0.471391,5.0,0.400000,0.547723,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,0.490274,183,195,154,158.0,177.6,194.4,443.0,0.562077,0.496692,...,5.6,6.4,6.0,1045.0,0.660287,0.473838,48.0,0.500000,0.505291,True
61,0.490274,183,195,154,158.0,177.6,194.4,443.0,0.562077,0.496692,...,114.0,115.2,118.4,125.0,0.552000,0.499290,0.0,0.000000,0.000000,True
62,0.490274,183,195,154,158.0,177.6,194.4,443.0,0.562077,0.496692,...,108.2,92.4,114.0,496.0,0.518145,0.500175,5.0,0.800000,0.447214,True
63,0.490274,183,195,154,158.0,177.6,194.4,443.0,0.562077,0.496692,...,1.0,3.0,1.0,825.0,0.671515,0.469947,30.0,0.500000,0.508548,False


In [19]:
# TODO: combine training data with post season bracket matchups
# TODO: Create a simulation model that draws from a distribution on who's going to win a matchup then continue to simulate the bracket as if each team won the game.

In [24]:
def save_data(train_dict, _dir="baseline"):
    if not os.path.isdir(_dir):
        os.mkdir(_dir)
    for year in tqdm(train_dict):
        train_dict[year].to_csv(f"{_dir}/{year}.csv", index=False)
        
def load_data(_dir="baseline"):
    if not os.path.isdir(_dir):
        raise NotADirectoryError(f"{_dir} is not a directory")
    else:
        train_data = dict()
        for dirname, _, filenames in tqdm(os.walk(_dir)):
            for filename in filenames:
                table_name = filename.split('.')[0]
                table_path = os.path.join(dirname, filename)
                try:
                    train_data[table_name] = pd.read_csv(table_path, index_col=False)
                except UnicodeDecodeError:
                    train_data[table_name] = pd.read_csv(table_path, encoding='cp1252')
                except Exception as e:
                    print(f"Error with {filename}: {e}")
    return train_data

In [25]:
# save_data(train_dict, "../data/baseline")

In [26]:
# train_data=load_data("../data/baseline")
train_data = train_dict

In [28]:
all_data = pd.concat(train_data.values(), ignore_index=True)


In [29]:
all_data

Unnamed: 0,rating_team,EBP_latest_team,NOL_latest_team,POM_latest_team,POM_rolling_team,NOL_rolling_team,EBP_rolling_team,count_reg_team,mean_reg_team,std_reg_team,...,RankTOPct_opp,ORPct_opp,RankORPct_opp,FTRate_opp,RankFTRate_opp,NSTRate_opp,RankNSTRate_opp,OppNSTRate_opp,RankOppNSTRate_opp,win
0,19.514120,15,6,15,15.0,5.8,15.8,447.0,0.805369,0.396360,...,,,,,,,,,,True
1,19.514120,15,6,15,15.0,5.8,15.8,447.0,0.805369,0.396360,...,,,,,,,,,,True
2,19.514120,15,6,15,15.0,5.8,15.8,447.0,0.805369,0.396360,...,,,,,,,,,,True
3,19.514120,15,6,15,15.0,5.8,15.8,447.0,0.805369,0.396360,...,,,,,,,,,,False
4,18.203812,4,8,4,4.2,8.8,3.2,557.0,0.795332,0.403821,...,,,,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
435,6.187057,121,104,107,110.2,142.2,125.8,721.0,0.690707,0.462523,...,,,,,,,,,,False
436,5.196624,191,165,207,210.2,178.0,197.4,461.0,0.577007,0.494571,...,,,,,,,,,,True
437,5.196624,191,165,207,210.2,178.0,197.4,461.0,0.577007,0.494571,...,,,,,,,,,,False
438,4.247193,205,135,159,154.0,145.2,203.8,202.0,0.633663,0.483000,...,,,,,,,,,,False


In [95]:

train = all_data[all_data["Season"] != 2022]
test = all_data[all_data["Season"] == 2022]

X_train = train.drop("win", axis=1)
y_train = train["win"]
X_test = test.drop("win", axis=1)
y_test = test["win"]

training_cols = X_train.columns
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# For torch....
# X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
# y_train_tensor = torch.tensor(y_train.values.astype(np.float64), dtype=torch.float32)
# X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
# y_test_tensor = torch.tensor(y_test.values.astype(np.float64), dtype=torch.float32)

# train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
# test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

In [83]:
# train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)

In [84]:
class BaselineModel(torch.nn.Module):
    
    def __init__(self, input_size, hidden_size, dropout_rate=0.5):
        super(BaselineModel, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        ## Init layers
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.batch_norm = nn.BatchNorm1d(hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.layer2 = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        
        out = self.layer1(x)
        out = self.batch_norm(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.layer2(out)
        out = torch.sigmoid(out)
        
        return out

In [85]:
# INPUT_SIZE = X_train.shape[1]
# HIDDEN_LAYER_SIZE = 256

# model = BaselineModel(INPUT_SIZE, HIDDEN_LAYER_SIZE)

# # Simple Binary Cross-Entropy Loss and SGD
# criterion = nn.BCELoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# # Training loop
# num_epochs = 1000
# for epoch in range(num_epochs):
#     for inputs, labels in train_loader:
#         optimizer.zero_grad()
#         outputs = model(inputs)
#         loss = criterion(outputs, labels.unsqueeze(1))
#         loss.backward()
#         optimizer.step()
    
#     if (epoch+1) % 50 == 0:
#         print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

In [86]:
# torch.set_printoptions(sci_mode=False)

# model.eval()  # Set the model to evaluation mode
# correct = 0
# total = 0

# train_inputs = X_train_tensor
# train_actual_outputs = y_train_tensor

# inputs = X_test_tensor
# actual_outputs = y_test_tensor

# with torch.no_grad():  # Inference mode, gradients not needed
#     train_outputs = model(train_inputs)
#     train_predicted = (model(X_train_tensor) >= .5).int()
#     train_total = train_outputs.size(0)
#     train_correct = (train_predicted.squeeze().int() == train_actual_outputs.int()).sum().item()
    
#     outputs = model(inputs)
#     predicted = (model(X_test_tensor) >= .5).int()
#     total = outputs.size(0)
#     correct = (predicted.squeeze().int() == actual_outputs.int()).sum().item()

# train_accuracy = 100 * train_correct / train_total
# print(f'Accuracy on the train set: {train_accuracy:.2f}%')

# accuracy = 100 * correct / total
# print(f'Accuracy on the test set: {accuracy:.2f}%')

### XGBoost

In [290]:
model = xgboost.XGBClassifier(n_estimators=500, subsample=.9, early_stopping=True, gamma=1)
model.fit(X_train, y_train)

In [291]:
sorted({k: v for k, v in zip(training_cols, model.feature_importances_)}.items(), key=lambda x: -x[1])

[('rating_team', 0.07475173),
 ('rating_opp', 0.07377953),
 ('EBP_rolling_team', 0.048669755),
 ('POM_latest_team', 0.04078311),
 ('std_reg_team', 0.03906178),
 ('std_reg_opp', 0.039040543),
 ('EBP_rolling_opp', 0.03854042),
 ('std_post_opp', 0.03784448),
 ('POM_rolling_opp', 0.037488624),
 ('EBP_latest_opp', 0.036340304),
 ('NOL_rolling_opp', 0.03401407),
 ('count_post_team', 0.033700295),
 ('NOL_latest_team', 0.03338241),
 ('count_post_opp', 0.03321767),
 ('mean_post_team', 0.03316278),
 ('mean_reg_opp', 0.032806795),
 ('NOL_latest_opp', 0.03240204),
 ('mean_reg_team', 0.031859178),
 ('std_post_team', 0.03155847),
 ('mean_post_opp', 0.031354293),
 ('POM_latest_opp', 0.031038709),
 ('NOL_rolling_team', 0.030929849),
 ('count_reg_opp', 0.03005102),
 ('POM_rolling_team', 0.02991567),
 ('Season', 0.02970687),
 ('EBP_latest_team', 0.027967593),
 ('count_reg_team', 0.026631981)]

In [292]:
 #XGB
train_predicted_output = pd.DataFrame(
    {
        # "Predicted": model.predict_proba(X_test)[:, 1], 
        "Predicted": model.predict(X_train),
        "Actual": y_train.astype(int)
    }
)

predicted_output = pd.DataFrame(
    {
        # "Predicted": model.predict_proba(X_test)[:, 1], 
        "Predicted": model.predict(X_test),
        "Actual": y_test.astype(int)
    }
)

print(f"Train Accuracy: {np.mean(np.where(train_predicted_output['Predicted'] == train_predicted_output['Actual'], 1, 0))}")
print(f"Test Accuracy: {np.mean(np.where(predicted_output['Predicted'] == predicted_output['Actual'], 1, 0))}")

Train Accuracy: 0.9957627118644068
Test Accuracy: 0.7164179104477612


In [229]:
# predicted_output = pd.DataFrame(
#     {
#         "Predicted": torch.round(outputs, decimals = 2).flatten(), 
#          "Actual": y_test_tensor.flatten()
#     }
# )
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#     print(predicted_output)

In [230]:
def create_testing_data(df, feature_df, seed_lookup, scaler, year):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    
    matchup_df = df.copy()
    
    matchup_df["TeamID"] = matchup_df["StrongSeed"].apply(
        lambda x: seed_lookup[x]
    )
    matchup_df["OppID"] = matchup_df["WeakSeed"].apply(
        lambda x: seed_lookup[x]
    )
    
    post_season_merged = merge_features_to_games(feature_df, matchup_df, year, training=False)
    
    # post_season_merged = post_season_merged.drop(
    #     ['TeamID_team', 'OppID', 'TeamID_opp', 'Slot', 'StrongSeed', 'WeakSeed'], 
    #     axis=1
    # )

    post_season_merged = post_season_merged[scaler.feature_names_in_]

    X_test = scaler.transform(post_season_merged)
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    return X_test_tensor

In [231]:
def probabilistic_choice(row):
    return np.random.choice([row['StrongSeed'], row['WeakSeed']], p=[row['prob'], 1-row['prob']])

def get_model_predictions(model, inputs, matchup_df, model_type):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    if model_type == "torch":
        model.eval()  # Set the model to evaluation mode
        with torch.no_grad():  # Inference mode, gradients not needed
            outputs = model(inputs)
    elif model_type == "xgb":
        outputs = model.predict_proba(inputs)[:, 1]
    
    pred_df = matchup_df.copy()
    pred_df["prob"] = outputs
    pred_df["winner"] = pred_df.apply(probabilistic_choice, axis=1)

    return pred_df
    
    

In [232]:
def create_feature_df(year, systems=["POM", "NOL", "EBP"]):
    
    seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == year
    ]

    feature_df = get_features(mens_data, year=year, systems=systems)
    
    res = pd.merge(seeding_df, feature_df, on="TeamID")

    return res


# Run Simulation Based on Model
Recursively loop through the seeding df to pick teams at each round.

TODO: Swap out function to use baseline model predictions

In [233]:
def create_submission(model, gender, bracket, year, systems=["POM", "NOL", "EBP"]):

    df = mens_data["MNCAATourneySlots"][
        (mens_data["MNCAATourneySlots"]["Season"] == year)
    ]
    seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == year
    ]

    seed_lookup = {
        k: v for k, v in zip(
            seeding_df["Seed"], 
            seeding_df["TeamID"]
        )
    }

    feature_df = get_features(mens_data, year=year, systems=systems)

    def simulate(_df, round_num=0, results=None):
        
        df = _df.copy()
        
        if results is None:
            results = {}
            
        if round_num > 6:
            return results
        
        if round_num == 0: # Play-IN games
            temp_df = df[~df["Slot"].str.startswith("R")]
        
        else:
            temp_df = df[df["Slot"].str.startswith(f"R{round_num}")]
            temp_df["StrongSeed"] = temp_df["StrongSeed"].apply(
                lambda x: results[x] if x in results else x
            )
            temp_df["WeakSeed"] = temp_df["WeakSeed"].apply(
                lambda x: results[x] if x in results else x
            )
            
        ###################### CHANGE ####################
        ##################################################
        # temp_df["winner"] = np.where(
        #     np.random.uniform(0,1,size=len(temp_df)) <= .70,
        #     temp_df["StrongSeed"], 
        #     temp_df["WeakSeed"]
        # )
        ###################################################
        inputs = create_testing_data(temp_df, feature_df, seed_lookup, scaler, year)
        temp_df = get_model_predictions(model, inputs, temp_df, model_type="xgb")

        for k, v in zip(temp_df["Slot"], temp_df["winner"]):
            results[k] = v
        
        results = simulate(df, round_num + 1, results)
        return results
    
    round_winner = simulate(df)
    
    df["Team"] = df["Slot"].apply(lambda x: round_winner[x])
    df["Tournament"] = gender
    df["Bracket"] = bracket
    
    return df[["Tournament", "Bracket", "Slot", "Team"]]

In [250]:
brackets = {}
for i in range(100):
    brackets[i] = create_submission(model=model, gender="M", bracket=i, year=2023)

In [251]:
df = pd.concat(brackets.values(), ignore_index=False)

In [1]:
df

NameError: name 'df' is not defined

In [252]:
def get_successors(seed):
    net = networkx.DiGraph()

    slot_df = mens_data["MNCAATourneySlots"][
        (mens_data["MNCAATourneySlots"]["Season"] == 2023)
    ]

    net.add_edges_from([i for i in zip(slot_df["WeakSeed"].values, slot_df["Slot"].values)])
    net.add_edges_from([i for i in zip(slot_df["StrongSeed"].values, slot_df["Slot"].values)])

    successors = [i[1] for i in dfs_edges(net, seed)]

    return successors

In [302]:
def check_bracket_distribution(df, seed):
    """
    df: pd.DataFrame
        This is the submission dataframe for all brackets
    team: str
        a seed representing the team
    """
    successors = get_successors(seed)
    path_df = df[df["Slot"].isin(successors)]
    path_df = path_df.groupby("Slot")["Team"].apply(lambda x: np.mean(x == seed))

    # seeding_teams = pd.merge(mens_data["MNCAATourneySeeds"][
    #     mens_data["MNCAATourneySeeds"]["Season"] == 2023
    # ], mens_data["MTeams"], on="TeamID")

    # try:
    #     team_name = seeding_teams[seeding_teams["Seed"] == seed]["TeamName"].values[0]
    # except:
    #     team_name=""

    # print("*"*50)
    # print(f"{team_name} CHANCE TO WIN IT ALL: {path_df[-1]}")
    # print("*"*50)

    return path_df

In [327]:
tournament_2023 = pd.merge(mens_data["MNCAATourneySeeds"][
            mens_data["MNCAATourneySeeds"]["Season"] == 2023
        ], mens_data["MTeams"], on="TeamID")

In [337]:
tournament_2023["round_win_prob"]= tournament_2023["Seed"].apply(lambda x: check_bracket_distribution(df, x)[:6].values)

tournament_2023[
    [
        "round_of_32",
        "sweet16",
        "elite8",
        "final4",
        "final",
        "champ",
    ]
] = list(tournament_2023["round_win_prob"])

In [338]:
tournament_2023.sort_values(by=["champ", "final", "final4", "elite8", "sweet16", "round_of_32"], ascending=False)

Unnamed: 0,Season,Seed,TeamID,TeamName,FirstD1Season,LastD1Season,round_win_prob,round_of_32,sweet16,elite8,final4,final,champ
35,2023,Y02,1400,Texas,1985,2024,"[1.0, 0.7, 0.43, 0.28, 0.26, 0.23]",1.00,0.70,0.43,0.28,0.26,0.23
0,2023,W01,1345,Purdue,1985,2024,"[0.97, 0.9, 0.67, 0.42, 0.24, 0.13]",0.97,0.90,0.67,0.42,0.24,0.13
3,2023,W04,1397,Tennessee,1985,2024,"[1.0, 0.93, 0.24, 0.2, 0.16, 0.08]",1.00,0.93,0.24,0.20,0.16,0.08
2,2023,W03,1243,Kansas St,1985,2024,"[1.0, 0.94, 0.37, 0.18, 0.14, 0.07]",1.00,0.94,0.37,0.18,0.14,0.07
17,2023,X01,1104,Alabama,1985,2024,"[0.91, 0.85, 0.46, 0.26, 0.11, 0.07]",0.91,0.85,0.46,0.26,0.11,0.07
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45,2023,Y11b,1338,Pittsburgh,1985,2024,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.00,0.00,0.00,0.00,0.00,0.00
49,2023,Y15,1159,Colgate,1985,2024,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.00,0.00,0.00,0.00,0.00,0.00
61,2023,Z11a,1113,Arizona St,1985,2024,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.00,0.00,0.00,0.00,0.00,0.00
62,2023,Z11b,1305,Nevada,1985,2024,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.00,0.00,0.00,0.00,0.00,0.00


In [None]:
# Comment change

In [336]:
check_bracket_distribution(df, "X16b")

Slot
R1X1    0.09
R2X1    0.05
R3X1    0.03
R4X1    0.00
R5WX    0.00
R6CH    0.00
X16     0.98
Name: Team, dtype: float64

In [None]:
# def create_submission(gender, bracket, year):
#     feature_df = create_feature_df(year=year, systems=["POM", "NOL", "EBP"])
#     round_winner = simulate(df, seed_lookup, year=year)
    
#     df["Team"] = df["Slot"].apply(lambda x: round_winner[x])
#     df["Tournament"] = gender
#     df["Bracket"] = bracket
    
#     return df[["Tournament", "Bracket", "Slot", "Team"]]

In [40]:
# #TODO: Create feature df before recursion to speed this up
# #TODO: Step through this to make sure it's working
# sub = create_submission(gender="M", bracket=1, year=2023)

In [41]:
# sub

Unnamed: 0,Tournament,Bracket,Slot,Team
2385,M,1,R1W1,W01
2386,M,1,R1W2,W02
2387,M,1,R1W3,W03
2388,M,1,R1W4,W04
2389,M,1,R1W5,W05
...,...,...,...,...
2447,M,1,R6CH,Y02
2448,M,1,W16,W16a
2449,M,1,X16,X16a
2450,M,1,Y11,Y11b
