# March Madness 2024 Code
### Team: Taylor Last, Ted Woodsides, Jake Hopkins

### Notes
- **Please add docstrings or commments to code so we all know what's going on**
- I can't find how to all work on the same notebook so we might have to copy code and all work on separate pieces
- **Check create_submission and simulate functions to see how to get a bracket in the form the competition asks for**
### Things to keep in mind
- **Days are standardized already**: Dayzero tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the men's national semifinals are always on DayNum=152, the "play-in" games are on days 134-135, men's Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. 
- **Special note about "Season" numbers**: the college basketball season lasts from early November until the national championship tournament that starts in the middle of March. For instance, this year the first regular season games were played in November 2023 and the national championship games will be played in April 2024. Because a basketball season spans two calendar years like this, it can be confusing to refer to the year of the season. By convention, when we identify a particular season, we will reference the year that the season ends in, not the year that it starts in. So for instance, the current season will be identified in our data as the 2024 season, not the 2023 season or the 2023-24 season or the 2023-2024 season, though you may see any of these in everyday use outside of our data.

### Questions to consider
- Do we want to train the model by making prediction for all games (regular season and post), then only make predictions for tournament games, or do we want to only train on tournament games?

### Featues I'd like to add
- Quantify guard play
- Free throw percentage
- Pace
- Time of possession

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
from typing import List
from tqdm import tqdm
import multiprocessing

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import xgboost

import networkx
from networkx.algorithms.traversal.depth_first_search import dfs_edges

import warnings
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [2]:
# Refer to this if you need to look up documentation
print(f"Pandas Version: {pd.__version__}")

Pandas Version: 2.2.0


In [3]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
data = dict()
for dirname, _, filenames in os.walk('../kaggle/input'):
    for filename in filenames:
        table_name = filename.split('.')[0]
        table_path = os.path.join(dirname, filename)
        try:
            data[table_name] = pd.read_csv(table_path)
        except UnicodeDecodeError:
            data[table_name] = pd.read_csv(table_path, encoding='cp1252')
        except Exception as e:
            print(f"Error with {filename}: {e}")

# Split dict of dataframes by gender and other (supplemental) data
mens_data = dict()
womens_data = dict()
supplemental_data = dict()

for k, v in data.items():
    if k.startswith("M"):
        mens_data[k] = v
    elif k.startswith("W"):
        womens_data[k] = v
    else:
        supplemental_data[k] = v
        

In [15]:
def get_season_stats(dataset, detailed=False, post_season=False, year=None):
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if detailed:
        if post_season:
            df = dataset[f"{gender}NCAATourneyDetailedResults"]
        else:
            df = dataset[f"{gender}RegularSeasonDetailedResults"]
        
    else:
        if post_season:
            df = dataset[f"{gender}NCAATourneyCompactResults"]
        else:
            df = dataset[f"{gender}RegularSeasonCompactResults"]
    
    if year is not None:
        df = df[df["Season"] == year]
    return df, gender

def compute_margins_of_victory(df):
    df["margin"] = df["WScore"] - df["LScore"]
    
    win_df = df[["WTeamID", "margin"]].rename(columns={"WTeamID": "TeamID"})
    lose_df = df[["LTeamID", "margin"]].rename(columns={"LTeamID": "TeamID"})
    lose_df["margin"] = -lose_df["margin"]

    res = pd.concat([win_df, lose_df], axis=0)
    return res.groupby("TeamID")["margin"].mean()

def join_team_names(df, data, gender="M"):
    """
    df: pd.DataFrame
        dataframe appending teams to
    data: dict[str, pd.DataFrame]
        dictionary of all table names and data
    """
    res = pd.merge(df, data[f"{gender}Teams"][["TeamID", "TeamName"]], on="TeamID")
    return res

def create_srs(df,gender):

    df["margin"] = df["WScore"] - df["LScore"]
    win_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"WTeamID": "team_id", "LTeamID": "opp_id"}
    )
    lose_df = df[["WTeamID", "margin", "LTeamID"]].rename(
        columns={"LTeamID": "team_id", "WTeamID": "opp_id"}
    )
    lose_df["margin"] = -lose_df["margin"]

    teams = pd.concat([win_df, lose_df], axis=0)
    spreads = compute_margins_of_victory(df)
    
    terms = []
    solutions = []

    for team_id in spreads.keys():
        row = []
        opps = list(teams[teams["team_id"] == team_id]["opp_id"])

        for opp_id in spreads.keys():
            if opp_id == team_id:
                # coef for the team itself should be 1
                row.append(1)
            elif opp_id in opps:
                # coef for opponents is 1 over num of opps
                row.append(-1.0/len(opps))
            else:
                # teams not faced get a 0 coef
                row.append(0)
        terms.append(row)

        solutions.append(spreads[team_id])

    solutions, _, _, _ = np.linalg.lstsq(np.array(terms), np.array(solutions), rcond=None)
    
    ratings = list(zip( spreads.keys(), solutions ))
    srs = pd.DataFrame(ratings, columns=['team', 'rating'])
    rankings = srs.sort_values('rating', ascending=False).reset_index()[['team', 'rating']]
    rankings = join_team_names(rankings.rename(columns={"team": "TeamID"}), data, gender=gender)
    return rankings

def get_coach_win_perc(
    dataset: dict,
    regular_season: bool,
    year:int = 2024
) -> pd.DataFrame:
    """
    
    parameters
    ----------
    dataset: dict
        dictionary of datasets to use. it will be
        mens_data or womens_data.
        
    year: int
        year to filter data. it will get coaches stats for everything
        up until this year. (model can't have any look ahead bias). for post
        season games, use a year one less than the year of interest.
        
    returns
    -------
    coaches_stats: pd.DataFrame
        dataframe with count of wins, win percentage, and std dev
        of wins.
    """
    
    # Gets the first letter in dataset
    gender = list(dataset.keys())[0][0]
    
    if regular_season:
        df = dataset[f"{gender}RegularSeasonCompactResults"]
        #Filter season up until season of interest
        df = df[df["Season"] <= year]
    else:
        df = dataset[f"{gender}NCAATourneyCompactResults"]
        #Filter season up until season of interest
        df = df[df["Season"] < year]
        
    
    
    winning_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "WTeamID"],
        right_on=["Season", "TeamID"]
    )

    winning_coaches_df = winning_coaches_df[
        (winning_coaches_df['DayNum'] >= winning_coaches_df['FirstDayNum']) 
        & (winning_coaches_df['DayNum'] <= winning_coaches_df['LastDayNum'])
    ]
    winning_coaches_df["win"] = 1

    #Make sure the join dind't create dupes
    assert len(winning_coaches_df) == len(df)

    losing_coaches_df = pd.merge(
        df,
        dataset[f"{gender}TeamCoaches"],
        how="left",
        left_on=["Season", "LTeamID"],
        right_on=["Season", "TeamID"]
    )

    losing_coaches_df = losing_coaches_df[
        (losing_coaches_df['DayNum'] >= losing_coaches_df['FirstDayNum']) 
        & (losing_coaches_df['DayNum'] <= losing_coaches_df['LastDayNum'])
    ]
    losing_coaches_df["win"] = 0

    #Make sure the join dind't create dupes
    assert len(losing_coaches_df) == len(df)

    coaches_df = pd.concat(
        [
            losing_coaches_df[["CoachName", "win"]],
            winning_coaches_df[["CoachName", "win"]]
        ],
        axis=0
    )

    coach_stats = (
        coaches_df
        .groupby("CoachName")["win"]
        .describe()
        .sort_values("count", ascending=False)
        [["count", "mean", "std"]]
        .fillna(0)
    )

    return coach_stats
def get_system_ratings(
    mens_dataset, #There are only ratings for men
    systems: List[str],
    year: int=2024,
):
    """
    gets system ratings for each team for specified systems for a specific year.
    
    parameters
    ---------
    mens_dataset: dict
        dictionary of datasets for men
    systems: List[str]
        list of dictionaries we are interested in seeing
    year: int
        year to look for ratings
    moving_average: str
        specifies how to calculate rolling ratings for given systems.
        if None, the system takes the most recent system rating
    
    returns
    -------
    df: pd.DataFrame
        data that reflects ratings for a team
    """
    
    # Filter by season - only take most recent
    df = mens_dataset["MMasseyOrdinals"]
    df = df[df["Season"] == year]
    
    # Filter by system
    df = df[df["SystemName"].isin(systems)]
    
    latest_rank = (
        df
        .sort_values("RankingDayNum")
        .groupby(["TeamID","SystemName"])
        ["OrdinalRank"]
        .last()
        .unstack("SystemName")
        .reset_index().
        rename(columns = {i: i+"_latest" for i in systems})
    )
    
    transformed_df = (
        df
        .sort_values(by="RankingDayNum")
        .groupby(["TeamID", "SystemName"], group_keys=False)
        ["OrdinalRank"]
        .rolling(5) # TODO: Parameterize this (window and moving average method)
        .mean()
        .unstack("SystemName")
        .reset_index()
        .drop("level_1", axis=1)
        .groupby("TeamID")
        [systems]
        .last()
        .reset_index()
        .rename(columns = {i: i+"_rolling" for i in systems})
    )
    
    res = pd.merge(latest_rank, transformed_df, on="TeamID")

    return res

def get_post_season(data, year):
    
    df, gender = get_season_stats(
            data, 
            detailed=False, 
            post_season=True, 
            year=year
    )
    
    # Shuffle teams for positional encoding (model shouldn't have winning teams features as the same)
    df["TeamID"] = np.where(
        np.random.uniform(0,1, size=len(df)) > .5, 
        df["WTeamID"], 
        df["LTeamID"]
    )
    df["team_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["WScore"], 
        df["LScore"]
    )
    df["OppID"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LTeamID"], 
        df["WTeamID"]
    )
    df["opp_score"] = np.where(
        df["TeamID"] == df["WTeamID"], 
        df["LScore"], 
        df["WScore"]
    )
    df = df.drop(
        ["WTeamID", "LTeamID", "WScore", "LScore", "WLoc", "NumOT"],
        axis=1
    )
    
    return df

def get_features(mens_data, year, systems):
    # Season Stats
    df, gender = get_season_stats(
        mens_data, 
        detailed=False, 
        post_season=False, 
        year=year
    )

    # Rating System
    srs = create_srs(df, gender)

    # System Ratings
    system_ratings = get_system_ratings(
        mens_data, 
        systems=systems
    ) #KenPom, Nolan ELO, EPSN BPI

    # Ratings df
    ratings_df = pd.merge(
                srs,
                system_ratings,
                on="TeamID"
    )

    # Coaches postseason win stats
    coaches_postseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=False, 
        year=year
    ).rename(columns={"count": "count_post", "mean": "mean_post", "std": "std_post"})

    # Coaches regular season win stats
    coaches_regseason_win_df = get_coach_win_perc(
        dataset=mens_data, 
        regular_season=True, 
        year=year
    ).rename(columns={"count": "count_reg", "mean": "mean_reg", "std": "std_reg"})

    coaches_df = pd.merge(
        coaches_regseason_win_df,
        coaches_postseason_win_df,
        on="CoachName",
        how="left"
    ).fillna(0)

    # Get coaches for the year and only grab the most recent coach for a certain team
    curr_coaches = (
        mens_data["MTeamCoaches"][
            mens_data["MTeamCoaches"]["Season"] == year
        ]
        .sort_values("FirstDayNum")
        .groupby("TeamID")["CoachName"]
        .last()
        .reset_index()
    )

    # Get coach stats for current coaches
    coaches_df = pd.merge(
        curr_coaches,
        coaches_df,
        on="CoachName",
        how="left"
    )


    feature_df = (
        pd.merge(
            ratings_df,
            coaches_df
        )
        .drop(["TeamName", "CoachName"], axis=1)
    )

    
    return feature_df


def merge_features_to_games(feature_df, post_season_df, year, training=True):
    
    post_season_merged = pd.merge(
        pd.merge(
            feature_df,
            post_season_df,
            on="TeamID",
        ),
        feature_df,
        left_on="OppID",
        right_on="TeamID",
        suffixes=("_team", "_opp")
    )
    if training:
        post_season_merged["win"] = post_season_merged["team_score"] > post_season_merged["opp_score"]
        post_season_merged = (
            post_season_merged
            .drop(
                ["team_score", "OppID", "opp_score", "DayNum"], 
                axis=1
            )
            .rename(columns = {"TeamID_team": "TeamID", "TeamID_opp": "OppID"})
        )

    for col in post_season_merged.columns:
        if col.replace("_team", "_opp") in post_season_merged.columns and "_team" in col:
            post_season_merged[col.replace("_team", "_diff")] = post_season_merged[col] - post_season_merged[col.replace("_team", "_opp")]
            post_season_merged = post_season_merged.drop([col, col.replace("_team", "_opp")], axis=1)

    # post_season_merged = post_season_merged.drop(["Season_x", "Season_y"], axis=1)
    return post_season_merged

In [16]:
#TODO: Add features from box scores

In [17]:
def get_team_stats(df, year=None):

    if year is not None:
        df[df["Season"] == year]
    
    df["margin"] = df["WScore"] - df["LScore"]
    win_df = df.rename(
        columns={"WTeamID": "team_id", "LTeamID": "opp_id", "WLoc": "Loc"}
    )
    win_df = win_df.rename(columns={col: col[1:] + "_opp" for col in win_df.columns if col.startswith("L") and col != "Loc"})
    win_df = win_df.rename(columns={col: col[1:] for col in win_df.columns if col.startswith("W") and not col.endswith("_opp")})
    
    lose_df = df.rename(
        columns={"LTeamID": "team_id", "WTeamID": "opp_id", "WLoc": "Loc"}
    )
    lose_df = lose_df.rename(columns={col: col[1:] for col in lose_df.columns if col.startswith("L") and col != "Loc"})
    lose_df = lose_df.rename(columns={col: col[1:] + "_opp" for col in lose_df.columns if col.startswith("W")})
    lose_df["Loc"] = lose_df["Loc"].apply(lambda x: "H" if x == "A" else "A" if x == "H" else "N")
    lose_df["margin"] = -lose_df["margin"]

    teams = pd.concat([win_df, lose_df], axis=0)

    df = teams.groupby(["Season", "team_id"])[
        ['FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'Ast',
        'TO', 'Stl', 'Blk', 'PF', 'FGM_opp', 'FGA_opp', 'FGM3_opp', 'FGA3_opp',
        'FTM_opp', 'FTA_opp', 'OR_opp', 'DR_opp', 'Ast_opp', 'TO_opp',
        'Stl_opp', 'Blk_opp', 'PF_opp', 'margin'
        ]
    ].agg([
            ("mean", "mean"), 
            ("quant25" , lambda x: x.quantile(.25)), 
            ("quant75", lambda x: x.quantile(.75))
        ]
    ).reset_index()
    df.columns = [(col + "_" + agg_func).strip("_") for col, agg_func in zip(df.columns.get_level_values(0), df.columns.get_level_values(1))]

    for col in df.columns:
        if (
            "_opp" in col
            and col.replace("_opp", "") in df.columns 
            and col not in ["Season", "team_id"]
        ):
            new_col = col.replace("_opp", "") + "_diff"
            df[new_col] = df[col.replace("_opp", "")] - df[col]
            df = df.drop([col.replace("_opp", ""), col], axis=1)
    return df

In [18]:
def get_advanced_features(mens_data, year=None, systems=["POM", "NOL", "EBP"]):
    df, _ = get_season_stats(
        mens_data,
        detailed=True,
        post_season=False,
        year=year
    )
    adv_features = get_team_stats(df)
    basic_features = get_features(mens_data=mens_data, year=year, systems=systems)

    res = pd.merge(
        basic_features,
        adv_features.rename(columns={"team_id": "TeamID"}),
        on="TeamID"
    )

    return res

In [19]:
def save_data(train_dict, _dir="baseline"):
    if not os.path.isdir(_dir):
        os.mkdir(_dir)
    for year in tqdm(train_dict):
        train_dict[year].to_csv(f"{_dir}/{year}.csv", index=False)
        
def load_data(_dir="baseline"):
    if not os.path.isdir(_dir):
        raise NotADirectoryError(f"{_dir} is not a directory")
    else:
        train_data = dict()
        for dirname, _, filenames in tqdm(os.walk(_dir)):
            for filename in filenames:
                table_name = filename.split('.')[0]
                table_path = os.path.join(dirname, filename)
                try:
                    train_data[table_name] = pd.read_csv(table_path, index_col=False)
                except UnicodeDecodeError:
                    train_data[table_name] = pd.read_csv(table_path, encoding='cp1252')
                except Exception as e:
                    print(f"Error with {filename}: {e}")
    return train_data

# Model V2
Improvements:
- Build in detailed stats from box scores
- Build in conference stats
- Tune hyperparams
- Add in tempo
- Add in experience (maybe)
- Maybe add round number as a feature

# Baseline Model

Use a simple rating system (SRS) combined with KenPom to fit a model predicting the probability of a given matchup in the tournament. Then randomly sample from the distribution outputed from the model to create multiple submissions.

### Features:
- SRS system from the regular season
- KenPom ranking system
- Coach historical success in post season and regular season

### Model: XGBoost


In [20]:
# This function will change a lot based on what we are trying to predict
# Simplest training method is to grab team ids from previous years and pull in reg season stats to make a prediction
# what we should try to get to is running simulations and making predictions based on matchups then have some sort of loss metric for how good or bad a bracket is.
# Also adding stats like if they're on a run or not would be cool (tough to do at inference time)
def create_mens_training_data(advanced=False):
    
    training_data = dict()
    
    for year in tqdm(range(2003, 2025)):
        
        if advanced: 
            feature_df = get_advanced_features(mens_data, year=year, systems=["POM", "NOL", "EBP"])
        else:
            feature_df = get_features(mens_data, year=year, systems=["POM", "EBP"])
        post_season_df = get_post_season(mens_data, year)
        post_season_merged = merge_features_to_games(feature_df, post_season_df, year)
        
        training_data[year] = post_season_merged
    
    return training_data

In [21]:
df = pd.read_csv("../data/stats/advanced_stats_tournament_teams.csv")

In [22]:
# Need to join this on who is in the tournament every year and create predictions based on matchups 2003-2023
train_dict = create_mens_training_data()

100%|██████████| 22/22 [01:14<00:00,  3.38s/it]


In [23]:
save_data(train_dict, "../data/v3")

100%|██████████| 22/22 [00:00<00:00, 287.39it/s]


In [24]:
train_data=load_data("../data/v3")

1it [00:00, 12.45it/s]


In [26]:
temp_df = pd.concat(train_dict.values(), ignore_index=True)
# for col in temp_df.columns:
#     if col.replace("_team", "_opp") in temp_df.columns and "_team" in col:
#         temp_df[col.replace("_team", "_diff")] = temp_df[col] - temp_df[col.replace("_team", "_opp")]
#         temp_df = temp_df.drop([col, col.replace("_team", "_opp")], axis=1)

In [36]:
temp_df = temp_df[temp_df["Season"]>= 2010]

Unnamed: 0,TeamID,G,W,L,W-L%,SRS,SOS,W.1,L.1,W.2,...,TS%_opp,TRB%_opp,AST%_opp,STL%_opp,BLK%_opp,eFG%_opp,TOV%_opp,ORB%_opp,FT/FGA_opp,Season
0,1101,31,11,20,0.355,-19.60,-4.12,2,12,9,...,0.558,49.8,53.6,9.7,12.0,0.518,17.8,30.2,0.323,2014
1,1101,31,10,21,0.323,-17.20,-6.34,4,14,7,...,0.573,55.4,50.1,8.8,11.6,0.539,19.2,33.4,0.325,2015
2,1101,31,13,18,0.419,-13.93,-7.53,8,10,10,...,0.565,51.0,47.5,7.3,8.6,0.527,18.2,26.9,0.313,2016
3,1101,29,13,16,0.448,-11.86,-7.10,7,11,9,...,0.553,52.9,55.5,8.9,8.8,0.522,18.5,30.1,0.287,2017
4,1101,32,16,16,0.500,-9.14,-6.82,8,10,9,...,0.540,50.3,48.1,8.9,7.5,0.499,19.5,28.5,0.298,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3503,1463,30,22,8,0.733,5.52,-1.24,10,4,11,...,0.509,47.1,48.9,9.4,7.9,0.471,13.5,24.1,0.223,2019
3504,1463,30,23,7,0.767,6.63,-1.51,11,3,10,...,0.492,45.2,50.9,9.6,8.9,0.459,15.5,21.0,0.183,2020
3505,1463,31,19,12,0.613,-0.10,-1.34,11,3,10,...,0.518,49.8,48.5,10.3,9.5,0.486,16.5,26.3,0.219,2022
3506,1463,30,21,9,0.700,7.81,-1.62,10,4,10,...,0.498,45.8,50.4,8.5,7.3,0.461,16.5,23.1,0.227,2023


In [94]:
merged_df = pd.merge(
    pd.merge(
        temp_df,
        df.drop(["FirstD1Season", "LastD1Season"], axis=1),
        on=["TeamID", "Season"],
        how="left"
    ),
    df.drop([ "FirstD1Season", "LastD1Season"], axis=1),
    how="left",
    left_on=["OppID", "Season"],
    right_on=["TeamID", "Season"]
).rename(columns={"TeamID_x": "TeamID", "TeamName_x": "TeamName", "TeamName_y": "OppName"}).drop("TeamID_y", axis=1)

In [95]:
for col in merged_df.columns:
    if col.endswith("y"):
        merged_df[col.replace("_y", "_diff")] = merged_df[col] - merged_df[col.replace("_y", "_x")]
        merged_df = merged_df.drop([col, col.replace("_y", "_x")], axis=1)

In [96]:
merged_df.columns

Index(['TeamID', 'Season', 'OppID', 'win', 'rating_diff', 'EBP_latest_diff',
       'POM_latest_diff', 'POM_rolling_diff', 'EBP_rolling_diff',
       'count_reg_diff', 'mean_reg_diff', 'std_reg_diff', 'count_post_diff',
       'mean_post_diff', 'std_post_diff', 'TeamName', 'OppName', 'G_diff',
       'W_diff', 'L_diff', 'W-L%_diff', 'SRS_diff', 'SOS_diff', 'W.1_diff',
       'L.1_diff', 'W.2_diff', 'L.2_diff', 'W.3_diff', 'L.3_diff', 'Tm._diff',
       'Opp._diff', 'Pace_diff', 'ORtg_diff', 'FTr_diff', '3PAr_diff',
       'TS%_diff', 'TRB%_diff', 'AST%_diff', 'STL%_diff', 'BLK%_diff',
       'eFG%_diff', 'TOV%_diff', 'ORB%_diff', 'FT/FGA_diff', 'Pace_opp_diff',
       'ORtg_opp_diff', 'FTr_opp_diff', '3PAr_opp_diff', 'TS%_opp_diff',
       'TRB%_opp_diff', 'AST%_opp_diff', 'STL%_opp_diff', 'BLK%_opp_diff',
       'eFG%_opp_diff', 'TOV%_opp_diff', 'ORB%_opp_diff', 'FT/FGA_opp_diff'],
      dtype='object')

In [97]:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#     display(
#         pd.merge(
#             pd.merge(
#                 temp_df[temp_df["Season"] == 2023], 
#                 mens_data["MTeams"], 
#                 on="TeamID"
#             ),
#             mens_data["MTeams"], 
#             left_on="OppID",
#             right_on="TeamID",
#             suffixes=("_TEAM", "_OPP")
#         )
#     )

In [98]:
all_data = pd.concat(train_data.values(), ignore_index=True)

In [99]:
all_data

Unnamed: 0,TeamID,Season,OppID,win,rating_diff,EBP_latest_diff,POM_latest_diff,POM_rolling_diff,EBP_rolling_diff,count_reg_diff,mean_reg_diff,std_reg_diff,count_post_diff,mean_post_diff,std_post_diff
0,1242,2008,1424,True,13.774373,-53,-60,-76.4,-60.0,-135.0,0.102618,-0.035132,2.0,0.074783,-0.016972
1,1242,2008,1437,True,14.944785,-14,-18,-20.2,-13.0,35.0,0.098419,-0.034341,15.0,0.140000,-0.037148
2,1242,2008,1172,True,7.731931,-109,-89,-104.4,-125.8,-59.0,0.143807,-0.042701,21.0,0.640000,0.489898
3,1242,2008,1314,True,3.053989,4,7,7.0,5.6,-161.0,-0.094959,0.058539,-37.0,-0.085806,0.040149
4,1272,2008,1280,True,11.056161,45,49,42.4,43.6,182.0,0.106466,-0.047866,20.0,0.238095,-0.054138
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,1192,2005,1228,False,-18.830900,288,314,317.4,293.8,369.0,-0.179590,0.051916,-4.0,-0.571429,-0.534522
1310,1175,2005,1181,False,-18.057894,281,290,295.6,289.8,-486.0,-0.292857,0.117008,-80.0,-0.800000,-0.402524
1311,1285,2005,1449,False,-15.442183,142,100,94.6,142.4,-236.0,0.022186,0.005438,-2.0,0.000000,0.000000
1312,1105,2005,1324,False,0.494361,209,203,205.8,208.6,-5.0,0.051485,0.001747,0.0,0.000000,0.000000


In [100]:
# import pandas as pd
# from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.preprocessing import StandardScaler

# def feature_selection(_df, num_features_out=20):

#     # Assuming you have a DataFrame `df` with features and `target` as your target variable
#     df = _df.copy()
#     target = df['win']
#     df = df.drop(["win", "TeamID", "OppID"], axis=1)

#     # 1. Variance Threshold
#     selector = VarianceThreshold(threshold=0.01)  # Adjust the threshold value as needed
#     df_reduced = selector.fit_transform(df)
#     selected_columns = df.columns[selector.get_support()]
#     print(f"Shape after Variance Threshold: {df_reduced.shape}")

#     # 2. Univariate Selection
#     k = 50  # Select the top 100 features based on univariate tests
#     univariate_selector = SelectKBest(f_classif, k=k)
#     df_reduced = univariate_selector.fit_transform(df_reduced, target)
#     unscaled_selected_columns = selected_columns[univariate_selector.get_support()]
#     print(f"Shape after Univariate Selection: {df_reduced.shape}")

#     # 3. Model-based Feature Importance
#     # Scale the features
#     # scaler = StandardScaler()
#     # df_scaled = scaler.fit_transform(df_reduced)

#     # Fit a Random Forest model to get feature importance
#     rf = RandomForestClassifier()
#     rf.fit(df_reduced, target.astype(int))

#     # Get feature importances and select the top features
#     importances = rf.feature_importances_
#     indices = np.argsort(importances)[::-1][:num_features_out]  # Select top 30 features
#     df_final = df_reduced[:, indices]
#     selected_columns = unscaled_selected_columns[indices]
#     print(f"Shape after Model-based Feature Selection: {df_final.shape}")

#     # # 4. Recursive Feature Elimination (Optional, if more reduction is needed)
#     # # Note: RFE can be computationally expensive; consider using RFECV for cross-validated selection
#     # rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=num_features_out)
#     # df_reduced = rfe.fit_transform(df_scaled, target.astype(int))
#     # selected_columns = selected_columns[rfe.get_support()]
#     # print(f"Shape after RFE: {df_reduced.shape}")

#     return df_final, target, selected_columns

In [101]:
all_data.isna().sum().sum()

0

In [102]:
# all_data = all_data[
#     [
#         # "margin_mean_diff", 
#         "Season", 
#         "win",
#         "rating_diff",
#         # 'Blk_quant75_diff_diff', 
#         # 'DR_quant25_diff_diff', 
#         # 'PF_mean_diff_diff', 
#         'FTA_mean_diff_diff', 
#         'FGA_quant25_diff_diff', 
#         'count_post_diff', 
#         'TO_mean_diff_diff',
#         "POM_latest_diff",
#         "POM_rolling_diff"
#     ]
# ]



In [209]:
all_data = merged_df.copy()


In [226]:
first_testing_year = 2019

train = all_data[all_data["Season"] < first_testing_year]
test = all_data[(all_data["Season"] >= first_testing_year) & (all_data["Season"] < 2024)]

leaked_cols = ["G_diff", "W_diff", "L_diff", "Tm._diff", "Opp._diff"]
# leaked_cols = []

X_train = train.drop(["win","Season", "TeamID", "OppID", "TeamName", "OppName"] + leaked_cols, axis=1)
y_train = train["win"]
X_test = test.drop(["win","Season", "TeamID", "OppID", "TeamName", "OppName"] + leaked_cols, axis=1)
y_test = test["win"]


training_cols = X_train.columns
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# X_train, y_train, feature_cols = feature_selection(train)
# X_test = test.drop(["win","Season", "TeamID", "OppID"], axis=1)
# X_test = X_test[feature_cols].values
# y_test = test["win"]
# training_cols = feature_cols



In [227]:
mens_data["MRegularSeasonCompactResults"]

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
186547,2024,114,1454,75,1237,70,A,0
186548,2024,114,1455,74,1412,66,A,0
186549,2024,114,1459,91,1359,69,H,0
186550,2024,114,1462,91,1177,58,H,0


In [228]:
train

Unnamed: 0,TeamID,Season,OppID,win,rating_diff,EBP_latest_diff,POM_latest_diff,POM_rolling_diff,EBP_rolling_diff,count_reg_diff,...,3PAr_opp_diff,TS%_opp_diff,TRB%_opp_diff,AST%_opp_diff,STL%_opp_diff,BLK%_opp_diff,eFG%_opp_diff,TOV%_opp_diff,ORB%_opp_diff,FT/FGA_opp_diff
0,1242,2010,1250,True,19.960676,-233,-252,-269.8,-243.8,418.0,...,0.026,0.050,3.7,2.2,0.1,-0.4,0.055,-0.1,-2.2,0.004
1,1242,2010,1320,False,12.525166,-112,-106,-107.8,-114.4,382.0,...,0.037,0.028,2.6,-9.9,-1.5,-2.7,0.034,1.1,-4.9,-0.016
2,1181,2010,1345,True,8.532949,7,7,10.2,9.0,608.0,...,0.027,0.017,5.5,2.3,-0.1,-3.1,0.011,2.0,-0.3,0.029
3,1393,2010,1211,True,8.532211,80,64,66.2,80.8,460.0,...,-0.075,0.010,-0.1,-12.5,-2.5,1.6,0.003,-2.5,-5.4,0.056
4,1140,2010,1243,False,2.203785,-37,-51,-57.2,-39.6,64.0,...,-0.021,0.020,0.1,-1.6,1.2,1.0,0.008,0.9,6.1,0.084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,1267,2018,1455,True,-9.598630,67,65,51.8,55.2,-460.0,...,0.093,0.008,-10.4,-6.9,0.0,1.3,-0.003,-1.3,-9.0,0.047
596,1252,2018,1314,False,-12.689660,141,158,157.4,143.6,-730.0,...,0.044,-0.019,-4.2,6.3,-0.4,-0.5,-0.010,-3.4,1.6,-0.081
597,1420,2018,1438,True,-19.127061,216,221,249.0,250.2,-326.0,...,0.001,-0.056,-2.1,-4.3,-3.3,-1.1,-0.066,-0.3,-1.9,-0.016
598,1254,2018,1347,False,-3.778386,157,135,142.2,166.2,107.0,...,0.057,0.011,-1.3,3.3,-0.6,2.6,0.006,3.4,-1.8,0.010


### XGBoost

In [229]:
# # # model = xgboost.XGBClassifier(n_estimators=500, subsample=.9, early_stopping=True, gamma=1)
# # model = xgboost.XGBClassifier(n_estimators=100, early_stopping=True)
# # # model.fit(X_train, y_train)


# # for year in range(2003, first_testing_year):

# #     train = all_data[all_data["Season"] == year]
# #     if train.empty:
# #         continue

# #     X_train = train.drop("win", axis=1)
# #     y_train = train["win"]

# #     model.fit(X_train, y_train)

# # Define parameters for the XGBoost model
# params = {
#     'objective': 'binary:logistic',
#     'eval_metric': 'logloss',
#     'tree_method': 'gpu_hist',
#     'max_depth': 4,  # Reduced max depth
#     'subsample': 0.8,  # Subsample percentage of the training data
#     'lambda': 2,  # Increased L2 regularization
#     'alpha': 0.5,  # Increased L1 regularization
#     'eta': 0.01,  # Lower learning rate
# }
# validation_years = range(first_testing_year - 2, first_testing_year)

# # Prepare the validation set
# validation_data = all_data[all_data["Season"].isin(validation_years)]
# X_val = validation_data.drop("win", axis=1)
# y_val = validation_data["win"]
# dval = xgboost.DMatrix(X_val, label=y_val)

# model = None

# for year in range(2003, first_testing_year - 2):
#     train = all_data[all_data["Season"] == year]
#     if train.empty:
#         continue

#     X_train = train.drop("win", axis=1)
#     y_train = train["win"]

#     dtrain = xgboost.DMatrix(X_train, label=y_train)

#     # Early stopping requires at least one set to be passed in evals
#     if model is None:
#         model = xgboost.train(params, dtrain, num_boost_round=1000, 
#                           evals=[(dval, 'validation')], early_stopping_rounds=50)
#     else:
#         model = xgboost.train(params, dtrain, num_boost_round=1000, xgb_model=model,
#                           evals=[(dval, 'validation')], early_stopping_rounds=50)

# # After the loop, 'model' will be your trained model


In [230]:
# import xgboost as xgb
# from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
# from sklearn.metrics import roc_auc_score

# # Prepare the validation set outside the objective function
# validation_data = all_data[all_data["Season"].isin(range(first_testing_year - 2, first_testing_year))]
# X_val = validation_data.drop("win", axis=1)
# y_val = validation_data["win"]
# dval = xgb.DMatrix(X_val, label=y_val)

# # Define the parameter space
# space = {
#     'max_depth': hp.choice('max_depth', range(3, 10)),
#     'subsample': hp.uniform('subsample', 0.7, 1),
#     'eta': hp.uniform('eta', 0.01, 0.3),
#     'lambda': hp.uniform('lambda', 0, 2),
#     'alpha': hp.uniform('alpha', 0, 1)
# }

# # Define the objective function to minimize
# def objective(params):
#     global model

#     # Update our parameters with the current set of hyperparameters
#     params['objective'] = 'binary:logistic'
#     params['eval_metric'] = 'logloss'
#     params['tree_method'] = 'gpu_hist'

#     model = None
#     for year in range(2003, first_testing_year - 2):
#         train = all_data[all_data["Season"] == year]
#         if train.empty:
#             continue

#         X_train = train.drop("win", axis=1)
#         y_train = train["win"]

#         dtrain = xgb.DMatrix(X_train, label=y_train)

#         if model is None:
#             model = xgb.train(params, dtrain, num_boost_round=1000, 
#                               evals=[(dval, 'validation')], early_stopping_rounds=50)
#         else:
#             model = xgb.train(params, dtrain, num_boost_round=1000, xgb_model=model,
#                               evals=[(dval, 'validation')], early_stopping_rounds=50)

#     # Evaluate the model on the validation set
#     y_pred = model.predict(dval)

#     # Calculate the loss
#     loss = 1 - roc_auc_score(y_val, y_pred)
#     return {'loss': loss, 'status': STATUS_OK}

# # Run the hyperparameter search using the Tree of Parzen Estimators (TPE) algorithm
# trials = Trials()
# best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=30, trials=trials)

# print(best)

In [231]:
# BEST PARAMS: {'alpha': 0.12695801173189394, 'eta': 0.010970710588702692, 'lambda': 1.8788973229927652, 'max_depth': 2, 'subsample': 0.8952525703549468}

validation_years = range(first_testing_year - 2, first_testing_year)


iterative_training = False
if iterative_training:
    # Prepare the validation set
    validation_data = all_data[all_data["Season"].isin(validation_years)]
    X_val = validation_data.drop("win", axis=1)
    y_val = validation_data["win"]
    dval = xgboost.DMatrix(X_val, label=y_val)

    model = None

    for year in range(2003, first_testing_year - 2):
        train = all_data[all_data["Season"] == year]
        if train.empty:
            continue

        X_train = train.drop("win", axis=1)
        y_train = train["win"]

        dtrain = xgboost.DMatrix(X_train, label=y_train)

        # Early stopping requires at least one set to be passed in evals
        if model is None:
            model = xgboost.train(params, dtrain, num_boost_round=1000, 
                            evals=[(dval, 'validation')], early_stopping_rounds=50)
        else:
            model = xgboost.train(params, dtrain, num_boost_round=1000, xgb_model=model,
                            evals=[(dval, 'validation')], early_stopping_rounds=50)

else:
    model = xgboost.XGBClassifier(eta=.5, min_child_weight=25, gradient_method="gradient_based")
    # model = xgboost.XGBClassifier()
    model.fit(X_train, y_train)


In [232]:
print(sorted({k: v for k, v in zip(training_cols, model.feature_importances_)}.items(), key=lambda x: -x[1])
)

# # Get feature importance dictionary
# feature_importances = model.get_score(importance_type='weight')

# # Convert to a more interpretable structure (e.g., a sorted list of tuples)
# sorted_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)

# # If you want to print the feature importances
# for feature, importance in sorted_importances:
#     print(f"Feature: {feature}, Importance: {importance}")

 #XGB

# dtest = xgboost.DMatrix(X_test, label=y_test)

train_predicted_output = pd.DataFrame(
    {
        # "Predicted": model.predict_proba(X_test)[:, 1], 
        "Predicted": np.round(model.predict(X_train), 0).astype(int),
        "Actual": y_train.astype(int)
    }
)

predicted_output = pd.DataFrame(
    {
        # "Predicted": model.predict_proba(X_test)[:, 1], 
        "Predicted": np.round(model.predict(X_test), 0).astype(int),
        "Actual": y_test.astype(int)
    }
)

print(f"Train Accuracy: {np.mean(np.where(train_predicted_output['Predicted'] == train_predicted_output['Actual'], 1, 0))}")
print(f"Test Accuracy: {np.mean(np.where(predicted_output['Predicted'] == predicted_output['Actual'], 1, 0))}")

[('SRS_diff', 0.26643956), ('SOS_diff', 0.05664105), ('W-L%_diff', 0.046077717), ('ORtg_opp_diff', 0.044838622), ('ORtg_diff', 0.027826749), ('mean_post_diff', 0.027032545), ('rating_diff', 0.025426537), ('POM_latest_diff', 0.023453364), ('FT/FGA_opp_diff', 0.022300575), ('STL%_diff', 0.02209979), ('W.2_diff', 0.021077394), ('3PAr_diff', 0.019971902), ('TS%_opp_diff', 0.01912505), ('TOV%_diff', 0.018919963), ('AST%_diff', 0.01890766), ('STL%_opp_diff', 0.018280178), ('TRB%_diff', 0.018114502), ('ORB%_opp_diff', 0.018105906), ('TS%_diff', 0.01763672), ('FTr_opp_diff', 0.017214565), ('FTr_diff', 0.01656459), ('ORB%_diff', 0.016334116), ('3PAr_opp_diff', 0.016065462), ('AST%_opp_diff', 0.015477447), ('eFG%_opp_diff', 0.015145992), ('count_post_diff', 0.014935987), ('L.3_diff', 0.014320661), ('L.1_diff', 0.014227272), ('TOV%_opp_diff', 0.014066086), ('BLK%_opp_diff', 0.013002967), ('count_reg_diff', 0.012888462), ('std_reg_diff', 0.0127007775), ('mean_reg_diff', 0.012498959), ('Pace_diff',

In [233]:
train_predicted_output["Actual"]

0      1
1      0
2      1
3      1
4      0
      ..
595    1
596    0
597    1
598    0
599    0
Name: Actual, Length: 600, dtype: int64

In [243]:
def create_testing_data(df, feature_df, seed_lookup, scaler, year):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    
    matchup_df = df.copy()
    
    matchup_df["TeamID"] = matchup_df["StrongSeed"].apply(
        lambda x: seed_lookup[x]
    )
    matchup_df["OppID"] = matchup_df["WeakSeed"].apply(
        lambda x: seed_lookup[x]
    )
    
    post_season_merged = merge_features_to_games(feature_df, matchup_df, year, training=False)
    df = pd.read_csv("../data/stats/advanced_stats_tournament_teams.csv")

    merged_df = pd.merge(
        pd.merge(
            post_season_merged,
            df.drop(["FirstD1Season", "LastD1Season"], axis=1),
            on=["TeamID", "Season"],
            how="left"
        ),
        df.drop([ "FirstD1Season", "LastD1Season"], axis=1),
        how="left",
        left_on=["OppID", "Season"],
        right_on=["TeamID", "Season"]
    ).rename(columns={
        "TeamID_x": "TeamID", 
        "TeamName_x": "TeamName", 
        "TeamName_y": "OppName"
        }
    ).drop("TeamID_y", axis=1)
    
    merged_df = merged_df[scaler.feature_names_in_]

    X_test = scaler.transform(merged_df)
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    return X_test_tensor

def probabilistic_choice(row):
    return np.random.choice([row['StrongSeed'], row['WeakSeed']], p=[row['prob'], 1-row['prob']])

def get_model_predictions(model, inputs, matchup_df, model_type):
    """
    matchup_df: pd.DataFrame
    scaler: StandardScaler
    year: int
    """
    if model_type == "torch":
        model.eval()  # Set the model to evaluation mode
        with torch.no_grad():  # Inference mode, gradients not needed
            outputs = model(inputs)
    elif model_type == "xgb":
        outputs = model.predict_proba(inputs)[:, 1]
    
    pred_df = matchup_df.copy()
    pred_df["prob"] = outputs
    pred_df["winner"] = pred_df.apply(probabilistic_choice, axis=1)

    return pred_df


def create_feature_df(year, systems=["POM", "NOL", "EBP"]):
    
    seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == year
    ]

    feature_df = get_features(mens_data, year=year, systems=systems)
    
    res = pd.merge(seeding_df, feature_df, on="TeamID")

    return res

# Run Simulation Based on Model
Recursively loop through the seeding df to pick teams at each round.

TODO: Swap out function to use baseline model predictions

In [244]:
def create_submission(model, gender, bracket, year, systems=["POM", "NOL", "EBP"]):

    df = mens_data["MNCAATourneySlots"][
        (mens_data["MNCAATourneySlots"]["Season"] == year)
    ]
    seeding_df = mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == year
    ]

    seed_lookup = {
        k: v for k, v in zip(
            seeding_df["Seed"], 
            seeding_df["TeamID"]
        )
    }

    # feature_df = get_features(mens_data, year=year, systems=systems)
    feature_df = get_advanced_features(mens_data=mens_data, year=year)

    def simulate(_df, round_num=0, results=None):
        
        df = _df.copy()
        
        if results is None:
            results = {}
            
        if round_num > 6:
            return results
        
        if round_num == 0: # Play-IN games
            temp_df = df[~df["Slot"].str.startswith("R")]
        
        else:
            temp_df = df[df["Slot"].str.startswith(f"R{round_num}")]
            temp_df["StrongSeed"] = temp_df["StrongSeed"].apply(
                lambda x: results[x] if x in results else x
            )
            temp_df["WeakSeed"] = temp_df["WeakSeed"].apply(
                lambda x: results[x] if x in results else x
            )
            

        inputs = create_testing_data(temp_df, feature_df, seed_lookup, scaler, year)
        temp_df = get_model_predictions(model, inputs, temp_df, model_type="xgb")

        for k, v in zip(temp_df["Slot"], temp_df["winner"]):
            results[k] = v
        
        results = simulate(df, round_num + 1, results)
        return results
    
    round_winner = simulate(df)
    
    df["Team"] = df["Slot"].apply(lambda x: round_winner[x])
    df["Tournament"] = gender
    df["Bracket"] = bracket
    
    return df[["Tournament", "Bracket", "Slot", "Team"]]

In [245]:
create_submission(model=model, gender="M", bracket=1, year=2023)

KeyError: 'TeamID'

In [72]:
# Add multiprocessing to this function or offload to GPU
brackets = {}
for i in tqdm(range(200)):
    brackets[i] = create_submission(model=model, gender="M", bracket=i, year=2023)

100%|██████████| 200/200 [18:40<00:00,  5.60s/it]


In [19]:
# def worker_function(i):
#     return i, create_submission(model=model, gender="M", bracket=i, year=2023)


# from concurrent.futures import ProcessPoolExecutor, as_completed
# from tqdm.notebook import tqdm as tq

# # Assuming your create_submission function is defined elsewhere and accessible

# def worker_function(i):
#     return create_submission(model="YourModel", gender="M", bracket=i, year=2023)


# def run_parallel():
#     brackets = {}
#     pool_size = multiprocessing.cpu_count()  # Number of processes to create

#     with multiprocessing.Pool(pool_size) as pool:
#         # Map the range of inputs to the worker function
#         # The tqdm call is moved here to track progress of the parallel tasks
#         results = list(tqdm(pool.imap(worker_function, range(100)), total=100))

#     # Populate the brackets dictionary with the results
#     for i, result in results:
#         brackets[i] = result

#     return brackets

# if __name__ == "__main__":
#     # Using a smaller number of processes in Jupyter might be more stable
#     pool_size = min(multiprocessing.cpu_count(), 4)  
#     pool = multiprocessing.Pool(pool_size)
    
#     # Use a list to collect the results
#     results = []
#     for _ in tqdm(pool.imap_unordered(worker_function, range(100)), total=100):
#         results.append(_)
    
#     pool.close()
#     pool.join()

#     # Now process the results
#     brackets = {result['bracket']: result for result in results}

In [73]:
df = pd.concat(brackets.values(), ignore_index=False)

In [74]:
# df.to_csv("../data/submissions/model_v2_5000_runs.csv")

In [75]:
df

Unnamed: 0,Tournament,Bracket,Slot,Team
2385,M,0,R1W1,W01
2386,M,0,R1W2,W02
2387,M,0,R1W3,W03
2388,M,0,R1W4,W04
2389,M,0,R1W5,W05
...,...,...,...,...
2447,M,199,R6CH,W01
2448,M,199,W16,W16a
2449,M,199,X16,X16b
2450,M,199,Y11,Y11a


In [82]:
def get_successors(seed):
    net = networkx.DiGraph()

    slot_df = mens_data["MNCAATourneySlots"][
        (mens_data["MNCAATourneySlots"]["Season"] == 2023)
    ]

    net.add_edges_from([i for i in zip(slot_df["WeakSeed"].values, slot_df["Slot"].values)])
    net.add_edges_from([i for i in zip(slot_df["StrongSeed"].values, slot_df["Slot"].values)])

    successors = [i[1] for i in dfs_edges(net, seed)]

    return successors

In [84]:
def check_bracket_distribution(df, seed, verbose=False):
    """
    df: pd.DataFrame
        This is the submission dataframe for all brackets
    team: str
        a seed representing the team
    """
    successors = get_successors(seed)
    path_df = df[df["Slot"].isin(successors)]
    path_df = path_df.groupby("Slot")["Team"].apply(lambda x: np.mean(x == seed))

    if verbose:
        seeding_teams = pd.merge(mens_data["MNCAATourneySeeds"][
            mens_data["MNCAATourneySeeds"]["Season"] == 2023
        ], mens_data["MTeams"], on="TeamID")

        try:
            team_name = seeding_teams[seeding_teams["Seed"] == seed]["TeamName"].values[0]
        except:
            team_name=""

        print("*"*50)
        print(f"{team_name} CHANCE TO WIN IT ALL: {path_df[-1]}")
        print("*"*50)

    return path_df

In [87]:
tournament_2023 = pd.merge(
    mens_data["MNCAATourneySeeds"][
        mens_data["MNCAATourneySeeds"]["Season"] == 2023
    ], 
    mens_data["MTeams"], 
    on="TeamID"
)

In [88]:
tournament_2023["round_win_prob"]= tournament_2023["Seed"].apply(lambda x: check_bracket_distribution(df, x)[:6].values)

tournament_2023[
    [
        "round_of_32",
        "sweet16",
        "elite8",
        "final4",
        "final",
        "champ",
    ]
] = list(tournament_2023["round_win_prob"])

In [89]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also

    display(tournament_2023.sort_values(by=["champ", "final", "final4", "elite8", "sweet16", "round_of_32"], ascending=False))

Unnamed: 0,Season,Seed,TeamID,TeamName,FirstD1Season,LastD1Season,round_win_prob,round_of_32,sweet16,elite8,final4,final,champ
0,2022,W01,1124,Baylor,1985,2024,"[0.965, 0.88, 0.74, 0.64, 0.535, 0.355]",0.965,0.88,0.74,0.64,0.535,0.355
34,2022,Y01,1242,Kansas,1985,2024,"[0.99, 0.57, 0.315, 0.235, 0.215, 0.195]",0.99,0.57,0.315,0.235,0.215,0.195
3,2022,W04,1417,UCLA,1985,2024,"[1.0, 0.93, 0.165, 0.16, 0.14, 0.11]",1.0,0.93,0.165,0.16,0.14,0.11
17,2022,X01,1211,Gonzaga,1985,2024,"[0.985, 0.565, 0.275, 0.245, 0.08, 0.075]",0.985,0.565,0.275,0.245,0.08,0.075
53,2022,Z03,1397,Tennessee,1985,2024,"[0.565, 0.225, 0.095, 0.055, 0.045, 0.04]",0.565,0.225,0.095,0.055,0.045,0.04
54,2022,Z04,1228,Illinois,1985,2024,"[0.9, 0.375, 0.14, 0.095, 0.06, 0.03]",0.9,0.375,0.14,0.095,0.06,0.03
51,2022,Z01,1112,Arizona,1985,2024,"[0.93, 0.45, 0.265, 0.155, 0.07, 0.025]",0.93,0.45,0.265,0.155,0.07,0.025
52,2022,Z02,1437,Villanova,1985,2024,"[0.87, 0.275, 0.165, 0.085, 0.06, 0.02]",0.87,0.275,0.165,0.085,0.06,0.02
21,2022,X05,1163,Connecticut,1985,2024,"[0.96, 0.365, 0.17, 0.13, 0.025, 0.015]",0.96,0.365,0.17,0.13,0.025,0.015
45,2022,Y12,1350,Richmond,1985,2024,"[0.76, 0.265, 0.11, 0.08, 0.02, 0.015]",0.76,0.265,0.11,0.08,0.02,0.015


In [37]:
test

Unnamed: 0,rating_team,EBP_latest_team,NOL_latest_team,POM_latest_team,POM_rolling_team,NOL_rolling_team,EBP_rolling_team,count_reg_team,mean_reg_team,std_reg_team,...,POM_rolling_opp,NOL_rolling_opp,EBP_rolling_opp,count_reg_opp,mean_reg_opp,std_reg_opp,count_post_opp,mean_post_opp,std_post_opp,win
1247,20.255540,1,2,1,1.0,3.0,1.0,893.0,0.687570,0.463744,...,5.8,15.8,6.8,587.0,0.664395,0.472604,27.0,0.592593,0.500712,True
1248,18.432270,6,12,7,6.8,10.8,4.8,256.0,0.691406,0.462818,...,51.6,67.8,55.8,505.0,0.572277,0.495239,6.0,0.166667,0.408248,True
1249,17.149414,5,4,6,5.6,6.4,6.0,1111.0,0.665167,0.472145,...,29.6,18.0,24.6,139.0,0.589928,0.493625,0.0,0.000000,0.000000,False
1250,15.476536,106,81,95,94.4,77.2,109.2,632.0,0.697785,0.459582,...,20.6,22.6,19.8,750.0,0.850667,0.356655,60.0,0.633333,0.485961,False
1251,15.390988,19,14,21,20.6,22.6,19.8,750.0,0.850667,0.356655,...,60.0,31.0,61.8,332.0,0.617470,0.486739,4.0,0.000000,0.000000,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,2.134960,218,254,260,250.6,234.2,213.2,108.0,0.324074,0.470210,...,43.4,63.4,43.2,581.0,0.736661,0.440824,30.0,0.633333,0.490133,False
1310,-0.380099,264,262,281,275.8,268.6,259.2,92.0,0.391304,0.490716,...,15.0,5.8,15.8,934.0,0.770878,0.420493,76.0,0.723684,0.450146,False
1311,-1.725279,352,352,354,353.6,348.2,349.4,90.0,0.444444,0.499688,...,195.0,180.2,173.4,59.0,0.644068,0.482905,1.0,0.000000,0.000000,False
1312,-3.448262,300,287,324,327.2,283.2,305.6,32.0,0.531250,0.507007,...,277.2,263.2,275.0,644.0,0.537267,0.498997,7.0,0.285714,0.487950,True
