## Modeling expected log loss from submission
A submission file gives probabilities for each matchup. We can use those probabilities to simulate hundreds of brackets and calculate a theoretical range of expected log loss values. We can also estimate the liklihood of each team winning the NCAA tournament. The resulting brackets give an interesting suite of information under the direct assumption that **your model is a good estimate of reality**. Comparing some of these results to the actual tournament results can be a useful sense check to make sure your model seems reasonable.

This simulation is running the same code used for this [Bracket Builder](https://github.com/armstrys/NCAA_BracketBuilder). I **think** it correctly simulates all years of the tournament, but if you need to verify then that bracket tool can at least give you a visual QC of some sort.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import lognorm, kstest
from tqdm import tqdm


## Change inputs here for different data (Men's vs Women's or submission file)#######

mw = 'M'   # 'M' for men's and 'W' for womens
submission_file = '../input/deeplearning-ncaam-embedskill/msubmission.csv' #path to your submission
truelosses = [0.484660,
              0.546481,
              0.507846,
              0.543527,
              0.484272]

#############

path = Path('../input/ncaam-march-mania-2021/')  #path for data files
seasons_file = path/(mw+'Seasons.csv')
seeds_file = path/(mw+'NCAATourneySeeds.csv')
slots_file = path/(mw+'NCAATourneySlots.csv')
teams_file = path/(mw+'Teams.csv')

teams_dict = pd.read_csv(teams_file).set_index('TeamID')['TeamName'].to_dict() # Create team dictionary to go from team ID to team name

##loading function
def prep(df,slots,seeds,season_info,season):
    df.reset_index(inplace=True, drop=True)
    df = df[['ID','Season','LeftTeamID','RightTeamID','Pred']]
    df.columns = ['ID','Season','LeftTeamID','RightTeamID','Pred']
    season_info = season_info.loc[season_info['Season']==season].copy()
    region_dict = {'W':season_info['RegionW'].values[0],
                   'X':season_info['RegionX'].values[0],
                   'Y':season_info['RegionY'].values[0],
                   'Z':season_info['RegionZ'].values[0],
                   }

    df_rev = df[['ID','Season','RightTeamID','LeftTeamID','Pred']].copy()
    df_rev.columns = ['ID','Season','LeftTeamID','RightTeamID','Pred']
    df_rev['Pred'] = 1-df_rev['Pred']
    df_rev['ID'] = str(season)+'_'+ df_rev['LeftTeamID'].astype(str)+'_'+df_rev['RightTeamID'].astype(str)
    df = pd.concat([df,df_rev])

    seeds = seeds.loc[seeds['Season']==season,:].copy()
    seeds.drop(columns='Season',inplace=True)
    seeds['Region'] = seeds['Seed'].str.extract(r'([WXYZ]).*')
    seeds['Region'].replace(region_dict,inplace=True)
    seeds['Number'] = seeds['Seed'].str.extract(r'[WXYZ](.*)')
    seeds['NewSeed'] = seeds['Region']+'-'+seeds['Number']
    
    oldseeds_dict = seeds.set_index('Seed')['NewSeed'].to_dict()
    seeds_dict = seeds.set_index('NewSeed')['TeamID'].to_dict()

    if mw == 'W': #womens csv does not have a column for season so we will fake it.
        slots['Season']=season
    else: pass
    slots = slots.loc[slots['Season']==season,:].copy()
    slots.drop(columns='Season',inplace=True)
    slots['StrongSeed'].replace(oldseeds_dict,inplace=True)
    slots['WeakSeed'].replace(oldseeds_dict,inplace=True)
    slots['Round'] = slots['Slot'].str.extract(r'(R.)[WXYZC].').fillna('R0')
    slots['Game'] = slots['Slot'].str.extract(r'.*([WXYZC].*)')

    return df, slots, seeds_dict, season


In [None]:
def run_simulation(submission_orig,trialsSeason):
    winners = []
    seasons = []
    loglosses = []
    
    for season in submission_orig['Season'].unique().astype(int):

        print(season)

        submission, slots, seeds_dict, season = prep(submission_orig.copy(),
                                                                    slots_orig.copy(),
                                                                    seeds_orig.copy(),
                                                                    season_info_orig.copy(),
                                                                    season)
        for _ in (range(trialsSeason)):

            games = slots.copy()
            games['WinnerSeed'] = ''
            games['StrongName'] = ''
            games['WeakName'] = ''
            games['WinnerName'] = ''
            games['StrongID'] = ''
            games['WeakID'] = ''
            games['WinnerID'] = ''
            games.loc[:,'Pred'] = np.nan
            game_cols = games.columns.to_list()
            new_cols = [game_cols[8]]+game_cols[6:8]+game_cols[4:6]+[game_cols[12]]+game_cols[9:12]+game_cols[0:4]
            games = games[new_cols]
            games.sort_values('Round',inplace=True)
            games.reset_index(inplace=True,drop=True)

            def update_games(games,round,next_round):

                for idx,row in games[games['Round']==round].iterrows():
                    games.loc[idx,'StrongID'] = seeds_dict[row['StrongSeed']]
                    games.loc[idx,'WeakID'] = seeds_dict[row['WeakSeed']]
    #                 games.loc[idx,'StrongName'] = teams_dict[games.loc[idx,'StrongID']]
    #                 games.loc[idx,'WeakName'] = teams_dict[games.loc[idx,'WeakID']]
    #                 games.sort_values(by=['Round','StrongSeed'],inplace=True)


                for idx,row in games[games['Round']==round].iterrows():

                    winThresh = np.random.rand()

                    game = row['Game']
                    id = (str(season)+'_'+ str(row['StrongID'])+'_'+ str(row['WeakID']))
                    pred = submission.loc[submission['ID']==id,'Pred'].values[0]
                    if pred> winThresh:
                        winslot = row['StrongSeed']
                        winID = row['StrongID']
                        loseslot = row['WeakSeed']
                        loseID = row['WeakID']
                    else:
                        winslot = row['WeakSeed']
                        winID = row['WeakID']
                        loseslot = row['StrongSeed']
                        loseID = row['StrongID']
                        pred = 1 - pred

                    games.loc[idx,'WinnerSeed'] = winslot
                    games.loc[idx,'WinnerID'] = winID
                    games.loc[idx,'Pred'] = pred

                    if round == 'R0':
                        next_slot = game
                        games.loc[games['Round']==next_round,'StrongSeed'] = (games.loc[games['Round']==next_round,'StrongSeed']
                                                                                .replace({next_slot:winslot}))
                        games.loc[games['Round']==next_round,'WeakSeed'] = (games.loc[games['Round']==next_round,'WeakSeed']
                                                                                .replace({next_slot:winslot}))
                    elif round == 'R5':
                        if game == 'X':
                            games.loc[games['Round']==next_round,'StrongSeed'] = winslot
                        else:
                            games.loc[games['Round']==next_round,'WeakSeed'] = winslot

                    else:
                        next_slot = round+game
                        games.loc[games['Round']==next_round,'StrongSeed'] = (games.loc[games['Round']==next_round,'StrongSeed']
                                                                                .replace({next_slot:winslot}))
                        games.loc[games['Round']==next_round,'WeakSeed'] = (games.loc[games['Round']==next_round,'WeakSeed']
                                                                                .replace({next_slot:winslot}))

                return games


            if mw == 'M': # no play-in for the womens tourney
                games = update_games(games,'R0','R1')
            else: pass

            games = update_games(games,'R1','R2')

            games = update_games(games,'R2','R3')

            games = update_games(games,'R3','R4')

            games = update_games(games,'R4','R5')

            games = update_games(games,'R5','R6')

            games = update_games(games,'R6','')

            winner = teams_dict[games.loc[games['Round']=='R6','WinnerID'].values[0]]
            games['logloss'] = -np.log(games['Pred'])
            logloss = np.mean(games.loc[games['Round']!= 'R0','logloss'])
    
            loglosses.append(logloss)
            winners.append(winner)
            seasons.append(str(season))
            
#     display(loglosses,seasons)
    
    return winners, seasons, loglosses


## Run Simulation for My Model
First we will run the simulation for one of my bracket submissions and see what the resulting **expected** log loss distribution is. Each plot also has 5 red lines on it which are the true log losses from this submission as calculated using [this notebook](https://www.kaggle.com/mmotoki/men-s-march-madness-2021-leaderboard-analyzer?scriptVersionId=55662591). 

In [None]:
###
season_info_orig = pd.read_csv(seasons_file)
seeds_orig = pd.read_csv(seeds_file)
slots_orig = pd.read_csv(slots_file)

submission_orig = pd.read_csv(submission_file)
submission_orig[['Season','LeftTeamID','RightTeamID']] = submission_orig['ID'].str.split('_',expand=True)
# submission_orig.head()

In [None]:

ntrials = 200
winners, seasons, loglosses = run_simulation(submission_orig,ntrials)

with open('winners.dat', 'wb') as f:
    pickle.dump(winners, f)

with open('seasons.dat', 'wb') as f:
    pickle.dump(seasons, f)


## Visualize the results

Here we will look at the tournament winners to see which teams won most frequently won the entire tournament. The team with the largest slice of the pie for a given season was expected to win the tournament according to the model.

In [None]:
with open('winners.dat','rb') as f:
    winners = pickle.load(f)

with open('seasons.dat','rb') as f:
    seasons = pickle.load(f)

In [None]:
df = pd.DataFrame({'Season':seasons,'Winners':winners})

seasongrouped = df.groupby('Season')

for name, group in seasongrouped:
#     print(name)
#     display(group['Winners'])

    regroup = group.groupby('Winners').count()
    regroup['Odds'] = ntrials/regroup['Season']
    regroup['Percent'] = 1/regroup['Odds']
    regroup.columns = ['Count','Odds','Percent']
    regroup['Count'].plot.pie()
    
    plt.title(name)
    plt.show()
    display(regroup.sort_values('Odds',ascending=True).iloc[0:10])
#     group['Winners'].groupby('Winners').count().plot.pie()
#     plt.show()


## Visualize theoretical range of log losses

Here we can see the expected range of log losses generated from our tournament model.

In [None]:
df = pd.DataFrame({'Season':seasons,'losses':loglosses})

ax = sns.histplot(x=loglosses, hue=seasons, kde=False)
x = np.arange(0,1,.01)

probs = []
ii=0
for s in df['Season'].unique():
    
    tL = truelosses[ii]
    ii+=1
    
    params = lognorm.fit(df.loc[df['Season']==s,'losses'].values)
    probs.append(lognorm.cdf(tL,*params))
    
    ax = sns.lineplot(x=x, y=lognorm.pdf(x, *params))
    plt.axvline(tL,ls='--',c='r')

plt.xlim([.3,.9])
plt.show()


The above plot shows the theoretical distribution of losses from about 1,000 random brackets generated from the submission file. The red lines are the true losses calculated using [this notebook](https://www.kaggle.com/mmotoki/men-s-march-madness-2021-leaderboard-analyzer). Note that the true losses fall squarely within the distribution of expected losses. When this same exercize is done with a leaky model the true losses can end up being far lower than the modeled losses. In reality, the true outcome was highly unlikely given the model and that is because the model is actually based upon the true outcome as opposed to being a good predictive model!

More helpful would be to be able to quantify how likely it is that the true log losses could have come from the expected distribution. In theory, if we had a large sample of true losses we would like them to mimic the distribution of the expected losses. In the example above the true losses are confined to a narrow range, which might suggest that our model is still not accurately characterizing the probabilities. Because we only have 5 test seasons in this case it is hard to distinguish whether the distribution is by chance or if they are actually different.

## Quantifying the relationship between true and modeled losses

I had been thinking of a valid method to quantify the relationship between the modeled losses and the true losses. Per the suggestion of @goodspellr, below I use a [Kolmogorovâ€“Smirnov](https://en.m.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test the hypothesis that the true losses were drawn from the distribution of the modeled losses (i.e. the modeled losses could be valid model of the tournament). @Goodspellr points out in [this comment](https://www.kaggle.com/c/ncaam-march-mania-2021/discussion/224698#1233172) that each tournament may need a separate cumulative distribution function to describe the modeled losses. In order to accurately compare the 5 tournaments the exceedance value is calculated for each true loss using the estimate of the modeled losses for that year. All 5 exceedance values are then compared to a uniform distribution for the KS-test. In this case the p-value is greater than any reasonable significance threshold we could choose.

In [None]:
statistic, pvalue = kstest(rvs=probs, cdf='uniform')

print(f'D-value of KS-test: {statistic}\nP-value: {pvalue}')

## Some other theoretical examples
We can also make a couple theoretical examples to see what we know **isn't** reality. Below we will model the tournament with every game being a 51/49 toss up or every game being a 90/10 advantage for one team. These both lead to very different log loss distributions than what is common both in my models and in the historic leaderboard. That is good because it means that there is actually some signal to model!

In [None]:
submission50 = submission_orig.copy()
submission50['Pred'] = .51

winners, seasons, loglosses = run_simulation(submission50,20)

In [None]:
ax = sns.histplot(x=loglosses, hue=seasons, kde=False)
x = np.arange(0,1,.01)
plt.xlabel('log loss')
for x in truelosses:
    plt.axvline(x,ls='--',c='r')
plt.show()

In [None]:
submission90 = submission_orig.copy()
submission90['Pred'] = .99

winners, seasons, loglosses = run_simulation(submission90,20)

In [None]:
ax = sns.histplot(x=loglosses, hue=seasons, kde=False)
x = np.arange(0,1,.01)
plt.xlabel('log loss')
for x in truelosses:
    plt.axvline(x,ls='--',c='r')
plt.show()

## Closing thoughts
This started as a fun way for me to simulate some tournament brackets, but I think it turned into an interesting method to sense check a tournament model. What you see in this notebook is about the extent of the work and so any additional insight, comments, or questions would be appreciated.

Honestly, at this stage, I am unsure if even an overfit model would produce true losses that were statistically distinguishable from the modeled losses. Further testing would be needed to show any real utility here outside of just understanding the inherent uncertainty in Kaggle scores caused by randomness in the NCAA tournament.