# March Madness Modeling

https://www.kaggle.com/c/mens-march-mania-2022

**Not particularly useful for our analysis:**
* pd.read_csv('MTeams.csv')
* pd.read_csv('MSeasons.csv')
* pd.read_csv('Cities.csv')
* pd.read_csv('MGameCities.csv')
* pd.read_csv('Conferences.csv')
* pd.read_csv('MTeamCoaches.csv')
* pd.read_csv('MTeamConferences.csv')
* pd.read_csv('MSecondaryTourneyTeams.csv')
* pd.read_csv('MSecondaryTourneyCompactResults.csv')

## Overall Plan:

The submission needs to contain every possible matchup for the last five years.

Create a dataframe of predictor variables for each team for each year.

The predictor variables will include the following:

* The seed at the end of the regular season.

Use the MRegularSeasonDetailedResults to create the following dataset.

Each feature will simply be the value of the team minus the corresponding value of the other team.

Note that we actually have to make two entries for each sample in the dataset.

**Here is the first entry:**

**Winning Team Entry**
* Season = Season
* DayNum = DayNum
* TeamID = WTeamID
* score = WScore - LScore
* field-goal-pct = (WFGM / WFGA) - (LFGM / LFGA)
* three-point-pct = (WFGM3 / WFGA3) - (LFGM3 / LFGA3)
* free-throw-pct = (WFTM / WFTA) - (LFTM / LFTA)
* off-rebounds = WOR - LOR
* def-rebounds = WDR - LDR
* assists = WAst - LAst
* turnovers = WTO - LTO
* steals = WStl - LStl
* blocks = WBlk - LBlk
* fouls = WPF - LPF
* won = 1

**Losing Team Entry**
* Season = Season
* DayNum = DayNum
* TeamID = LTeamID
* score = LScore - WScore
* field-goal-pct = (LFGM / LFGA) - (WFGM / WFGA)
* three-point-pct = (LFGM3 / LFGA3) - (WFGM3 / WFGA3)
* free-throw-pct = (LFTM / LFTA) - (WFTM / WFTA)
* off-rebounds = LOR - WOR
* def-rebounds = LDR - WDR
* assists = LAst - WAst
* turnovers = LTO - WTO
* steals = LStl - WStl
* blocks = LBlk - WBlk
* fouls = LPF - WPF
* won = 0

At this point we have created two entries for every row in the MRegularSeasonDetailedResults dataset. We aggregate the data for each team for each year. Then we create a weighted average of the results with the most recent results being weighed more highly. We now have just one row per team per year.  

To that resulting row we add the seed for each team from 'MNCAATourneySeeds.csv'.

Then we add the ranking for each team from 'MMasseyOrdinals.csv'. We grab the ranking from the RankingDayNum is equal to 133.

Now we need to bring the tournament results into the mix. There are two teams for each game in the tournament. Each team has a set of variables resulting from the regular season. To lower the number of attributes we subtract one set of variables from the other. Then, we can perform the classification. 

In [7]:
import pandas as pd
import os

In [8]:
files = [ file for file in os.listdir() if file.endswith('.csv') ]
print(files)

['538ratingsMen.csv', 'Cities.csv', 'Conferences.csv', 'MConferenceTourneyGames.csv', 'MGameCities.csv', 'MMasseyOrdinals.csv', 'MNCAATourneyCompactResults.csv', 'MNCAATourneyDetailedResults.csv', 'MNCAATourneySeedRoundSlots.csv', 'MNCAATourneySeeds.csv', 'MNCAATourneySlots.csv', 'MRegularSeasonCompactResults.csv', 'MRegularSeasonDetailedResults.csv', 'MSampleSubmissionStage1.csv', 'MSeasons.csv', 'MSecondaryTourneyCompactResults.csv', 'MSecondaryTourneyTeams.csv', 'MTeamCoaches.csv', 'MTeamConferences.csv', 'MTeams.csv', 'MTeamSpellings.csv', 'Random Predictions.csv', 'submission.csv']


## Creating features from regular season

In [9]:
# Useful for our analysis.
regResults = pd.read_csv('MRegularSeasonDetailedResults.csv')

In [10]:
def divideByZero(n1, n2):
    if n2 == 0:
        return 0
    else:
        return n1 / n2

In [11]:
nestedList = []

for label, row in regResults.iterrows():
    
    # Values corresponding to both winning and losing rows.
    winFieldGoalPct = divideByZero(row['WFGM'], row['WFGA'])
    loseFieldGoalPct = divideByZero(row['LFGM'], row['LFGA'])
    winThreePointPct = divideByZero(row['WFGM3'], row['WFGA3'])
    loseThreePointPct = divideByZero(row['LFGM3'], row['LFGA3'])
    winFreeThrowPct = divideByZero(row['WFTM'], row['WFTA'])
    loseFreeThrowPct = divideByZero(row['LFTM'], row['LFTA'])
    
    # We create a row for the winning team.
    score = row['WScore'] - row['LScore']
    fieldGoalPct = winFieldGoalPct - loseFieldGoalPct
    threePointPct = winThreePointPct - loseThreePointPct
    freeThrowPct = winFreeThrowPct - loseFreeThrowPct
    offRebounds = row['WOR'] - row['LOR']
    defRebounds = row['WDR'] - row['LDR']
    assists = row['WAst'] - row['LAst']
    turnovers = row['WTO'] - row['LTO']
    steals = row['WStl'] - row['LStl']
    blocks = row['WBlk'] - row['LBlk']
    fouls = row['WPF'] - row['LPF']
    won = 1
    
    winValues = [ row['Season'], row['DayNum'], row['WTeamID'], score, fieldGoalPct, 
                 threePointPct, freeThrowPct, offRebounds, defRebounds, assists, turnovers,
                 steals, blocks, fouls, won ]
    
    nestedList.append(winValues)
    
    
    # We create a row for the losing team.
    score = row['LScore'] - row['WScore']
    fieldGoalPct = loseFieldGoalPct - winFieldGoalPct
    threePointPct = loseThreePointPct - winThreePointPct
    freeThrowPct = loseFreeThrowPct - winFreeThrowPct
    offRebounds = row['LOR'] - row['WOR']
    defRebounds = row['LDR'] - row['WDR']
    assists = row['LAst'] - row['WAst']
    turnovers = row['LTO'] - row['WTO']
    steals = row['LStl'] - row['WStl']
    blocks = row['LBlk'] - row['WBlk']
    fouls = row['LPF'] - row['WPF']
    won = 0
    
    loseValues = [ row['Season'], row['DayNum'], row['LTeamID'], score, fieldGoalPct, 
                 threePointPct, freeThrowPct, offRebounds, defRebounds, assists, turnovers,
                 steals, blocks, fouls, won ]
    
    nestedList.append(loseValues)

In [12]:
colNames = ['Season', 'DayNum', 'TeamID', 'score', 'field-goal-pct', 'three-point-pct',
           'free-throw-pct', 'off-rebounds', 'def-rebounds', 'assists', 'turnovers', 
           'steals', 'blocks', 'fouls', 'won']

df = pd.DataFrame(data=nestedList, columns=colNames)

In [13]:
def getExponentialAverage(series, alpha):
    if len(series) == 0:
        print(series)
        raise Exception('The series has length 0')
    return series.mean()
    # return series.ewm(alpha=alpha).mean().iloc[-1]

In [14]:
groupedData = df.groupby(['Season', 'TeamID']).apply(getExponentialAverage, 0.05)

In [15]:
# Useful for our analysis.
tourneyResults = pd.read_csv('MNCAATourneyDetailedResults.csv')

# We create a new dataset with rows for both winners and losers.
nestedList = []

for index, row in tourneyResults[ ['Season', 'WTeamID', 'LTeamID'] ].iterrows():
    
    # Create the winning row with season, team, other team, and won.
    winValues = [ row['Season'], row['WTeamID'], row['LTeamID'], 1 ]
    
    # Create the losing row with the same values.
    loseValues = [ row['Season'], row['LTeamID'], row['WTeamID'], 0 ]
    
    nestedList.append(winValues)
    nestedList.append(loseValues)
    
tourneyLabels = pd.DataFrame(data=nestedList, columns=['Season', 'TeamID', 'OtherTeam', 'Won'])

In [16]:
groupedData.drop(columns=['TeamID', 'Season'], inplace=True)
# groupedData.reset_index(inplace=True)
groupedData.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,DayNum,score,field-goal-pct,three-point-pct,free-throw-pct,off-rebounds,def-rebounds,assists,turnovers,steals,blocks,fouls,won
Season,TeamID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2003,1102,72.464286,0.25,0.028111,-0.012117,-0.07706,-5.428571,-3.321429,3.857143,-1.535714,0.535714,0.214286,0.392857,0.428571
2003,1103,76.962963,0.62963,-0.00275,-0.037976,-0.00039,-2.259259,-2.111111,-0.259259,-2.703704,0.851852,-0.518519,-2.592593,0.481481
2003,1104,72.571429,4.285714,-0.002721,-0.005527,-0.007429,2.678571,1.285714,0.428571,-0.571429,1.071429,0.607143,-1.214286,0.607143
2003,1105,78.307692,-4.884615,-0.060278,0.004001,0.039882,0.307692,-3.269231,-1.269231,-0.153846,-0.076923,-2.115385,1.153846,0.269231
2003,1106,74.0,-0.142857,0.016981,0.05134,-0.088576,0.964286,1.5,-0.107143,1.964286,-0.428571,-0.035714,2.035714,0.464286


In [17]:
nestedList = []

# For each row here we need to obtain the two matching rows in the groupedData table.
# One of the rows must subtract the other.
for index, row in tourneyLabels.iterrows():
    
    season = row['Season']
    teamID = row['TeamID']
    otherTeam = row['OtherTeam']
    
    firstTeamData = groupedData.loc[ (season, teamID) ]
    secondTeamData = groupedData.loc[ (season, otherTeam) ]
    
    # Collect the difference of the first team minus the second team.
    data = (firstTeamData - secondTeamData).values[1:]
    nestedList.append(data)

In [18]:
newGroupedDf = pd.DataFrame(nestedList, columns=groupedData.columns[1:])
finalDf = pd.concat([tourneyLabels, newGroupedDf], axis=1)

## Parsing Seeds

In [19]:
seeds = pd.read_csv('MNCAATourneySeeds.csv')
seeds

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374
...,...,...,...
2349,2021,Z12,1457
2350,2021,Z13,1317
2351,2021,Z14,1159
2352,2021,Z15,1331


In [20]:
# We obtain the seed for each team.
import string

newSeeds = []
for seed in seeds['Seed']:
    
    # Remove the first letter from the seed.
    newSeed = seed[1:]
    
    # Some of them have a letter at the end that we need to remove.
    if newSeed[-1] in string.ascii_lowercase:
        newSeed = newSeed[:-1]
        
    newSeed = int(newSeed)
    newSeeds.append(newSeed)
    
seeds['Seed'] = newSeeds

In [21]:
seeds.head()

Unnamed: 0,Season,Seed,TeamID
0,1985,1,1207
1,1985,2,1210
2,1985,3,1228
3,1985,4,1260
4,1985,5,1374


In [22]:
finalDf = finalDf.merge(seeds, on=['Season', 'TeamID'], how='inner')

In [23]:
finalDf.head()

Unnamed: 0,Season,TeamID,OtherTeam,Won,score,field-goal-pct,three-point-pct,free-throw-pct,off-rebounds,def-rebounds,assists,turnovers,steals,blocks,fouls,won,Seed
0,2003,1421,1411,1,-9.208046,-0.041433,0.005866,0.111625,-2.681609,-1.588506,-3.26092,2.47931,-0.191954,-0.874713,3.747126,-0.151724,16
1,2003,1421,1400,0,-17.419951,-0.059047,-0.005234,0.026873,-4.698276,-3.79803,-6.1133,4.12931,-1.187192,-1.955665,0.520936,-0.337438,16
2,2003,1411,1421,0,9.208046,0.041433,-0.005866,-0.111625,2.681609,1.588506,3.26092,-2.47931,0.191954,0.874713,-3.747126,0.151724,16
3,2003,1112,1436,1,10.309113,0.022374,0.014799,0.101812,-1.307882,0.495074,1.247537,-3.140394,2.741379,2.511084,-2.286946,0.237685,1
4,2003,1112,1211,1,6.093318,0.008007,0.043958,0.059682,0.232719,0.034562,0.178571,-1.748848,2.596774,0.402074,-2.192396,0.150922,1


## Creating features from public rankings

In [24]:
# Useful for our analysis.
rankings = pd.read_csv('MMasseyOrdinals.csv')

In [25]:
lastRanking = rankings [ rankings['RankingDayNum'] == 133 ]

r = lastRanking.groupby(['Season', 'TeamID']).mean().reset_index().drop(columns='RankingDayNum')

In [26]:
finalDf = finalDf.merge(r, on=['Season', 'TeamID'], how='inner')

## Fitting Models

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from xgboost import XGBClassifier
from sklearn.metrics import log_loss

In [45]:
log_loss([1,0,1,0], [0.7, 0.1, 0.8, 0.2])

0.22708064055624455

In [34]:
X = finalDf.iloc[:, 4:].values
y = finalDf['Won'].values

In [48]:
y

array([1, 0, 0, ..., 0, 1, 0], dtype=int64)

In [64]:
def runCrossValidationUsingLogLoss(model, X, y):
    accuracyList = cross_val_score(model, X, y, scoring='neg_log_loss', cv=10)
    return accuracyList

In [65]:
def crossValidationLogLossAgain(model, X, y):
    preds = cross_val_predict(model, X, y, cv=10, method='predict_proba')
    preds = vals[:, 0]
    return log_loss(y, preds)

## Logistic Regression

In [66]:
model = LogisticRegression(max_iter=1000)
crossValidationLogLossAgain(model, X, y)

1.140351361294491

## Random Forest Classifier

In [67]:
model = RandomForestClassifier(n_estimators=200)
crossValidationLogLossAgain(model, X, y)

1.140351361294491

## XGB Classifier

In [68]:
model = XGBClassifier(n_estimators=200)
crossValidationLogLossAgain(model, X, y)









































1.140351361294491

In [69]:
from sklearn.neural_network import MLPClassifier

In [71]:
model = MLPClassifier()
crossValidationLogLossAgain(model, X, y)



1.140351361294491