# Final Project - Predicting March Madness Tournament Winner 

**Names:** Lauren Cutler, Hayden Kash, Sydney Smith

**Date:** April 19, 2024

## Background and Motivation

The reason we chose this project is because of our interest in the March Madness Tournament. We all enjoy watching the tournament every year as it is very exciting to see which college teams in the country will come out on top and be claimed as the best college basketball team. Many people every year always try to predict how the tournament will play out, so we thought why not try to actually do it by using data science. Since the NCAA is always maintaining a substantial amount of statistics on the players and teams, we thought this project would be doable as most of the data is public. Another deciding factor for choosing this project is when we present this project, the tournament will have concluded and we will be able to compare our results to who actually won.

## Project Objectives

The main objective is to look at multiple machine learning models to see which model performs the best in predicting the winners that advance from the round of 64 to the round of 32 and from the round of 32 to the sweet 16. In addition, we will see which model predicts the most correct teams in each round. For example, the model may predict 32 teams to advance from the round of 64 to the round of 32 but not all the 32 predicted teams will be correct. Even if not all the teams are correct if a portion of the teams continue to be correct in each of the rounds of the tournament that will help improve our bracket predictions.

## Data

For our project, we decided to use three data domains to help predict winners in multiple rounds of the 2024 NCAA men’s March Madness basketball tournament. The categories that we determined were the most essential for this project were the March Madness tournament data, the team statistics, and a power rating. Along with 2024 data we also decided to look at previous basketball seasons, to train our prediction model. The previous seasons we selected were the  2017, 2016, and 2009 seasons. The process of selecting these years was by randomly generating 3 random years between 2008-2023, as the website that holds all of the data we need only has the seasons 2008-present. Each dataset we use is in the form of a large data table on https://barttorvik.com/trank.php#, so all we needed to do was paste the data into an Excel file and convert it to a csv file. Therefore, all of the data we read for our project will be only through csv files.

We started with three individual data domains for each of the four years. The tournament data contained the winners of each round of the tournament. These columns are our outcomes to predict. The teams data contained data on offensive and defensive efficiencies and turnovers. In total there were 16 statistics for team performance. The barthag column is based off of points per possession and is supposed to calculate the chance of beating a division 1 team. 


#### Team stats, Barthag, Tournament statistics descriptions

**Breakdown of what each metric means:** 

- **RK** : Team Rank 
- **CONF** : Conference
- **ADJ. EFF. OFF.** : Adjusted Offensive Efficiency 
- **ADJ. EFF. DEF.** : Adjusted Defensive Efficiency
- **EFF. FG% OFF.** :  Effective Field Goal Percentage Offense 
- **EFF. FG% DEF.** : Effective Field Goal Percentange Deffense
- **TURNOVER% OFF.** : Turnover Percentage Offense
- **TURNOVER% DEF.** : Turnover Percentage Defense 
- **REB% OFF.** : Rebound Percentage Offense 
- **REB% DEF.** : Rebound Percentange Defense 
- **FT RATE OFF.** : Free Throw Rate Offense 
- **FT RATE DEF.** : Free Throw Rate Defense 
- **FT% OFF.** : Free Throw Percentage Offense 
- **FT% DEF.** : Free Throw Percentage Defense 
- **2P% OFF.** : 2 Pointer Percentage Offense
- **2P% DEF.** : 2 Pointer Percentage Defense 
- **3P% OFF.** : 3 Pointer Percentage Offense
- **3P% DEF.** : 3 Pointer Percentage Defense
- **Barthag.** : Power rating (chance of beating a D1 team)
- **PAKE** : Performance against Komputer expectations 
- **PASE** : Performance against seed expectations 
- **WINS** : Wins excluding play in games 
- **LOSS** : Losses excluding play in games
- **W%** : Win percentage excluding play in games 
- **R64** : Appearances in the round of 64
- **R32** : Appearances in the round of 32
- **S16** : Appearances in the sweet 16
- **E8** : Appearances in the elite eight
- **F4** : Appearances in the final four
- **F2** : Championship game appearances
- **CHAMP** : National titles
- **TOP2** : Years awarded a 1 or 2 seed
- **F4%** : Likelihood of getting to at least the final 4
- **CHAMP%** : Likelihood of winning at least 1 title per efficiency rating


## Data Processing

We had to clean the teams data the most. Every other row in the teams data was a rating of that statistic. We did not want these rows in our data. Once we loaded the teams data into a pandas data frame we programmatically removed every other row. We also started with more than the 64 teams in the tournament. We reduced the teams data to the 64 teams for each year. When we tried to reduce the teams to 64 we realized that some of the team names had numbers or rankings in their names. We had to get rid of the rankings in the team names to programmatically get the list to 64. For the tournament and power rating data all we had to do was copy and paste the data from the website into excel and read in the csv file. The tournament and power rating was reduced to the 64 tournament on the website. 

Once we had 64 teams for each dataset for each year we worked on combining the datasets together. We combine all of previous years into one csv file in excel. This became our training data and what we first used to explore different machine learning models. Then we combine the three datasets for 2024 into one csv file. For all the data we checked to make sure the descriptive statistcs made sense, no dupilicates, and no null values. 


## Exploratory Analysis

For our exploratory analysis we started by looking at different machine learning models on the 2009, 2016, 2017 data. We looked at logistic regression, decision tree, SVM, KNN. We found that the decision tree did not perform as good as the other three models. Next, we looked at KNN, regression, SVM with the previous years as our training data and 2024 data as our test. All three models had similar accuracy so we moved forward with SVM and KNN. 

In [88]:
# imports and setup

import scipy as sc
from scipy.stats import norm

from sklearn import tree
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm # For regression analysis
from sklearn import linear_model # For regression analysis
from sklearn import metrics


import matplotlib.pyplot as plt
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6)

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import tree


import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('ggplot')


from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

### Logistic Regression exploratory analysis 

In [89]:
#2009, 2016, 2017 data
data = pd.read_excel('Complete Combined Files .xlsx')

In [90]:
data.head()

Unnamed: 0,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,OFF. REB% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,206,Akron,MAC,102.9,95.6,48.1,45.5,20.7,26.4,34.3,...,33.2,29.4,1,0,0,0,0,0,0,0.6871
1,25,American,Pat,104.2,99.9,53.7,45.2,21.2,20.8,31.5,...,37.4,33.0,1,0,0,0,0,0,0,0.411
2,39,Arizona,P10,118.4,101.7,53.0,51.0,19.4,18.4,35.9,...,38.7,34.9,1,1,1,0,0,0,0,0.6002
3,2,Arizona St.,P10,118.0,94.6,56.4,47.0,18.6,19.5,29.1,...,37.0,31.9,1,1,0,0,0,0,0,0.854
4,155,Binghamton,AE,101.6,102.3,49.4,46.6,19.7,21.6,31.7,...,33.5,32.9,1,0,0,0,0,0,0,0.9377


In [91]:
X= data.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64'],axis=1)

In [92]:
#checking correlations because one of the assumptions of logistic regression is no perfect multicollinearity among independent variables.
X.corr(method='pearson', min_periods=1, numeric_only=False)

Unnamed: 0,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,OFF. REB% OFF.,OFF. REB% DEF.,FT RATE OFF.,FT RATE DEF.,FT% OFF.,FT% DEF.,2P% OFF.,2P% DEF.,3P% OFF.,3P% DEF.,BARTHAG
ADJ. EFF. OFF.,1.0,-0.302666,0.62817,0.016115,-0.479529,-0.240984,0.161306,-0.136001,-0.102214,-0.286381,0.445124,0.084193,0.529949,-0.034671,0.504958,0.092774,0.51702
ADJ. EFF. DEF.,-0.302666,1.0,0.030741,0.710293,0.005809,-0.288723,-0.268007,0.086525,0.039824,0.004432,0.031785,0.156899,-0.003702,0.640749,0.046652,0.389342,-0.519868
EFF. FG% OFF.,0.62817,0.030741,1.0,0.111205,-0.202413,-0.321216,-0.284749,-0.281031,-0.216726,-0.356844,0.345656,0.067115,0.867997,0.041823,0.747324,0.1579,0.260174
EFF. FG% DEF.,0.016115,0.710293,0.111205,1.0,-0.124246,-0.022969,-0.228701,0.075509,-0.016649,-0.03531,0.146895,0.117667,0.079702,0.882149,0.090346,0.575494,-0.279983
TURNOVER% OFF.,-0.479529,0.005809,-0.202413,-0.124246,1.0,0.215373,0.347886,0.262208,0.295497,0.142056,-0.294047,-0.12616,-0.152853,-0.148887,-0.177384,-0.002125,-0.137656
TURNOVER% DEF.,-0.240984,-0.288723,-0.321216,-0.022969,0.215373,1.0,0.19209,0.489894,-0.031002,0.440002,-0.22011,-0.207003,-0.237096,0.022842,-0.294161,-0.111957,-0.041799
OFF. REB% OFF.,0.161306,-0.268007,-0.284749,-0.228701,0.347886,0.19209,1.0,0.162419,0.254524,0.174179,-0.229168,-0.093387,-0.190445,-0.229249,-0.245206,-0.081738,0.154509
OFF. REB% DEF.,-0.136001,0.086525,-0.281031,0.075509,0.262208,0.489894,0.162419,1.0,-0.01691,0.01856,-0.088684,-0.27365,-0.239859,0.063156,-0.228324,0.051924,-0.15525
FT RATE OFF.,-0.102214,0.039824,-0.216726,-0.016649,0.295497,-0.031002,0.254524,-0.01691,1.0,0.155134,-0.156552,0.003917,-0.075807,-0.019005,-0.290404,0.008773,-0.121462
FT RATE DEF.,-0.286381,0.004432,-0.356844,-0.03531,0.142056,0.440002,0.174179,0.01856,0.155134,1.0,-0.200624,0.057268,-0.317084,-0.000367,-0.243076,-0.069698,-0.097462


In [93]:
#drop '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.' because they are highly correlated 
X= data.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64', '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.'],axis=1)

In [70]:
#scaling the data
X= scale(X)

In [71]:
#predicting the round of 32
y = data['R32']

In [72]:
#create an empty vector of length of Complete Combined Files .xlsx to store original indexs of teams
indices = np.arange(192)

#include indices_train and indices_test to capture the original index 
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices, random_state=1, test_size=0.3)

In [73]:
#fitting the logistic regression model on previous years combined dataset
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)

In [74]:

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.74      0.77        31
           1       0.72      0.78      0.75        27

    accuracy                           0.76        58
   macro avg       0.76      0.76      0.76        58
weighted avg       0.76      0.76      0.76        58



The logistic regression model predicted with 0.76 accuracy. We continued to look at the sweet 16 and elite 8 and the accuracy continued to be high. 

In [75]:
# looking at how well regression does with 2024 data as the test data

#reading in 2024 data
data24 = pd.read_csv('2024 Final Total Data.csv')

In [76]:
#dropping the same columns in the 2024 data
X24= data24.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64', '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.'],axis=1)

In [77]:
#scaling the 2024 data
X24 = scale(X24)

In [78]:
#What we are predicting 
y = data['R32']
y24 = data24['R32']

In [79]:
#setting up the test (past years) train (2024 data)
X_train = X
X_test = X24

y_train = y
y_test = y24


In [80]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)

In [81]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.67      0.62      0.65        32
           1       0.65      0.69      0.67        32

    accuracy                           0.66        64
   macro avg       0.66      0.66      0.66        64
weighted avg       0.66      0.66      0.66        64



The accuracy went down a little when using the 2024 data. 

### SVM exploratory analysis 


### KNN exploratory analysis 


### Decision tree exploratory analysis 

In [94]:
#reading in 2009, 2016, 2017 data
data = pd.read_excel('Complete Combined Files .xlsx')


In [95]:
#dropping the outcome columns
X = data.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP'], axis = 1)

#predicting the round of 32 winners
y = data['R32']

NEED TO DO: SUMMARIZE EXPLORATORY ANALYSIS AND WHY WE CHOSE TO MOVE FORWARD WITH SVM AND KNN. 

## Analysis Methodology

After our exploratory analysis KNN, logistic regression, and SVM all performed similarly. We decided to move forward with KNN and SVM in our final analysis to see which is better at predicting the winners that advance from the round of 64 to the round of 32 and from the round of 32 to the sweet 16. In addition, we will look atwhich model predicts the most correct teams in each round. 

### SVM exploratory analysis 


Be sure to explain how you change C and general method

### KNN exploratory analysis 


Be sure to explain how you change K and general method

## Results

Explain the accuracy of each round and the teams the survived the longest and how that compares to the actual tournament bracket. 

## Evaluation

Scaling the data did not help

It seems like many classification models perform well on this type of data. The decision tree did not. 

CHaing the C and number of K helped us fine tune the model each round. 



## Ethical Considerations

Our ethical considerations include

1. Impact on the sport itself 
    
Sports are something that are supposed to provide entertainment to people and the individuals that play them typically make a career out of them. Creating a data processing method takes away from players achievements and turns everything into one big statistic, potentially creating ignorance to athletes accomplishments and hard work that has gotten them to this point in their career.
    
   
2. Inaccessibility to technology 

It is important to recognize that not everyone that enjoys March Madness has access to a computer, limiting their ability to create a predictive model that could support their decisions when creating their bracket.


3. Bracket Competitions

Using the predictive model in a bracket competition may be unethical if the others in the bracket group are not aware. This could be especially problematic if money is involved and an individual is not upfront about how their bracket was completed


# Project Summary

In summary, we started by coming up with a topic we were all interested in. We all enjoy the march madness tournament and were excited to put our data science skills to the test on predicting winners in the tournament. Next, we looked for basketball stats that we could use in our model. Once we had data, we cleaned it so we could look at descriptive statistics and make sure the data was clean and accurate. We needed to do some work to get the data in the format we needed to have training and testing data. Finally, we ran KNN, SVM, decision tree, and logistic regression models to see which predicted the winners the best. It appears that logistic regression, KNN, and SVM all perform similarly but for us SVM performed the best. We were able to predict ………FINISH THIS PART