# Final Project - Predicting March Madness Tournament Winner 

**Names:** Lauren Cutler, Hayden Kash, Sydney Smith

**Date:** April 19, 2024

## Background and Motivation

The reason we chose this project is because of our interest in the March Madness Tournament. We all enjoy watching the tournament every year as it is very exciting to see which college teams in the country will come out on top and be claimed as the best college basketball team. Many people every year always try to predict how the tournament will play out, so we thought why not try to actually do it by using data science. Since the NCAA is always maintaining a substantial amount of statistics on the players and teams, we thought this project would be doable as most of the data is public. Another deciding factor for choosing this project is when we present this project, the tournament will have concluded and we will be able to compare our results to who actually won.

## Project Objectives

The main objective is to look at multiple machine learning models to see which model performs the best in predicting the winners that advance from the round of 64 to the round of 32 and from the round of 32 to the sweet 16. In addition, we will see which model predicts the most correct teams in each round. For example, the model may predict 32 teams to advance from the round of 64 to the round of 32 but not all the 32 predicted teams will be correct. Even if not all the teams are correct if a portion of the teams continue to be correct in each of the rounds of the tournament that will help improve our bracket predictions.

## Data

For our project, we decided to use three data domains to help predict winners in multiple rounds of the 2024 NCAA men’s March Madness basketball tournament. The categories that we determined were the most essential for this project were the March Madness tournament data, the team statistics, and a power rating. Along with 2024 data we also decided to look at previous basketball seasons, to train our prediction model. The previous seasons we selected were the  2017, 2016, and 2009 seasons. The process of selecting these years was by randomly generating 3 random years between 2008-2023, as the website that holds all of the data we need only has the seasons 2008-present. Each dataset we use is in the form of a large data table on https://barttorvik.com/trank.php#, so all we needed to do was paste the data into an Excel file and convert it to a csv file. Therefore, all of the data we read for our project will be only through csv files.

We started with three individual data domains for each of the four years. The tournament data contained the winners of each round of the tournament. These columns are our outcomes to predict. The teams data contained data on offensive and defensive efficiencies and turnovers. In total there were 16 statistics for team performance. The barthag column is based off of points per possession and is supposed to calculate the chance of beating a division 1 team. 


#### Team stats, Barthag, Tournament statistics descriptions

**Breakdown of what each metric means:** 

- **RK** : Team Rank 
- **CONF** : Conference
- **ADJ. EFF. OFF.** : Adjusted Offensive Efficiency 
- **ADJ. EFF. DEF.** : Adjusted Defensive Efficiency
- **EFF. FG% OFF.** :  Effective Field Goal Percentage Offense 
- **EFF. FG% DEF.** : Effective Field Goal Percentange Deffense
- **TURNOVER% OFF.** : Turnover Percentage Offense
- **TURNOVER% DEF.** : Turnover Percentage Defense 
- **REB% OFF.** : Rebound Percentage Offense 
- **REB% DEF.** : Rebound Percentange Defense 
- **FT RATE OFF.** : Free Throw Rate Offense 
- **FT RATE DEF.** : Free Throw Rate Defense 
- **FT% OFF.** : Free Throw Percentage Offense 
- **FT% DEF.** : Free Throw Percentage Defense 
- **2P% OFF.** : 2 Pointer Percentage Offense
- **2P% DEF.** : 2 Pointer Percentage Defense 
- **3P% OFF.** : 3 Pointer Percentage Offense
- **3P% DEF.** : 3 Pointer Percentage Defense
- **Barthag.** : Power rating (chance of beating a D1 team)
- **PAKE** : Performance against Komputer expectations 
- **PASE** : Performance against seed expectations 
- **WINS** : Wins excluding play in games 
- **LOSS** : Losses excluding play in games
- **W%** : Win percentage excluding play in games 
- **R64** : Appearances in the round of 64
- **R32** : Appearances in the round of 32
- **S16** : Appearances in the sweet 16
- **E8** : Appearances in the elite eight
- **F4** : Appearances in the final four
- **F2** : Championship game appearances
- **CHAMP** : National titles
- **TOP2** : Years awarded a 1 or 2 seed
- **F4%** : Likelihood of getting to at least the final 4
- **CHAMP%** : Likelihood of winning at least 1 title per efficiency rating


## Data Processing

We had to clean the teams data the most. Every other row in the teams data was a rating of that statistic. We did not want these rows in our data. Once we loaded the teams data into a pandas data frame we programmatically removed every other row. We also started with more than the 64 teams in the tournament. We reduced the teams data to the 64 teams for each year. When we tried to reduce the teams to 64 we realized that some of the team names had numbers or rankings in their names. We had to get rid of the rankings in the team names to programmatically get the list to 64. For the tournament and power rating data all we had to do was copy and paste the data from the website into excel and read in the csv file. The tournament and power rating was reduced to the 64 tournament on the website. 

Once we had 64 teams for each dataset for each year we worked on combining the datasets together. We combine all of previous years into one csv file in excel. This became our training data and what we first used to explore different machine learning models. Then we combine the three datasets for 2024 into one csv file. For all the data we checked to make sure the descriptive statistcs made sense, no dupilicates, and no null values. 


## Exploratory Analysis

For our exploratory analysis we started by looking at different machine learning models on the 2009, 2016, 2017 data. We looked at logistic regression, decision tree, SVM, KNN. We found that the decision tree did not perform as good as the other three models. Next, we looked at KNN, regression, SVM with the previous years as our training data and 2024 data as our test. All three models had similar accuracy so we moved forward with SVM and KNN. 

In [1]:
# imports and setup

import scipy as sc
from scipy.stats import norm

from sklearn import tree, svm
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm # For regression analysis
from sklearn import linear_model # For regression analysis
from sklearn import metrics


import matplotlib.pyplot as plt
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6)

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import tree


import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('ggplot')


from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

### Logistic Regression exploratory analysis 

In [2]:
#2009, 2016, 2017 data
data = pd.read_excel('Complete Combined Files .xlsx')

In [3]:
data.head()

Unnamed: 0,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,OFF. REB% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,206,Akron,MAC,102.9,95.6,48.1,45.5,20.7,26.4,34.3,...,33.2,29.4,1,0,0,0,0,0,0,0.6871
1,25,American,Pat,104.2,99.9,53.7,45.2,21.2,20.8,31.5,...,37.4,33.0,1,0,0,0,0,0,0,0.411
2,39,Arizona,P10,118.4,101.7,53.0,51.0,19.4,18.4,35.9,...,38.7,34.9,1,1,1,0,0,0,0,0.6002
3,2,Arizona St.,P10,118.0,94.6,56.4,47.0,18.6,19.5,29.1,...,37.0,31.9,1,1,0,0,0,0,0,0.854
4,155,Binghamton,AE,101.6,102.3,49.4,46.6,19.7,21.6,31.7,...,33.5,32.9,1,0,0,0,0,0,0,0.9377


In [4]:
X= data.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64'],axis=1)

In [5]:
#checking correlations because one of the assumptions of logistic regression is no perfect multicollinearity among independent variables.
X.corr(method='pearson', min_periods=1)

Unnamed: 0,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,OFF. REB% OFF.,OFF. REB% DEF.,FT RATE OFF.,FT RATE DEF.,FT% OFF.,FT% DEF.,2P% OFF.,2P% DEF.,3P% OFF.,3P% DEF.,BARTHAG
ADJ. EFF. OFF.,1.0,-0.302666,0.62817,0.016115,-0.479529,-0.240984,0.161306,-0.136001,-0.102214,-0.286381,0.445124,0.084193,0.529949,-0.034671,0.504958,0.092774,0.51702
ADJ. EFF. DEF.,-0.302666,1.0,0.030741,0.710293,0.005809,-0.288723,-0.268007,0.086525,0.039824,0.004432,0.031785,0.156899,-0.003702,0.640749,0.046652,0.389342,-0.519868
EFF. FG% OFF.,0.62817,0.030741,1.0,0.111205,-0.202413,-0.321216,-0.284749,-0.281031,-0.216726,-0.356844,0.345656,0.067115,0.867997,0.041823,0.747324,0.1579,0.260174
EFF. FG% DEF.,0.016115,0.710293,0.111205,1.0,-0.124246,-0.022969,-0.228701,0.075509,-0.016649,-0.03531,0.146895,0.117667,0.079702,0.882149,0.090346,0.575494,-0.279983
TURNOVER% OFF.,-0.479529,0.005809,-0.202413,-0.124246,1.0,0.215373,0.347886,0.262208,0.295497,0.142056,-0.294047,-0.12616,-0.152853,-0.148887,-0.177384,-0.002125,-0.137656
TURNOVER% DEF.,-0.240984,-0.288723,-0.321216,-0.022969,0.215373,1.0,0.19209,0.489894,-0.031002,0.440002,-0.22011,-0.207003,-0.237096,0.022842,-0.294161,-0.111957,-0.041799
OFF. REB% OFF.,0.161306,-0.268007,-0.284749,-0.228701,0.347886,0.19209,1.0,0.162419,0.254524,0.174179,-0.229168,-0.093387,-0.190445,-0.229249,-0.245206,-0.081738,0.154509
OFF. REB% DEF.,-0.136001,0.086525,-0.281031,0.075509,0.262208,0.489894,0.162419,1.0,-0.01691,0.01856,-0.088684,-0.27365,-0.239859,0.063156,-0.228324,0.051924,-0.15525
FT RATE OFF.,-0.102214,0.039824,-0.216726,-0.016649,0.295497,-0.031002,0.254524,-0.01691,1.0,0.155134,-0.156552,0.003917,-0.075807,-0.019005,-0.290404,0.008773,-0.121462
FT RATE DEF.,-0.286381,0.004432,-0.356844,-0.03531,0.142056,0.440002,0.174179,0.01856,0.155134,1.0,-0.200624,0.057268,-0.317084,-0.000367,-0.243076,-0.069698,-0.097462


In [6]:
#drop '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.' because they are highly correlated 
X= data.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64', '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.'],axis=1)

In [7]:
#scaling the data
X= scale(X)

In [8]:
#predicting the round of 32
y = data['R32']

In [9]:
#create an empty vector of length of Complete Combined Files .xlsx to store original indexs of teams
indices = np.arange(192)

#include indices_train and indices_test to capture the original index 
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices, random_state=1, test_size=0.3)

In [10]:
#fitting the logistic regression model on previous years combined dataset
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)

In [11]:

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.74      0.77        31
           1       0.72      0.78      0.75        27

    accuracy                           0.76        58
   macro avg       0.76      0.76      0.76        58
weighted avg       0.76      0.76      0.76        58



The logistic regression model predicted with 0.76 accuracy. We continued to look at the sweet 16 and elite 8 and the accuracy continued to be high. 

In [12]:
# looking at how well regression does with 2024 data as the test data

#reading in 2024 data
data24 = pd.read_csv('2024 Final Total Data.csv')

In [13]:
#dropping the same columns in the 2024 data
X24= data24.drop(['RK','TEAM','CONF', 'R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'R64', '2P% OFF.', 'EFF. FG% DEF.', '3P% OFF.', '2P% DEF.'],axis=1)

In [14]:
#scaling the 2024 data
X24 = scale(X24)

In [15]:
#What we are predicting 
y = data['R32']
y24 = data24['R32']

In [16]:
#setting up the test (past years) train (2024 data)
X_train = X
X_test = X24

y_train = y
y_test = y24


In [17]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)

In [18]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.67      0.62      0.65        32
           1       0.65      0.69      0.67        32

    accuracy                           0.66        64
   macro avg       0.66      0.66      0.66        64
weighted avg       0.66      0.66      0.66        64



The accuracy went down a little when using the 2024 data. 

### KNN exploratory analysis - Testing to See How Accurate the Model Can be


#### KNN Second Round Prediction 

In [19]:
X = data.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP'], axis = 1) #dropping columns that have non-numeric values and that have the things we are trying to predict
y = data['R32'] # y data is the round of 32, or what we are trying to predict 
indicies = np.arange(192) # give an index to each value 
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.5)

model = KNeighborsClassifier(n_neighbors = 5) # define the model and the number of neighbors, vary n_neighbors to find highest accuacy without overfitting the data 
model.fit(X_Train, y_Train) # use the model on the data 
y_pred = model.predict(X_Test) #predict !

index = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    index.append(indicies_test[i])   #return the indicies of the predicted variable 

predicted_teams_real_outcome = data.iloc[index]['R32'] # return the actual outcome 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred # the correctly predicted teams are where the actual outcomes equal what the model predicted 

print('sum of correctly predicted teams (how many there are):' ,sum(correctly_predicted_teams))
print('length of the predicted variable:', len(y_pred))
print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

sum of correctly predicted teams (how many there are): 74
length of the predicted variable: 96
Accuracy on test data = 77.08 %


#### KNN Sweet 16 Prediction

In [20]:
y =  data['S16'] # set the y data to be only the sweet 16 data 

indicies = np.arange(192) # give an index to each value 
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.5)

model = KNeighborsClassifier(n_neighbors = 5) # define the model and the number of neighbors, vary n_neighbors to find highest accuacy without overfitting the data 
model.fit(X_Train, y_Train) # use the model on the data 
y_pred = model.predict(X_Test) #predict 

index = []
for i in range(len(y_pred)): #loop through the length of the predicted variable 
    index.append(indicies_test[i])   #return the indicies of the predited variable 

predicted_teams_real_outcome = data.iloc[index]['S16'] # return the actual outcome that is defined in the data 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred  # the correctly predicted teams are where the actual outcomes equal what the model predicted 

print('sum of correctly predicted teams (how many there are):' ,sum(correctly_predicted_teams))
print('length of the predicted variable:', len(y_pred))
print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

sum of correctly predicted teams (how many there are): 72
length of the predicted variable: 96
Accuracy on test data = 75.00 %


#### KNN Elite 8 Prediction 

In [21]:
y = data['E8']  # set the y data to be only elite 8 data 

indicies = np.arange(192) # give an index to each value 
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.5)

model = KNeighborsClassifier(n_neighbors = 5)  # define the model and the number of neighbors, vary n_neighbors to find highest accuacy without overfitting the data 
model.fit(X_Train, y_Train) # use the model on the data 
y_pred = model.predict(X_Test) #predict 

index = []
for i in range(len(y_pred)):  #loop through the length of the predicted variable 
    index.append(indicies_test[i])   #return the indicies of the predited variable 

predicted_teams_real_outcome = data.iloc[index]['E8'] # return the actual outcome that is defined in the data 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred # the correctly predicted teams are where the actual outcomes equal what the model predicted 

print('sum of correctly predicted teams (how many there are):' ,sum(correctly_predicted_teams))
print('length of the predicted variable:', len(y_pred))
print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

sum of correctly predicted teams (how many there are): 77
length of the predicted variable: 96
Accuracy on test data = 80.21 %


#### KNN Final 4 Prediction 

In [22]:
y = data['F4'] # set the y data to be only final 4 data 
indicies = np.arange(192)  # give an index to each value 
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.5)

model = KNeighborsClassifier(n_neighbors = 5) # define the model and the number of neighbors, vary n_neighbors to find highest accuacy without overfitting the data 
model.fit(X_Train, y_Train) # use the model on the data
y_pred = model.predict(X_Test) #predict 

index = []
for i in range(len(y_pred)): #loop through the length of the predicted variable 
    index.append(indicies_test[i])   #return the indicies of the predited variable 

predicted_teams_real_outcome = data.iloc[index]['F4'] # return the actual outcome that is defined in the data

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred # the correctly predicted teams are where the actual outcomes equal what the model predicted 

print('sum of correctly predicted teams (how many there are):' ,sum(correctly_predicted_teams))
print('length of the predicted variable:', len(y_pred))
print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

sum of correctly predicted teams (how many there are): 89
length of the predicted variable: 96
Accuracy on test data = 92.71 %


### Decision tree exploratory analysis 

In [23]:
#reading in 2009, 2016, 2017 data
data = pd.read_excel('Complete Combined Files .xlsx')


In [24]:
#dropping the outcome columns
X = data.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP'], axis = 1)

#predicting the round of 32 winners
y = data['R32']

#### DecisionTrees Second Round Prediction 

In [25]:
indicies = np.arange(192) # essentially doing the same thing as the before, creating indicies for each variable
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.9)

decisionTree = tree.DecisionTreeClassifier(max_depth = 0.7, min_samples_split = 10) # define the decision tree classifier and a max depth and minimum samples to use within the tree 
decisionTree = decisionTree.fit(X_Train, y_Train) # apply the model to the training data 

y_pred_test = decisionTree.predict(X_Test) # predict the variables 
y_pred_train = decisionTree.predict(X_Train)

index = []
for i in range(len(y_pred_test)): # loop throught the predicted variable 
    index.append(indicies_test[i])   # append the predicted variable index to the index variable so that we can find which teams are predicted 

predicted_teams_real_outcome = data.iloc[index]['R32'] # get the actual outcome from the original dataset 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred_test # the correctly predicted teams will be if the actual outcome equals the predcited team 

print('number of correcly predicted teams =', sum(correctly_predicted_teams))
print('number of predicted teams =', len(y_pred_test))
print('Accuracy on test data = {:.3f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred_test)*100), '%')

number of correcly predicted teams = 86
number of predicted teams = 173
Accuracy on test data = 49.711 %


#### Decision Trees Sweet 16 Prediction 

In [26]:
y = data['S16'] # set the new y variable to be the sweet 16 data from the original data set 

indicies = np.arange(192) # essentially doing the same thing as the before, creating indicies for each variable
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.9)

decisionTree = tree.DecisionTreeClassifier(max_depth = 0.7, min_samples_split = 10)# define the decision tree classifier and a max depth and minimum samples to use within the tree 
decisionTree = decisionTree.fit(X_Train, y_Train) # apply the model to the training data 

y_pred_test = decisionTree.predict(X_Test) # predict the variables 
y_pred_train = decisionTree.predict(X_Train)

index = []
for i in range(len(y_pred_test)): # loop throught the predicted variable 
    index.append(indicies_test[i])    # append the predicted variable index to the index variable so that we can find which teams are predicted 

predicted_teams_real_outcome = data.iloc[index]['S16'] # get the actual outcome from the original dataset 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred_test # the correctly predicted teams will be if the actual outcome equals the predcited team 

print('number of correcly predicted teams =',sum(correctly_predicted_teams))
print('number of predicted teams =',len(y_pred_test))
print('Accuracy on test data = {:.3f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred_test)*100), '%')

number of correcly predicted teams = 130
number of predicted teams = 173
Accuracy on test data = 75.145 %


#### Decision Trees Elite 8 Prediction 

In [27]:
y = data['E8'] # set the new y variable to be the sweet 16 data from the original data set 

indicies = np.arange(192) # essentially doing the same thing as the before, creating indicies for each variable
# run the train_test_split function where we use the X & y data defined above as well as the indicies we defined so that we can find the teams
X_Train, X_Test, y_Train, y_Test, indicies_train, indicies_test = train_test_split(X, y, indicies, random_state = 3, test_size = 0.9)

decisionTree = tree.DecisionTreeClassifier(max_depth = 0.7, min_samples_split = 10) # define the decision tree classifier and a max depth and minimum samples to use within the tree 
decisionTree = decisionTree.fit(X_Train, y_Train) # apply the model to the training data 

y_pred_test = decisionTree.predict(X_Test) # predict the variables
y_pred_train = decisionTree.predict(X_Train)

index = []
for i in range(len(y_pred_test)): # loop throught the predicted variable 
    index.append(indicies_test[i])   # append the predicted variable index to the index variable so that we can find which teams are predicted

predicted_teams_real_outcome = data.iloc[index]['R32'] # get the actual outcome from the original dataset 

correctly_predicted_teams =  predicted_teams_real_outcome == y_pred_test # the correctly predicted teams will be if the actual outcome equals the predcited team 

print('sum of the correctly predicted teams', sum(correctly_predicted_teams))

#print(list(predicted_teams_real_outcome) )
print('sum of the predicted teams real outcome:', sum(predicted_teams_real_outcome))
print('number of predicted teams =',len(y_pred_test))
print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred_test)*100), '%')

sum of the correctly predicted teams 86
sum of the predicted teams real outcome: 87
number of predicted teams = 173
Accuracy on test data = 87.86 %


**After conducting a comprehensive analysis of the Logistic Regression, KNN, and Decission Trees classification methods, our group arrived at a strategic decision to proceed with KNN. The KNN classifer consistently exhibited the highest accuracy throughout the exploration phase of our project.** 

**Though Logistic Regression demonstrated similiarities to KNN and SVM (explored below), it had a slightly lower accuracy. Because of this, we decided to put our focus towards KNN and SVM.** 

**The performance of the Decision Trees classifier was poor, as it presented challenges with accuracy and accuracy improvement. The highest accuracy that we were able to achieve with decision trees was 49.711% in the second round, indicating a suboptimal outcome akin to random guessing. When the max depth was changed within the decision trees model, we were able to achieve 100% accuracy, but that was due to overfitting within the model itself. The highest accuracy of this model occured with a maximum depth of 0.7.**


## Analysis Methodology

After our exploratory analysis KNN, logistic regression, and SVM all performed similarly. We decided to move forward with KNN and SVM in our final analysis to see which is better at predicting the winners that advance from the round of 64 to the round of 32 and from the round of 32 to the sweet 16. In addition, we will look at which model predicts the most correct teams in each round. 

### SVM exploratory analysis 


#### First Round Winners

Be sure to explain how you change C and general method

In [28]:
#load combined data set of 2017,2016, and 2009 tournaments
train_dataset = pd.read_excel("Complete Combined Files .xlsx")
#load data set of 2024 tournament
test_dataset = pd.read_csv("2024 Final Total Data.csv")

# set x_train to the combined data set, but drop all columns that aren't related to a team's statistics
X_train = train_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP'],axis = 1)
# set x_test to the 2024 data set, but drop all columns that aren't related to a team's statistics
X_test = test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP'],axis = 1)
# set y_train to the column of the winners of the first round of the combined data set
y_train = train_dataset['R32']
# set y_test to the column of the winners of the first round of the 2024 data set
y_test = test_dataset['R32']

test_dataset

Unnamed: 0,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,OFF. REB% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,117,Akron,MAC,105.1,101.8,51.7,49.1,16.9,16.7,29.2,...,31.7,30.7,1,0,0,0,0,0,0,0.6054
1,10,Alabama,SEC,124.9,101.7,56.3,49.4,15.9,15.6,34.8,...,36.7,31.3,1,1,1,1,1,0,0,0.9213
2,24,Arizona,P12,121.0,92.9,54.9,48.2,16.1,18.1,35.6,...,37.3,33.0,1,1,1,0,0,0,0,0.9546
3,38,Auburn,SEC,120.9,93.0,54.2,43.7,15.1,18.1,32.9,...,35.2,30.2,1,0,0,0,0,0,0,0.9563
4,14,Baylor,B12,122.5,100.8,55.7,51.4,17.7,17.0,35.0,...,39.5,33.4,1,1,0,0,0,0,0,0.9046
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,348,Wagner,NEC,96.1,105.2,45.1,48.8,15.7,16.8,28.8,...,32.2,30.4,1,0,0,0,0,0,0,0.2362
60,112,Washington St.,P12,112.9,96.5,51.8,47.2,16.4,15.7,32.9,...,33.9,32.5,1,1,0,0,0,0,0,0.8488
61,121,Western Kentucky,CUSA,105.4,101.7,51.7,48.6,18.4,18.4,28.7,...,34.3,32.4,1,0,0,0,0,0,0,0.5869
62,92,Wisconsin,B10,118.6,98.9,52.2,52.0,14.9,16.9,30.2,...,34.9,36.9,1,0,0,0,0,0,0,0.9006


In [29]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C = 5000, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[17 15]
 [ 4 28]]
Accuracy =  0.703125


In [30]:
# find out which teams the model predicted on winning in the first round

predicted_winners = []

# put the index numbers of the predicted winners into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winners.append(i)
        
print(len(predicted_winners))


predicted_team_winners = []

# go through the 2024 data set and find the names of the next predicted winners by using their index numbers.

for i in range(len(test_dataset)):
    for index in predicted_winners:
        if (i == index):
            predicted_team_winners.append(test_dataset["TEAM"][i])
    
print(len(predicted_team_winners))
print(list(predicted_team_winners))

43
43
['Alabama', 'Arizona', 'Auburn', 'Baylor', 'BYU', 'Clemson', 'Colorado', 'Colorado St.', 'Connecticut', 'Creighton', 'Dayton', 'Drake', 'Duke', 'Florida', 'Florida Atlantic', 'Gonzaga', 'Houston', 'Illinois', 'Iowa St.', 'James Madison', 'Kansas', 'Kentucky', 'Marquette', 'Michigan St.', 'Mississippi St.', 'Nebraska', 'Nevada', 'New Mexico', 'North Carolina', 'North Carolina St.', 'Northwestern', 'Oregon', 'Purdue', "Saint Mary's", 'San Diego St.', 'TCU', 'Tennessee', 'Texas', 'Texas A&M', 'Texas Tech', 'Utah St.', 'Washington St.', 'Wisconsin']


#### Second Round Winners

In [31]:
#create a new data frame that filters out the losers of the first round so we can just have the winners
second_round_test_dataset = test_dataset.loc[predicted_winners]
# reset the index numbers
second_round_test_dataset = second_round_test_dataset.reset_index()
# set y_train to the column of the actual winners of the second round of the combined data set
y_train = train_dataset['S16']
# set y_test to the column of the winners of the second round of the 2024 data set
y_test = second_round_test_dataset['S16']
# set x_test to the winners of the first round, but drop all columns that aren't related to a team's statistics
X_test = second_round_test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP', 'index'],axis = 1)
second_round_test_dataset.head()

Unnamed: 0,index,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,TURNOVER% DEF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,1,10,Alabama,SEC,124.9,101.7,56.3,49.4,15.9,15.6,...,36.7,31.3,1,1,1,1,1,0,0,0.9213
1,2,24,Arizona,P12,121.0,92.9,54.9,48.2,16.1,18.1,...,37.3,33.0,1,1,1,0,0,0,0,0.9546
2,3,38,Auburn,SEC,120.9,93.0,54.2,43.7,15.1,18.1,...,35.2,30.2,1,0,0,0,0,0,0,0.9563
3,4,14,Baylor,B12,122.5,100.8,55.7,51.4,17.7,17.0,...,39.5,33.4,1,1,0,0,0,0,0,0.9046
4,5,25,BYU,B12,119.8,99.7,54.8,48.2,15.3,16.0,...,34.8,32.0,1,0,0,0,0,0,0,0.9087


In [32]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C = 2500, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[23  4]
 [ 8  8]]
Accuracy =  0.7209302325581395


In [33]:
# find out which teams the model predicted on winning in the second round

predicted_winners = []

# put the index numbers of the predicted winners into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winners.append(i)
        
print(len(predicted_winners))


predicted_team_winners = []

# go through the first round winners and find the names of the next predicted winners by using their index numbers.
for i in range(len(second_round_test_dataset)):
    for index in predicted_winners:
        if (i == index):
            predicted_team_winners.append(second_round_test_dataset["TEAM"][i])
    
print(len(predicted_team_winners))
print(list(predicted_team_winners))

12
12
['Arizona', 'Baylor', 'Colorado', 'Connecticut', 'Duke', 'Gonzaga', 'Houston', 'Illinois', 'Iowa St.', 'Kansas', 'Purdue', 'Wisconsin']


#### Sweet 16 Winners

In [34]:
#create a new data frame that filters out the losers of the second round so we can just have the winners
sweet16_test_dataset = second_round_test_dataset.loc[predicted_winners]
# reset the index numbers
sweet16_test_dataset = sweet16_test_dataset.reset_index()
# set y_train to the column of the actual winners of the sweet 16 of the combined data set
y_train = train_dataset['E8']
# set y_test to the column of the winners of the sweet 16 of the 2024 data set
y_test = sweet16_test_dataset['E8']
# set x_test to the winners of the second round, but drop all columns that aren't related to a team's statistics
X_test = sweet16_test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP', 'index', 'level_0'],axis = 1)
sweet16_test_dataset


Unnamed: 0,level_0,index,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,1,2,24,Arizona,P12,121.0,92.9,54.9,48.2,16.1,...,37.3,33.0,1,1,1,0,0,0,0,0.9546
1,3,4,14,Baylor,B12,122.5,100.8,55.7,51.4,17.7,...,39.5,33.4,1,1,0,0,0,0,0,0.9046
2,6,9,16,Colorado,P12,118.3,98.4,55.4,49.7,18.1,...,39.1,32.1,1,1,0,0,0,0,0,0.8861
3,8,11,6,Connecticut,BE,127.1,92.7,57.2,44.7,14.7,...,36.1,31.3,1,1,1,1,1,1,1,0.9686
4,12,15,18,Duke,ACC,121.8,95.9,55.3,48.7,14.2,...,38.1,32.1,1,1,1,1,0,0,0,0.9265
5,15,19,7,Gonzaga,WCC,123.0,98.3,57.1,46.8,14.1,...,36.3,33.8,1,1,1,0,0,0,0,0.9132
6,16,22,186,Houston,B12,120.3,86.5,50.5,43.9,13.7,...,34.9,30.0,1,1,1,0,0,0,0,0.9778
7,17,23,31,Illinois,B10,126.9,101.8,54.4,47.9,15.0,...,35.2,34.3,1,1,1,1,0,0,0,0.9187
8,18,24,93,Iowa St.,B12,114.1,87.4,52.2,47.3,15.5,...,35.6,31.5,1,1,1,0,0,0,0,0.952
9,20,26,51,Kansas,B12,113.7,94.1,53.5,48.3,16.5,...,33.6,34.6,1,1,0,0,0,0,0,0.9078


In [35]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C =100000, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[4 4]
 [0 4]]
Accuracy =  0.6666666666666666


In [36]:
# find out which teams the model predicted on winning in the sweet 16
predicted_winners = []

# put the index numbers of the predicted winners into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winners.append(i)
        
print(len(predicted_winners))


predicted_team_winners = []

# go through the second round winners and find the names of the next predicted winners by using their index numbers.
for i in range(len(sweet16_test_dataset)):
    for index in predicted_winners:
        if (i == index):
            predicted_team_winners.append(sweet16_test_dataset["TEAM"][i])
    
print(len(predicted_team_winners))
predicted_team_winners

8
8


['Arizona',
 'Connecticut',
 'Duke',
 'Gonzaga',
 'Houston',
 'Illinois',
 'Iowa St.',
 'Purdue']

#### Elite Eight Winners

In [37]:
#create a new data frame that filters out the losers of the sweet 16 round so we can just have the winners
elite8_test_dataset = sweet16_test_dataset.loc[predicted_winners]
# reset the index numbers
elite8_test_dataset = elite8_test_dataset.reset_index(drop = True)
# set y_train to the column of the actual winners of the elite 8 of the combined data set
y_train = train_dataset['F4']
# set y_test to the column of the winners of the elite 8 of the 2024 data set
y_test = elite8_test_dataset['F4']
# set x_test to the winners of the sweet 16, but drop all columns that aren't related to a team's statistics
X_test = elite8_test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP', 'index', 'level_0'],axis = 1)
elite8_test_dataset

Unnamed: 0,level_0,index,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,1,2,24,Arizona,P12,121.0,92.9,54.9,48.2,16.1,...,37.3,33.0,1,1,1,0,0,0,0,0.9546
1,8,11,6,Connecticut,BE,127.1,92.7,57.2,44.7,14.7,...,36.1,31.3,1,1,1,1,1,1,1,0.9686
2,12,15,18,Duke,ACC,121.8,95.9,55.3,48.7,14.2,...,38.1,32.1,1,1,1,1,0,0,0,0.9265
3,15,19,7,Gonzaga,WCC,123.0,98.3,57.1,46.8,14.1,...,36.3,33.8,1,1,1,0,0,0,0,0.9132
4,16,22,186,Houston,B12,120.3,86.5,50.5,43.9,13.7,...,34.9,30.0,1,1,1,0,0,0,0,0.9778
5,17,23,31,Illinois,B10,126.9,101.8,54.4,47.9,15.0,...,35.2,34.3,1,1,1,1,0,0,0,0.9187
6,18,24,93,Iowa St.,B12,114.1,87.4,52.2,47.3,15.5,...,35.6,31.5,1,1,1,0,0,0,0,0.952
7,32,43,12,Purdue,B10,126.8,94.9,56.2,47.3,16.3,...,40.9,31.4,1,1,1,1,1,1,0,0.9659


In [38]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C =10000, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[6 0]
 [0 2]]
Accuracy =  1.0


In [39]:
# find out which teams the model predicted on winning in the elite 8
predicted_winners = []

# put the index numbers of the predicted winners into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winners.append(i)
        
print(len(predicted_winners))


predicted_team_winners = []

# go through the sweet 16 winners and find the names of the next predicted winners by using their index numbers.
for i in range(len(elite8_test_dataset)):
    for index in predicted_winners:
        if (i == index):
            predicted_team_winners.append(elite8_test_dataset["TEAM"][i])
    
print(len(predicted_team_winners))
predicted_team_winners

2
2


['Connecticut', 'Purdue']

#### Final Four Winners

In [40]:
#create a new data frame that filters out the losers of the elite 8 round so we can just have the winners
final4_test_dataset = elite8_test_dataset.loc[predicted_winners]
# reset the index numbers
final4_test_dataset = final4_test_dataset.reset_index(drop = True)
# set y_train to the column of the actual winners of the final four of the combined data set
y_train = train_dataset['F2']
# set y_test to the column of the winners of the final four of the 2024 data set
y_test = final4_test_dataset['F2']
# set x_test to the winners of the elite eight, but drop all columns that aren't related to a team's statistics
X_test = final4_test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP', 'index', 'level_0'],axis = 1)
final4_test_dataset

Unnamed: 0,level_0,index,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,8,11,6,Connecticut,BE,127.1,92.7,57.2,44.7,14.7,...,36.1,31.3,1,1,1,1,1,1,1,0.9686
1,32,43,12,Purdue,B10,126.8,94.9,56.2,47.3,16.3,...,40.9,31.4,1,1,1,1,1,1,0,0.9659


In [41]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C =10000, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[2]]
Accuracy =  1.0


In [42]:
# find out which teams the model predicted on winning in the final four
predicted_winners = []

# put the index numbers of the predicted winners into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winners.append(i)
        
print(len(predicted_winners))


predicted_team_winners = []

# go through the elite 8 winners and find the names of the next predicted winners by using their index numbers.
for i in range(len(final4_test_dataset)):
    for index in predicted_winners:
        if (i == index):
            predicted_team_winners.append(final4_test_dataset["TEAM"][i])
    
print(len(predicted_team_winners))
predicted_team_winners

2
2


['Connecticut', 'Purdue']

#### Championship Winner

In [43]:
#create a new data frame that filters out the losers of the final four so we can just have the winners
champ_test_dataset = final4_test_dataset.loc[predicted_winners]
# reset the index numbers
champ_test_dataset = champ_test_dataset.reset_index(drop = True)
# set y_train to the column of the actual winner of the championship of the combined data set
y_train = train_dataset['CHAMP']
# set y_test to the column of the winner of the championship of the 2024 data set
y_test = champ_test_dataset['CHAMP']
# set x_test to the winners of the final four, but drop all columns that aren't related to a team's statistics
X_test = champ_test_dataset.drop(["RK",'TEAM',"CONF", "R64", 'R32', 'S16', 'E8', 'F4','F2','CHAMP', 'index', 'level_0'],axis = 1)
champ_test_dataset

Unnamed: 0,level_0,index,RK,TEAM,CONF,ADJ. EFF. OFF.,ADJ. EFF. DEF.,EFF. FG% OFF.,EFF. FG% DEF.,TURNOVER% OFF.,...,3P% OFF.,3P% DEF.,R64,R32,S16,E8,F4,F2,CHAMP,BARTHAG
0,8,11,6,Connecticut,BE,127.1,92.7,57.2,44.7,14.7,...,36.1,31.3,1,1,1,1,1,1,1,0.9686
1,32,43,12,Purdue,B10,126.8,94.9,56.2,47.3,16.3,...,40.9,31.4,1,1,1,1,1,1,0,0.9659


In [44]:
# create a svm classification model
model = svm.SVC(kernel = 'rbf',C =1000, gamma = 'scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#create a confusion matrix and print overall accuracy score
matrix = metrics.confusion_matrix(y_true = y_test, y_pred = y_pred)
print(matrix)
print("Accuracy = ", metrics.accuracy_score(y_true = y_test, y_pred = model.predict(X_test)))

[[0 1]
 [1 0]]
Accuracy =  0.0


In [45]:
# find out which teams the model predicted on winning in the championship
predicted_winner = []

# put the index numbers of the predicted winner into an array
for i in range(len(y_pred)):
    if (y_pred[i] == 1):
        predicted_winner.append(i)
        
print(len(predicted_winner))


predicted_team_winner = []

# go through the final four winners and find the name of the next predicted winner by using their index number.
for i in range(len(champ_test_dataset)):
    for index in predicted_winner:
        if (i == index):
            predicted_team_winner.append(champ_test_dataset["TEAM"][i])
    
print(len(predicted_team_winner))
print(predicted_team_winner)

1
1
['Purdue']


### KNN exploratory analysis - Using KNN On 2024 Data 


In [46]:
# read in the 2024 data 
data_2024 = pd.read_csv('2024 Final Total Data.csv')

#### KNN Second Round Prediction On 2024 Data 

In [47]:
# drop non-numeric values and things we use to predict 
XTrain = data.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP'], axis = 1)
XTest = data_2024.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP'], axis = 1)
yTrain = data['R32'] # the training data will be the combined data (from 2009, 2016, 2017)
yTest = data_2024['R32'] #the testing data will be the 2024 data 

model = KNeighborsClassifier(n_neighbors = 1) # define the model 
model.fit(XTrain, yTrain) #apply the model to the training data 
y_pred = model.predict(XTest) #make predictions 

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = yTest, y_pred = y_pred)*100), '%')

Accuracy on test data = 68.75 %


In [48]:
predicted_winners = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    if y_pred[i] == 1: # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i) #  put it into the array `predicted_winners`

predicted_teams = []

for j in range(len(data_2024)): # loop through the length of the 2024 data 
    for i in predicted_winners: # go through the predicted winners array that we created above 
        if j == i: # when the team in the 2024 data is also in the predicted winners from our model, 
            predicted_teams.append(data_2024['TEAM'][j]) # append the team name to the new list 


actual_winners = [] # essentially doing the same this as above just with the teams that actually won so that we can compare

for i in range(len(yTest)): 
    if yTest[i] == 1: 
        actual_winners.append(i)
        
actual_teams = [] 
for j in range(len(data_2024)): 
    for i in actual_winners: 
        if j == i: 
            actual_teams.append(data_2024['TEAM'][j])

In [49]:
print('Teams that were predicted:\n' , predicted_teams)
print()
print('Teams that actually made it:\n', actual_teams)
print()
print('check to see how many teams it predicted:', len(predicted_teams))
print() 
print('how many teams there actually are:', len(actual_teams))

Teams that were predicted:
 ['Alabama', 'Auburn', 'Baylor', 'BYU', 'Clemson', 'Colorado', 'Connecticut', 'Creighton', 'Dayton', 'Duke', 'Duquesne', 'Florida', 'Florida Atlantic', 'Gonzaga', 'Houston', 'Illinois', 'Iowa St.', 'James Madison', 'Marquette', 'Michigan St.', 'Nebraska', 'New Mexico', 'Northwestern', 'Oakland', "Saint Mary's", 'San Diego St.', 'TCU', 'Tennessee', 'Texas', 'Texas Tech', 'Utah St.', 'Wisconsin']

Teams that actually made it:
 ['Alabama', 'Arizona', 'Baylor', 'Clemson', 'Colorado', 'Connecticut', 'Creighton', 'Dayton', 'Duke', 'Duquesne', 'Gonzaga', 'Grand Canyon', 'Houston', 'Illinois', 'Iowa St.', 'James Madison', 'Kansas', 'Marquette', 'Michigan St.', 'North Carolina', 'North Carolina St.', 'Northwestern', 'Oakland', 'Oregon', 'Purdue', 'San Diego St.', 'Tennessee', 'Texas', 'Texas A&M', 'Utah St.', 'Washington St.', 'Yale']

check to see how many teams it predicted: 32

how many teams there actually are: 32


#### KNN Sweet 16 Prediction On 2024 Data 

In [50]:
sweet16 = data_2024.loc[predicted_winners] # edit the data so that the sweet 16 data only contains the winners our model predcited for the second round 
sweet16 = sweet16.reset_index() # have to reset the index so that we can find the index of our variables 
y_Train = data['S16'] # want to train from the ORIGINAL COMBINED DATA SET 
y_Test = sweet16['S16'] # want to test only on the 2024 data 
X_Test = sweet16.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'index'], axis = 1)

model = KNeighborsClassifier(n_neighbors = 1) # create the model with 1 neighbor 
model.fit(XTrain, y_Train) # apply the model to the training data 
y_pred = model.predict(X_Test) #predict the winners 

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

Accuracy on test data = 62.50 %


In [51]:
predicted_winners = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    if y_pred[i] == 1: # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i)

predicted_teams = []

for j in range(len(sweet16)): # loop through the length of the sweet 16 data 
    for i in predicted_winners: 
        if j == i: # when the team in the sweet 16 data is also in the predicted winners from our model, 
            predicted_teams.append(sweet16['TEAM'][j])
            
print('teams that were predicted:\n', list(predicted_teams))

teams that were predicted:
 ['Auburn', 'Clemson', 'Colorado', 'Connecticut', 'Duke', 'Florida Atlantic', 'Gonzaga', 'Illinois', 'Iowa St.', 'Nebraska', 'Northwestern', 'Tennessee', 'Texas Tech', 'Wisconsin']


#### KNN Elite 8 Prediction On 2024 Data 

In [52]:
elite8 = sweet16.loc[predicted_winners] # edit the data such that it only contains the winners that our model precited in sweet 16
elite8 = elite8.reset_index() #have to reset the index of our dataset 
y_Train = data['E8'] # train off of the ORIGINAL DATA SET OF COMBINED YEARS 
y_Test = elite8['E8'] # predict from the 2024 data 
X_Test = elite8.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'index', 'level_0'], axis = 1)

model = KNeighborsClassifier(n_neighbors = 1) #define the model 
model.fit(XTrain, y_Train) #apply the model 
y_pred = model.predict(X_Test) # predict 

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

Accuracy on test data = 64.29 %


In [53]:
predicted_winners = []
for i in range(len(y_pred)):# loop through the length of the predicted variable 
    if y_pred[i] == 1: # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i)
        
len(predicted_winners)

predicted_teams = []

for j in range(len(elite8)): # loop through the length of the elite 8 data 
    for i in predicted_winners: 
        if j == i: # when the team in the elite 8 data is also in the predicted winners from our model, 
            predicted_teams.append(elite8['TEAM'][j]) # return the team names 
            
            
print('confusion matrix: \n', metrics.confusion_matrix(y_Test, y_pred))  
print()
print('teams that were predicted:\n', predicted_teams)

confusion matrix: 
 [[5 4]
 [1 4]]

teams that were predicted:
 ['Auburn', 'Connecticut', 'Duke', 'Florida Atlantic', 'Gonzaga', 'Illinois', 'Iowa St.', 'Tennessee']


#### KNN Final Four Prediction On 2024 Data 

In [54]:
finalfour = elite8.loc[predicted_winners] # create a new dataset that only contains the winners of the elite 8 round that our model predicted 
finalfour = finalfour.reset_index(drop = True) 
y_Train = data['F4'] # train off of the ORIGINAL DATA SET OF COMBINED YEARS 
y_Test = finalfour['F4'] # test from the 2024 dataset 
X_Test = finalfour.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'index', 'level_0'], axis = 1)

model = KNeighborsClassifier(n_neighbors = 1) # create the model and decide number of neighbors 
model.fit(XTrain, y_Train) # apply the model to the training data 
y_pred = model.predict(X_Test) # predict based off the 2024 data 

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

Accuracy on test data = 75.00 %


In [55]:
predicted_winners = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    if y_pred[i] == 1:  # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i)
        
len(predicted_winners)

predicted_teams = []

for j in range(len(finalfour)): # loop through the length of the final 4 data
    for i in predicted_winners: 
        if j == i: # when the team in the final 4 data is also in the predicted winners from our model,
            predicted_teams.append(finalfour['TEAM'][j]) # return the team names 
            
            
print('confusion matrix: \n', metrics.confusion_matrix(y_Test, y_pred))  
print()
print('teams that were predicted:\n', predicted_teams)

confusion matrix: 
 [[5 2]
 [0 1]]

teams that were predicted:
 ['Connecticut', 'Duke', 'Gonzaga']


#### KNN Final 2 Prediction On 2024 Data 

In [56]:
finaltwo = finalfour.loc[predicted_winners] # create a new dataset such that the only data in it are the predicted winners from the final 4 
finaltwo = finaltwo.reset_index(drop = True) # reset the index of the dataset 
y_Train = data['F2'] # train based off of the ORIGINAL DATA OF THE PREVIOUS YEARS 
y_Test = finaltwo['F2'] # test based on the 2024 data 
X_Test = finaltwo.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'index', 'level_0'], axis = 1)

model = KNeighborsClassifier(n_neighbors = 1) # define the model 
model.fit(XTrain, y_Train) # apply the model to the training data 
y_pred = model.predict(X_Test) # predict 

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')

Accuracy on test data = 100.00 %


In [57]:
predicted_winners = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    if y_pred[i] == 1: # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i)
        
len(predicted_winners)

predicted_teams = []

for j in range(len(finaltwo)): # loop through the length of the final 2 data
    for i in predicted_winners: 
        if j == i:  # when the team in the final 2 data is also in the predicted winners from our model,
            predicted_teams.append(finaltwo['TEAM'][j]) # return the team names 
            
            
print('confusion matrix: \n', metrics.confusion_matrix(y_Test, y_pred))  
print()
print('teams that were predicted:\n', predicted_teams)

confusion matrix: 
 [[2 0]
 [0 1]]

teams that were predicted:
 ['Connecticut']


#### KNN Champtionship Prediction On 2024 Data 

In [58]:
champ = finaltwo.loc[predicted_winners] # subset the data such that only the winners from the final two are in the dataset 
champ = champ.reset_index(drop = True) # reset the index 
y_Train = data['CHAMP'] # train off of the ORIGINAL COMBINED DATA 
y_Test = champ['CHAMP'] # test on the 2024 data 
X_Test = champ.drop(columns = ['RK','TEAM','CONF', 'R64','R32', 'S16', 'E8', 'F4', 'F2', 'CHAMP', 'index', 'level_0'], axis = 1)

model = KNeighborsClassifier(n_neighbors = 1) # create the model 
model.fit(XTrain, y_Train) # apply the model to the training data 
y_pred = model.predict(X_Test) # test on the 2024 data

print('Accuracy on test data = {:.2f}'.format(metrics.accuracy_score(y_true = y_Test, y_pred = y_pred)*100), '%')


Accuracy on test data = 0.00 %


In [59]:
predicted_winners = []
for i in range(len(y_pred)): # loop through the length of the predicted variable 
    if y_pred[i] == 1: # if the value of the predicted variable is 1, put it into the array `predicted_winners`
        predicted_winners.append(i)
        
len(predicted_winners)

predicted_teams = []

for j in range(len(champ)): # loop through the length of the champ data
    for i in predicted_winners: 
        if j == i: # when the team in the champ data is also in the predicted winners from our model,
            predicted_teams.append(champ['TEAM'][j])# return the team names 
            
            
print('confusion matrix: \n', metrics.confusion_matrix(y_Test, y_pred))  
print()
print('teams that were predicted:\n', predicted_teams)

confusion matrix: 
 [[0 0]
 [1 0]]

teams that were predicted:
 []


Through the application of KNN on the 2024 data, the only thing that was varied to increase accuracy was the number of neighbors. For each of these, we opted to use a value of k = 1, because it provided the highest accuracy. 

Having a single neighbor means that the data has low bias, and it is trying to capture all of the intricacies and nuances of the training data. It is common to see that in a KNN model, the data will be 'overfit' with a k value of 1, but since the model did not return an obscenely high accuracy from the beginning, we determined that k = 1 was appropriate, as it did not appear that it was overfitting too much. 

## Final Results

Explain the accuracy of each round and the teams the survived the longest and how that compares to the actual tournament bracket. 




## Evaluation

Throughout our process, we found that scaling the data did not help increase the accuracy of the models we were using. In some instances it appeared like it might, but then we realized that there was high probability that this was leading to overfitting of the data. 

Given the expansiveness of our dataset, featuring numerous variables, it became apparent that many classification models perform well on these kinds of datasets. We were able to conclude that the decision tree did not perform well, as it gave the lowest accuracy and had large issues with over-fitting the data. 

To refine our approach, we optimized the SVM's **C** parameter and adjusted the **K** value for the KNN model, which allowed us to fina a balance between maximizing accuracy and mitigating overfitting. 

A notable challenge that we encountered when designing our model was that our model frequently designated both teams as advancing to the subsequent round, which was a logical inconsistency we were unable to rectify within the framework of our model. 



## Ethical Considerations

**Our ethical considerations include**

**1. Impact on the sport itself** 
    
Sports serve as a source of entertainment, and for many athletes, they constitute a livelihood. Implementing a data processing method like this can potentially overshadow the individual achievements of players, reducing their hardwork and effort to mere statistics. This approach risks undermining the recognition of althletes' dedication to their sport and accomplishments. 
    
   
**2. Inaccessibility to technology** 

We must acknowledge that not all March Madness fans have access to technology. This limitation hampers their ability to utilize predicive models in making informed decisions when crafting their brackets, which can in turn create disparities in participation and engagement. 


**3. Bracket Competitions**


The use of predictive models in bracket competitions raises ethical concerns if participants are unaware of its use. This becomes particularly problematic in scenarios involving monetary stakes, where transparency regarding the methodology employed in the creation of one's bracket is essential for fairness and integrity.

# Project Summary

In summary, we started by coming up with a topic we were all interested in. We all enjoy the march madness tournament and were excited to put our data science skills to the test on predicting winners in the tournament. Next, we looked for basketball stats that we could use in our model. Once we had data, we cleaned it so we could look at descriptive statistics and make sure the data was clean and accurate. We needed to do some work to get the data in the format we needed to have training and testing data. Finally, we ran KNN, SVM, decision tree, and logistic regression models to see which predicted the winners the best. It appears that logistic regression, KNN, and SVM all perform similarly but for us SVM performed the best. We were able to predict ………FINISH THIS PART