**Project Notebook**

*This notebook is meant to be run start to finish one time (without jumping around) to ensure that everything runs correctly.*

The first step is to run the .py file that contains all of the dataframes that we need to access in this notebook. The use of this file is to remove the "gross" data manipulation from the final notebook, and pull the data straight from there.

In [1]:
%run Project_data.py

These imports are very necessary for all of the modeling that we will be doing in this notebook. The main library we are using is sklearn, which contains a lot of useful tools for modeling.

In [2]:
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.datasets import fetch_lfw_people
import time
from sklearn.datasets import make_circles
from mpl_toolkits.mplot3d import Axes3D
import sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

This is the creation of our training vectors and labels. The dataframe called 'final' is pulled from the .py file and includes all of the training data that we need.

In [3]:
train_vectors = final.drop(columns = ['WIN'])
train_labels = final['WIN']

We then used this SVC model training from class to obtain the correct parameters for our model to be most effective during testing.

In [4]:
start = time.time()

tmp_vectors = train_vectors
tmp_labels = train_labels

print("Fitting the classifier to the training set")
param_grid = {'C': [1e1, 5e1, 1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],
              'kernel': ['rbf']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)

clf = clf.fit(tmp_vectors, tmp_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
print("Best parameters found by grid search:")
print(clf.best_params_)

end = time.time()
print("Runtime",end - start)

Fitting the classifier to the training set
Best estimator found by grid search:
SVC(C=10.0, class_weight='balanced', gamma=0.005)
Best parameters found by grid search:
{'C': 10.0, 'gamma': 0.005, 'kernel': 'rbf'}
Runtime 4.274148464202881


Now, it is time to use our model to test on the 2021 March Madness bracket...

In [5]:
predict21 = clf.predict(final21)
predict21

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

In [6]:
true21 = (predict21 == result21)

In [7]:
sum(true21)/63

0.7142857142857143

We compared the predicted result of the games to the actual result of the games pulled from the .py file. The number displayed above is the proportion of games our model predicted correctly. 

It is time to test how the accuracy of our model looks over time. The following code is a loop that runs 100 iterations of the a training/testing split and calculates the accuracy of the model each iteration. The results are stored in a list that will be used to find the average accuracy.

In [8]:
accuracy = []
i = 0
while i < 100:
    train_vectors, test_vectors, train_labels, test_labels = train_test_split(final.drop(columns = ['WIN']), final['WIN'], test_size = 0.25)
    
    tmp_vectors = train_vectors
    tmp_labels = train_labels

    param_grid = {'C': [1e1, 5e1],
                  'gamma': [0.001, 0.005, 0.01],
                  'kernel': ['rbf']}
    clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)

    clf = clf.fit(tmp_vectors, tmp_labels)
    
    predict_vectors = test_vectors
    true_labels = test_labels
    
    pred_labels = clf.predict(predict_vectors)
    
    accuracy.append(clf.score(predict_vectors, true_labels))
    
    i += 1
    

In [9]:
sum(accuracy)/len(accuracy)

0.6849367088607594

Now, we want to understand how a model would predict the 2021 March Madness bracket if we used both teams statistics in the same row of a datafram, as opposed to taking the difference in the statistics. We used the combined dataframe pulled from the .py file to run this prediction.

In [10]:
train_vectors = combined.drop(columns = ['WIN'])
train_labels = combined['WIN']

In [11]:
start = time.time()

tmp_vectors = train_vectors
tmp_labels = train_labels

print("Fitting the classifier to the training set")
param_grid = {'C': [1e1, 5e1, 1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],
              'kernel': ['rbf']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)

clf = clf.fit(tmp_vectors, tmp_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
print("Best parameters found by grid search:")
print(clf.best_params_)

end = time.time()
print("Runtime",end - start)

Fitting the classifier to the training set
Best estimator found by grid search:
SVC(C=10.0, class_weight='balanced', gamma=0.05)
Best parameters found by grid search:
{'C': 10.0, 'gamma': 0.05, 'kernel': 'rbf'}
Runtime 2.83632493019104


In [12]:
predict21 = clf.predict(combined21)
predict21

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [13]:
true21 = (predict21 == result21)

In [14]:
sum(true21)/63

0.6984126984126984

We can see that using the combined dataframe results in a little bit worse of a prediction for the 2021 March Madness bracket.

But we do want to see the accuracy of this model over time, and how it stacks up against the previous model that uses a different dataset.

In [15]:
accuracy = []
i = 0
while i < 100:
    train_vectors, test_vectors, train_labels, test_labels = train_test_split(combined.drop(columns = ['WIN']), combined['WIN'], test_size = 0.25)
    
    tmp_vectors = train_vectors
    tmp_labels = train_labels

    param_grid = {'C': [1e1, 5e1],
                  'gamma': [0.001, 0.005, 0.01],
                  'kernel': ['rbf']}
    clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)

    clf = clf.fit(tmp_vectors, tmp_labels)
    
    predict_vectors = test_vectors
    true_labels = test_labels
    
    pred_labels = clf.predict(predict_vectors)
    
    accuracy.append(clf.score(predict_vectors, true_labels))
    
    i += 1

In [16]:
sum(accuracy)/len(accuracy)

0.6858227848101264

We find that both of these models are fairly similar in accuracy when predicting the outcome of March Madness games.

The following code utilizes logistic fitting to observe which features have the most significance when predicting the outcomes of these games.

In [21]:
train_vectors = final.drop(columns = ['WIN'])
train_labels = final['WIN']

logit_model = sm.Logit(train_labels, sm.add_constant(train_vectors))
result = logit_model.fit()
print(result.summary() )

Optimization terminated successfully.
         Current function value: 0.453724
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                    WIN   No. Observations:                  315
Model:                          Logit   Df Residuals:                      296
Method:                           MLE   Df Model:                           18
Date:                Sun, 04 Dec 2022   Pseudo R-squ.:                  0.2589
Time:                        19:22:18   Log-Likelihood:                -142.92
converged:                       True   LL-Null:                       -192.84
Covariance Type:            nonrobust   LLR p-value:                 2.368e-13
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2434      0.285      0.854      0.393      -0.315       0.802
ADJOE          0.3984      0.

In [22]:
train_vectors = combined.drop(columns = ['WIN'])
train_labels = combined['WIN']

logit_model = sm.Logit(train_labels, sm.add_constant(train_vectors))
result = logit_model.fit()
print(result.summary() )

Optimization terminated successfully.
         Current function value: 0.419954
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                    WIN   No. Observations:                  315
Model:                          Logit   Df Residuals:                      278
Method:                           MLE   Df Model:                           36
Date:                Sun, 04 Dec 2022   Pseudo R-squ.:                  0.3140
Time:                        19:22:44   Log-Likelihood:                -132.29
converged:                       True   LL-Null:                       -192.84
Covariance Type:            nonrobust   LLR p-value:                 3.847e-11
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.1279     22.382     -0.453      0.651     -53.996      33.740
ADJOE          0.4299      0.

To conclude these two models, we see that there are many features that are very significant, as well as many features that have very little significance. The interesting part of this is that when doing SVC modeling with the exclusion of the insignificant features, the model accuracy become worse. Below is a small example of this...

In [19]:
accuracy = []
i = 0
while i < 100:
    train_vectors, test_vectors, train_labels, test_labels = train_test_split(final.drop(columns = ['WIN', 'EFG_O', 'EFG_D', 'TORD', '2P_O', '2P_D', '3P_O', '3P_D']), final['WIN'], test_size = 0.25)
    
    tmp_vectors = train_vectors
    tmp_labels = train_labels

    param_grid = {'C': [1e1, 5e1],
                  'gamma': [0.001, 0.005, 0.01],
                  'kernel': ['rbf']}
    clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)

    clf = clf.fit(tmp_vectors, tmp_labels)
    
    predict_vectors = test_vectors
    true_labels = test_labels
    
    pred_labels = clf.predict(predict_vectors)
    
    accuracy.append(clf.score(predict_vectors, true_labels))
    
    i += 1

In [20]:
sum(accuracy)/len(accuracy)

0.6499999999999999

We see that the accuracy of this model decreases a small amount when excluding some of the insignificant features. We also performed various combinations of excluding features, but the best accuracy seems to always be the model that uses all features.

In conclusion, our initial model does a fairly good job at predicting the 2021 March Madness bracket. The accuracy is slighlty higher than if you were to simply pick the team with the better seed (which is promising). We want to be able to pick out the upsets and predict those because that is the main difficulty when predicting March Madness games. These models are fairly basic in the realm of sports prediciton, but it is a good display of the skills we have learned in class. It is important to note that the outcome of a sports game is not always correlated with the season statistics of the team. There are many other factors at hand. Humans are very hard to predict, especially in sports. We are streaky individuals, and this is really illuminated when looking at the results of basketball games. How do some 'worse' teams in March Madness make crazy runs in the tournament, and beat really good teams? Sometimes, we just do not know. Sometimes, their team statistics do not matter at all. With all of this information in mind, our model is a decent and cool prediction of the 2021 March Madness bracket.