##__Final Project: Feature Selection and Model Training__

Name: Drew Zink

Topic: NCAA Division 1 College Basketball Predictive Metrics

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

cbb = pd.read_csv('cbb_new_cleaned.csv')
cbb.head()

Unnamed: 0,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,TORD,ORB,DRB,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,SEED,P7,DEEP
0,33.0,123.3,94.9,0.9531,52.6,48.1,15.4,18.2,40.7,30.0,...,30.4,53.9,44.6,32.7,36.2,71.7,8.6,1.0,1.0,1.0
1,36.0,129.1,93.6,0.9758,54.8,47.7,12.4,15.8,32.1,23.7,...,22.4,54.8,44.7,36.5,37.5,59.3,11.3,1.0,1.0,1.0
2,33.0,114.4,90.4,0.9375,53.9,47.7,14.0,19.5,25.5,24.9,...,30.0,54.7,46.8,35.2,33.2,65.9,6.9,3.0,1.0,1.0
3,31.0,115.2,85.2,0.9696,53.5,43.0,17.7,22.8,27.4,28.7,...,36.6,52.8,41.9,36.5,29.7,67.5,7.0,3.0,1.0,1.0
4,37.0,117.8,86.3,0.9728,56.6,41.1,16.2,17.1,30.0,26.2,...,26.9,56.3,40.0,38.2,29.0,71.5,7.7,1.0,0.0,1.0


Before creating the model, I want to introduce a test set that does not possess a postseason result. This set is from the most recent year in college basketball. I will clean the set the same way I did in my EDA to match the formatting and structure of the previously cleaned set.

In [10]:
cbb24 = pd.read_csv('cbb24.csv')

print(cbb24.head(10))

# drop observations that contain nans from 'SEED'
cbb24 = cbb24.dropna(subset=['SEED'])

cbb24['P7'] = cbb24['CONF'].isin(['ACC', 'B10', 'B12', 'SEC', 'BE', 'P12', 'Amer'])

# Removing TEAM feature, as supposed to view these as blind resumes
cbb24 = cbb24.drop(['RK','TEAM', 'CONF', 'G'], axis=1)

# Converting all non-floats to floats
cbb24['W'] = cbb24['W'].astype(float)
cbb24['SEED'] = cbb24['SEED'].astype(float)
cbb24['P7'] = cbb24['P7'].astype(float)

cbb24.info()

   RK            TEAM CONF   G   W  ADJOE  ADJDE  BARTHAG  EFG%  EFGD%  ...  \
0   1         Houston  B12  34  30  119.2   85.5   0.9785  49.7   44.0  ...   
1   2     Connecticut   BE  34  31  127.1   93.6   0.9712  57.1   45.1  ...   
2   3          Purdue  B10  33  29  126.2   94.7   0.9644  56.0   47.7  ...   
3   4        Iowa St.  B12  34  27  113.6   86.5   0.9583  51.9   47.1  ...   
4   5          Auburn  SEC  34  27  120.7   92.1   0.9573  54.1   43.4  ...   
5   6         Arizona  P12  33  25  121.5   93.6   0.9526  55.0   48.7  ...   
6   7       Tennessee  SEC  32  24  115.6   91.2   0.9382  51.5   45.4  ...   
7   8       Marquette   BE  34  25  118.9   94.6   0.9328  55.1   49.7  ...   
8   9  North Carolina  ACC  34  27  116.8   93.2   0.9305  51.3   46.4  ...   
9  10       Creighton   BE  32  23  120.6   96.5   0.9289  57.5   46.4  ...   

    DRB   FTR  FTRD  2P_O  2P_D  3P_O  3P_D  ADJ_T   WAB  SEED  
0  30.2  29.9  39.0  48.4  43.4  34.7  30.0   63.3  10.6   1.0  


To construct the final feature set, I am going to look at the correlation of each variable with the target feature 'DEEP'. Note that since there are only two possible outcomes for deep, there will not be many strong correlations present.

In [3]:
corr = cbb.corr(method='pearson')

high_corr_features = corr.index[abs(corr['DEEP']) > 0.25]
print(high_corr_features)


Index(['W', 'ADJOE', 'ADJDE', 'BARTHAG', 'WAB', 'SEED', 'DEEP'], dtype='object')


When the correlation threshold is set to 0.25, 6 features are present: Win Total, Adjusted Offensive Efficiency Rating, Adjusted Defensive Efficiency Rating, Barthag Rating, Wins Above Bubble, and NCAA Tournament Seeding. I am surprised that there are no individual team statistics other than efficiency ratings present in this set, but I understand why ratings rise to the top. They are mathematical composite ratings that consider all of these statistics.

In [4]:
cbb_final = cbb[high_corr_features]

# Remove 'DEEP' from high_corr_features (not present in cbb24)
high_corr_features = high_corr_features.drop('DEEP')
cbb24_final = cbb24[high_corr_features]

Finally, we need to split the cbb dataset into data and target sets. 'DEEP' is our target variable, so we can split our set accordingly.

In [5]:
target = cbb_final['DEEP']
data = cbb_final.drop('DEEP', axis=1)

data = data.values
target = target.values

Now, I am going to attempt to create the model. We will begin by using stratified splits alongside a Logistic Regression application.

In [23]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold
import numpy as np

X = data
y = target

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

best_split = None
best_metric = 0  # Initialize to 0 for maximization
metrics = []

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculate confusion matrix and metrics
    cm = confusion_matrix(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    # Store metrics
    metrics.append((i, cm, precision, recall, f1))

    # Select the "best" split based on F1-score (or other metric)
    if f1 > best_metric:
        best_metric = f1
        best_split = (i, cm, precision, recall, f1)

# Print best split results
print(f"Best Split Index: {best_split[0]}")
print(f"Confusion Matrix:\n{best_split[1]}")
print(f"Precision: {best_split[2]:.2f}, Recall: {best_split[3]:.2f}, F1-Score: {best_split[4]:.2f}")

best_split_index = best_split[0]
train_index, test_index = list(skf.split(X, y))[best_split_index]

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

# Retrain on the best split
model.fit(X_train, y_train)
final_predictions = model.predict(X_test)

# Final Confusion Matrix
final_cm = confusion_matrix(y_test, final_predictions)
print(f"Final Confusion Matrix:\n{final_cm}")

Best Split Index: 0
Confusion Matrix:
[[193   7]
 [ 12  15]]
Precision: 0.68, Recall: 0.56, F1-Score: 0.61
Final Confusion Matrix:
[[193   7]
 [ 12  15]]


We are trying to minimize the number of incorrect picks, which means that we want no false "positives." In the model we came up with, less than half of the true positives were predicted as positives. The aim for this project is to underfit rather than overfit to reduce risk. We will try applying this to our cbb24_final set.

In [24]:
test24 = cbb24_final.values
# use the model to predict the outputs
pred24 = model.predict(test24)

print(pred24.shape)
print(pred24)

# Find the number of 1s present in the outputted array
count = np.count_nonzero(pred24 == 1)
print(count, "teams had the potential to go deep.")

(68,)
[1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
3 teams had the potential to go deep.


The three teams that the model predicted to go deep were Houston, UConn, and Purdue. Two of these predictions were correct, as UConn and Purdue went on to meet in the National Championship Game last year, with UConn ultimately winning. On the other hand, Houston lost in the Sweet 16, their third game, which displays the randomess of this tournament. Statistically speaking, these were all great predictions, as oddsmakers set these teams among the favorites as the regular season came to a close last year. This gives me faith that I could in good conscience use this model to verify if teams could make it to the Elite 8 or further.