# Machine Learning

In this notebook, we will create a model to predict the probability of a team winning an NCAA tournament game against another team.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

Below is the prediction data created in MatchupFeatureEngineering.ipynb.

In [2]:
matchups = pd.read_csv('mydata/matchups.csv')
matchups.head()

Unnamed: 0,Weighted_Rating_x,Weighted_Rating_y,Colley_Rating_x,Teamrank_Rating_x,Teamrank10_Rating_x,Trank_OE_x,Trank_DE_x,Trank_Rating_x,EFG%_x,EFGD%_x,...,yOffxDefAstDiff,xOffyDefPoints3,yOffxDefPoints3,xOffyDefPoints2,yOffxDefPoints2,xOffyDefPoints1,yOffxDefPoints1,xOffyDefPoints,yOffxDefPoints,PointDiff
0,19.603241,9.92806,0.484032,0.446465,0.444444,97.3,99.1,0.445334,49.8,46.5,...,0.021118,0.363127,0.263351,0.319767,0.290067,0.275786,0.267921,0.958679,0.821339,0.13734
1,40.358259,23.232983,0.921698,0.921212,1.0,117.2,88.8,0.978279,54.1,47.6,...,0.152258,0.403022,0.374303,0.328184,0.314259,0.263955,0.230452,0.995161,0.919015,0.076146
2,40.358259,36.520674,0.921698,0.921212,1.0,117.2,88.8,0.978279,54.1,47.6,...,0.109056,0.394202,0.307715,0.31197,0.314546,0.275416,0.234707,0.981588,0.856967,0.12462
3,37.274341,36.520674,0.86592,0.822222,0.848148,115.2,92.3,0.943463,55.0,46.8,...,0.018604,0.387736,0.377983,0.319887,0.311065,0.302387,0.233008,1.01001,0.922057,0.087953
4,42.312421,25.040472,0.938298,1.0,0.961111,121.0,85.6,1.0,56.3,44.8,...,0.078373,0.347188,0.439235,0.358261,0.316658,0.257623,0.231491,0.963072,0.987385,-0.024313


In [3]:
y = matchups.Upset
X = matchups.drop(columns = ['Upset'])

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)


### Logistic Regression

The first model we will be using is Logistic Regression. Logistic regression is good at predicting probabilities and L1 regression is helpful for dimension reduction to find the most important features. We will fit two models, one with L1 regularization and the other with L2 regularization. We will use cross validation to find the optimal regularization coefficients that minimize log loss to make sure we are not overfitting. Then we will test the models on an unseen testing set and assess their log loss.

In [5]:
c = [.0003, .001, .003, .01, .03, .1, .3, 1, 3, 10]
log1 = LogisticRegressionCV(cv = 9, random_state = 0, solver = 'liblinear', penalty = 'l1', Cs = c, scoring = 'neg_log_loss').fit(X_train, y_train)

In [6]:
log2 = LogisticRegressionCV(cv = 9, random_state = 0, solver = 'liblinear', penalty = 'l2', Cs = c, scoring = 'neg_log_loss').fit(X_train, y_train)

In [7]:
# Prints the average cv log loss for the model chosen in cross validation
avg_log_loss1 = []
avg_log_loss2 = []
for i in range(len(log1.scores_[1][0])):
    loss1 = 0
    loss2 = 0
    for j in range(len(log1.scores_[1])):
        loss1 = loss1 + log1.scores_[1][j][i]
        loss2 = loss2 + log2.scores_[1][j][i]
    avg_log_loss1.append(loss1 / len(log1.scores_[1]))
    avg_log_loss2.append(loss2 / len(log2.scores_[1]))
print(max(avg_log_loss1))  # best avg cv log loss for logistic regression model with L1 penalty
print(max(avg_log_loss2))  # best avg cv log loss for logistic regression model with L2 penalty

-0.5524384227555681
-0.5523004897146524


In [8]:
predict1 = log1.predict_proba(X_test)
predict2 = log2.predict_proba(X_test)
predict3 = (predict1 + predict2) / 2
actual_log_loss1 = log_loss(y_test, predict1)
actual_log_loss2 = log_loss(y_test, predict2)
actual_log_loss3 = log_loss(y_test, predict3)
print(actual_log_loss1)
print(actual_log_loss2)
print(actual_log_loss3)

0.584843643531568
0.5863985166050165
0.5854963134931546


In [9]:
log1.C_  # The regularization coefficient found in cross validation for L1 penalty logistic regression

array([0.1])

In [10]:
log2.C_  # The regularization coefficient found in cross validation for L2 penalty logistic regression

array([0.003])

### Random Forest

The second model we will be using is Random Forest. Random Forest is robust to nonlinear decision boundries compared to logistic regression so it might perform better. We will use cross validation to find the optimal hyperparamters that minimize log loss to make sure we are not overfitting. Then we will test the model on an unseen testing set and assess its log loss.

In [11]:
# Hyperparameters for random forest
rf_params = {
    'n_estimators': [int(x) for x in np.linspace(50, 1000, 20)],
    'criterion': ['entropy', 'gini'],
    'max_depth': [10, 20, 30, 40, None],
    'min_samples_split': [int(x) for x in np.linspace(2, 10, 9)],
    'max_features': [4, 5, 'sqrt', 'log2', None],
    'random_state': [0]
}

In [12]:
rf = RandomForestClassifier()
random_search = RandomizedSearchCV(rf, param_distributions = rf_params, cv = 9, n_iter = 100, random_state = 0, 
                                   scoring = 'neg_log_loss').fit(X_train, y_train)

In [13]:
# The best hyperparamaters found in the random grid search to minimize average log loss
random_search.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features=4, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=8,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [14]:
# The best average log loss achieved through the random grid search
random_search.best_score_

-0.5510799178506031

In [15]:
# The log loss of the random forest model on the unseen testing set
predict = random_search.predict_proba(X_test)
log_loss(y_test, predict)

0.6199459847111619

### K Nearest Neighbors

The third model we will be using is K Nearest Neighbors. KNN is a simple model that is easy to explain. One problem with our dataset is that it contains 100+ features, some of which are not important. We do not want to use these unimportant features to determine similarity. Therefore, we will use our previous logistic regression model with L1 penalty for dimension reduction. The L1 model zeroes out features that are highly correlated or unimportant.

In [16]:
zero_cols = []
for i in range(len(log1.coef_[0])):
    if log1.coef_[0][i] == 0.0:
        zero_cols.append(X.columns[i])
nonzero_X = X.drop(columns = zero_cols)
nonzero_X.head()

Unnamed: 0,Weighted_Rating_y,Trank_OE_x,Trank_DE_x,EFGD%_x,TOR_x,TORD_x,ORB_x,DRB_x,seed_x,Trank_DE_y,...,SeedDiff,WeightedRatingDiff,TrankTempoDiff,KenpomTempoAbsDiff,xOffyDefTrankAvg,yOffxDefTrankAvg,yOffxDefTODiff,xOffyDefTOAvg,xOffyDefRebDiff,yOffxDefRebDiff
0,9.92806,97.3,99.1,46.5,20.8,21.8,30.1,32.7,16.0,105.5,...,0.0,9.675181,3.7,2.9896,101.4,94.65,0.4,22.85,-8.2,-1.0
1,23.232983,117.2,88.8,47.6,18.2,24.9,34.3,33.5,2.0,105.0,...,-13.0,17.125276,2.3,2.8618,111.1,97.15,-4.5,20.3,3.9,0.3
2,36.520674,117.2,88.8,47.6,18.2,24.9,34.3,33.5,2.0,92.2,...,-5.0,3.837585,7.9,7.8767,104.7,100.95,-8.6,20.75,2.9,0.6
3,36.520674,115.2,92.3,46.8,20.3,20.0,36.5,29.5,3.0,92.2,...,-4.0,0.753667,0.0,0.1964,103.7,102.7,-3.7,21.8,5.1,4.6
4,25.040472,121.0,85.6,44.8,18.7,22.9,38.0,29.0,1.0,101.8,...,-15.0,17.271949,1.3,1.0755,111.4,95.45,-2.2,20.1,6.7,5.0


The dataset above contains the features we will use in KNN. Notice that it is now only 21 features rather than 100+ features.

In [17]:
# The new training and testing sets, using the same random_state and test size so we have the same test set as before
nX_train, nX_test, y_train, y_test = train_test_split(nonzero_X, y, test_size = 0.1, random_state = 0)


In order to use KNN, you must scale your features to the same scale so each feature has the same influence on similarity so we'll scale the data below. When scaling, you must only scale based off of data in the training set and then apply the same scale to the testing set.

In [18]:
scale = MinMaxScaler()
nX_train = pd.DataFrame(scale.fit_transform(nX_train))
nX_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,0.645269,0.586777,0.34375,0.223214,0.451327,0.02027,0.475336,0.240223,0.2,0.383871,...,0.272727,0.207332,0.333333,0.176162,0.338346,0.528926,0.860465,0.432039,0.605263,0.619497
1,0.82893,0.561983,0.305556,0.446429,0.761062,0.378378,0.838565,0.391061,0.266667,0.212903,...,0.681818,0.01448,0.526515,0.160495,0.182957,0.545455,0.651163,0.660194,0.894737,0.544025
2,0.312378,0.798898,0.309028,0.580357,0.584071,0.5,0.923767,0.407821,0.0,0.493548,...,0.0,0.687881,0.715909,0.472887,0.616541,0.303719,0.64186,0.597087,0.755639,0.5
3,0.746649,0.608815,0.03125,0.089286,0.424779,0.641892,0.717489,0.441341,0.0,0.216129,...,0.363636,0.240405,0.511364,0.132653,0.22807,0.32438,0.595349,0.334951,0.586466,0.641509
4,0.709344,0.705234,0.652778,0.8125,0.20354,0.445946,0.278027,0.513966,0.666667,0.283871,...,0.681818,0.005637,0.75,0.577216,0.368421,0.68595,0.6,0.281553,0.349624,0.58805


In [19]:
nX_test = pd.DataFrame(scale.transform(nX_test))
nX_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,0.723534,0.46832,0.302083,0.303571,0.336283,0.263514,0.372197,0.480447,0.266667,0.396774,...,0.363636,0.070496,0.628788,0.363253,0.240602,0.56405,0.562791,0.383495,0.609023,0.462264
1,0.762746,0.520661,0.434028,0.625,0.40708,0.337838,0.529148,0.379888,0.4,0.625806,...,0.727273,0.000232,0.367424,0.117439,0.466165,0.816116,0.474419,0.160194,0.477444,0.600629
2,0.984225,0.694215,0.138889,0.580357,0.566372,0.601351,0.780269,0.251397,0.0,0.0,...,0.681818,0.001486,0.261364,0.320396,0.137845,0.533058,0.404651,0.660194,0.763158,0.830189
3,0.689595,0.421488,0.170139,0.196429,1.0,0.567568,0.623318,0.648045,0.133333,0.387097,...,0.181818,0.1523,0.55303,0.257026,0.190476,0.456612,0.627907,0.68932,0.620301,0.559748
4,0.790595,0.793388,0.184028,0.133929,0.539823,0.432432,0.735426,0.575419,0.0,0.177419,...,0.318182,0.265204,0.738636,0.571529,0.365915,0.427686,0.544186,0.514563,0.853383,0.424528


In [20]:
# Hyperparameters for KNN
knn_params = {
    'n_neighbors': [int(x) for x in np.linspace(1, 200, 201)],
    'weights': ['uniform', 'distance']
}

In [21]:
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, knn_params, cv = 9, scoring = 'neg_log_loss').fit(nX_train, y_train)

In [22]:
# The best hyperparamaters found in the grid search to minimize average log loss
grid_search.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=40, p=2,
           weights='distance')

In [23]:
# The best average log loss achieved through the grid search
grid_search.best_score_

-0.566319894143969

In [24]:
# The log loss of the KNN model on the unseen testing set
predict = grid_search.predict_proba(nX_test)
log_loss(y_test, predict)

0.600555493230301

So, the Random Forest, L1, and L2 logistic regression models had the best performance for CV log loss, but only L1 and L2 logistic regression performed the best on the testing set. Therefore, when making predictions we will only use the L1 and L2 logitic regression models.