# Rating Linear Regression

In order to determine the best way to judge how good a team is, we will create a weighted rating of scaled Colley, Teamrank, Trank, and Kenpom rating metrics. We will use ridge regression to determine the weights because we don't want to leave out any of the rating metrics using forward selection or LASSO regression. The response variable will be the score differential of NCAA tournament games and the X variables will be the difference of each rating metric between the two teams. Therefore, the weighted ratings should be interpreted by the following rule: the difference in the weighted rating of team 1 versus team 2 is the predicted score differential.

In [1]:
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

The ranking data below was created in the TeamDataCleaning.ipynb notebook and the NCAA tournament data was provided by Kaggle.

In [2]:
ranking_data = pd.read_csv('mydata/ranking_data.csv')
tournament_data = pd.read_csv('DataFiles/NCAATourneyCompactResults.csv')

In [3]:
ranking_data.head()

Unnamed: 0,Season,TeamID,Colley_Rating,Teamrank_Rating,Teamrank10_Rating,Trank_Rating,Kenpom_Rating
0,2008,1314,1.0,0.939394,0.903704,0.974641,0.912825
1,2008,1272,0.967124,0.925253,0.866667,0.987633,0.931297
2,2008,1417,0.959789,0.915152,0.914815,0.984203,0.949233
3,2008,1397,0.956817,0.864646,0.864815,0.951881,0.82584
4,2008,1242,0.938298,1.0,0.961111,1.0,1.0


In [4]:
tournament_data.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


In [5]:
# The ranking data is only from 2008 and 2010-2018
tournament_data = tournament_data.query('Season > 2009 | Season == 2008').drop(columns = ['DayNum', 'WLoc', 'NumOT'])

In [6]:
game_data = pd.merge(tournament_data, ranking_data, left_on = ['Season', 'WTeamID'], right_on = ['Season', 'TeamID'])
game_data = pd.merge(game_data, ranking_data, left_on = ['Season', 'LTeamID'], right_on = ['Season', 'TeamID'])

In [7]:
game_data.head()

Unnamed: 0,Season,WTeamID,WScore,LTeamID,LScore,TeamID_x,Colley_Rating_x,Teamrank_Rating_x,Teamrank10_Rating_x,Trank_Rating_x,Kenpom_Rating_x,TeamID_y,Colley_Rating_y,Teamrank_Rating_y,Teamrank10_Rating_y,Trank_Rating_y,Kenpom_Rating_y
0,2008,1291,69,1164,60,1291,0.484032,0.446465,0.444444,0.445334,0.475676,1164,0.351254,0.250505,0.237037,0.127105,0.277968
1,2008,1181,71,1125,70,1181,0.921698,0.921212,1.0,0.978279,0.905094,1125,0.619471,0.531313,0.561111,0.516109,0.529898
2,2008,1242,85,1340,61,1242,0.938298,1.0,0.961111,1.0,1.0,1340,0.599374,0.573737,0.546296,0.598109,0.578689
3,2008,1242,75,1424,56,1242,0.938298,1.0,0.961111,1.0,1.0,1424,0.768816,0.688889,0.672222,0.850239,0.69526
4,2008,1242,72,1437,57,1242,0.938298,1.0,0.961111,1.0,1.0,1437,0.718976,0.670707,0.67037,0.820827,0.685701


In [8]:
game_data['Colley_Diff'] = game_data['Colley_Rating_x'] - game_data['Colley_Rating_y']
game_data['Teamrank_Diff'] = game_data['Teamrank_Rating_x'] - game_data['Teamrank_Rating_y']
game_data['Teamrank10_Diff'] = game_data['Teamrank10_Rating_x'] - game_data['Teamrank10_Rating_y']
game_data['Trank_Diff'] = game_data['Trank_Rating_x'] - game_data['Trank_Rating_y']
game_data['Kenpom_Diff'] = game_data['Kenpom_Rating_x'] - game_data['Kenpom_Rating_y']
game_data['Score_Diff'] = game_data['WScore'] - game_data['LScore']

In [9]:
regression_data = game_data[['Colley_Diff', 'Teamrank_Diff', 'Teamrank10_Diff', 'Trank_Diff', 'Kenpom_Diff', 'Score_Diff']]

In [10]:
y = regression_data.Score_Diff
X = regression_data.drop(columns = ['Score_Diff'])

# We want to use the same train/test split as we'll use in the predictions (ML.ipynb) to avoid leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

In [11]:
X_train.head()

Unnamed: 0,Colley_Diff,Teamrank_Diff,Teamrank10_Diff,Trank_Diff,Kenpom_Diff
433,0.153419,0.128352,0.091216,0.042396,0.11172
439,0.345421,0.249042,0.150338,0.274552,0.222965
18,0.59455,0.660606,0.687037,0.858449,0.682153
619,-0.095159,-0.156322,-0.029183,-0.133191,-0.11541
79,0.228468,0.137255,0.037567,0.091252,0.15254


In [12]:
X_train.corr()

Unnamed: 0,Colley_Diff,Teamrank_Diff,Teamrank10_Diff,Trank_Diff,Kenpom_Diff
Colley_Diff,1.0,0.907181,0.81932,0.901616,0.926536
Teamrank_Diff,0.907181,1.0,0.895144,0.937683,0.981465
Teamrank10_Diff,0.81932,0.895144,1.0,0.83737,0.897854
Trank_Diff,0.901616,0.937683,0.83737,1.0,0.944816
Kenpom_Diff,0.926536,0.981465,0.897854,0.944816,1.0


As you can see, the data is highly correlated, so therefore forward/backward selection would most likely ignore a certain ratings metric and LASSO would most likely calculate a coefficient of 0 with a certain metric so therefore we want to use ridge regression.

In [13]:
reg = RidgeCV(cv = 5, fit_intercept = False, alphas = [.01, .03, .1, .3, 1, 3, 10]).fit(X_train, y_train)

In [14]:
reg.alpha_

3

In [15]:
reg.coef_

array([ 6.71236012, 11.19605646,  3.11732104, 13.51627166,  8.3058081 ])

The above coefficients will be the weights for Colley Ratings, Teamrank Ratings, last 10 games Teamrank ratings, Trank ratings, and Kenpom ratings respectively.