## Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import confusion_matrix
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org
  ipython-dev@scipy.org""")


## Data Wrangling

In [2]:
data = pd.read_csv('clean_data.csv', sep='\t')
data = data[['Date','HGS','HGA','HYC','HRC','HWW','AGS','AGA','AYC','ARC','AWW','watch']]
data = data.reindex(index=data.index[::-1])
data.reset_index(inplace=True)
data['Date'] =  pd.to_datetime(data['Date'])
data[['HGS','HGA','HYC',
      'HRC','HWW','AGS',
      'AGA','AYC','ARC',
      'AWW','watch']] = data[['HGS','HGA',
                              'HYC','HRC',
                              'HWW','AGS',
                              'AGA','AYC',
                              'ARC','AWW',
                              'watch']].apply(pd.to_numeric)
data.drop('index',axis=1,inplace=True)

In [3]:
recent = data[data['Date'] > '2010-08-01']
recent = data[data['Date'] < '2018-05-13']

Unnamed: 0,Date,HGS,HGA,HYC,HRC,HWW,AGS,AGA,AYC,ARC,AWW,watch
0,2018-05-13,1.864865,0.864865,1.297297,0.054054,0.297297,1.405405,1.486486,1.351351,0.135135,0.27027,1
1,2018-05-13,0.72973,1.459459,1.351351,0.027027,0.216216,0.891892,1.810811,1.621622,0.027027,0.216216,0
2,2018-05-13,1.0,1.486486,1.621622,0.054054,0.27027,2.837838,0.72973,1.540541,0.054054,0.513514,0
3,2018-05-13,0.972973,1.27027,1.405405,0.054054,0.216216,1.675676,0.945946,1.081081,0.108108,0.324324,0
4,2018-05-13,1.810811,0.756757,1.621622,0.027027,0.324324,1.189189,1.702703,1.702703,0.108108,0.324324,0
5,2018-05-13,2.162162,1.027027,1.189189,0.027027,0.405405,0.918919,1.351351,1.459459,0.054054,0.243243,1
6,2018-05-13,0.756757,1.540541,1.621622,0.081081,0.189189,1.972973,1.378378,1.540541,0.054054,0.432432,0
7,2018-05-13,1.162162,1.486486,1.864865,0.0,0.324324,0.837838,1.459459,1.864865,0.027027,0.243243,0
8,2018-05-13,0.945946,1.0,1.756757,0.0,0.081081,1.162162,1.621622,1.486486,0.027027,0.324324,0
9,2018-05-10,1.25,1.861111,1.972222,0.055556,0.388889,1.861111,0.777778,1.638889,0.027778,0.333333,0


In [4]:
X = recent[['HGS','HGA','HYC',
      'HRC','HWW','AGS',
      'AGA','AYC','ARC',
      'AWW']]
y = recent['watch']

In the cell below the data is split into training, validation and testing. Sklearn's train_test_split can be used twice to partition randomly. We are using the 70-15-15% split.

In [5]:
X_train, X_vt, y_train, y_vt = train_test_split(X, y, test_size=0.3, random_state=101)
X_validate, X_test, y_validate, y_test = train_test_split(X_vt, y_vt, test_size=0.5, random_state=101)

## Trying Various Support Vector Machine Classifiers

We first try a support vector machine model with up to degree 5 features and a linear decision boundary. The parameter C optimized over values from 0.00001 to 10000 in powers of 10. This parameter C can be thought of as the penalization of misclassification. If C is very high then the classifier will be punished greatly for wrong classifications and will adjust to make as many as possible correct. This can of course, lead to overfitting the data. We will see whether this is the case when using the test set.

The standard scaler was used which means that the features were scaled to have mean zero and standard deviation 1. This is very important for SVMs because if there is a great difference in scale then the algorithm to solve can fail.

The result we are interested in is if the model predicts a game to be worth watching, the probability that it actually is interesting. This corresponds to $\frac{TP}{FP + TP}$ from the confusion matrix and is usually referred to as **precision**. 

However, some models can have a good precision but hardly predict any games to be worth watching. Therefore, we require $TP > 100$.

I have decided to suppress warnings about the lack of convergence.

The best classifier of this type had $c=0.0001$ and had precision of 39.8%.

In [6]:
@ignore_warnings(category=ConvergenceWarning)
def linear_svc():
    for c in [10**i for i in range(-5, 6)]:
        polynomial_svm_clf = Pipeline([
            ('poly_features', PolynomialFeatures(degree=5)),
            ('scaler', StandardScaler()),
            ('svm_clf', LinearSVC(C=c, loss='hinge'))
        ])

        polynomial_svm_clf.fit(X_train, y_train)
        preds = polynomial_svm_clf.predict(X_validate)
        cf = confusion_matrix(y_validate, preds)
        if cf[0][1] > 100:
            print(f'c={c}')
            print(cf)
            print(cf[1][1] / (cf[0][1] + cf[1][1]))
linear_svc()

c=1e-05
[[419 181]
 [222 114]]
0.3864406779661017
c=0.0001
[[452 148]
 [238  98]]
0.3983739837398374
c=10
[[436 164]
 [232 104]]
0.3880597014925373
c=100
[[472 128]
 [263  73]]
0.36318407960199006
c=1000
[[405 195]
 [239  97]]
0.3321917808219178
c=10000
[[418 182]
 [239  97]]
0.34767025089605735
c=100000
[[438 162]
 [240  96]]
0.37209302325581395


Next a non-linear kernel was used. After some testing, a degree 7 polynomial for the kernel was found to be beneficial to results but not make the computation time completely unreasonable.

You may notice that a polynomial is not being used in the features. This is due to the "kernel trick". Thanks to the kernel trick, we can get the same results as adding a polynomial in the features simply by making the kernel polynomial.

The kernel is a function which gives a measure of the similarity between any two datapoints.

The best result was

In [7]:
for c in [10**i for i in range(0, 6)]:
    poly_kernel_svm_clf = Pipeline([
        ('scaler', StandardScaler()),
        ('svm_clf', SVC(kernel='poly', degree=7, coef0=1, C=c))
    ])

    poly_kernel_svm_clf.fit(X_train, y_train)
    preds = poly_kernel_svm_clf.predict(X_validate)
    cf = confusion_matrix(y_validate, preds)
    if cf[0][1] > 100:
        print(f'c={c}')
        print(cf)
        print(cf[1][1] / (cf[0][1] + cf[1][1]))

c=1
[[494 106]
 [282  54]]
0.3375
c=10
[[444 156]
 [234 102]]
0.3953488372093023
c=100
[[418 182]
 [225 111]]
0.378839590443686
c=1000
[[391 209]
 [214 122]]
0.3685800604229607
c=10000
[[383 217]
 [193 143]]
0.3972222222222222
c=100000
[[383 217]
 [193 143]]
0.3972222222222222


Next the rbf kernel was used. The Gaussian RBF kernel function is given by:
$$\phi_\gamma (\textbf{x}, \ell) = \exp(-\gamma \| \textbf{x} - \ell \| ^ 2)$$

This kernel can handle very complicated decision boundaries.

The best result was: a probability of 40.3% with a c=100 and gamma=1.

In [8]:
for c in [10**i for i in range(1, 6)]:
    for g in [10**i for i in range(-1, 6)]:
        rbf_kernel_svm_clf = Pipeline([
            ('scaler', StandardScaler()),
            ('svm_clf', SVC(kernel='rbf', gamma=g, C=c))
        ])
        rbf_kernel_svm_clf.fit(X_train, y_train)
        preds = rbf_kernel_svm_clf.predict(X_validate)
        cf = confusion_matrix(y_validate, preds)
        if cf[0][1] > 100:
            print(f'c={c}, g={g}')
            print(cf)
            print(cf[1][1] / (cf[0][1] + cf[1][1]))

c=10, g=1
[[486 114]
 [260  76]]
0.4
c=100, g=1
[[483 117]
 [257  79]]
0.4030612244897959
c=1000, g=0.1
[[452 148]
 [237  99]]
0.4008097165991903
c=1000, g=1
[[483 117]
 [257  79]]
0.4030612244897959
c=10000, g=0.1
[[419 181]
 [219 117]]
0.3926174496644295
c=10000, g=1
[[483 117]
 [257  79]]
0.4030612244897959
c=100000, g=0.1
[[399 201]
 [207 129]]
0.39090909090909093
c=100000, g=1
[[483 117]
 [257  79]]
0.4030612244897959


Finally, I tried using the sigmoid kernel:
$$\phi_{r, \gamma} (\textbf{x}, \ell) = \tanh (\gamma \cdot \textbf{x}^T \ell + r)$$
which has a similar motivation to the sigmoid function from logistic regression.

The best result here was: a probability of 39.1% with c=10 and gamma=10.

In [9]:
for c in [10**i for i in range(1, 6)]:
    for g in [10**i for i in range(-1, 6)]:
        rbf_kernel_svm_clf = Pipeline([
            ('scaler', StandardScaler()),
            ('svm_clf', SVC(kernel='sigmoid', gamma=g, C=c))
        ])
        rbf_kernel_svm_clf.fit(X_train, y_train)
        preds = rbf_kernel_svm_clf.predict(X_validate)
        cf = confusion_matrix(y_validate, preds)
        if cf[0][1] > 100:
            print(f'c={c}, g={g}')
            print(cf)
            print(cf[1][1] / (cf[0][1] + cf[1][1]))

c=10, g=0.1
[[403 197]
 [224 112]]
0.36245954692556637
c=10, g=1
[[412 188]
 [222 114]]
0.37748344370860926
c=10, g=10
[[413 187]
 [216 120]]
0.39087947882736157
c=10, g=100
[[407 193]
 [232 104]]
0.3501683501683502
c=10, g=1000
[[419 181]
 [239  97]]
0.3489208633093525
c=10, g=10000
[[414 186]
 [227 109]]
0.3694915254237288
c=10, g=100000
[[417 183]
 [236 100]]
0.35335689045936397
c=100, g=0.1
[[406 194]
 [223 113]]
0.36807817589576547
c=100, g=1
[[412 188]
 [225 111]]
0.3712374581939799
c=100, g=10
[[421 179]
 [226 110]]
0.3806228373702422
c=100, g=100
[[407 193]
 [232 104]]
0.3501683501683502
c=100, g=1000
[[420 180]
 [241  95]]
0.34545454545454546
c=100, g=10000
[[421 179]
 [238  98]]
0.35379061371841153
c=100, g=100000
[[415 185]
 [237  99]]
0.3485915492957746
c=1000, g=0.1
[[411 189]
 [215 121]]
0.3903225806451613
c=1000, g=1
[[411 189]
 [227 109]]
0.36577181208053694
c=1000, g=10
[[419 181]
 [224 112]]
0.3822525597269625
c=1000, g=100
[[408 192]
 [231 105]]
0.35353535353535354
c

So the best performing model here was with the rbf kernel with parameters c=100 and gamma=1. Let's assess it's test accuracy:

In [11]:
test_model = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=1, C=100))
])
test_model.fit(X_train.append(X_validate), 
                       y_train.append(y_validate))
preds = test_model.predict(X_test)
cf = confusion_matrix(y_test, preds)
print(cf)
print(cf[1][1] / (cf[0][1] + cf[1][1]))

[[502 130]
 [239  65]]
0.3333333333333333


Again, a dissapointing result. Only 33.3% accuracy on the test data. 30% of games are worth watching so this is not performing any better than random guessing. Support vector machine classifiers have not performed well on the dataset.

This is a classic case of the model overfitting the training and validation data.