## Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.kernel_approximation import Nystroem

## Data Wrangling

In [2]:
data = pd.read_csv('clean_data.csv', sep='\t')
data = data[['Date','WHH','WHD','WHA','HWW','AWW','watch']]
data = data.reindex(index=data.index[::-1])
data.reset_index(inplace=True)
data['Date'] =  pd.to_datetime(data['Date'])
data[['WHH','WHD','WHA','HWW','AWW','watch']] = data[['WHH','WHD','WHA','HWW','AWW','watch']].apply(pd.to_numeric)
data.drop('index',axis=1,inplace=True)

In [3]:
recent = data[data['Date'] > '2010-08-01']
recent = data[data['Date'] < '2018-05-13']

In [4]:
X = recent[['WHH','WHD','WHA','HWW','AWW']]
y = recent['watch']

In the cell below the data is split into training, validation and testing. Sklearn's train_test_split can be used twice to partition randomly. We are using the 70-15-15% split.

In [5]:
X_train, X_vt, y_train, y_vt = train_test_split(X, y, test_size=0.3, random_state=101)
X_validate, X_test, y_validate, y_test = train_test_split(X_vt, y_vt, test_size=0.5, random_state=101)

## Trying Various Support Vector Machine Classifiers

We first try a support vector machine model with up to degree 5 features and a linear decision boundary. The parameter C optimized over values from 0.00001 to 10000 in powers of 10. This parameter C can be thought of as the penalization of misclassification. If C is very high then the classifier will be punished greatly for wrong classifications and will adjust to make as many as possible correct. This can of course, lead to overfitting the data. We will see whether this is the case when using the test set.

The standard scaler was used which means that the features were scaled to have mean zero and standard deviation 1. This is very important for SVMs because if there is a great difference in scale then the algorithm to solve can fail.

The result we are interested in is if the model predicts a game to be worth watching, the probability that it actually is interesting. This corresponds to $\frac{TP}{FP + TP}$ from the confusion matrix and is usually referred to as **precision**. 

However, some models can have a good precision but hardly predict any games to be worth watching. Therefore, we require $TP > 100$.

I have decided to suppress warnings about the lack of convergence.

The best classifier of this type had $c=0.0001$ and had precision of 39.8%.

In [6]:
@ignore_warnings(category=ConvergenceWarning)
def linear_svc():
    best_precision = 0
    for d in range(3, 8):
        for c in [10**i for i in range(-5, 6)]:
            polynomial_svm_clf = Pipeline([
                ('poly_features', PolynomialFeatures(degree=d)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=c, loss='hinge', random_state=101))
            ])

            polynomial_svm_clf.fit(X_train, y_train)
            preds = polynomial_svm_clf.predict(X_validate)
            cf = confusion_matrix(y_validate, preds)
            if cf[1][1] > 100:
                if cf[1][1] / (cf[1][1] + cf[0][1]) > best_precision:
                    best_cf = cf
                    best_precision = cf[1][1] / (cf[1][1] + cf[0][1])
                    best_c = c
                    best_d = d
            print(f'done with d={d}, c={c}')
    print(best_cf)
    print(best_precision)
    print(best_d)
    print(best_c)
linear_svc()

done with d=3, c=1e-05
done with d=3, c=0.0001
done with d=3, c=0.001
done with d=3, c=0.01
done with d=3, c=0.1
done with d=3, c=1
done with d=3, c=10
done with d=3, c=100
done with d=3, c=1000
done with d=3, c=10000
done with d=3, c=100000
done with d=4, c=1e-05
done with d=4, c=0.0001
done with d=4, c=0.001
done with d=4, c=0.01
done with d=4, c=0.1
done with d=4, c=1
done with d=4, c=10
done with d=4, c=100
done with d=4, c=1000
done with d=4, c=10000
done with d=4, c=100000
done with d=5, c=1e-05
done with d=5, c=0.0001
done with d=5, c=0.001
done with d=5, c=0.01
done with d=5, c=0.1
done with d=5, c=1
done with d=5, c=10
done with d=5, c=100
done with d=5, c=1000
done with d=5, c=10000
done with d=5, c=100000
done with d=6, c=1e-05
done with d=6, c=0.0001
done with d=6, c=0.001
done with d=6, c=0.01
done with d=6, c=0.1
done with d=6, c=1
done with d=6, c=10
done with d=6, c=100
done with d=6, c=1000
done with d=6, c=10000
done with d=6, c=100000
done with d=7, c=1e-05
done with

Next I wanted to try an SVM with a polynomial kernel. If I just used an SVC with the kernel set as `poly` the model would take a very long time to fit. This is because fitting the model boils down to a quadratic optimization problem which runs in roughly $O(n_{features} \times n^{3}_{observations})$ time.

Therefore, we avoid this problem by employing what is called the "kernel approximation". Roughly speaking, rather than using a complex kernel, we transform the features in such a way to approximate the desired kernel, then solve using the linear kernel (which is very fast to solve). 

This is done for degrees 3 to 7 polynomials and for C values from 0.01 to 10,000 in powers of 10. The best result was for degree 7 and C value 100. This gave a precision percentage of 39.7%.

In [7]:
@ignore_warnings(category=ConvergenceWarning)
def poly_svc():
    best_precision = 0
    for c in [10**i for i in range(-2, 5)]:
        for d in range(3, 8):
            kernel_approx = Nystroem(kernel='poly',
                                         degree=d,
                                         random_state=101)
            poly_svm_clf = Pipeline([
                    ('scaler', StandardScaler()),
                    ('svm_clf', LinearSVC(C=c, loss='hinge', random_state=101))
                ])

            X_train_transformed = kernel_approx.fit_transform(X_train)
            X_validate_transformed = kernel_approx.fit_transform(X_validate)
            poly_svm_clf.fit(X_train_transformed, y_train)
            preds = poly_svm_clf.predict(X_validate_transformed)
            cf = confusion_matrix(y_validate, preds)
            if cf[1][1] > 100:
                if cf[1][1] / (cf[1][1] + cf[0][1]) > best_precision:
                    best_cf = cf
                    best_precision = cf[1][1] / (cf[1][1] + cf[0][1])
                    best_c = c
                    best_d = d
            print(f'done with c={c}, d={d}')
    print(best_cf)
    print(best_precision)
    print(best_c)
    print(best_d)
poly_svc()

done with c=0.01, d=3
done with c=0.01, d=4
done with c=0.01, d=5
done with c=0.01, d=6
done with c=0.01, d=7
done with c=0.1, d=3
done with c=0.1, d=4
done with c=0.1, d=5
done with c=0.1, d=6
done with c=0.1, d=7
done with c=1, d=3
done with c=1, d=4
done with c=1, d=5
done with c=1, d=6
done with c=1, d=7
done with c=10, d=3
done with c=10, d=4
done with c=10, d=5
done with c=10, d=6
done with c=10, d=7
done with c=100, d=3
done with c=100, d=4
done with c=100, d=5
done with c=100, d=6
done with c=100, d=7
done with c=1000, d=3
done with c=1000, d=4
done with c=1000, d=5
done with c=1000, d=6
done with c=1000, d=7
done with c=10000, d=3
done with c=10000, d=4
done with c=10000, d=5
done with c=10000, d=6
done with c=10000, d=7
[[355 245]
 [175 161]]
0.39655172413793105
100
7


Next the rbf kernel was used with the same kernel approximation method as above. The Gaussian RBF kernel function is given by:
$$\phi_\gamma (\textbf{x}, \ell) = \exp(-\gamma \| \textbf{x} - \ell \| ^ 2)$$

This kernel can handle very complicated decision boundaries.

The best result was: a probability of 37.7% with a c=10 and gamma=0.01.

In [8]:
@ignore_warnings(category=ConvergenceWarning)
def rbf_svc():
    best_precision = 0
    for c in [10**i for i in range(-2, 5)]:
        for g in [10**i for i in range(-2, 6)]:
            kernel_approx = Nystroem(kernel='rbf',
                                     gamma=g,
                                     random_state=101)
            rbf_svm_clf = Pipeline([
                    ('scaler', StandardScaler()),
                    ('svm_clf', LinearSVC(C=c, loss='hinge', random_state=101))
                ])
            X_train_transformed = kernel_approx.fit_transform(X_train)
            X_validate_transformed = kernel_approx.fit_transform(X_validate)
            rbf_svm_clf.fit(X_train_transformed, y_train)
            preds = rbf_svm_clf.predict(X_validate_transformed)
            cf = confusion_matrix(y_validate, preds)
            if cf[1][1] > 100:
                if cf[1][1] / (cf[1][1] + cf[0][1]) > best_precision:
                    best_cf = cf
                    best_precision = cf[1][1] / (cf[1][1] + cf[0][1])
                    best_c = c
                    best_g = g
            print(f'done with c={c}, g={g}')
    print(best_cf)
    print(best_precision)
    print(best_c)
    print(best_g)
rbf_svc()

done with c=0.01, g=0.01
done with c=0.01, g=0.1
done with c=0.01, g=1
done with c=0.01, g=10
done with c=0.01, g=100
done with c=0.01, g=1000
done with c=0.01, g=10000
done with c=0.01, g=100000
done with c=0.1, g=0.01
done with c=0.1, g=0.1
done with c=0.1, g=1
done with c=0.1, g=10
done with c=0.1, g=100
done with c=0.1, g=1000
done with c=0.1, g=10000
done with c=0.1, g=100000
done with c=1, g=0.01
done with c=1, g=0.1
done with c=1, g=1
done with c=1, g=10
done with c=1, g=100
done with c=1, g=1000
done with c=1, g=10000
done with c=1, g=100000
done with c=10, g=0.01
done with c=10, g=0.1
done with c=10, g=1
done with c=10, g=10
done with c=10, g=100
done with c=10, g=1000
done with c=10, g=10000
done with c=10, g=100000
done with c=100, g=0.01
done with c=100, g=0.1
done with c=100, g=1
done with c=100, g=10
done with c=100, g=100
done with c=100, g=1000
done with c=100, g=10000
done with c=100, g=100000
done with c=1000, g=0.01
done with c=1000, g=0.1
done with c=1000, g=1
done 

Finally, I tried using the sigmoid kernel:
$$\phi_{r, \gamma} (\textbf{x}, \ell) = \tanh (\gamma \cdot \textbf{x}^T \ell + r)$$
which has a similar motivation to the sigmoid function from logistic regression.

The best result here was: a probability of 41.5% with c=10 and gamma=1.

In [9]:
best_precision = 0
for c in [10**i for i in range(1, 6)]:
    for g in [10**i for i in range(-1, 6)]:
        rbf_kernel_svm_clf = Pipeline([
            ('scaler', StandardScaler()),
            ('svm_clf', SVC(kernel='sigmoid', gamma=g, C=c, random_state=101))
        ])
        rbf_kernel_svm_clf.fit(X_train, y_train)
        preds = rbf_kernel_svm_clf.predict(X_validate)
        cf = confusion_matrix(y_validate, preds)
        if cf[1][1] > 100:
            if cf[1][1] / (cf[1][1] + cf[0][1]) > best_precision:
                best_cf = cf
                best_precision = cf[1][1] / (cf[1][1] + cf[0][1])
                best_c = c
                best_g = g
        print(f'done with c={c}, g={g}')
print(best_cf)
print(best_precision)
print(best_c)
print(best_g)

done with c=10, g=0.1
done with c=10, g=1
done with c=10, g=10
done with c=10, g=100
done with c=10, g=1000
done with c=10, g=10000
done with c=10, g=100000
done with c=100, g=0.1
done with c=100, g=1
done with c=100, g=10
done with c=100, g=100
done with c=100, g=1000
done with c=100, g=10000
done with c=100, g=100000
done with c=1000, g=0.1
done with c=1000, g=1
done with c=1000, g=10
done with c=1000, g=100
done with c=1000, g=1000
done with c=1000, g=10000
done with c=1000, g=100000
done with c=10000, g=0.1
done with c=10000, g=1
done with c=10000, g=10
done with c=10000, g=100
done with c=10000, g=1000
done with c=10000, g=10000
done with c=10000, g=100000
done with c=100000, g=0.1
done with c=100000, g=1
done with c=100000, g=10
done with c=100000, g=100
done with c=100000, g=1000
done with c=100000, g=10000
done with c=100000, g=100000
[[411 189]
 [202 134]]
0.4148606811145511
10
1


So the best performing model here was with the sigmoid kernel with polynomial features, with parameters c=10 and gamma=1. Let's assess it's test accuracy:

In [14]:
@ignore_warnings(category=ConvergenceWarning)
def test_svc():
    sigmoid_svm_clf = Pipeline([
        ('scaler', StandardScaler()),
        ('svm_clf', SVC(kernel='sigmoid', gamma=1, C=10, random_state=101))
    ])
    sigmoid_svm_clf.fit(X_train.append(X_validate), y_train.append(y_validate))
    preds =sigmoid_svm_clf.predict(X_test)
    cf = confusion_matrix(y_test, preds)
    print(f'test precision: {cf[1][1] / (cf[1][1] + cf[0][1])}')
    print(cf)
test_svc()

test precision: 0.3333333333333333
[[420 212]
 [198 106]]


Again, a dissapointing result. Only 33.3% accuracy on the test data. 30% of games are worth watching so this is not performing any better than random guessing. Support vector machine classifiers have not performed well on the dataset. This is a classic case of the model overfitting the training and validation data.