### Problem 4 (Hinge Loss vs Logistic Loss)

In [319]:
import numpy as np
import pandas as pd

First, we are going to clean up the csv file by pushing the 1st data point (which is incorrectly encoded as columns) into the dataframe and then renaming columns, with the target feature called 'Feature 0'.

In [320]:
df = pd.read_csv('breast-cancer.csv')
df.loc[-1] = [float(i) for i in df.columns] 
df.index = df.index + 1
df = df.sort_index()
df.columns = ['Feature ' + str(i) for i in range(10)]

Use this helper function to print out the coefficients of the intercept and weight vector.

In [321]:
def print_coefs(results):
    print('Intercept coefficient:\t', results.intercept_)
    for i in range(len(df.columns[:-1])):
        print('Coefficient of', df.columns[i], ':\t', results.coef_[0][i])

In [322]:
df

Unnamed: 0,Feature 0,Feature 1,Feature 2,Feature 3,Feature 4,Feature 5,Feature 6,Feature 7,Feature 8,Feature 9
0,-1.0,5.0,1.0,1.1,1.2,2.0,1.3,3.0,1.4,1.5
1,-1.0,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0
2,-1.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0
3,-1.0,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0
4,-1.0,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
678,-1.0,3.0,1.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0
679,-1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0
680,1.0,5.0,10.0,10.0,3.0,7.0,3.0,8.0,10.0,2.0
681,1.0,4.0,8.0,6.0,4.0,3.0,4.0,10.0,6.0,1.0


### (a)

Shuffle all of the data points using df.sample() with a random state of 42, then split the data set into a 50/25/25 training/validation/test split. So, the first half of the shuffled data set will be the training set, the 3rd quarter will be the validation set, and the 4th quarter will be the test set.

In [323]:
df = df.sample(df.shape[0], random_state = 42)

In [324]:
tr = df.iloc[:int(df.shape[0]*0.5), :]
va = df.iloc[int(df.shape[0]*0.5):int(df.shape[0]*0.75), :]
te = df.iloc[int(df.shape[0]*0.75):, :]

### (b)

Using validation to determine an appropriate regularization parameter $\lambda$, fit a linear support vector classifier (SVC, `sklearn.svm.SVC`) and logistic regressor (LR, `sklearn.linear_model.LogisticRegression`) for 50 iterations to minimize the regularized empirical risk $$\min \frac{1}{n} \sum_{i=1}^{n} \ell(x_i,y;w) + \lambda \cdot ||w||_2^2$$

for a hinge loss and logistic loss, respectively. 

In particular, for each $\lambda \in \Lambda =  \{.01,.02,.03,...,1\}$, fit a SVC and LR with appropriate inputs to (possibly a subset of) arguments  { `penalty` , `loss` , `C` , `max_iter`, `random_state`, `fit_intercept`}, and then compute the misclassification error rate on the validation set. Use `max_iter=100000` and `random_state=42` for all trained models, and be sure to add an offset (`fit_intercept=True`) to your model. 

(Hint: Looking at the documentation for these models in sklearn may help.)

__Print out the best regularization parameter $\lambda^* \in \Lambda$ (using the `print_coefs` helper function) and the intercept & weight coefficients for SVC and LR models corresponding to $\lambda^*$__ 

We will call these weight vectors $w^*_\text{hinge}$ and $w^*_\text{logistic}$, respectively.

In [325]:
from sklearn.svm import SVC

In [326]:
def err(y_hat, y):
    return np.mean(y_hat != y)

In [327]:
lmbdas = np.arange(0.001, 1.0005, 0.001)
lmbdas.shape

(1000,)

In [328]:
errors_svm = np.zeros(1000,)
for i in range(lmbdas.shape[0]):
    # C is the inverse of lmbda
    # A larger max_iter is 
    clf = SVC(C = 1/lmbdas[i], max_iter = 10000000, kernel = 'linear', random_state = 42)
    clf.fit(tr.drop(['Feature 0'], axis = 1), tr['Feature 0'])
    preds = clf.predict(va.drop(['Feature 0'], axis = 1))
    error = err(preds, va['Feature 0'])
    errors_svm[i] = error

In [329]:
index = np.argmin(errors_svm)

In [330]:
best_lmbda_svm = lmbdas[index]

In [331]:
print('Best regularization parameter via validation:\t', best_lmbda_svm)

Best regularization parameter via validation:	 0.001


In [332]:
from sklearn.linear_model import LogisticRegression

In [333]:
errors_lr = np.zeros(1000,)
for i in range(lmbdas.shape[0]):
    # C is the inverse of lmbda
    clf = LogisticRegression(C = 1/lmbdas[i], fit_intercept = True, max_iter = 100000, random_state = 42)
    clf.fit(tr.drop(['Feature 0'], axis = 1), tr['Feature 0']) 
    preds = clf.predict(va.drop(['Feature 0'], axis = 1))
    error = err(preds, va['Feature 0'])
    errors_lr[i] = error

In [334]:
index = np.argmin(errors_lr)

In [335]:
best_lmbda_lr = lmbdas[index]

In [336]:
print('Best regularization parameter via validation:\t', best_lmbda_lr)

Best regularization parameter via validation:	 0.13


In [337]:
svm = SVC(C = 1/best_lmbda_svm, max_iter = 10000000, kernel = 'linear', random_state = 42)
svm.fit(tr.drop(['Feature 0'], axis = 1), tr['Feature 0']) # C is the inverse of lmbda

SVC(C=1000.0, kernel='linear', max_iter=10000000, random_state=42)

In [338]:
print('Intercept and Weights learned for Linear SVM model:\n')
print_coefs(svm)

Intercept and Weights learned for Linear SVM model:

Intercept coefficient:	 [-6.62047887]
Coefficient of Feature 0 :	 0.4326447790905661
Coefficient of Feature 1 :	 -0.11535016980666057
Coefficient of Feature 2 :	 0.5344670470541502
Coefficient of Feature 3 :	 0.2859889518406433
Coefficient of Feature 4 :	 0.14507377988253012
Coefficient of Feature 5 :	 0.30099635389911583
Coefficient of Feature 6 :	 -0.09394456878168
Coefficient of Feature 7 :	 0.20246044847511158
Coefficient of Feature 8 :	 0.43425201176899364


In [339]:
lr = LogisticRegression(C = 1/best_lmbda_lr, fit_intercept = True, max_iter = 1000000, random_state = 42)
lr.fit(tr.drop(['Feature 0'], axis = 1), tr['Feature 0']) # C is the inverse of lmbda

LogisticRegression(C=7.692307692307692, max_iter=1000000, random_state=42)

In [340]:
print('Intercept and Weights learned for Logistic Regression model:\n') 
print_coefs(lr)

Intercept and Weights learned for Logistic Regression model:

Intercept coefficient:	 [-13.24494534]
Coefficient of Feature 0 :	 0.841718885448195
Coefficient of Feature 1 :	 0.21131888783035613
Coefficient of Feature 2 :	 1.1259693843248286
Coefficient of Feature 3 :	 0.3819520202726725
Coefficient of Feature 4 :	 0.03295039194701652
Coefficient of Feature 5 :	 0.4154922872396178
Coefficient of Feature 6 :	 0.1810764173361439
Coefficient of Feature 7 :	 0.2636751379463769
Coefficient of Feature 8 :	 0.2894598197882617


### c) 

Report the misclassification rates of $w^*_\text{hinge}$ and $w^*_\text{logistic}$ on
the test set. Which model performs better?

In [341]:
preds_svm = svm.predict(te.drop(['Feature 0'], axis = 1))
preds_lr = lr.predict(te.drop(['Feature 0'], axis = 1))

print("The error of the SVM, which is built upon the hinge loss, is " + 
      str(round(err(preds_svm, te['Feature 0']), 3)) + ".")
print("The error of the Logistic Regression, which is built upon the logistic loss, is " + 
      str(round(err(preds_lr, te['Feature 0']), 3)) + ".")

The error of the SVM, which is built upon the hinge loss, is 0.029.
The error of the Logistic Regression, which is built upon the logistic loss, is 0.035.


### (d) See pdf.

### (e)

Compute the log-likelihood of these two models using the test data set. Creating some helper functions that compute hinge loss and logistic loss for a given data point and weight vector $(x,y,w)$ may help.

Relate your findings to your misclassification error results in part (b).

__Response:__