This time we employ the cross validation to figure out the best model for spam filter.

**Remark** The objective functions for logistic regression implemented in `sklearn` are:
<img src="L1.png">
and
<img src="L2.png">

where
- $w$ are the coefficients, which was denoted by $\beta_i$ in the class.
- $c$ is the intercept, which was denoted by $\beta_0$ in the class. We can change the parameter "fit_intercept" to keep or remove it.
- $C$ is the inverse of regularization strength. This is opposite to the $\alpha$ we used in Ridge and Lasso. Smaller values specify stronger regularization.
- Therefore the first objective function is of $L_1$ panelty and the second of $L_2$.

### Problem 1
Use the class <code>GridSearchCV</code> to find out the best combination of parameter for logistic regression. (Set <code>cv=5</code> and <code>scoring='accuracy'</code>). 

In [12]:
#from __future__ import print_function
import pandas as pd
import numpy as np

train = pd.read_csv('data/spam_train.csv')
#x_train = spam_train_df.iloc[:, :57].values
#y_train = spam_train_df.iloc[:, -1].values


test = pd.read_csv('data/spam_test.csv')
#x_test = spam_test_df.iloc[:, :57].values
#y_test = spam_test_df.iloc[:, -1].values

In [13]:
# rheineke: Very good. I assume you are anticipating the roc_auc problem
train[['spam']] = train.spam=='spam'
train[['spam']] = 1*train[['spam']]

test[['spam']] = test.spam=='spam'
test[['spam']] = 1*test[['spam']]

# building test/train x and y:

x_train = train.iloc[:, 0:57]
y_train = train.iloc[:,-1]

x_test = test.iloc[:, 0:57]
y_test = test.iloc[:,-1]

In [16]:
grid_param = [{
    'penalty': ['l1', 'l2'],
    'fit_intercept': [False, True],
    'C': np.logspace(-5, 5, 100)
}]

# Your solution

In [17]:
from sklearn.model_selection import GridSearchCV

from sklearn import linear_model
logit = linear_model.LogisticRegression()

## fit all models
para_search = GridSearchCV(estimator=logit, param_grid=grid_param, scoring='accuracy', cv=5).fit(x_train, y_train)

    - What's the best combination?
    - What's the best score?
    - Refit the best estimator on the whole data set. How many coefficients were reduced to 0?(Hint: the absolute value of coefficients that are smaller than 1e-4.) 
    - What's the corresponding training error and test error? (Training error is the model performance on spam_train, while test error is the performance on spam_test.)

In [18]:
# what's the best estimator?
para_search.best_estimator_


LogisticRegression(C=18.307382802953661, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [19]:
# what's the best score? 
print para_search.best_score_

0.928695652174


In [20]:
# Refit the best estimator on the whole data set. How many coefficients were reduced to 0? 

logit_best = para_search.best_estimator_
logit_best.fit(x_train, y_train)

print logit_best.coef_

abs_coef = abs(logit_best.coef_)
print 'number of coefficients reduced to 0: %i' %len(abs_coef[abs_coef < 1e-4])


[[ -3.92887375e-01  -1.94398114e-01   2.22984685e-01   4.50220775e-01
    9.97035103e-01   1.16420798e+00   1.58293179e+00   9.84975067e-01
    1.24576167e+00   4.36382273e-01  -7.77049963e-01  -4.21781050e-01
    1.65179579e-01  -2.05806810e-01   6.24816253e-01   1.10951892e+00
    1.13303311e+00   1.05774839e-01   5.14678054e-02   8.16206374e-01
    2.44117343e-01   2.49414812e-01   2.91252878e+00   3.14746488e-01
   -2.19315764e+00  -1.13309078e+00  -7.31625867e+00   2.64993629e-01
   -1.46957256e+00  -2.01957325e+00  -1.47811732e-01   2.50578024e+00
   -9.29213616e-01   3.72565060e-01  -1.39231106e+00   1.70858807e+00
    0.00000000e+00   1.24868800e+00  -1.50379455e+00  -1.02265976e+00
   -1.91449464e+01  -2.13393730e+00  -2.90039470e+00  -1.96836258e+00
   -1.02034960e+00  -1.62209977e+00  -3.34756284e-01  -2.15373277e+00
   -1.70630950e+00  -7.52200627e-01  -2.42272121e+00   2.24165423e-01
    3.19432568e+00   1.43999714e+00   3.50526768e-01   8.51506012e-04
    3.84988729e-04]]

In [21]:
# What's the corresponding training error and test error?

training_error = 1-logit_best.score(x_train, y_train)
test_error = 1-logit_best.score(x_test, y_test)

print 'training error for best model: %.4f' %training_error
print 'test error for best model: %.4f' %test_error


training error for best model: 0.0617
test error for best model: 0.0713


### Problem 2

Set *scoring = 'roc_auc'* and search again, what's the best parameters? Fit the best estimator on the spam_train data set. What's the training error and test error?

In [23]:
### your solution

para_search = GridSearchCV(estimator=logit, param_grid=grid_param, scoring='roc_auc', cv=5).fit(x_train, y_train)


In [24]:
# what are the best parameters?
para_search.best_estimator_


LogisticRegression(C=1.4174741629268048, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [25]:
# Fit the best estimator on the spam_train data set. What's the training error and test error?

logit_best = para_search.best_estimator_
logit_best.fit(x_train, y_train)

training_error = 1-logit_best.score(x_train, y_train)
test_error = 1-logit_best.score(x_test, y_test)

print 'training error for best model: %.4f' %training_error
print 'test error for best model: %.4f' %test_error

training error for best model: 0.0639
test error for best model: 0.0717


### Problem 3

In this exercise, we will predict the number of applications received(*Apps*) using the other variables in the College data set.

The features and the target variable are prepared as $x$ and $y$.

In [26]:
import pandas as pd
college = pd.read_csv('data/college.csv')
x = college.iloc[:, 2:]
y = college.iloc[:, 1]
print(college.shape)
college.head()

(777, 18)


Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


- (1) Split this data into a training set and a test set with train_size=0.5.(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=0* and *tran_size=0.5*.)

- (2) Fit a linear model on the training set and report the training error and test error(mean squared error, you can use the function *sklearn.metrics.mean_squared_error*).

- (3) Fit a ridge regression on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error.

- (4) Fit a lasso on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error

- (5) Compare the results obtained, what do you find?

In [29]:
### your solution

#(1) Split this data into a training set and a test set with train_size=0.5

import sklearn.model_selection as ms
x_train, x_test, y_train, y_test = ms.train_test_split(x, y, test_size=0.5, random_state=0)

In [38]:
#(2) Fit a linear model on the training set and report the training error and test error

from sklearn import linear_model
lr = linear_model.LinearRegression()

lr.fit(x_train, y_train)

from sklearn.metrics import mean_squared_error

#mean_squared_error(y_true, y_pred)

print 'mean squared error for training data: %.2f' %mean_squared_error(y_train, lr.predict(x_train))
print 'mean squared error for test data: %.2f' %mean_squared_error(y_test, lr.predict(x_test))


mean squared error for training data: 1113145.99
mean squared error for test data: 1244800.27


In [43]:
# (3) Fit a ridge regression on the training set, with α chosen by the cross validation. 
# Report the training error and test error

from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
ridge_model = linear_model.Ridge() 

alphas = np.logspace(0, 8, 100)
grid_param=dict(alpha=alphas)

para_search = GridSearchCV(estimator=ridge_model, param_grid=grid_param, cv=5).fit(x, y)

ridge_best = para_search.best_estimator_

print 'Ridge mean squared error for training data: %.2f' %mean_squared_error(y_train, ridge_best.predict(x_train))
print 'Ridge mean squared error for test data: %.2f' %mean_squared_error(y_test, ridge_best.predict(x_test))


Ridge mean squared error for training data: 1197502.32
Ridge mean squared error for test data: 958010.41


In [60]:
# (4) Fit a lasso on the training set, with  αα  chosen by the cross validation. 
# Report the training error and test error

lasso_model = linear_model.Lasso() 

para_search = GridSearchCV(estimator=lasso_model, param_grid=grid_param, cv=5).fit(x, y)

lasso_best = para_search.best_estimator_

print 'Lasso mean squared error for training data: %.2f' %mean_squared_error(y_train, lasso_best.predict(x_train))
print 'Lasso mean squared error for test data: %.2f' %mean_squared_error(y_test, lasso_best.predict(x_test))
# rheineke: These mean squared error values are suprisingly small

Lasso mean squared error for training data: 0.08
Lasso mean squared error for test data: 0.09


### Problem 4
This time  we will try to predict the variable *Private* using the other variables in the College data set. The features and target variable are prepared for you.

In [66]:
college = pd.read_csv('data/college.csv')
college[['Private']] = college.Private=='Yes'
college[['Private']] = 1*college[['Private']]

x = college.iloc[:, 1:]
y = college.iloc[:, 0]

- (1) Split this data into a training set and a test set with train_size=0.5(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=1* and *tran_size=0.5*.)]

- (2) Fit a logistic regression with regularizaton. Use the function **GridSearchCV** to fint out the best parameters.

In [67]:
import sklearn.model_selection as ms
x_train, x_test, y_train, y_test = ms.train_test_split(x, y, test_size=0.5, random_state=1)

In [68]:
#grid_para_logit = [{'penality': ['l1', 'l2'], 'alpha': np.logspace(-5, 5, 100)}]
grid_para_logit = [{'C': np.logspace(-5, 5, 100)}]

from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
logit = linear_model.LogisticRegression()

#logit.get_params().keys()
para_search = GridSearchCV(estimator=logit, param_grid=grid_para_logit, cv=5).fit(x, y)

In [69]:
# What's the best parameters?

logit_best = para_search.best_estimator_
logit_best

LogisticRegression(C=0.00020565123083486514, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [70]:
#Refit the model on the training set with best parameters. What's the training error and test error?

training_error = 1-logit_best.score(x_train, y_train)
test_error = 1-logit_best.score(x_test, y_test)

print 'training error for best model: %.4f' %training_error
print 'test error for best model: %.4f' %test_error


training error for best model: 0.0541
test error for best model: 0.0463


    
- (3) Fit a KNN model. Use the function **GridSearchCV** to fint out the appropriate parameter *n_neighbors*. Refit the model on the training set and report the training error and test error.

- (4) Compare the results of logistic regression and KNN.

In [73]:
### your solution

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier()
#knn.get_params().keys()

grid_param = [{'n_neighbors': range(3, 31)}]
para_search = GridSearchCV(estimator=knn, param_grid=grid_param, scoring='accuracy', cv=5).fit(x, y)
knn_best = para_search.best_estimator_

training_error = 1-knn_best.score(x_train, y_train)
test_error = 1-knn_best.score(x_test, y_test)

print 'training error for best model: %.4f' %training_error
print 'test error for best model: %.4f' %test_error

training error for best model: 0.0593
test error for best model: 0.0720


In [None]:
# rheineke: What's the results of comparing these two models?