This time we employ the cross validation to figure out the best model for spam filter.

**Remark** The objective functions for logistic regression implemented in `sklearn` are:
<img src="L1.png">
and
<img src="L2.png">

where
- $w$ are the coefficients, which was denoted by $\beta_i$ in the class.
- $c$ is the intercept, which was denoted by $\beta_0$ in the class. We can change the parameter "fit_intercept" to keep or remove it.
- $C$ is the inverse of regularization strength. This is opposite to the $\alpha$ we used in Ridge and Lasso. Smaller values specify stronger regularization.
- Therefore the first objective function is of $L_1$ panelty and the second of $L_2$.

### Problem 1
Use the class <code>GridSearchCV</code> to find out the best combination of parameter for logistic regression. (Set <code>cv=5</code> and <code>scoring='accuracy'</code>). 

In [18]:
from __future__ import print_function
import pandas as pd
import numpy as np
import sklearn

spam_train_df = pd.read_csv('data/spam_train.csv')
x_train = spam_train_df.iloc[:, :57].values
y_train = spam_train_df.iloc[:, -1].values


spam_test_df = pd.read_csv('data/spam_test.csv')
x_test = spam_test_df.iloc[:, :57].values
y_test = spam_test_df.iloc[:, -1].values

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
# from sklearn import datasets
# from sklearn import neighbors
# knn = neighbors.KNeighborsClassifier()

# logit = linear_model.LogisticRegression()
# logit.fit(x_train, y_train)   # i fit before the grid search

para_grid = [{
    'penalty': ['l1', 'l2'],
    'fit_intercept': [False, True],
    'C': np.logspace(-5, 5, 100)
}]

# Your solution
# para_search_0 = GridSearchCV(estimator=logit, 
#                              param_grid=para_grid, 
#                              scoring='accuracy', 
#                              cv=5).fit(x_train, y_train)


logit = linear_model.LogisticRegression()
para_search = gs.GridSearchCV(logit, para_grid, cv = 5, scoring = 'accuracy')
para_search.fit(x_train, y_train)      # the answer fit after the grid search

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.26186e-05, ...,   7.92483e+04,   1.00000e+05]), 'fit_intercept': [False, True]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

    - What's the best combination?
    - What's the best score?
    - Refit the best estimator on the whole data set. How many coefficients were reduced to 0?(Hint: the absolute value of coefficients that are smaller than 1e-4.) 
    - What's the corresponding training error and test error? (Training error is the model performance on spam_train, while test error is the performance on spam_test.)

In [38]:
### your solution
print(para_search.best_params_)   # gives best combination

print(para_search.best_score_)   # gives best score

logit_best = para_search.best_estimator_
print(np.sum(np.abs(logit_best.coef_) < 1e-4))  
# best coefficients with really small (eseentially zero) values 
# sum up how many there are, and there are four

# 1 minues to get error
print("Training error: %.5f" % (1-logit_best.score(x_train, y_train)))
print("Training error: %.5f" % (1-logit_best.score(x_test, y_test)))


# logit_best.coef_  # what does this answer in the above questions???? NOTHING mijo



{'penalty': 'l1', 'C': 0.027185882427329403, 'fit_intercept': False}
0.932989690722
4
Training error: 0.05155
Training error: 0.06684


array([[ -3.83822373e-04,  -7.37875172e-04,   1.07448619e-03,
          0.00000000e+00,   0.00000000e+00,   4.59882083e-05,
         -5.32080285e-04,   8.38713896e-04,   3.53231647e-04,
          2.50923051e-03,  -6.99586461e-04,  -8.20526561e-02,
         -3.58114747e-02,   0.00000000e+00,   3.35163207e-03,
          3.19029522e-04,   1.46636961e-03]])

### Problem 2

Set *scoring = 'roc_auc'* and search again, what's the best parameters? Fit the best estimator on the spam_train data set. What's the training error and test error?

In [40]:
### your solution
logit = linear_model.LogisticRegression()

y_train_fix = np.where(y_train =='spam',       # if it says spam
                       np.ones(y_train.size),   # replace with ones (onto original size)
                       np.zeros(y_train.size))  # else replace with zeros

y_train = [1 if i == 'spam' else 0 for i in y_train]
y_test = [1 if i == 'spam' else 0 for i in y_test]



para_search= gs.GridSearchCV(estimator=logit, param_grid=para_grid, 
                             scoring='roc_auc', cv=5)
para_search.fit(x_train, y_train_fix)
logit_best = para_search.best_estimator_   # how do you know to do this?
# error because data not binary, retrain y to 1/0 instead of spam/ham

AttributeError: 'list' object has no attribute 'size'

In [15]:
print(para_search_0.best_params_)
print(para_search_0.best_score_)

{'penalty': 'l1', 'C': 1.4174741629268048, 'fit_intercept': True}
0.971550943131


### Problem 3

In this exercise, we will predict the number of applications received(*Apps*) using the other variables in the College data set.

The features and the target variable are prepared as $x$ and $y$.

In [16]:
import pandas as pd
college = pd.read_csv('data/college.csv')
x = college.iloc[:, 2:]
y = college.iloc[:, 1]
print(college.shape)
college.head()

(777, 18)


Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


- (1) Split this data into a training set and a test set with train_size=0.5.(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=0* and *tran_size=0.5*.)

- (2) Fit a linear model on the training set and report the training error and test error(mean squared error, you can use the function *sklearn.metrics.mean_squared_error*).

- (3) Fit a ridge regression on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error.

- (4) Fit a lasso on the training set, with $\alpha$ chosen by the cross validation. Report the training error and test error

- (5) Compare the results obtained, what do you find?

In [24]:
### your solution
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.5, random_state = 0)

from sklearn import linear_model
logit = linear_model.LogisticRegression()
train_error = []
test_error = []

for i in range(5):
#    x_train, x_test, y_train, y_test = ms.train_test_split(iris.data, iris.target, test_size=1.0/3, random_state=i)
    logit.fit(x_train, y_train)
    train_error.append(1 - logit.score(x_train, y_train))   # make sure use 1 - score to get error (score is accuracy)
    test_error.append(1 - logit.score(x_test, y_test))

ridge = linear_model.Ridge(alpha = 1) # create a ridge regression instance
ridge.fit(x, y) # fit data
ridge.coef_, ridge.intercept_ # print out the coefficients

### Problem 4
This time  we will try to predict the variable *Private* using the other variables in the College data set. The features and target variable are prepared for you.

In [25]:
x = college.iloc[:, 1:]
y = college.iloc[:, 0]

- (1) Split this data into a training set and a test set with train_size=0.5(Hint: Use the function **sklearn.cross_validation.train_test_split** , set *random_state=1* and *tran_size=0.5*.)]

- (2) Fit a logistic regression with regularizaton. Use the function **GridSearchCV** to fint out the best parameters.

In [28]:
grid_para_logit = [{'penality': ['l1', 'l2'], 'alpha': np.logspace(-5, 5, 100)}]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.5, random_state = 0)
logit = linear_model.LogisticRegression()
train_error = []
test_error = []

for i in range(5):
    logit.fit(x_train, y_train)
    train_error.append(1 - logit.score(x_train, y_train))   # make sure use 1 - score to get error (score is accuracy)
    test_error.append(1 - logit.score(x_test, y_test))


    - What's the best parameters?
    - Refit the model on the training set with best parameters. What's the training error and test error?
    
- (3) Fit a KNN model. Use the function **GridSearchCV** to fint out the appropriate parameter *n_neighbors*. Refit the model on the training set and report the training error and test error.

- (4) Compare the results of logistic regression and KNN.

In [31]:
### your solution
grid_param = [{'n_neighbors': range(3, 31)}]
## fit all models
para_search = GridSearchCV(estimator=knn, 
                           param_grid=grid_param, 
                           scoring='accuracy', 
                           cv=5).fit(x,y)

print(list(para_search.cv_results_.keys()))
para_search.cv_results_

['rank_test_score', 'split4_test_score', 'mean_score_time', 'param_n_neighbors', 'std_test_score', 'std_train_score', 'split1_train_score', 'split0_test_score', 'mean_test_score', 'std_score_time', 'split2_train_score', 'split0_train_score', 'params', 'std_fit_time', 'split4_train_score', 'split2_test_score', 'split3_test_score', 'mean_train_score', 'mean_fit_time', 'split3_train_score', 'split1_test_score']


{'mean_fit_time': array([ 0.01559997,  0.00200009,  0.00219998,  0.00260005,  0.00239992,
         0.00200009,  0.00239997,  0.00220003,  0.00239992,  0.002     ,
         0.00200005,  0.00200005,  0.00219994,  0.00200009,  0.00219998,
         0.00219994,  0.00260005,  0.00220003,  0.00199995,  0.00220003,
         0.00220003,  0.002     ,  0.00220003,  0.00260005,  0.00220003,
         0.00239997,  0.00260005,  0.00219998]),
 'mean_score_time': array([ 0.02359996,  0.00339999,  0.00360003,  0.00320001,  0.00340004,
         0.004     ,  0.00359998,  0.00339999,  0.00380006,  0.004     ,
         0.004     ,  0.00399995,  0.00420008,  0.00399995,  0.00400009,
         0.00439997,  0.00419993,  0.00439997,  0.00440006,  0.00459995,
         0.0046    ,  0.00480003,  0.00479999,  0.00419993,  0.00480003,
         0.00440001,  0.00420003,  0.00479999]),
 'mean_test_score': array([ 0.92921493,  0.92535393,  0.93436293,  0.93178893,  0.93178893,
         0.93178893,  0.93436293,  0.9343629