# CSE 572: Lab 10

In this lab, you will practice implementing techniques for model selection including cross validation and grid search.

To execute and make changes to this notebook, click File > Save a copy to save your own version in your Google Drive or Github. Read the step-by-step instructions below carefully. To execute the code, click on each cell below and press the SHIFT-ENTER keys simultaneously or by clicking the Play button. 

When you finish executing all code/exercises, save your notebook then download a copy (.ipynb file). Submit the following **three** things:
1. a link to your Colab notebook,
2. the .ipynb file, and
3. a pdf of the executed notebook on Canvas.

To generate a pdf of the notebook, click File > Print > Save as PDF.

In [None]:
# Import libraries
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
seed = 0
np.random.seed(0)

### Load the iris dataset

In [None]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']

data.sample(5, random_state=seed)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
114,5.8,2.8,5.1,2.4,Iris-virginica
62,6.0,2.2,4.0,1.0,Iris-versicolor
33,5.5,4.2,1.4,0.2,Iris-setosa
107,7.3,2.9,6.3,1.8,Iris-virginica
7,5.0,3.4,1.5,0.2,Iris-setosa


In [None]:
data.shape

(150, 5)

Standardize the data by subtracting the feature-wise mean and dividing by the feature-wise standard deviation for each sample.

In [None]:
# YOUR CODE HERE
y = data['class']
data = data.drop(['class'],axis=1)
data = (data-np.mean(data,axis=0))/np.std(data,axis=0)
data['class']=y
data.shape

(150, 5)

In [None]:
data.sample(5, random_state=seed)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
114,-0.052506,-0.587764,0.762759,1.579429,Iris-virginica
62,0.18983,-1.976181,0.137236,-0.261193,Iris-versicolor
33,-0.41601,2.651878,-1.341272,-1.312977,Iris-setosa
107,1.765012,-0.356361,1.445147,0.790591,Iris-virginica
7,-1.021849,0.800654,-1.284407,-1.312977,Iris-setosa


### k-fold Cross validation

We will use 5-fold cross validation to train and evaluate our classifier. We will not do any model selection/hyperparameter tuning in this step, so we need to split our data into a training and test set.

To split the data into 5 folds we will shuffle the rows and then split them into $k$ equal groups.

In [None]:
k = 5

# Note: np.split raises error if indices_or_sections is 
# an integer and doesn't result in equal size splits
folds = np.split(data.sample(frac=1, random_state=seed), indices_or_sections=k)


Use a for loop to print the number of samples and number of samples from each class in each fold.

In [None]:
# YOUR CODE HERE
i=1
for f in folds:
  print("Number of samples in fold ",i," is ", f.shape[0], "\n",
        "Number of samples of Iris-versicolor are ",f['class'].value_counts()['Iris-versicolor'],"\n",
        "Number of samples of Iris-virginica are ",f['class'].value_counts()['Iris-virginica'],"\n",
        "Number of samples of Iris-setosa are ",f['class'].value_counts()['Iris-setosa'],"\n")
  i+=1



Number of samples in fold  1  is  30 
 Number of samples of Iris-versicolor are  13 
 Number of samples of Iris-virginica are  6 
 Number of samples of Iris-setosa are  11 

Number of samples in fold  2  is  30 
 Number of samples of Iris-versicolor are  10 
 Number of samples of Iris-virginica are  15 
 Number of samples of Iris-setosa are  5 

Number of samples in fold  3  is  30 
 Number of samples of Iris-versicolor are  10 
 Number of samples of Iris-virginica are  10 
 Number of samples of Iris-setosa are  10 

Number of samples in fold  4  is  30 
 Number of samples of Iris-versicolor are  6 
 Number of samples of Iris-virginica are  10 
 Number of samples of Iris-setosa are  14 

Number of samples in fold  5  is  30 
 Number of samples of Iris-versicolor are  11 
 Number of samples of Iris-virginica are  9 
 Number of samples of Iris-setosa are  10 



### Train a k Nearest Neighbors classifier 

We will use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) in sklearn for our classification model. Use cross validation to train and evaluate the model. Set hyperparameters to `n_neighbors=5`, `metric='l2'`, and `weights='uniform'`.

Implement a for loop to iterate through each fold, training a new kNN model each iteration with one fold assigned to validation and the remaining folds assigned to training. Compute the validation accuracy for each iteration and append it to the `accuracies` list.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
accuracies = []
k=5
# YOUR CODE HERE

for i in range(k):
  X=folds.copy()
  y=folds[i]
  del X[i]
  train = pd.concat(X, sort=False)

  # print(train.shape)
  # print(y.shape)

  train_y = train['class'];
  train_x= train.drop(['class'],axis=1)

  val_y = y['class']
  val_x = y.drop(['class'],axis=1)

  # print(val_y.shape)
  # print(val_x.shape)

  clf = KNeighborsClassifier(n_neighbors=5,metric='l2',weights='uniform')
  clf.fit(train_x,train_y)
  pred = clf.predict(val_x)
  accuracies.append(accuracy_score(pred,val_y))

accuracies

[1.0, 0.8666666666666667, 1.0, 1.0, 0.9]

Print the mean and standard deviation of the accuracy from cross validation (across all $k$ folds).

In [None]:
print('Mean accuracy: {:.2f}'.format(np.mean(accuracies)))
print('Standard deviation of accuracy: {:.2f}'.format(np.std(accuracies)))


Mean accuracy: 0.95
Standard deviation of accuracy: 0.06


**Question 1: If you increased the number of folds, do you expect the standard deviation of the accuracy across $k$ folds to increase or decrease? Why?**

**Answer:**

When we increase $k$ folds compartively, our size of the test data increases and the size of validation data decreases. Due to which the model might fit on similar test data each iteration as more the value $k$ means more number of records in $k$-1 set. Now when checked against a smaller validation set each iteration will give more variance between the results.

### Hyperparameter selection using cross validation and grid search

In this exercise, we will use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) again but this time we will perform hyperparameter selection using k-fold cross validation and Grid Search. 

We have three model choices (hyperparameters) for our kNN model:
- Number of neighbors ($k$ or `n_neighbors`). We will consider all integer values $k \in [1,10]$.
- Whether to treat all neighbors equally when taking majority vote, or weight them according to their distance from the query point (`weights='uniform'` or `weights='distance'`).
- The distance metric for computing distance between query point and neighbors (`metric` argument). We will consider three options for `metric`: `'l1'`, `'l2'`, and `'cosine'`.

**Question 2: How many total combinations of the above hyperparameter choices are there?**

**Answer:**

We have 60 combinations

Instead of implementing cross validation manually as we did in the previous example, we will use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) class in sklearn to perform grid search and cross validation simultaneously. 

First, we will split the data into a training (70\%) and test (30\%) test.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[data.columns[:-1]], 
                                                    data['class'], 
                                                    test_size=0.3, 
                                                    random_state=seed)

We will then use the training set for cross validation and grid search to select the optimal hyperparameter settings.

Next, we define the values for grid search using a dictionary in which the keys are the parameter names to be passed to the model function and each corresponding value is a list of possible values to try in grid search.

In [None]:
param_grid = {'n_neighbors': list(range(1, 11)), 
              'weights': ['uniform', 'distance'],
              'metric': ['l1', 'l2', 'cosine']
             }

Next, we instantiate a kNeighborsClassifier but do not specify the hyperparameter settings yet.

In [None]:
knn = KNeighborsClassifier()

We can then pass this classifier and our parameter grid to a new GridSearchCV object and fit the GridSearchCV using our training data.

In [None]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(knn, param_grid)

clf.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'metric': ['l1', 'l2', 'cosine'],
                         'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'weights': ['uniform', 'distance']})

The cross validation results are stored as an attribute of the GridSearchCV object as a dictionary with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

In [None]:
cv_results = pd.DataFrame(clf.cv_results_)

cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006086,0.004478,0.006147,0.002994,l1,1,uniform,"{'metric': 'l1', 'n_neighbors': 1, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
1,0.003158,0.0023,0.00224,0.000181,l1,1,distance,"{'metric': 'l1', 'n_neighbors': 1, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
2,0.00219,0.000311,0.022088,0.037354,l1,2,uniform,"{'metric': 'l1', 'n_neighbors': 2, 'weights': ...",0.857143,0.904762,1.0,0.857143,0.952381,0.914286,0.055533,44
3,0.002428,5.8e-05,0.002836,8.9e-05,l1,2,distance,"{'metric': 'l1', 'n_neighbors': 2, 'weights': ...",0.857143,0.904762,1.0,0.904762,0.952381,0.92381,0.048562,37
4,0.002545,0.000212,0.004372,0.000219,l1,3,uniform,"{'metric': 'l1', 'n_neighbors': 3, 'weights': ...",0.904762,1.0,1.0,0.857143,0.952381,0.942857,0.055533,10
5,0.002666,0.000155,0.003098,0.000188,l1,3,distance,"{'metric': 'l1', 'n_neighbors': 3, 'weights': ...",0.904762,1.0,1.0,0.904762,0.952381,0.952381,0.042592,2
6,0.002897,0.000356,0.004995,0.00043,l1,4,uniform,"{'metric': 'l1', 'n_neighbors': 4, 'weights': ...",0.857143,1.0,1.0,0.904762,1.0,0.952381,0.060234,2
7,0.002456,0.000383,0.002478,0.000283,l1,4,distance,"{'metric': 'l1', 'n_neighbors': 4, 'weights': ...",0.857143,1.0,1.0,0.904762,0.952381,0.942857,0.055533,10
8,0.002238,0.000208,0.003206,0.000648,l1,5,uniform,"{'metric': 'l1', 'n_neighbors': 5, 'weights': ...",0.809524,1.0,1.0,0.904762,0.952381,0.933333,0.07127,27
9,0.002902,0.000579,0.00314,0.000696,l1,5,distance,"{'metric': 'l1', 'n_neighbors': 5, 'weights': ...",0.809524,1.0,1.0,0.904762,0.952381,0.933333,0.07127,27


Look at the [GridSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to read about the other attributes stored after fitting. Print the value of the attribute that gives the parameter settings for the best results on the hold out data.

In [None]:
# YOUR CODE HERE
cv_results['params'][clf.best_index_]

{'metric': 'l1', 'n_neighbors': 6, 'weights': 'uniform'}

Train a new kNN classifier using the hyperparameter settings that were found to give the best results on the hold out data from GridSearchCV (the values printed in the last cell). Train it on the full training set.

In [None]:
# YOUR CODE HERE
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(metric= 'l1', n_neighbors= 6, weights= 'uniform')
knn.fit(X_train,y_train)

KNeighborsClassifier(metric='l1', n_neighbors=6)

Apply the trained classifier to the test dataset and print the test accuracy.

In [None]:
# YOUR CODE HERE
from sklearn.metrics import accuracy_score
pred = knn.predict(X_test)
accuracy_score(pred, y_test)

0.9777777777777777