# Lab 4: Model evaluation

Hello! This lab has an objective to teach various ways of evaluating a model we already learned how to build in Lab 3. In the last lab, we did not perform any automatic process to determine the optimal parameter, but just put any value and saw whether the model showed a better perfor|mance. In this lab, you will learn how to perform a grid search for parameter settings. After that, in the following assignment, you will try various ways of evaluating a model.

### These are what we will show in the Lab 4
- 4-1. Run several validation methods using scikit-learn
  - K-fold
  - Grid-search
  - Nested k-fold
  
- 4-2. Implement manually (for programmers)
  - K-fold

## 4-1. Run several validation methods using scikit-learn

The first validation method we can run is k-fold cross-validation. This method is simple but most widely used in practice. It divides the dataset into k-1:1 proportion and uses the right side set as a validation set. We change the validation set k times and run validation k times to generalize the validation performance. 

#### Load the libraries

Basic libraries used throughout the lab session!

In [None]:
import pandas as pd
import numpy as np
RANDOM_SEED = 12345

#### Load the data

In this lab, we will use the same data as we used in the previous lab: **Connectionist Bench** from UCI Machine Learning Repository, which can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data). We already located the dataset into **datasets** directory, so you can simply include it from there. This dataset has two classes: ***Mines***, ***Rocks*** with 60 attributes representing each data entity. More information can be found <a href="https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)">here</a>. If you succeed in downloading, place the file in the same directory with this jupyter lab file, and let's get started!

The first thing you always need to do is loading data and check it correctly loaded. We will use pandas to load and manipulate it. Since there is no proper **head** for the table, you need to put an option not to use the first row as a set of column names.

In [None]:
data = pd.read_csv("datasets/sonar.all-data", header=None)

You can always check whether the shape of data by looking at first few rows.

In [None]:
data.head()

We'll use scikit learn, in which case we usually manage labels and data attributes separately. Let's separate the data labels from the dataset.

In [None]:
X = data.drop(60, axis=1)
y = data.iloc[:, -1]

**From this part, we have a different process.** We will no longer have a test set. Instead, we will split our dataset into two parts: A training set and a validation set. Here we use the validation set for further generalization of our model. However, if we want to use the validation set for the model creation process to determine optimal parameters, we may need to split our dataset into three parts, including the test set. In this case, the test set will be used to get the final performance measure of the created model.

Since we are not trying optimization in this stage, we will just divide our dataset into two parts using the **k-fold cross validation** method.

#### K-fold

Scikit-learn offers two types of k-fold methods: k-fold and stratified k-fold. As you can guess by its name, stratified k-fold will keep the labels' proportion when separating the dataset. We will try both and see which one creates better models on our dataset.

First, let's try a normal **k-fold** method. You can find it in the *model_selection* package.

In [None]:
from sklearn.model_selection import KFold

Next, we will initialize our instance as we did before for classifiers. Here we need to specify the number of splits (n_splits). Let's set it to five.

In [None]:
kf = KFold(n_splits=5)

Now we can put our dataset into **split** method of our instance. It will automatically divide our dataset with a 4:1 ratio five times following the order of the dataset. If we want to shuffle the datasets, we need to predefine it when we create the instance.

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    #print(type(train_index))
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Next, let's try a stratified k-fold method. You can find it also in the model_selection package.

In [None]:
from sklearn.model_selection import StratifiedKFold

In [None]:
skf = StratifiedKFold(n_splits=5)

In this case, the difference is we also need to give y value into the split method so that the algorithm knows the label distribution.

In [None]:
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    print(type(train_index))
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

We can use these indices to get split validation and training set, but it requires too much work for us. Scikit-learn offers another option, which automates the cross-validation process by calling one method after initializing our classifier instance.

In [None]:
from sklearn.model_selection import cross_val_score

Let's make a basic SVC classifier with the RBF kernel.

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC()

This function uses **StratifiedKFold** inside, so you do not need to worry about the class distribution. If you want to use **KFold** instead of **StratifiedKFold**, you may want to create a **KFold** instance and put it as a parameter into the function.

In [None]:
scores = cross_val_score(clf, X, y, cv=5) 

In [None]:
kf = KFold(n_splits=5)
scores2 = cross_val_score(clf, X, y, cv=kf)

In [None]:
scores

In [None]:
scores2

In [None]:
np.mean(scores)

The default score is an *accuracy*, but you can even plot different scores, such as F1-score. Let's plot an F1-score instead of an accuracy.

In [None]:
scores3 = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

In [None]:
np.mean(scores3)

#### Grid search

In the last lab, we tried to increase our test accuracy by putting diverse parameter values. However, it is not feasible in practice since we cannot wait for the finish of model training and put tons of parameter combinations manually. In this situation, **grid search** is used to find optimal parameters given specific ranges of parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV

It receives sets of parameters as a form of dictionary list (a list having dictionaries as its entities). Inside each dictionary, we specify the possible combination of parameters.

In [None]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

First, as always, we need to create an instance with all the parameters we have.

In [None]:
search = GridSearchCV(clf, param_grid, cv=5)

Next, we can directly fit this instance with our dataset. Since it will run cross-validation inside, we do not need to put any other split dataset, but just put the entire dataset.

In [None]:
search.fit(X, y)

Now our first grid-search is done! You can find out the best score and the best estimator.

In [None]:
search.best_estimator_

In [None]:
search.best_score_

#### Nested k-fold

Nested k-fold is used when we want to estimate optimal parameters, but we do not have enough data entities in our dataset to separate it into three parts (training, validation, and test). This method firstly runs k-fold to run grid-search and runs another k-fold to test the performance measure. Therefore, it must shuffle the dataset before running each k-fold since its strategy is to estimate parameters and test using a different portion of the same dataset.

Here we are going to use a default SVC classifier again!

In [None]:
clf = SVC(kernel="rbf")

The basic idea of nested k-fold is that we use one cross-validation to **create**, and the other cross-validation to **evaluate** the models and pick the best one. We can say that the second cross-validation works like a test set.

We eventually need a loop, but let's learn about a basic structure first.

First, we need to set two different k-fold cross-validation instances.

In [None]:
model_cv = KFold(n_splits=4, shuffle=True, random_state=RANDOM_SEED)
eval_cv = KFold(n_splits=4, shuffle=True, random_state=RANDOM_SEED+1)

Next, we also need to set one grid-search instance with the first k-fold instance.

In [None]:
search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=model_cv)
search.fit(X, y)

We can get the best model from our first cross-validation set and the parameter grid.

In [None]:
search.best_estimator_

In [None]:
search.best_score_

However, this best score is not useful since it evaluates the same portion of the dataset used to train. Therefore, now we need to use our second cross-validation instance to get a more reasonable cross-validation score.

In [None]:
np.mean(cross_val_score(search.best_estimator_, X=X, y=y, cv=eval_cv))

That is just one cycle. We need to run the process multiple times to get the best model by comparing multiple mean cross-validation scores.

In [None]:
scores = []
searches = []

COUNT = 4

for i in range(COUNT):

    model_cv = KFold(n_splits=4, shuffle=True)
    eval_cv = KFold(n_splits=4, shuffle=True)
    
    search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=model_cv)
    search.fit(X, y)
    searches.append(search)

    scores.append(np.mean(cross_val_score(search.best_estimator_, X=X, y=y, cv=eval_cv)))

Then we can pick the maximum score from the score list we made.

In [None]:
scores

In [None]:
max_score_idx = np.argmax(scores)
max_score_idx

We can get the trained model information (parameter setting) from the index we get.

In [None]:
searches[max_score_idx].best_estimator_

## 4-2. Implement manually

Here, we are going to implement k-fold. It is a straightforward algorithm having only three steps: 1) divide the data into k folds, 2) choose one of the chunks as one set and all the other chunks as another set, 3) repeat 1-2 k times.

We will also try to make the same structure with the one in the scikit-learn library so that we can quickly test and compare!

In [None]:
class KFold_Manual():
    def __init__(self, n_splits=5, shuffle=False, random_state=RANDOM_SEED):
        return
        
    def split(self, X):
        return

The answer is as follows:

In [None]:
class KFold_Manual():
    def __init__(self, n_splits=5, shuffle=False, random_state=RANDOM_SEED):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = RANDOM_SEED

    def split(self, X):
        #extract the indices
        indices = X.index.values
        #shuffle
        if self.shuffle == True:
            indices = np.random.shuffle(indices, random_state = self.random_state)
        
        #split
        split_indices = np.array_split(indices, self.n_splits)
        
        #index manipulation
        results = []

        for i in range(self.n_splits):
            splits = [np.zeros(0), np.zeros(0)]

            for idx, val in enumerate(split_indices):
                if idx != i:
                    splits[0] = np.concatenate((splits[0], val))
                else:
                    splits[1] = np.concatenate((splits[1], val))
                
            results.append(splits)

        return results

Now, let's copy and paste the code above and run it here!

In [None]:
kf = KFold_Manual()

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# END OF LAB 4