# Optional: Scikit-learn primer.
In this additional assignment, you will learn to use the scikit-learn library. It is highly recommended to go through this notebook before starting with the final assignment.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}}$

## Introduction
All algorithms, both learning and pre-processing, in scikit-learn have been implemented with the same `fit`, `predict` and `transform` API. As soon as you have learned this API you can use any algorithm without having to implement it on your own. For a given learning problem, you can then apply all those algortihms in the same way. The API also hides all the complex optimization choices that have to be made. You can control these by changing the hyper-parameters. The effects of these choices have been well documented in the API documentation and the provided tutorials of scikit-learn.  



## Dataset

In this assignment, we will use the Iris dataset to keep things simple.

In [1]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

## Using classifiers
Using a classifier in scikit-learn consist of 3 steps:
1. Initialize the model. During this step, you can already give it some default hyper-parameters.
2. Fitting the model on the training data.
3. Making predictions and/or evaluating the model.

### Create
Creating models is very easy in scikit-learn. All you have to do is create a new instance of the model's class.

$ \ex{1} $ Extent the list of models with the`SVC` and `LogisticRegression` algorithms. Give the SVM a `poly` kernel. Also, give both algorithms a regularization constant `C=0.5`.

In [2]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

models = {
    "GaussianNB": GaussianNB(),
    "DummyClassifier": DummyClassifier(strategy="most_frequent"),
    "DecisionTreeClassifier": DecisionTreeClassifier(max_depth=None, min_samples_leaf=2),
    "KNeighborsClassifier": KNeighborsClassifier(n_neighbors=3, weights="distance"),
    # START ANSWER   
    "SVM": SVC(C=10, kernel="poly"),
    "LogisticRegression": LogisticRegression(penalty="l2", C=1, max_iter=1000),
    # END ANSWER
}

assert "GaussianNB" in models and isinstance(models["GaussianNB"], GaussianNB), "There is no GaussianNB in models"
assert "DecisionTreeClassifier" in models and isinstance(models["DecisionTreeClassifier"], DecisionTreeClassifier), "There is no DecisionTreeClassifier in models"
assert "KNeighborsClassifier" in models and isinstance(models["KNeighborsClassifier"], KNeighborsClassifier), "There is no KNeighborsClassifier in models"
assert "SVM" in models and isinstance(models["SVM"], SVC), "There is no SVC in models"
assert "LogisticRegression" in models and isinstance(models["LogisticRegression"], LogisticRegression), "There is no LogisticRegression in models"

### Fit
$ \ex{2} $ Fit each of your models on the entire training set by calling the `.fit` method of the model.

In [3]:
for name, model in models.items():
    # START ANSWER  
    model.fit(X, y)
    # END ANSWER

In [4]:
from sklearn.utils.validation import check_is_fitted

for model in models.values():
    check_is_fitted(model)

### Evaluate
The `sklearn.metrics` module has lots of metrics that can evaluate a model's predictions. Here is an example of how to calculate a model's F1 and accuracy score.

In [20]:
from sklearn.metrics import f1_score, accuracy_score

for name, model in models.items():
    prediction = model.predict(X)
    f1_score_value = f1_score(prediction, y, average='weighted')
    accuracy = accuracy_score(prediction, y)
    print(name)
    print("- accuracy_score", accuracy)
    print("- f1_score", f1_score_value)

GaussianNB
- accuracy_score 0.9533333333333334
- f1_score 0.9533380004667132
DummyClassifier
- accuracy_score 0.3333333333333333
- f1_score 0.5
DecisionTreeClassifier
- accuracy_score 0.9733333333333334
- f1_score 0.973344004268374
KNeighborsClassifier
- accuracy_score 1.0
- f1_score 1.0
SVM
- accuracy_score 0.9866666666666667
- f1_score 0.9866720021341869
LogisticRegression
- accuracy_score 0.9733333333333334
- f1_score 0.973344004268374


## Data splitting
Models usually achieve a high evaluation score on the training set. However, this doesn't say anything about how well it generalizes to unseen data. So we usually evaluate models using either a test/validation split or k-fold validation. Scikit-learn also makes our life easier here by implementing both functions for us.

### Test/validation split
We can split datasets into training and test sets using the `train_test_split` function. The `test_size` parameter indicate the percentage of data that should go to the test set. The `stratify`  parameter indicate that the split should take the distribution of target labels `y` into account during the split. This parameter ensures that both the train and test have the same distribution of target variables.

$ \ex{3} $ The data has already been split into a training and a test set. Fit the model using the training set and evaluate them using the test set.

The result on the test set should roughly be equal to:

|                  Model |    F1 | Accuracy |
|-----------------------:|------:|---------:|
|             GaussianNB |  0.86 |     0.86 |
| DummyClassifier        | 0.33  | 0.5      |
| DecisionTreeClassifier | 0.933 |    0.934 |
| KNeighborsClassifier   | 1     | 1        |
| SVM                    | 1     | 1        |
| LogisticRegression     | 0.933 | 0.934    |

In [22]:
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

# START ANSWER 
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    f = f1_score(pred, y_test, average='weighted')
    acc = accuracy_score(pred, y_test)
    print(name, " F1: ", f, " Accuracy: ", acc)
# END ANSWER 

GaussianNB  F1:  0.8666666666666667  Accuracy:  0.8666666666666667
DummyClassifier  F1:  0.5  Accuracy:  0.3333333333333333
DecisionTreeClassifier  F1:  0.9340067340067341  Accuracy:  0.9333333333333333
KNeighborsClassifier  F1:  1.0  Accuracy:  1.0
SVM  F1:  1.0  Accuracy:  1.0
LogisticRegression  F1:  0.9340067340067341  Accuracy:  0.9333333333333333


## K-fold validation
Setting up k-fold validation is a bit more work but we can do it as follows:

In [8]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

def k_fold_fit_and_evaluate(X, y, model, scoring_method, n_splits=5):
    # define evaluation procedure
    cv = KFold(n_splits=n_splits, random_state=42, shuffle=True)
    # evaluate model
    scores = cross_validate(model, X, y, scoring=scoring_method, cv=cv, n_jobs=-1)
    
       
    return scores["test_score"]

Note: `cross_validate` expects a `scoring_method`. We can create a `scoring_method` using the `make_scorer` function from scikit-learn.

$ \ex{4} $ Use the example below to calculate the mean and std for both the F1 and the accuracy.

The result using k-fold validation should roughly be equal to:


|                  Model | mean F1 | std F1 | mean Accuracy | std Accuracy |
|-----------------------:|--------:|--------|--------------:|--------------|
|             GaussianNB |   0.959 | 0.0249 |        0.9599 | 0.024        |
| DummyClassifier        | 0.107   | 0.0187 | 0.1079        | 0.0186       |
| DecisionTreeClassifier |   0.946 | 0.0338 |       0.94655 | 0.0338       |
| KNeighborsClassifier   | 0.966   | 0.0214 | 0.9663        | 0.02144      |
| SVM                    | 0.959   | 0.0251 | 0.9596        | 0.02516      |
| LogisticRegression     | 0.973   | 0.0249 | 0.9732        | 0.024955     |

In [23]:
from sklearn.metrics import make_scorer
import numpy as np
n_splits = 5


scoring_method = make_scorer(lambda prediction, true_target: f1_score(prediction, true_target, average="weighted"))
# START ANSWER 
score = []
for name, model in models.items():
    tmp_acc = k_fold_fit_and_evaluate(X_train, y_train, model, scoring_method)
    m_acc = np.mean(tmp_acc)
    std_acc = np.std(tmp_acc)
    print(name, " | mean accuracy: ", m_acc, " | std accuracy: ", std_acc)
    score.append(tmp)
    
#    tmp_f = f1_score(pred, y_test, average='weighted')
#     #np.mean(model, scoring_method)
#     cv = KFold(n_splits=n_splits, random_state=42, shuffle=True)
#     d = cross_validate(model, X, y, scoring=scoring_method, cv=cv, n_jobs=-1)
#     print(d)
#     print(name, ": ", sco/n_splits)
# END ANSWER  
# print(score)

for name, model in models.items():
    metrics = k_fold_fit_and_evaluate(X, y, model, scoring_method, n_splits=n_splits) 
    # START ANSWER 
    # END ANSWER 

GaussianNB  | mean accuracy:  0.9629234305808094  | std accuracy:  0.00015740809211831994
DummyClassifier  | mean accuracy:  0.12452140452140455  | std accuracy:  0.021856661856661862
DecisionTreeClassifier  | mean accuracy:  0.9467035109140373  | std accuracy:  0.051951899191208156
KNeighborsClassifier  | mean accuracy:  0.9554131425149658  | std accuracy:  0.01500162276153724
SVM  | mean accuracy:  0.9628725317287408  | std accuracy:  0.02358760709406426
LogisticRegression  | mean accuracy:  0.9629234305808094  | std accuracy:  0.00015740809211831994


## Grid search
Scikit-learn also makes it easier to tune hyper-parameters using `GridSearchCV`.

$ \ex{5} $ Extend the `model_parameters` dict by specifying a grid search for the `KNeighborsClassifier`, `SVM` and the `LogisticRegression` models.

In [49]:
from sklearn.model_selection import GridSearchCV

random_state = 42
n_splits = 5
scoring_method = make_scorer(lambda prediction, true_target: f1_score(prediction, true_target, average="weighted"))

model_parameters = {
    "GaussianNB": {
    
    },
    "DummyClassifier": {
        
    },
    "DecisionTreeClassifier": {
        'random_state': [random_state],
        'max_depth': [None, 2, 5, 10]
    },
    # START ANSWER
    'KNeighborsClassifier': {
        'algorithm': ['ball_tree', 'kd_tree', 'brute', 'auto'],
        'leaf_size': [2, 11, 5, 9, 25, 3, 42],
        'p': [0.9, 1, 2, 3, 4, 2.5]
    },
    'SVM': {
        'C': [0.1, 1, 10, 100, 1000],
        'gamma': [1, 0.5, 0.1, 0.05, 0.01, 0.001, 0.0005, 0.0001, 'scale'],
        'kernel': ['linear', 'sigmoid', 'poly', 'rbf']
    },
    'LogisticRegression':{
        
    }
    # END ANSWER
}


for model_name, parameters in model_parameters.items():
    model = models[model_name]
    
    cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, verbose=False, scoring=scoring_method).fit(X, y)
    
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    best_params = grid_search.best_params_
    
    print(model_name)
    print("- best_score =", best_score)
    print("best paramters:")
    for k,v in best_params.items():
        print("-", k, v)


GaussianNB
- best_score = 0.9599161225248183
best paramters:
DummyClassifier
- best_score = 0.10791990370937739
best paramters:
DecisionTreeClassifier
- best_score = 0.9465598893859765
best paramters:
- max_depth None
- random_state 42
KNeighborsClassifier
- best_score = 0.9663617061759476
best paramters:
- algorithm ball_tree
- leaf_size 2
- p 1
SVM
- best_score = 0.9799785345717235
best paramters:
- C 10
- gamma 0.05
- kernel rbf
LogisticRegression
- best_score = 0.9732912280701754
best paramters:


## Using Transformers
The transformers have a similar but slightly different API than the models. Transformers still have the `fit` method. The fit method is, for example, use in the `StandardScaler` to find the `mean` and `std` values. However, the `predict` method is replaced with the `transform` method. Scikit-learn did this to make it clear to the users that this is not a model but a feature transformer.

In [50]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)

scaler.mean_, scaler.scale_

(array([5.82962963, 3.05703704, 3.75111111, 1.20518519]),
 array([0.82210877, 0.44297659, 1.74999965, 0.763842  ]))

After fitting the transformer, you can call the `transform` method, and it will transform the input features based on the parameters it found during the last `fit` call.

In [51]:
X_train_transformed = scaler.transform(X_train)
print("X_train")
print("mean", X_train.mean())
print("std", X_train.std())
print()
print("X_train_transformed")
print("mean", X_train_transformed.mean())
print("std", X_train_transformed.std())

X_train
mean 3.460740740740741
std 1.9662465199534571

X_train_transformed
mean 6.579099405186112e-17
std 1.0


$ \ex{6} $ First, transform the dataset using the `Normalizer` transformer. The fit and evaluate each model using the transformed features.

In [71]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

scaler = preprocessing.Normalizer()

# START ANSWER
x_tr = scaler.transform(X_train)
x_te = scaler.transform(X_test)

for name, model in models.items():
    model.fit(x_tr, y_train)
    pred = model.predict(x_te)
    acc = accuracy_score(pred, y_test)
    print(name, ": ", acc)
# END ANSWER 

GaussianNB :  1.0
DummyClassifier :  0.3333333333333333
DecisionTreeClassifier :  0.8
KNeighborsClassifier :  1.0
SVM :  1.0
LogisticRegression :  0.9333333333333333
