# Session 3

## Overview

At the end of the module, you will use cross-validation to select the right **hyperparameters** for a decision tree classifier.

You need to compare results of your classifier and tune it correctly based on error measurements.

This module assumes you:

* Have a goal
* Have a clean data set (already provided and imported)
* Have a model ([decision tree classifier](https://en.wikipedia.org/wiki/Decision_tree_learning) in this case)

The **goal** of the created model is to classify accurately which human activity is performed depending on the measurements (information from the sensors).

In [None]:
import cross_functions as cf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeClassifier as cart, export_graphviz
from IPython.display import Image

## Recreate the previous data set and selected features

The next two code lines recover the features used for the decision tree classifier `sel_feat`, and the training dataset used `explore`. Remember to focus on Subject 1 to remove the inter-subject variability:

In [None]:
sel_feat = [9, 457, 448, 56, 77]
explore = cf.Xtrain[cf.Xtrain.subject == '1']

## Decision tree validation

`sel_feat` is the output of an unsupervised learning method ([PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)) to the dataset. Use validation in the resulting descriptive model ([decision tree classifier](https://en.wikipedia.org/wiki/Decision_tree_learning)) to:

- Evaluate if model the behaves accurately in an un-known environment.
- Generate solutions for overfitting in case it happens.


To do this, you have to:
1. **Sample data:** Separate the dataset by using to obtain a sample of **20%** of the data
2. **Train model:** train the decision tree model with the remaining **80%**
3. **Test accuracy:**: perform accuracy tests to observe it's performance.
4. **Cross-validate**: to achieve the two previously mentioned goals.

## Data sampling

Create a `validation set` from the data. Remember to remove the validation set (`test_x` and `test_y`) from the rest of the data (`train_x` and `train_y`).


**Random Sampling** is the best sampling approach for obtaining the validation set since this is a classification problem.


`sklearn` is a module in python that already has methods for this task. Use the function `train_test_split` as follows:

```
train_x, test_x, train_y, test_y = train_test_split(DATA_FEATURES, DATA_LABELS, 
                                                    test_size=RULE_OF_THUMB, random_state=11)

```
**Note:** As the data set is only for the subject 1, you can not use the test set provided in the HAR data files zip.

In [None]:
train_x, test_x, train_y, test_y = \
    train_test_split(explore.iloc[:,sel_feat], explore.activity, test_size=0.2, random_state=11)

The test sample should be smaller (around one-fourth) of our train sample, run the following cell to verify this:

In [None]:
print('train sample size - rows: %s, columns: %s' % train_x.shape)
print('test  sample size - rows: %s,  columns: %s' % test_x.shape)

## Model training

After separating the dataset, train the model with the remaining 80%.

For the training use the following snippet of code as a base. This code snippet is a modification of the code from the previous session, only now you need to use `train_x` and `train_y`. 

```
cart_tot = cart(min_samples_leaf=39)
cart_tot = cart_tot.fit(TRAIN_DATA, TRAIN_ANSWERS)
pred_tot = cart_tot.predict(TEST_DATA)
```

In [None]:
cart_tot = cart(min_samples_leaf=39)
cart_tot = cart_tot.fit(train_x, train_y)
pred_tot = cart_tot.predict(test_x)

## Accuracy testing
Run the following cell to know the trained model's accuracy. Note that it evaluates the accuracy using the test sets `test_y` and `test_x`. You can say you have **validated** the model.

In [None]:
cf.print_accuracy(test_y, pred_tot)
graph = cf.print_tree_graph(cart_tot, cf.features, cf.activity, sel_feat)
Image(graph.create_png())

This outcome, as expected, is lower than if you evaluated the accuracy with the training datasets. Nevertheless, using the test datasets gives a better approximation of how the model behaves with new observations.

By testing the model on all available data without training from it at the same time, you can improve the out-of-sample error representation.

## Cross-validation
**Cross-validation** is a technique that can accomplish this.

Cross validation does not allow for random samples to mix observation between them, so pay extra attention to keep the random samples disjunctive.

### Cross-validation sampling

Use the function; `compute_disjunctive_random_splits` to do this. This implemented function is very similar to `train_test_split`, except it iterates over the numbers of times you want to split the data frame, and returns for each iteration the result of `train_test_split` for different disjunctive random samples.

Run the following cell to look at the results.

In [None]:
masks = cf.compute_disjunctive_random_splits(explore.iloc[:,sel_feat], explore.activity, 10)
mask = 0
print('numbers of `masks`: %s\n' % len(masks))
print('shape of `train_x` inside layer #%s of `masks`: %s' % (mask, masks[mask][0].shape))
print('shape of `test_x`  inside layer #%s of `masks`: %s' % (mask, masks[mask][1].shape))
print('shape of `train_y` inside layer #%s of `masks`: %s'% (mask, masks[mask][2].shape))
print('shape of `test_y`  inside layer #%s of `masks`: %s'% (mask, masks[mask][3].shape))

### Cross validation execution

Now build a way to re-sample the data set multiple times with different random orders:

**Constrains**

- Use a different random sample for both training and testing during each iteration.
- Ensure test samples do not contain observations used for training.
- Use 10% for validation testing as a Rule-of-thumb for the test-train ratio.
- Keep random samples disjunctive

```
# declaration of global variables


for iteration in range(0, 10):
    # data sampling
    
    # Model definition, fitting, and prediction
    
    accuracy = np.round(accuracy_score(y_true=test_y, y_pred=pred_tot, normalize=True), 4)
    print('Accuracy for iteration #%s: %s' % (iteration, accuracy))
    global_accuracy.append(accuracy)

print('Global accuracy #%s' % np.mean(global_accuracy))
```

In [None]:
global_accuracy = []

masks = cf.compute_disjunctive_random_splits(explore.iloc[:,sel_feat], explore.activity, 10)

for iteration in range(0, 10):
    train_x, test_x, train_y, test_y = masks[iteration]
    
    cart_tot = cart(min_samples_leaf=39)
    cart_tot = cart_tot.fit(train_x, train_y)
    pred_tot = cart_tot.predict(test_x)
    
    accuracy = np.round(accuracy_score(y_true=test_y, y_pred=pred_tot, normalize=True), 4)
    print('Accuracy for iteration #%s: %s' % (iteration, accuracy))
    global_accuracy.append(accuracy)

print('Global accuracy: %s' % np.mean(global_accuracy))

## Extending the use of Cross validation

The model,  `DecisionTreeClassifier` has many **hyperparameter**s to pick on.

Look at the [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) function. The followings are some hypterparameters you are using and can modify:

- **min_samples_leaf**: already being used, for manual pruning.
- **criterion**: The function to measure the quality of a split. Supported methods are “**gini**” for the Gini impurity and “**entropy**” for the information gain.
- **max_features**: maximum number of features to consider in each split, 

Since cross-validation helps us to compare subtle differences between models and test which one is better, the next step is to tune and test the best combination of hyperparameters.

`sklearn` once again has made thins easier; the function `ParameterGrid` creates a list of all possible combinations you input, to provide a way to iterate thru said combinations.

Enhance the last experiment using `ParameterGrid` to fit `DecisionTreeClassifier` with a new set of **hyperparameters** in each set of iterations.

**Constrains**

- Use multiple sets of iterations with different random samples for both training and testing.
- Ensure test samples do not contain observations used for training.
- Use a 10% for validation testing as a Rule-of-thumb for the test-train ratio.
- Repeat the combination of hyperparameters for their own cross valisation **exclusively**.
- Store all the performance outputs of each combination.
- Compare the performance output means for each combination.
- Print the set of hyperparameters wit the best performanceBest.
- Set **min_samples_leaf** to: `39, 13`
- Set **criterion** to: `'gini', 'entropy'`
- Set **max_features** to: `5, 4`

```
# declaration of global variables
hyperparameters = [{'max_features': [SET],
                    'min_samples_leaf': [SET],
                    'criterion': [SET]}]


for params in ParameterGrid(hyperparameters): 
    # declaration of local variables

    for iteration in range(0, 10):
        # data sampling

        # Model definition, fitting, and prediction
        cart_tot = cart(**params)
        cart_tot = cart_tot.fit(TRAIN_DATA, TRAIN_ANSWERS)
        pred_tot = cart_tot.predict(TEST_DATA)

        accuracy = np.round(accuracy_score(y_true=test_y, y_pred=pred_tot, normalize=True), 4)
        local_accuracy.append(accuracy)
    
    global_accuracy.append(np.mean(local_accuracy))
    print('local accuracy for combination %s: #%s' % (params, np.mean(local_accuracy)))

print(ParameterGrid(hyperparameters)[global_accuracy.index(min(global_accuracy))])
```

In [None]:
# declaration of global variables
hyperparameters = [{'max_features': [5, 4],
                    'min_samples_leaf': [39, 13],
                    'criterion': ['gini', 'entropy']}]
global_accuracy = []
masks = cf.compute_disjunctive_random_splits(explore.iloc[:,sel_feat], explore.activity, 10)

for params in ParameterGrid(hyperparameters): 
    # declaration of local variables
    local_accuracy = []
    for iteration in range(0, 10):
        # data sampling
        train_x, test_x, train_y, test_y = masks[iteration]
        
        # Model definition, fitting, and prediction
        cart_tot = cart(**params)
        cart_tot = cart_tot.fit(train_x, train_y)
        pred_tot = cart_tot.predict(test_x)

        accuracy = np.round(accuracy_score(y_true=test_y, y_pred=pred_tot, normalize=True), 4)
        local_accuracy.append(accuracy)
    
    global_accuracy.append(np.mean(local_accuracy))
    print('local accuracy for combination %s: #%s' % (params, np.mean(local_accuracy)))

print('\n\nBest combination: %s' % ParameterGrid(hyperparameters)[global_accuracy.index(max(global_accuracy))] + 
      '\nWith an average accuracy of: %s' % max(global_accuracy))

## Conclusions

You satisfactorily performed hyperparameters tunning for a model using cross-validation, thus avoiding to use any rule of thumb in said hyperparameters, and solving a common and non-trivial problem in a day-to-day classification problem.

Regardless, you must remember:

1. This is still a descriptive analysis, for all combinations to be tested, a parameter grid of much higher dimensions should be used.
2. You used only subject 1, hence it is not guaranteed that any model built using the restricted data will perform well when classifying activities done by different subjects.