<b><font size="6"><u>Neural Networks</u></font></b>

In this notebook we are going to check some of the most important parameters that can influence the performance of a neural network.

# <font color='#BFD72F'>Contents</font> <a class="anchor" id="toc"></a>

* [1 - Initial Steps](#first-bullet)
    * [1.1 - Connect to Google Colab](#first-bullet)
    * [1.2 - Importing Libraries and Data](#import)
    * [1.3 - Data Understanding](#understand)
    * [1.4 - Split the Data](#split)
    * [1.5 - Preparing Functions](#func)
* [2 - Neural Networks](#nn)
    * [2.1 - The hidden layer size](#hidden)
    * [2.2 - The activation function](#activation)
    * [2.3 - The solver](#solver)
    * [2.4 - The learning rate initialization](#lr_init)
    * [2.5 - The learning rate](#lr)
    * [2.6 - The maximum iterations](#max_iter)
    * [2.7 - Other parameters](#other)
* [3 - RandomizedSearch](#randsearch)
* [4 - Extra: Pipeline](#pipe)

# <font color='#BFD72F'>1. Initial Steps</font> <a class="anchor" id="first-bullet"></a>

## <font color='#BFD72F'>1.1. Connect to Google Colab</font> <a class="anchor" id="first-bullet"></a>
[Back to Contents](#toc)

**Step 1 -** Connect the google colab notebook with your google drive. Before running the code below, make sure you have this notebook in the folders mentioned in the variable `path`.<br>

In [None]:
# Connect Google Colab to Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
path = '/content/drive/MyDrive/Colab Notebooks/DM2/LAB05 - Neural Networks/'

## <font color='#BFD72F'>1.2. Importing Libraries and Data</font><a class="anchor" id="import"></a>
[Back to Contents](#toc)

**Step 2 -** Import the needed libraries.

In [None]:
import time
from sklearn.neural_network import MLPClassifier
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
import numpy as np

import warnings
warnings.filterwarnings('ignore')

**Step 3 -** Import the data that is going to be used into `pandas` dataframes.

**Step 3.1 -** Import and check the Insurance dataset.

In [None]:
insurance = pd.read_csv(path + 'data/insurance.csv')

#drop useless first columns
insurance.drop(columns="Unnamed: 0", inplace =True)

#take a look at the data
insurance

<font color='orange'>____GOAL____  : </font> Predict if the health plan will get an upgrade using the given features by the customer

The original dataset contains 2500 rows of hospitalization data that an insurance company is analysing, where the Insurance charges are given against the following attributes of the insured: *Age*, *Sex*, *BMI*, *Number of Children*, *Smoker*, *Region*, etc.
The categorical variables have already been encoded and converted into dummy variables for you.


`Insured ID`<br>
`Year_Birth` -  Insurance contractor year of birth<br>
`Gender` - (dummies) Insurance contractor gender, female / male.<br>
`Region` - (dummies) The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.<br>
`Marital Status` - (dummies) Insurance contractor marital status.<br>
`Smoker` - (dummies) Smoker / No smoker.<br>
`Income` - Insurance contractor income.<br>
`BMI` - Body mass index<br>
`BSA` - body surface area.<br>
`Insured_Satisfaction` - Insured satisfaction regarding insurance assistance/services covered during hospitalization.<br>
`Expenses` - Individual medical costs billed by health insurance.<br>
`Expenses in percentage (%)` by categories (Treatment / Medication / Medical_Assistance / Exams / Ambulance_Transport).<br>
`Expenses coverage percentage (%)` (Insurance_Coverage / Patient_Coverage).<br>
`Plan_Option` Type of plan insurance <br>
`Upgrade Health Plan` The customer upgraded the health plan insurance
<- <font color='orange'> **Dependent Variable / Target** </font>

## <font color='#BFD72F'> 1.3. Data Understanding</font> <a class="anchor" id="understand"></a>
[Back to Contents](#toc)

**Step 4 -** Use the method `info()` to look at datatypes and missing values.

In [None]:
insurance.info()

**Step 5 -** Use the method `describe()` to look at the descriptive statistics of each one of your variables.

In [None]:
insurance.describe().T

**Step 6 -** Use the method value_counts() to look at the class distribution of the dependent variable.

In [None]:
insurance['Upgrade_Health_Plan'].value_counts()

## <font color='#BFD72F'> 1.4. Split the Data</font> <a class="anchor" id="split"></a>
[Back to Contents](#toc)

**Step 7** - Define the independent variables as __X__ and the dependent variable as __y__.

In [None]:
X = insurance.loc[:, insurance.columns != 'Upgrade_Health_Plan']
y = insurance['Upgrade_Health_Plan']

**Step 8** - By using the method `train_test_split`, split the data into X_train, X_test, y_train and y_test, defining `test_size` as 30%, `random_state` equal to 150 and `stratify` by the target.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    random_state = 150,
                                                    shuffle = True,
                                                    stratify = y)

## <font color='#BFD72F'> 1.5. Preparing Functions</font> <a class="anchor" id="func"></a>
[Back to Contents](#toc)

**Step 9** - Create a function named __avg_score__ that will return the average score value for the train and the test set, applying a certain model (is this notebook we will be using neural networks) and using K-fold cross-validation. It should also measure the model's fitting time and the number of iterations needed by the model.<br>
The function will receive as parameters the model and the number of folds to be used.

In [None]:
def avg_score(model, number_splits):

    # create lists to store the results from the different neural networks
    score_train = []
    score_test = []
    timer = []
    n_iter = []

    # apply kfold with the pre-defined number_splits
    kf = KFold(n_splits=number_splits)

    for train_index, test_index in kf.split(X):

        # get the indexes of the observations assigned for each partition
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        #---> start counting time
        begin = time.perf_counter()

        # fit the model to the data
        model.fit(X_train, y_train)

        #---> finish counting time
        end = time.perf_counter()

        # check the mean accuracy for the train
        value_train = model.score(X_train, y_train)
        # check the mean accuracy for the test
        value_test = model.score(X_test,y_test)

        # append the accuracies, the time and the number of iterations in the corresponding list
        score_train.append(value_train)
        score_test.append(value_test)
        timer.append(end-begin)
        n_iter.append(model.n_iter_)

    # calculate the average and the std for each measure (accuracy, time and number of iterations)
    ### AVG
    avg_time = round(np.mean(timer),3)
    avg_train = round(np.mean(score_train),3)
    avg_test = round(np.mean(score_test),3)
    avg_iter = round(np.mean(n_iter),1)
    ### STD
    std_time = round(np.std(timer),2)
    std_train = round(np.std(score_train),2)
    std_test = round(np.std(score_test),2)
    std_iter = round(np.std(n_iter),1)

    return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_test) + '+/-' + str(std_test), str(avg_iter) + '+/-' + str(std_iter)

**Step 10** - Create a function named __show_results__ that will return the average score for the train and test dataset (returned from the function __avg_score__) for several given models.

In [None]:
def show_results(df, *args, number_splits):
    """
    Receive an empty dataframe and the different models and call the function avg_score
    """
    count = 0
    # for each model passed as argument
    for arg in args:
        # obtain the results provided by avg_score
        time, avg_train, avg_test, avg_iter = avg_score(arg, number_splits)
        # store the results in the right row
        df.iloc[count] = time, avg_train, avg_test, avg_iter
        count+=1
    return df

# <font color='#BFD72F'>2. Neural Networks</font> <a class="anchor" id="nn"></a>
[Back to Contents](#toc)

Neural network is a type of machine learning model that's designed to mimic the way the human brain works.<br>
It can be used for a wide range of tasks, such as recognizing images, understanding natural language, playing games, making medical diagnoses, etc.


<font size=6>The Structure of Neural Network</font>

Neural Networks are composed of layers of nodes, or "neurons," each of which performs a simple computation and passes the results to the next layer.<br>

__Input Layer__: This is the initial data for the network. Each node in this layer represents a feature of the input data.<br>

__Hidden Layer__: These are the layers between the input and output layer. They are referred to as "hidden" because we don't see their input-output relationships directly from the data; they are internally computed. The nodes in these layers apply weights to the inputs and pass them through an `activation function`, which decides whether a neuron should be activated based on the weighted sum.<br>

__Output Layer__ The output layer produces the final predictions or classifications made by the network. The structure of the output layer depends on the type of problem. For instance, for a __binary classification__ problem, the output layer would have one neuron with a `sigmoid activation function` to predict two classes.
For a __multi-class problem__, the output layer would have as many neurons as the number of classes with a `softmax activation function` to predict probabilities of different classes.


A neural network learns by adjusting the weights applied in the hidden layers. During the training process, the network is given a set of inputs and the desired outputs. It makes a prediction based on the current weights, and then adjusts the weights to minimize the difference between the predicted and actual outputs. This process is usually accomplished through a method called __backpropagation__ and an __optimization strategy__ like stochastic gradient descent.

<img src="https://drive.google.com/uc?id=1e14l1uAqWEedSwDVVruCBrsgrtMJdNQC" width="400px"> <img name="neuralnetwork.gif">

**Step 11** - Create an instance of MLPClassifier with the default parameters and name it as __model__. Check the results using the above created functions.

In [None]:
model = MLPClassifier()
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['Raw'])
show_results(df, model, number_splits=10)

## <font color='#BFD72F' id="hidden"> 2.1. The hidden layer size </font>
[Back to Contents](#toc)

__Determine the Number of hidden layers and neurons__<br>
By default, one hidden layer with 100 neurons, which is represented as:<br>
__(100,)__

__The number of hidden layers__: <br>
-	Increasing the number of hidden layers might or might not improve the accuracy, it depends on the complexity of the problem
-	Increasing the number of hidden layers more than needed will cause overfit on the training set and a decrease in the accuracy value for the test set

__The number of hidden units__: <br>
-	Using too few neurons in the hidden layers will result in underfitting
-	Using too many neurons in the hidden layer may result in overfitting and increases the training time of the neural network

The aim is to keep a good trade-off between the simplicity of the model and the performance accuracy! <br>
<div class="alert alert-success">Different rules of thumb exist (take them with a grain of salt): <br>

-	The number of hidden neurons should be __between the size of the input layer and the size of the output layer__
-	The number of hidden neurons should be __2/3 the size of the input layer, plus the size of the output layer__
-	The number of hidden neurons should be __less than twice the size of the input layer__</div>




**Step 12** - Create an `MLPClassifier` with one hidden layer and one neuron and name it __model_simple__.

In [None]:
model_simple = MLPClassifier(hidden_layer_sizes=(1))

**Step 13** - Create an `MLPClassifier` with one hidden layer and 10 neurons and name it __model_medium__.

In [None]:
model_medium = MLPClassifier(hidden_layer_sizes=(10))

**Step 14** - Create an `MLPClassifier` with four hidden layers and 100 neurons each and name it __model_complex__.

In [None]:
model_complex = MLPClassifier(hidden_layer_sizes = (100,100,100,100))

**Step 15** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['Simple','Medium','Complex'])
show_results(df, model_simple, model_medium, model_complex, number_splits= 10)

While the results may differ in different runs, at the end will arive at the following conclusions:
- The more complex the model, the higher the running time;
- We can boost the performance of our model by correctly adjusting the complexity of it - too simple and it will underfit, too complex and it might overfit.

## <font color='#BFD72F' id="activation"> 2.2. The activation function</font>
[Back to Contents](#toc)

__(default = 'relu')__

Check this <a href="https://www.analyticsvidhya.com/blog/2021/04/activation-functions-and-their-derivatives-a-quick-complete-guide/">link</a> for more information regarding the advantages and disadvantages of different activation functions.

<img src="https://drive.google.com/uc?id=1IqkVu_b-NLJtBfguWwbuzgn46GG9IlgF" width="400px"> <img name="activation.png">

**Step 16** - Create an instance of `MLPClassifier`, define the activation as _relu_ and name it as __model_relu__.

In [None]:
model_relu = MLPClassifier(activation = 'relu')

 - __Advantages:__
     - Computationally efficient - allows the network to converge very quickly.
 - __Disadvantages:__
     - The dying ReLU problem - When inputs approach zero, or are negative, the gradient of the function becomes zero and the network cannot perform backpropagation and cannot learn.

**Step 16.1** - Use the `.activation` attribute to check the activation function used for hidden layer.

In [None]:
model_relu.activation

**Step 17** - Create an instance of `MLPClassifier`, define the activation as _logistic_ (sigmoid) and name it as __model_logistic__.

In [None]:
model_logistic = MLPClassifier(activation = 'logistic')

 - __Advantages:__
     - Smooth gradient, preventing “jumps” in output values.
     - Output values bound between 0 and 1, normalizing the output of each neuron.
 - __Disadvantages:__
     - Vanishing gradient—for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or have slow convergence.
     - Computationally expensive.

**Step 18** - Create an instance of `MLPClassifier`, define the activation as _tanh_ and name it as __model_tanh__.

In [None]:
model_tanh = MLPClassifier(activation = 'tanh')

 - __Advantages:__
     - Zero centered - making it easier to model inputs that have strongly negative, neutral and strongly positive values. Other than that it is similar to the sigmoid function. <br>
 - __Disadvantages:__
     - Same as with the sigmoid function.

**Step 19** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['relu','logistic','tanh'])
show_results(df, model_relu, model_logistic, model_tanh, number_splits= 10)

Checking the results, we can identify some evidences:
- Relu tends to be faster than logistic or tanh.
- Sigmoid functions and their variations (such as tanh) generally work better in the case of classification problems.

**Step 19.1** - Use the `out_activation_` attribute to check the activation function used for the output layer. <br>
Note: you can see this attribute after fitting the model to the data ex:  model.fit(X_train, y_train).

In [None]:
model_relu.out_activation_

## <font color='#BFD72F' id="solver"> 2.3. The solver </font>
[Back to Contents](#toc)

__default = 'adam'__

For more information check this <a href="http://www.robotics.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf">paper</a>.

**Step 20** - Create an instance of `MLPClassifier`, define the solver as _sgd_ and name ir as __model_sgd__.

In [None]:
model_sgd = MLPClassifier(solver = 'sgd')

Notes:
- While Gradient Descent use the whole training data to do a single update, in SGD a random data point of the training data to update the parameters - SGD is faster than GD.
- It uses a common learning rate for all parameters, contrarialy to what happens in Adam.

**Step 21** - Create an instance of `MLPClassifier`, define the solver as _adam_ and name it as __model_adam__.
<br>

When to use: It achieves good results fast, therefore is a good option for complex models, if processing time is an issue.

In [None]:
model_adam = MLPClassifier(solver = 'adam')

Notes:
- It computes individual adaptive learning rates for different parameters.
- Adam combines the advantages of RMSProp and AdaGrad.

**Step 22** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['sgd','adam'])
show_results(df, model_sgd, model_adam,number_splits= 10)

In sklearn, the number of iterations for __sgd__ and __adam__ correspond to the number of epochs (an epoch consists of one full cycle through the training data).

<img src="https://drive.google.com/uc?id=1sq7a1PigaAptGEF0lA-UL3cKoGnesYb0" width="500px"> <img name="optimizers.gif">

For more information check this <a href="https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/">webpage</a>.


## <font color='#BFD72F' id="lr_init"> 2.4. The learning rate initialization </font>
[Back to Contents](#toc)

__Only for sgd and adam (default = 0.001)__

The learning rate is one of the most important hyper-parameters to tune for training deep neural networks:

__Small LR__:
- If the learning rate is small, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny - a smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train.
- A learning rate that is too small may never converge or may get stuck on a suboptimal solution.

__Big LR__:
- If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse - a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights.

The training should start with a relatively large learning rate because, in the beginning, random weights are far from optimal, and then the learning rate should decrease during training to allow for more fine-grained weight updates.


<img src="https://drive.google.com/uc?id=1Y5wLeNa8pvtCst54Ch5oSpkewSQzWc90" width="800px"> <img name="lr.png">

**Step 23** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate_init as _0.5_ and name it as __model_lr_big__.

In [None]:
model_lr_big = MLPClassifier(solver = 'sgd', learning_rate_init = 0.5)

**Step 24** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate_init as _0.001_ and name it as __model_lr_medium__.

In [None]:
model_lr_medium = MLPClassifier(solver = 'sgd', learning_rate_init = 0.001)

**Step 25** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate_init as _0.000001_ and name it as __model_lr_small__.

In [None]:
model_lr_small = MLPClassifier(solver = 'sgd', learning_rate_init = 0.000001)

**Step 26** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['big','medium','small'])
show_results(df, model_lr_big, model_lr_medium, model_lr_small,number_splits= 10)

## <font color='#BFD72F' id="lr"> 2.5. The learning rate</font>
[Back to Contents](#toc)

__Only for sgd (default = 'constant')__

**Step 27** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate as _constant_ and name it as __model_constant__.

Definition: If the learning rate is constant, as the name says, the learning rate will always remain equal to the initial learning rate.

In [None]:
model_constant = MLPClassifier(solver = 'sgd', learning_rate = 'constant')

**Step 28** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate as _invscaling_ and name it as __model_invscaling__.

Definition: If the learning rate is invscaling, it gradually decreases the learning rate at each time step ‘t’ using an inverse scaling exponent of ‘power_t’.

$$
effective\,learning\,rate = \frac{learning\_rate\_init}{t^{\,power\_t}}
$$

Note: The __power_t__ (default = 0.5) is another parameter that you can change.

In [None]:
model_invscaling = MLPClassifier(solver = 'sgd', learning_rate = 'invscaling')

**Step 29** - Create an instance of `MLPClassifier`, define the solver as _sgd_, the learning_rate as _adaptive_ and name it as __model_adaptive__.

Definition: <br>
If the learning rate is adaptive, then it keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. <br>
Each time two consecutive epochs fail to decrease training loss by at least __tol__ (tolerance for the optimization - another parameter that you can change), or fail to increase validation score by at least __tol__ if __early_stopping__ (meaning to terminate training when the validation score is not improving - another parameter that you can change) is on, the current learning rate is divided by 5.

In [None]:
model_adaptive = MLPClassifier(solver = 'sgd', learning_rate = 'adaptive')

**Step 30** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive,number_splits= 10)

## <font color='#BFD72F' id="max_iter"> 2.6. The maximum iterations </font>
[Back to Contents](#toc)

By default, sklearn defines the maximum number of iterations as 200. While this could be enough for simple datasets, in complex problems you should try values higher that allow the model to converge.

**Step 31** - Create an instance of `MLPClassifier`, define the max_iter as _20_ and name it as __model_maxiter_20__.

In [None]:
model_maxiter_20 = MLPClassifier(max_iter = 20)

**Step 32** - Create an instance of `MLPClassifier`, define the max_iter as _100_ and name it as __model_maxiter_100__.

In [None]:
model_maxiter_100 = MLPClassifier(max_iter = 100)

**Step 33** - Create an instance of `MLPClassifier`, define the max_iter as _500_ and name it as __model_maxiter_500__.

In [None]:
model_maxiter_500 = MLPClassifier(max_iter = 500)

**Step 34** - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test', 'Iterations'], index = ['max iter 20','max iter 100','max iter 200'])
show_results(df, model_maxiter_20, model_maxiter_100, model_maxiter_500,number_splits= 10)

## <font color='#BFD72F' id="other"> 2.7. Other parameters</font>

|Parameter| Definition | LBFGS | SGD | ADAM |
|---|---|---|---|---|
|alpha| L2 penalty (regularization term) parameter | yes | yes | yes |
| power_t | The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. | no | yes | no |
| shuffle | Whether to shuffle samples in each iteration. | no | yes | yes |
| tol | Tolerance for the optimization. When the loss or score is not improving by at least tol for n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops. | yes | yes | yes |
| warm_start | When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. | yes | yes | yes |
| momentum | Momentum for gradient descent update. Should be between 0 and 1. | no | yes | no |
| nesterovs_momentum | Whether to use Nesterov’s momentum.| no | yes | no |
| early stopping | Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs. The split is stratified, except in a multilabel setting.  | no | yes | yes |
| validation_fraction | The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True | no | yes | yes|
| beta1 | Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1). | no | no | yes |
| beta2 | Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1).  | no | no | yes |
| epsilon | Value for numerical stability in adam. | no | no | yes |
| n_iter_no_change | Maximum number of epochs to not meet tol improvement. |  no | yes | yes |
| max_fun | Only used when solver=’lbfgs’. Maximum number of loss function calls. The solver iterates until convergence (determined by ‘tol’), number of iterations reaches max_iter, or this number of loss function calls. | yes | no | no |

# <font color='#BFD72F'>3. RandomizedSearch</font> <a class="anchor" id="randsearch"></a>
[Back to Contents](#toc)

**Step 35** - Define a dictionary named as __parameter_space__ and define the following options to be considered during modelling:
- 'hidden_layer_sizes': [(50,50,50), (100,)
- 'activation': ['tanh', 'relu']
- 'solver': ['sgd', 'adam']
- 'learning_rate_init' : [0.0001, 0.001, 0.01, 0.1]

In [None]:
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'learning_rate_init': [0.0001, 0.001, 0.01, 0.1]
}

**Step 36** - Create an instance of `RandomizedSearchCV` named as __clf__ and pass as parameters the __model__ and the __parameter_space__.

In [None]:
clf = RandomizedSearchCV(model, parameter_space)

**Step 37** - Fit your instance to X_train and y_train.

In [None]:
clf.fit(X_train, y_train)

**Step 38** - Call the attribute _.best_params__ to check which is the best combination of parameters.

In [None]:
clf.best_params_

**Step 39** - Create a __final_model__ with the best parameters, as checked in previous step, calling the attribute _.best_estimator__.

In [None]:
final_model = clf.best_estimator_.fit(X_train, y_train)
print('Train:', final_model.score(X_train, y_train))
print('Test:', final_model.score(X_test, y_test))

**Step 40** - Create a loop to check the mean and the standard deviation of the different models created using the different combinations using `RandomizedSearchCV`.

In [None]:
# Best parameter set
print('------------------------------------------------------------------------------------------------------------------------')
print('Best parameters found:\n', clf.best_params_)
print('------------------------------------------------------------------------------------------------------------------------')

# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
params = clf.cv_results_['params']

for mean, std, param in zip(means, stds, params):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std , param))

# <font color='#BFD72F'>4. Extra: Pipeline</font> <a class="anchor" id="pipe"></a>
[Back to Contents](#toc)



A pipeline is a sequence of interconnected __transformers__ and an __estimator__ that encompasses the entire workflow, from data preparation and preprocessing to model training, evaluation, and deployment, as represented as follows:

<img src="https://drive.google.com/uc?id=1aHVWQpnjJNfdIaPKfm263VyxxB1UzRhC" width="800px"> <img name="pipeline.png">

As defined in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">documentation</a>, the __transformers__ must implement `.fit()` and `.transform()` methods while the __estimator__ only needs to implement `.fit()`.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.


**Step 41** - Create a pipeline to perform scaling (with `MinMaxScaler`), feature selection (with `SelectKBest`) and hyperparameter tuning (with `GridSearchCV`) for a `RandomForestClassifier`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


pipe = Pipeline([
    ('scaler', MinMaxScaler()),
    ('feature_selection', SelectKBest(f_classif)),
    ('classifier', RandomForestClassifier(random_state=5))
    ])

params = dict(
    feature_selection__k=[2, 3, 4],
    classifier__max_depth=[10, 20, 30]
    )

grid_search = GridSearchCV(pipe, param_grid=params, scoring='f1', cv=6)
gs = grid_search.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)

**Step 41.1** - Can call the `.best_estimator_` attribute to check the full features of the model.

In [None]:
print(gs.best_estimator_)

**Step 42** - Use pipeline result to classify `X_test` and measure the F1 score.

In [None]:
from sklearn.metrics import f1_score

Y_pred = gs.predict(X_test)
f1_score(y_test, Y_pred)

In the example above there were no preprocessing because the dataset had it already, but you can use pipeline for the full CRISP-DM process.
Lets now see an example how to test different scalers.

**Step 43** - Adapt the previous pipeline to test wich is the most suitable scaler, `MinMaxScaler` or `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler

scalers = {"minmax": MinMaxScaler(),
           "standard": StandardScaler()
          }

args = []
results = []
clf =[]

for scl in scalers.values():
    pipe = Pipeline([
    ('scaler', scl),
    ('feature_selection', SelectKBest(f_classif)),
    ('classifier', RandomForestClassifier(random_state=5))
    ])

    params = dict(
        feature_selection__k=[2, 3, 4],
        classifier__max_depth=[10, 20, 30]
        )

    grid_search = GridSearchCV(pipe, param_grid=params, scoring='f1', cv=6)
    gs = grid_search.fit(X_train, y_train)
    args.append(gs.best_params_)
    results.append(gs.best_score_)
    clf.append(gs.best_estimator_)

print(args)
print(results)

**Step 42** - Use the best classifier to classify `X_test` and measure the F1 score.

In [None]:
y_pred = clf[1].predict(X_test)
f1_score(y_test, Y_pred)

<b><font size="6"> Don't forget to practice at home  &#128521;

... Questions about the project? </font></b>
You may always look for more information on the internet, here's an [example](https://michael-fuchs-python.netlify.app/2020/08/21/the-data-science-process-crisp-dm/).