## Supervised Machine Learning Models - Part 2

### Table of Contents

* [PART 1](#chapter1)  

    * [Ensemble Learning algorithms](#section_1_1)
        1. [Bagging algorithms](#Section_1_1_1)
            *  [Random Forest](#section_2_1_1)
        3. [Boosting algorithms](#section_3_1_1)
             * [AdaBoost](#section_3_2_1)
             * [Gradient Boosting XGBoost ](#section_3_2_2)
             * [XGBoost ](#section_3_2_3)
        4. [Stacking](#section_4_1_1)
<br>

* [PART 2](#chapter2)

    * [Cross Validation](#section_4_1)
    * [HyperParameter Tuning](#section_5_1)
        * [GridsearchCV](#section_5_1_1)
        * [RandomSearchCV](#section_5_1_2)
    * [Model / Algorithm Selection](#section_6_1)

## PART 1 <a class="anchor" id="chapter1"></a>

## Ensemble Learning algorithms <a class="anchor" id="section_1_1"></a>

Ensemble learning models are simply combinations of different machine learning models.Instead of training one large/complex model for your dataset, you train multiple small/simpler models (weak-learners) and aggregate their output (in various ways) to form your prediction.

It combines multiple weak models/learners into one predictive model to **reduce bias, variance and/or improve accuracy.**

<font color='Brown'>**Bias**</font>
<font color='Brown'>**Variance**</font>

<img src="Images/bvsv.png" width="300"/>


<font color='Brown'>**Types of Ensemble Learning: N number of weak learners:**</font>
1. Bagging
2. Boosting
3. Stacking

These methods have the same wisdom-of-the-crowd concept but differ in the details of what it focuses on, the **type of weak learners** used, and the **type of aggregation** used to form the final output.





### 1. Bagging algorithms <a class="anchor" id="Section_1_1_1"></a>

**Bagging (Boostrap Aggregating):** Trains N different weak models (usually of same types – homogenous) with N non-overlapping subset of the input dataset **in parallel.** In the test phase, each model is evaluated. The label with the greatest number of predictions is selected as the prediction. Bagging methods reduces variance of the prediction.

For each weak-learner, the input data is randomly sampled from the original dataset with replacement and is trained. By sampling with replacement some observations may be repeated in each new training data set. A random sampling of the subset with replacement creates nearly iid samples. During inference, the test input is fed to all the weak-learners and the output is collected. The result outputted from bagging is the average (if the problem is regression) or the most suitable label by the voting scheme (if the problem is classification).

In bagging methods, the weak-learners, usually are of the same type. Since the random sampling with replacement, the bagging method doesn't change the bias in the prediction but **reduces its variance.**

<img src="Images/bag.png" width="300"/>

Each decision tree in the bag is trained on an independent subset of the training data. These subsets are random bootstraps of the whole training set. In other words, suppose the training data is a table with n observations on m features. Each component tree of the bagging will receive a subset of k observations on m features to train on, with k < n. Each observation of a subset is drawn from the full data with replacement.

In [1]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

bag = BaggingRegressor(base_estimator=DecisionTreeRegressor()) 

#### Random Forest <a class="anchor" id="section_2_1_1"> </a>

- Random forest is very similar to bagging: it also consists of many decision trees, each of the trees is assigned with a bootstrap sample of the training data, and the final result of the meta-model is computed as the average or mode of the outputs from the components.

- The only difference is that random forests, when splitting a node of a component tree, not all of the features are taken as candidates to split. Instead, only a subset of the whole feature set is selected to be the candidates (the selection is random for each node) and then the best feature from this subset is appointed to be the splitting test at that node. Suppose there are m features overall, the size of the subset can be any number from 1 to m-1.

- The random forests algorithm tries to decorrelate the trees so that they learn different things about the data. It does this by selecting **a random subset of variables.** If one or a few independent variables are very strong predictors for the response variable, these features will be selected in many of the trees, causing them to become correlated. Random subsampling of independent variables ensures that not always the best predictors overall are selected for every tree and, the model does have a chance to learn other features of the data.

- They are simply ensembles of decision trees. Each tree is trained with a different randomly selected part of the data with randomly selected independent variables. The goal of introducing randomness is **to reduce the variance** of the model so it does **not overfit**, at the expense of a **small increase in the bias** and **some loss of interpretability**. This strategy generally boosts the performance of the final model.

<font color='Brown'>**How it works:**</font>

- Individual trees are built independently, using the same procedure as for a normal decision tree but with only a random portion of the data and only considering a random subset of the features at each node. Aside from this, the training procedure is exactly the same as for an individual Decision Tree, repeated N times.

- To make a prediction using a Random Forest each an individual prediction is obtained from each tree. Then, if it is a classification problem, we take the most frequent prediction as the result, and if it is a regression problem we take the average prediction from all the individual trees as the output value. The following figure illustrates how this is done:


<img src="Images/randomforest.png" width="700"/>

**Parameters:**
For random forests, we have two critical arguments. 

- Number of Features: how many columns to use when sampling 
- Number of Trees: how many trees to average




In [2]:
from sklearn.ensemble import RandomForestRegressor
rfr= RandomForestRegressor()

### 2. Boosting algorithms <a class="anchor" id="section_3_1_1"></a>

The general idea of boosting also encompasses building multiple weak learners to contribute to the final result. However, these component trees are built **sequentially**, one after another, and how to build the latter one is dependent on the result of the formers. Put another way, the next weak learner is built in a way to specifically improve on what the existing weak learners are bad at.

There are a few answers to how should the next tree address the shortcomings of the previous trees, these answers divide boosting into several styles. The most popular styles are **AdaBoost**, **Gradient Boosting** and **XGBoost**

In the test phase, each model is evaluated and based on the test error of each weak model, the prediction is weighted for voting. Boosting methods **decreases the bias of the prediction.**

<img src="Images/boost.png" width="300"/>

<font color='Brown'>**How it works:**</font>

Step 1:  The base learner takes all the distributions and assign equal weight or attention to each observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates  a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classiﬁed or have higher errors by preceding weak rules.





#### AdaBoost (Adaptive Boosting) <a class="anchor" id="section_3_2_1"></a>

Adaptive Boosting aka AdaBoost fits a sequence of weak learners on different weighted training data. It starts by predicting original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.

<font color='Brown'>**How it works:**</font>
<img src="Images/ada.png" width="500"/>

***Box 1:*** You can see that we have assigned equal weights to each data point and applied a decision stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated vertical line at left side to classify the data points. We see that, this vertical line has incorrectly predicted three + (plus) as – (minus). In such case, we’ll assign higher weights to these three + (plus) and apply another decision stump.

***Box 2:*** Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus). Again, we will assign higher weight to three – (minus) and apply another decision stump.

***Box 3:*** Here, three – (minus) are given higher weights. A decision stump (D3) is applied to predict these mis-classified observation correctly. This time a horizontal line is generated to classify + (plus) and – (minus) based on higher weight of mis-classified observation.

***Box 4:*** Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. You can see that this algorithm has classified these observation quite well as compared to any of individual weak learner.

We can use AdaBoost algorithms for both classification and regression problem.

***The drawback of AdaBoost*** is that it is easily defeated by noisy data, the efficiency of the algorithm is highly affected by outliers as the algorithm tries to fit every point perfectly.

In [None]:
from sklearn.ensemble import AdaBoostClassifier #For Classification
from sklearn.ensemble import AdaBoostRegressor #For Regression
from sklearn.tree import DecisionTreeClassifier

clf = AdaBoostClassifier(n_estimators = 50, base_estimator = DecisionTreeClassifier)
clf.fit(x_train,y_train)
clf.predict(x_test)

**Parameters**
- N estimators: It controls the number of weak learners.
- Learning Rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
- Base estimators: It helps to specify different ML algorithm.


#### Gradient Boosting <a class="anchor" id="section_3_2_2"></a>

Gradient boosting is also based on the sequential and symbol learning model. The base learners are generated sequentially in such a way that the present based learner is always more effective than the previous one. The overall model improves sequentially with each iteration now.

However, in this boosting the weights for misclassified outcomes are not incremented. The main idea here is to overcome the residual errors (prediction error for the current ensemble of models) of the previous model. It tries to optimize the loss function of previous learner by adding a new adaptive model that combines weak learners. Specifically in gradient boosting, the simple models are trees. As in random forests, many trees are grown but in this case, trees are sequentially grown and each tree focuses on fixing the shortcomings of the previous trees. 

<img src="Images/gb.png" width="500"/>

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.

The loss function is the one that needs to be optimized (Reduce the error) You have to keep adding a model that will regularize the loss function from the previous learner.
Just like adaptive boosting gradient boosting can also be used for both classification and regression.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier #For Classification
from sklearn.ensemble import GradientBoostingRegressor #For Regression
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, loss_function = deviance, max_depth=1)
clf.fit(X_train, y_train)

**Parameters**
- N estimators: It controls the number of weak learners.
- Learning Rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
- Loss Reduction: The loss function to use during splitting
- Sample size: The proportion of the data exposed to the model during each iteration.
- Max_depth: maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree.


#### XG Boosting <a class="anchor" id="section_3_2_3"></a>

- One of the most widely used algorithms for gradient boosting is XGBoost which stands for “extreme gradient boosting” which is the a more advanced version of the gradient boosting method. XGboost as well as other gradient boosting methods has many parameters to regularize and optimize the complexity of the model. This flexibility comes with benefits; methods depending on XGboost have won many machine learning competitions.

- XGBoost was introduced because the gradient boosting algorithm was computing the output at a prolonged rate right because there's a sequential analysis of the data set and it takes a longer time.

- The main aim of this algorithm is to increase **speed** and model **performance**.

- XGBoost is also called a regularized boosting technique. This helps to reduce overfit modelling.

- It supports **parallelization** by creating decision trees. There's no sequential modeling in computing methods for evaluating any large and any complex modules.

- Around the time it was first invented and published, XGBoost used to be the algorithm-of-choice for Kagglers. Lots of the winners in around the years 2015 and 2016 won their prize using XGBoost.


XGBoost is similar to gradient boosting algorithm but it has a few tricks up its sleeve which makes it stand out from the rest.

Features of XGBoost are:

- Clever Penalisation of Trees
- A Proportional shrinking of leaf nodes
- Newton Boosting
- Extra Randomisation Parameter

In XGBoost the trees can have a varying number of terminal nodes and left weights of the trees that are calculated with less evidence is shrunk more heavily. Newton Boosting uses Newton-Raphson method of approximations which provides a direct route to the minima than gradient descent. The extra randomisation parameter can be used to reduce the correlation between the trees,  the lesser the correlation among classifiers, the better our ensemble of classifiers will turn out. 

In [None]:
from xgboost import XGBRegressor
from xgboost import XGBClassifier
clf = XGBClassifier(n_estimators = 100, max_depth = 3)
clf.fit(x_train,y_train)
clf.predict(x_test)

### 3 Stacking <a class="anchor" id="section_4_4_1"></a>

**Stacking:**  Trains N different weak models (usually of different types – heterogenous) with one of the two subsets of the
dataset in parallel. Once the weak learners are trained, they are used to trained a meta learner to combine their predictions and carry out final prediction using the other subset. In test phase, each model predicts its label, these set of labels are fed to the meta learner which generates the final prediction.

<img src="Images/stack.png" width="500"/>

In [None]:
from sklearn.ensemble import StackingRegressor
models = [ ('lr', LinearRegression()),('dt', DecisionTreeRegressor()]
stacking = StackingRegressor(estimators=models, final_estimator=RandomForestRegressor(n_estimators=10,random_state=42))
stacking.fit(X_train, y_train)

**Which is the best, Bagging or Boosting?**
There’s not an outright winner; it depends on the data, the simulation and the circumstances.

- Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

- Both are good at reducing variance and provide higher stability but only Boosting tries to reduce bias. On the other hand, Bagging may solve the over-fitting problem, while Boosting can increase it.

- If the problem is that the single model gets a **very low performance**, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimises the advantages and reduces pitfalls of the single model.

- By contrast, if the difficulty of the single model is **over-fitting**, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting.

<img src="Images/bbs.png" width="500"/>

## PART 2 <a class="anchor" id="chapter2"></a>

## Cross Validation <a class="anchor" id="section_4_1"></a>






In [1]:
import numpy as np
import pandas as pd

from sklearn import datasets

df = datasets.load_breast_cancer()
X_df, Y_df = df.data, df.target
print('Dataset Size : ', X_df.shape, Y_df.shape)

Dataset Size :  (569, 30) (569,)


#### Simple Holdout method or Train-test split

In Train-Test split method, the data is first shuffled randomly before splitting. As the model is trained on a different combination of data points, the model can give different results every time we train it, and this can be a cause of instability. 

Also, we can never assure that the train set we picked is representative of the whole dataset. Also when our dataset is not too large, there is a high possibility that the testing data may contain some important information that we lose as we do not train the model on the testing set.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_df, Y_df, train_size=0.80, test_size=0.2, random_state=1)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sizes :  (455, 30) (114, 30) (455,) (114,)


In [3]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier().fit(X_train, Y_train)
print('Train Accuracy : %.2f'%knn.score(X_train, Y_train))
print('Test Accuracy : %.2f'%knn.score(X_test, Y_test))

Train Accuracy : 0.95
Test Accuracy : 0.94


### **Cross-validation** 
Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to train a model and the other used to validate the model on different iterations.

<img src="https://i.ytimg.com/vi/kituDjzXwfE/maxresdefault.jpg" width="600"/> <br>


<font color='Brown'>**Why CV:**</font>

1. **More “efficient” use of data**- In this method it matters less how the data gets divided, as every observation is used for both training and testing

2. **Test the ability of a machine learning model to predict new data** - Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times.

3. **Flag problems like overfitting or selection bias** - by not generalizing a pattern.

4. **Parameters Fine-Tuning** - Most of the learning algorithms require some parameters tuning. We do it by trying different values and choosing the best ones. The same data cannot be used for both training and hyperpameter tuning.


#### <font color='Brown'>**Types of CVs:**</font>

1. K-Fold Cross Validation
2. Stratified K-fold Cross-Validation - Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole.
3. Leave One-out Cross Validation (LOOCV)
4. Leave P-out Cross Validation (LPOCV)

### K-Fold Cross Validation

<font color='Brown'>**How it works:**</font>


1. Shuffle the dataset randomly.
2. Split the training dataset into k groups
3. For each unique group:
    - Take 1 group as a hold out or validation data set
    - Take the remaining groups as a training data set
    - Fit a model on the training set and evaluate it on the validation set
    - Retain the evaluation score and discard the model
4. Repeat steps 3 *K* times. Each time use the remaining  fold as the test set.
5. Summarize the evaluation done on each fold

<img src="https://i.stack.imgur.com/0SQJq.png" width="800"/> <br>


#### Cross Validation - K fold & Stratified K-fold

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold

In [5]:
#### K fold Cross Validation

kcv = cross_val_score(KNeighborsClassifier(), X_train, Y_train, cv=KFold(n_splits=5))

print('Classifying With KFold Cross Validation : ', kcv)
print('Final score With KFold Cross Validation : ', kcv.mean())

Classifying With KFold Cross Validation :  [0.89010989 0.87912088 0.91208791 0.96703297 0.91208791]
Final score With KFold Cross Validation :  0.9120879120879121


In [6]:
#### Stratified K fold Cross Validation

skcv = cross_val_score(KNeighborsClassifier(), X_train, Y_train, cv=5)
skcv1 = cross_val_score(KNeighborsClassifier(), X_train, Y_train, cv=StratifiedKFold(n_splits=5))

print('Without mention of Cross Validation technique : ', skcv) ## It uses StratifiedKFold default
print('Classifying With Stratified KFold Cross Validation : ', skcv1)
print('Final score With Stratified KFold Cross Validation : ', skcv1.mean())

Without mention of Cross Validation technique :  [0.87912088 0.89010989 0.91208791 0.96703297 0.91208791]
Classifying With Stratified KFold Cross Validation :  [0.87912088 0.89010989 0.91208791 0.96703297 0.91208791]
Final score With Stratified KFold Cross Validation :  0.9120879120879121


##### Trade-offs Between Cross-Validation and Train-Test Split

1. Size of dataset 
2. Computational time

## HyperParameter Tuning / Optimization <a class="anchor" id="section_5_1"></a>

**Definition:**  Hyperparameter optimization, also called hyperparameter tuning, is the process of searching for a set of hyperparameters that gives the best model results on a given dataset.

Hyperparameters are parameters that are defined before training to specify how we want model training to happen
    Example: In random forest model, n_estimators (number of decision trees we want to have) is a hyperparameter. It can be set to any integer value but of course, setting it to 10 or 1000 changes the learning process significantly.


**Model parameters:** A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.They are required by the model when making predictions and are estimated or learnt from data.
    Example: Coefficients in Linear/Logistic regression, Split points in Decision Tree <br>
    
**Model hyperparameters:** These are adjustable parameters that can be tuned in order to obtain a model with optimal performance. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. 
    Example: k in k-nearest neighbors, depth of a tree


<img src="https://miro.medium.com/max/875/1*FIIGhzbuTo2vI62mFcbMTg.png" width="350"/> <br>


<font color='Brown'>**Types of Hyperparameter tuning:**</font>  

##### 1. **Manual method:** 
Select hyperparameters based on intuition/experience/guessing, train the model with the hyperparameters, and score on the validation data. Repeat process until you run out of patience or are satisfied with the results.
    Example: Changing lambda values in Ridge Regression, K values in KNN


#### 1. GridSearchCV <a class="anchor" id="section_5_1_1"></a>


In the grid search method, we create a grid of possible values for hyperparameters. Each iteration tries a combination of hyperparameters in a specific order. It fits the model on each combination of hyperparameters possible and records the model performance. Finally, it returns the best model with the best hyperparameters.


P1 = [v1, v2] <br>
P2 = [b1, b2]

Combinations --> v1b1, v1by, v2b1, v2b2

**GridSearchCV (estimator, param_grid, scoring=None, n_jobs=None, cv=None)**

- *param_grid* -->  Dictionary with parameters names (str) as keys and distributions or lists of parameters to try.

<img src="https://static.wixstatic.com/media/fd32e3_50364e42770b42c28e3d9837487e12d1~mv2.png/v1/fill/w_612,h_236,al_c/fd32e3_50364e42770b42c28e3d9837487e12d1~mv2.png" width="700"/> <br>

#### 2. RandomsearchCV <a class="anchor" id="section_5_1_2"></a>

In the random search method, we create a grid of possible values for hyperparameters. Each iteration tries a random combination of hyperparameters from this grid, records the performance, and lastly returns the combination of hyperparameters that provided the best performance.


**RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, random_state=None, return_train_score=False)**

- *param_distributions* -->  Dictionary with parameters names (str) as keys and distributions or lists of parameters to try.
- *cv:*  -->  Determines the cross-validation splitting strategy ((Stratified)KFold - default 5-fold)
- *scoring:* --> Evaluation metrics (default: accuracy_score for Classification & r2_score for Regression)
- *n_iter:* --> Number of parameter settings that are sampled (default=10)

<font color='DarkBlue'>**Steps of HyperParameter Tuning**</font> 

1. Select the right type of estimator.
2. Review the list of parameters of the model and build the HP space
3. Finding the methods for searching the hyperparameter space
4. Applying the cross-validation scheme approach
5. Assess the model score to evaluate the model

<font color='DarkBlue'>**Hyperparameters Nature:**</font> 

    Discrete: Number of estimators in ensemble models. E.g. 'n_estimators' : [50, 100]
    Continuous: Penalization coefficient, Number of samples per split. E.g. 'max_depth': list(range(2, 10))}
    Categorical: Loss (deviance, exponential), Regularization (Lasso, Ridge). E.g. 'criterion': ['gini','entropy']
    

#### <font color='Brown'>**Hyperparameter tuning for Random Forest example**</font> 

In [13]:
## glass.csv from Kaggle as Input dataset
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
#print(data.DESCR) 

X = data.data
y = data.target

# split the data using Scikit-Learn's train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

##### <font color='Darkblue'>**Basic Random forest model without tuning**</font>

In [14]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X_train, y_train)

# Predict on the test set and call accuracy
y_pred = rf_model.predict(X_test)

print('Accuracy score without tuning:' , round(accuracy_score(y_pred,y_test), 3))

Accuracy score without tuning: 0.958


In [15]:
rf_model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

<font color='Brown'>**Parameters:**</font>  

**Important hyperparameters**
    
    n_estimators: Number of decision trees
    max_features: Maximum number of features considered while splitting
    max_depth: Max depth of the tree
    min_samples_leaf: Minimum number of data points in a leaf node
    min_samples_split: min number of data points placed in a node before the node is split
    
 
**Hyperparameters that do not affect model performance**

    verbose: Printing information while training continues (the higher, the more messages)
    n_jobs: Number of jobs to run in parallel (default=1)
    random_state: Seed
 
 
 <font color='Brown'>**Attributes:**</font>  

- **get_params** -->  Get default parameters for the estimator/model.

- **best_estimator_** -->  Estimator which gave highest score (or smallest loss if specified) on the left out data

- **best_params_** -->  Parameter setting that gave the best results on the hold out data. 

- **cv_results_** --> Returns a dictionary of all the evaluation metrics

##### <font color='Darkblue'>**Model tuning with GridSearchCV**</font>

In [31]:
# Define the grid
param_grid = {
'n_estimators': [50, 100, 200],
'min_samples_leaf': [2, 5, 10],
'criterion': ['gini','entropy'],
'max_depth': [2, 3, 5, 8]}

In [32]:
from sklearn.model_selection import GridSearchCV

model_gridsearch = GridSearchCV(estimator=rf_model, param_grid=param_grid, scoring='accuracy',  cv=5, refit = True, return_train_score=True)

In [33]:
# Instantiate GridSearchCV
import time
start = time.time()

model_gridsearch.fit(X_train, y_train)

# Print the time spend and number of models ran
print("GridSearchCV took %.2f seconds for %d candidate parameter settings." % ((time.time() - start), len(model_gridsearch.cv_results_['params'])))

GridSearchCV took 58.09 seconds for 72 candidate parameter settings.


In [34]:
# Predict on the test set and call accuracy
y_pred_grid = model_gridsearch.predict(X_test)
accuracy_score(y_test, y_pred_grid)

0.958041958041958

In [35]:
model_gridsearch.best_params_

{'criterion': 'gini',
 'max_depth': 5,
 'min_samples_leaf': 2,
 'n_estimators': 200}

In [36]:
model_gridsearch.best_estimator_

##### <font color='Darkblue'>**Model tuning with RandomSearchCV**</font>

In [37]:
param_dist = {
'n_estimators': list(range(50, 200, 10)),
'min_samples_split': list(range(2, 10)),    
'criterion': ['gini','entropy'],
'max_depth': list(range(2, 8))}

In [38]:
from sklearn.model_selection import RandomizedSearchCV

model_random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist,  cv = 5, refit = True)

In [39]:
# Instantiate RandomizedSearchCV

start = time.time()

model_random_search.fit(X_train, y_train)

# Print the time spend and number of models ran
print("RandomizedSearchCV took %.2f seconds for %d candidate parameter settings." % ((time.time() - start), len(model_random_search.cv_results_['params'])))

RandomizedSearchCV took 8.81 seconds for 10 candidate parameter settings.


In [40]:
model_random_search.best_params_

{'n_estimators': 120,
 'min_samples_split': 2,
 'max_depth': 6,
 'criterion': 'gini'}

In [41]:
# Predict on the test set and call accuracy
y_pred_random = model_random_search.predict(X_test)
accuracy_score(y_test, y_pred_random)

0.965034965034965

In [42]:
#search_RF.cv_results_.

model_random_search.cv_results_['params']

[{'n_estimators': 160,
  'min_samples_split': 9,
  'max_depth': 2,
  'criterion': 'gini'},
 {'n_estimators': 180,
  'min_samples_split': 8,
  'max_depth': 4,
  'criterion': 'gini'},
 {'n_estimators': 90,
  'min_samples_split': 9,
  'max_depth': 6,
  'criterion': 'entropy'},
 {'n_estimators': 110,
  'min_samples_split': 5,
  'max_depth': 6,
  'criterion': 'gini'},
 {'n_estimators': 80,
  'min_samples_split': 6,
  'max_depth': 2,
  'criterion': 'gini'},
 {'n_estimators': 120,
  'min_samples_split': 2,
  'max_depth': 6,
  'criterion': 'gini'},
 {'n_estimators': 150,
  'min_samples_split': 7,
  'max_depth': 7,
  'criterion': 'gini'},
 {'n_estimators': 190,
  'min_samples_split': 6,
  'max_depth': 2,
  'criterion': 'gini'},
 {'n_estimators': 80,
  'min_samples_split': 6,
  'max_depth': 5,
  'criterion': 'entropy'},
 {'n_estimators': 60,
  'min_samples_split': 8,
  'max_depth': 6,
  'criterion': 'entropy'}]

##### <font color='Brown'>**Hyperparameters for other algorithm** </font>

**Logistic Regression**

    Regularization :  penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]
    Penality strength : C in [100, 10, 1.0, 0.1, 0.01]

**K-Nearest Neighbors (KNN)**

    Number of neighbors : n_neighbors in [1 to 21]
    Distance metrics : metric in [‘euclidean’, ‘manhattan’, ‘minkowski’]
    
**Gradient Boosting**

    learning_rate in [0.001, 0.01, 0.1]
    Number of trees in the model : n_estimators [10, 100, 1000]

## Model / Algorithm Selection<a class="anchor" id="section_6_1"></a>

**Learning Algorithm Selection:**
Choosing the right ML algorithm for your task can be overwhelming. There are dozens of options, each with their own advantages and disadvantages.

1. Based on business case and a solid understanding of what you are trying to accomplish.

2. Identify the type of Problem - Regression / Classification

3. Data understanding - Type of relationship, size of data & dimensions, data quality

4. Model & Time complexity - Training and Prediction time against resource available, Amount of parameter tuning needed

5. Model interpretability


<img src="Images/Comparison.PNG" width="1800"/>

**Model Selection:** is the process of choosing one of the models as the final model that addresses the problem.

1. Performance of Model accross various Evaluation metrics

2. Time & Resouces used - For training, Prediction and tuning Hyperparameters