## Supervised Machine Learning Models - Part 2

### Table of Contents

* [PART 1](#chapter1)  

    * [Ensemble Learning algorithms](#section_1_1)
        1. [Bagging algorithms](#Section_1_1_1)
        2. [Random Forest](#section_2_1_1)
        3. [Boosting algorithms](#section_3_1_1)
             * [Gradient Boosting](#section_3_2_1)
             * [XGBoost & AdaBoost](#section_3_2_2)
        4. [Stacking](#section_4_1_1)
<br>

* [PART 2](#chapter2)

    * [Cross Validation](#section_4_1)
    * [HyperParameter Tuning](#section_5_1)
        * [GridsearchCV](#section_5_1_1)
        * [RandomSearchCV](#section_5_1_2)
    * [Model Comparison](#section_6_1)

In [8]:
'''
Important Notebook tips

<font color='Brown'>**Binary Classifier:**</font>

<br>

<img src="https://editor.analyticsvidhya.com/uploads/85598tomek.png" width="300"/> <br>

<p float="left">
<img src="Images/ML_3.png" width="300"/>
<img src="Images/ML2.png" width="450"/>
<p>
'''

'\nImportant Notebook tips\n\n<font color=\'Brown\'>**Binary Classifier:**</font>\n\n<br>\n\n<img src="https://editor.analyticsvidhya.com/uploads/85598tomek.png" width="300"/> <br>\n\n<p float="left">\n<img src="Images/ML_3.png" width="300"/>\n<img src="Images/ML2.png" width="450"/>\n<p>\n'

## PART 1 <a class="anchor" id="chapter1"></a>

## Ensemble Learning algorithms <a class="anchor" id="section_1_1"></a>

Ensemble learning models are simply combinations of different machine learning models.






### 1. Bagging algorithms <a class="anchor" id="Section_1_1_1"></a>

### 2. Random Forest <a class="anchor" id="section_2_1_1"> </a>
- Random forests” is an ensemble of the same type of models, decision trees.
- Decision trees are a popular method for various machine learning tasks mostly because their interpretability is very high. A decision tree is a series of filters on the predictor variables. The series of filters end up in a class prediction. Each filter is a binary yes/no question, which creates bifurcations in the series of filters thus leading to a treelike structure. The filters are dependent on the type of predictor variables. If the variables are categorical, such as gender, then the filters could be “is gender female” type of questions. If the variables are continuous, such as gene expression, the filter could be “is PIGX expression larger than 210?”. Every point where we filter samples based on these questions are called “decision nodes”. The tree-fitting algorithm finds the best variables at decision nodes depending on how well they split the samples into classes after the application of the decision node. Decision trees handle both categorical and numeric predictor variables, they are easy to interpret, and they can deal with missing variables. Despite their advantages, decision trees tend to overfit if they are grown very deep and can learn irregular patterns. There are many variants of tree-based machine learning algorithms.

- Random forests are devised to counter the shortcomings of decision trees. They are simply ensembles of decision trees. Each tree is trained with a different randomly selected part of the data with randomly selected predictor variables. The goal of introducing randomness is to reduce the variance of the model so it does not overfit, at the expense of a small increase in the bias and some loss of interpretability. This strategy generally boosts the performance of the final model.
- The random forests algorithm tries to decorrelate the trees so that they learn different things about the data. It does this by selecting **a random subset of variables.** If one or a few predictor variables are very strong predictors for the response variable, these features will be selected in many of the trees, causing them to become correlated. Random subsampling of predictor variables ensures that not always the best predictors overall are selected for every tree and, the model does have a chance to learn other features of the data.
- Another sampling method introduced when building random forest models is **bootstrap resampling** before constructing each tree. This brings the advantage of out-of-the-bag (OOB) error prediction. In this case, the prediction error can be estimated for training samples that were OOB, meaning they were not used in the training, for some percentage of the trees. The prediction error for each sample can be estimated from the trees where that sample was OOB. OOB estimates claimed to be a good alternative to cross-validation estimated errors.

<img src="Images/rf.png" width="1000"/>

**Parameters:**
- For random forests, we have two critical arguments. One of the most critical arguments for random forest is the number of predictor variables to sample in each split of the tree. This parameter controls the independence between the trees, and as explained before, this limits overfitting.
- Another variable we can tune is the minimum node size of terminal nodes in the trees (min.node.size). This controls the depth of the trees grown. Setting this to larger numbers might cost a small loss in accuracy but the algorithm will run faster.



### 3. Boosting algorithms <a class="anchor" id="section_3_1_1"></a>

#### 3.A Gradient Boosting <a class="anchor" id="section_3_2_1"></a>
- Gradient boosting is a prediction model that uses an ensemble of decision trees similar to random forest. However, the decision trees are added sequentially, which is why these models are also called “Multiple Additive Regression Trees (MART)” (Friedman and Meulman 2003). Apart from this, you will see similar methods called “Gradient boosting machines (GBM)”(J. H. Friedman 2001) or “Boosted regression trees (BRT)” (Elith, Leathwick, and Hastie 2008) in the literature.

- Generally, “boosting” refers to an iterative learning approach where each new model tries to focus on data points where the previous ensemble of simple models did not predict well. Gradient boosting is an improvement over that, where each new model tries to focus on the residual errors (prediction error for the current ensemble of models) of the previous model. Specifically in gradient boosting, the simple models are trees. As in random forests, many trees are grown but in this case, trees are sequentially grown and each tree focuses on fixing the shortcomings of the previous trees. 

<img src="Images/gb.png" width="1000"/>

- One of the most widely used algorithms for gradient boosting is XGboost which stands for “extreme gradient boosting” (Chen and Guestrin 2016). Below we will demonstrate how to use this on our problem. XGboost as well as other gradient boosting methods has many parameters to regularize and optimize the complexity of the model. Finding the best parameters for your problem might take some time. However, this flexibility comes with benefits; methods depending on XGboost have won many machine learning competitions.

**Parameters**

- The most important parameters are number of trees (nrounds), tree depth (max_depth), and learning rate or shrinkage (eta). Generally, the more trees we have, the better the algorithm will learn because each tree tries to fix classification errors that the previous tree ensemble could not perform. Having too many trees might cause overfitting. However, the learning rate parameter, eta, combats that by shrinking the contribution of each new tree. This can be set to lower values if you have many trees. You can either set a large number of trees and then tune the model with the learning rate parameter or set the learning rate low, say to  0.01  or  0.1  and tune the number of trees. Similarly, tree depth also controls for overfitting. The deeper the tree, the more usually it will overfit. This has to be tuned as well; the default is at 6. You can try to explore a range around the default. Apart from these, as in random forests, you can subsample the training data and/or the predictive variables. These strategies can also help you counter overfitting.

#### 3.B XG Boosting & ADA Boosting <a class="anchor" id="section_3_2_2"></a>

### 4 Stacking <a class="anchor" id="section_4_4_1"></a>

## PART 2 <a class="anchor" id="chapter2"></a>

### Cross Validation <a class="anchor" id="section_4_1"></a>






**Cross-validation** is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.

<font color='Brown'>**Need of CV:**</font>

1. To Avoid Overfitting:
When we train a model on the training set, it tends to overfit most of the time, thus we utilise regularisation approaches to avoid this. Because we only have a few training instances, we must be cautious while lowering the number of training samples and conserving them for testing.

2. Support Model tuning:
Finding the best combination of model parameters is a common step to tune an algorithm toward learning the dataset’s hidden patterns. But doing this step on a simple training-testing split is typically not recommended. The model performance is usually very sensitive to such parameters and adjusting those based on a predefined dataset split should be avoided. It can cause the model to overfit and reduce its ability to generalize.

<font color='Brown'>**Types of CV:**</font>

1. Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model.
2. K-Fold Cross Validation
3. Stratified K-fold Cross-Validation
4. Leave One-out Cross Validation
5. Holdout Method

### K-Fold Cross Validation

<font color='Brown'>**How it works:**</font>


1. Pick a number of folds – K. Usually, k is 5 or 10 but you can choose any number which is less than the dataset’s length.
2. Split the dataset into k equal (if possible) parts (they are called folds)
3. Choose k – 1 folds as the training set. The remaining fold will be the test set
4. Train the model on the training set. On each iteration of cross-validation, you must train a new model independently of the model trained on the previous iteration
5. Validate on the test set and save the result
6. Repeat steps 3 – 6 *K* times. Each time use the remaining  fold as the test set. In the end, you should have validated the model on every fold that you have.


<img src="https://editor.analyticsvidhya.com/uploads/16042grid_search_cross_validation.png" width="500"/> <br>


<font color='Brown'>**Advantages of Cross-Validation :**</font>

1. Use All Your Data
2. Parameters Fine-Tuning

In [19]:
# from sklearn.model_selection import cross_val_score
# print(cross_val_score(model, X_train, y_train, cv=5))

### HyperParameter Tuning <a class="anchor" id="section_5_1"></a>

Choosing the correct set of hyperparameters to tune the models minimizes the loss function and achieves better results. 

**Model parameters:** These are the parameters that are estimated by the model from the given data. <br>
**Model hyperparameters:** These are the parameters that cannot be estimated by the model from the given data. These parameters are used to estimate the model parameters.

<font color='Brown'>**How it works:**</font>

Cross-Validation has two main steps: splitting the data into subsets (called folds) and rotating the training and validation among them. The splitting technique commonly has the following properties:

- Each fold has approximately the same size.
- Data can be randomly selected in each fold or stratified.​
- All folds are used to train the model except one, which is used for validation. That validation fold should be rotated until all folds have become a validation fold once and only once.​
- Each example is recommended to be contained in one and only one fold.​

K-fold and CV are two terms that are used interchangeably. K-fold is just describing how many folds you want to split your dataset into. Many libraries use k=10 as a default value representing 90% going to training and 10% going to the validation set. The next figure describes the process of iterating over the picked ten folds of the dataset.

<font color='Brown'>**Types of Hyperparameter tuning:**</font>  

1. **Manual:** select hyperparameters based on intuition/experience/guessing, train the model with the hyperparameters, and score on the validation data. Repeat process until you run out of patience or are satisfied with the results.
2. **Grid Search:** set up a grid of hyperparameter values and for each combination, train a model and score on the validation data. In this approach, every single combination of hyperparameters values is tried which can be very inefficient!
3. **Random search:** set up a grid of hyperparameter values and select random combinations to train the model and score. The number of search iterations is set based on time/resources.

<font color='Brown'>**Important Parameters:**</font>  

- **get_params** -->  Get parameters for this estimator.

- **cv** -->  Determines the cross-validation splitting strategy - *None*, to use the default 5-fold cross validation,

- **best_estimator_** -->  Estimator which gave highest score (or smallest loss if specified) on the left out data

- **best_score_** -->  Mean cross-validated score of the best_estimator. 

- **best_params_** -->  Parameter setting that gave the best results on the hold out data. 

#### 1. GridSearchCV <a class="anchor" id="section_5_1_1"></a>


In the grid search method, we create a grid of possible values for hyperparameters. Each iteration tries a combination of hyperparameters in a specific order. It fits the model on each combination of hyperparameters possible and records the model performance. Finally, it returns the best model with the best hyperparameters.

- **param_grid** -->  Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. 

#### 2. RandomsearchCV <a class="anchor" id="section_5_1_2"></a>

In the random search method, we create a grid of possible values for hyperparameters. Each iteration tries a random combination of hyperparameters from this grid, records the performance, and lastly returns the combination of hyperparameters that provided the best performance.


- **param_distributions** -->  Dictionary with parameters names (str) as keys and distributions or lists of parameters to try.

### Model Comparison <a class="anchor" id="section_6_1"></a>

1. Time complexity

2. Space complexity

3. Sample complexity

4. Bias-variance tradeoff

5. Methodology, Assumptions and Objectives