# Random Forest - Training and Feature Selection

In this notebook, we perform two tasks.
- Task 1: Train a Random Forest model optimally by using the **bagging** method.
- Task 2: Perform feature selection usin the Random Forest model.

We motivate the bagging method by emphasizing the better generalizability of the Random Forest model.


## How Does the Random Forest Model Improve Generalizability?

The Random Forest achieves better generalizability by reducing the variance of individual Decision Trees.

To reduce variance, the component trees are all designed to be **randomly different from one another**. This leads to **de-correlation between the individual tree predictions** and, in turn, to improved generalization.

Forest randomness also helps achieve high robustness with respect to **noisy data**.


## Randomization Techniques for Training Random Forests

Randomness is injected into the trees during the training phase. Two of the most popular randomness injection techniques are:
     - Random training dataset sampling   
     - Randomized node optimization (RNO)

Unlike bagging, RNO uses use all available training data and controls the randomness by varying the number of features for training.


## Randomization Technique: Random Training Dataset Sampling 


For random sampling from the training dataset, the bootstrapping technique is used. In bootstrapping, we use the same training algorithm for every predictor, but to **train them on different random subsets** of the training set. 
- When sampling is performed with replacement, this method is called bagging. Bagging stands for **bootstrap aggregation**.
- When sampling is performed without replacement, it is called pasting.


<img src="http://engineering.unl.edu/images/uploads/Bagging.png" width=700 height=400>


## Bagging vs Pasting

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on. Thus, bagging ends up with a slightly higher bias than pasting. The extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced. Overall, bagging often results in better models, thus it is generally preferred. 


## Training Random Forest using the Bagging Method

We use Scikit-Learn's RandomForestClassifier to implement the bagging method for training the Random Forest model. We induce randomness in following two ways:

- Sampling from dataset: learn from random subsets of the data
- Sampling from features: use a random subset of features to consider when looking for the best split

## Augmented Bagging

To induce additional randomness, the bagging method can be augmented as follows.
	- Learn from random thresholds of each feature (using extremely randomized trees or **Extra-Trees**).

We implenment the Extra-Trees method in the next notebook.



# Feature Selecion

The Random Forest model can be used to perform feature selecion. It enables us to detrmine the **relative importance of each feature**. 

A Random Forest measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). 

More precisely, it is a **weighted average**, where each node’s weight is equal to the number of training samples that are associated with it.

Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. 

We can access the result using the $feature\_importances\_$ attribute of the model.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Dataset


We use the Iris dataset, which is a multivariate data set. 

This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica

There are 4 features: 
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)

Total number of samples: 150

The dataset is also known as Fisher's Iris data set as it was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis".


<img src="http://engineering.unl.edu/images/uploads/IrisFlowers.png" width=800 height=400>


## Explore the Dataset

In [2]:
iris = load_iris()

# See the key values
print("\nKey Values: \n", list(iris.keys()))

# The feature names
print("\nFeature Names: \n", list(iris.feature_names))

# The target names
print("\nTarget Names: \n", list(iris.target_names))

# The target values (codes)
#print("\nTarget Values: \n", list(iris.target))


Key Values: 
 ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

Feature Names: 
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target Names: 
 ['setosa', 'versicolor', 'virginica']


## Create Data Matrix (X) and the 1D Target Array (y)


In [3]:
# Feature matrix
X = iris["data"]

# Target Array
y = iris["target"]


print(X.shape)
print(y.shape)

print("\nX data type: ", X.dtype)
print("y data type: ", y.dtype)

(150, 4)
(150,)

X data type:  float64
y data type:  int64


## Split Data Into Training and Test Sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Task 1: Train a Random Forest Classifier Model Optimally Using the Bagging Method 

A Random Forest is an **ensemble of Decision Trees**. It is generally trained via the bagging method (or sometimes pasting). 

The bagging method induces randomess (to implement tree diversity) by training the same Decision Tree on different random samples from the data.

In addition to the sample-based randomness, a Random Forest model induces feature-based randomness.

Instead of searching for the very best feature when splitting a node, it searches for the best feature among a **random subset of features**. This results in a greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model. 


### Scikit-Learn for Using the Bagging Method to Train a Random Forest Model

There are two ways to use the bagging method to train a Random Forest classifier.

- Using the BaggingClassifier class create an objet and pass it a decision tree created from the DecisionTreeClassifier class.

- Create a Random Forest object using the **RandomForestClassifier class**.

The second approach is more convenient and optimized for Decision Trees. Similarly, there is a RandomForestRegressor class for regression tasks. 



## Model Selection

We use some of the hyperparameters of the RandomForestClassifier class for model selection. 

For a full list of the hyperparameters visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


- n_estimators : The number of trees in the forest. Default=10


- criterion : The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific. Default=”gini”


- max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Default=None


- min_samples_split : The minimum number of samples required to split an internal node: Default=2

        -- If int, then consider min_samples_split as the minimum number.
        
        -- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
        
        
        
- min_samples_leaf : The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. Default=1


         -- If int, then consider min_samples_leaf as the minimum number.
         
         -- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.


- max_features : The number of features to consider when looking for the best split. Default="auto".

        -- If int, then consider max_features features at each split.
        
        -- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
        
        -- If “auto”, then max_features=sqrt(n_features).
        
        -- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
        
        -- If “log2”, then max_features=log2(n_features).
        
        -- If None, then max_features=n_features.


- max_leaf_nodes : Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Default=None


- class_weight : Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Default=None

         -- The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))



- oob_score : Whether to use out-of-bag samples to estimate the generalization accuracy. Default=False

        -- When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting oob_score=True. Then, we can get the score of the training dataset obtained using an out-of-bag estimate (use the $oob\_score\_$ attribute)
        
        
- bootstrap : Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. Default=True
 

- verbose : Controls the verbosity when fitting and predicting. Default=0



### Out-of-Bag (oob) Evaluation

When using the bagging method, some instances may be sampled several times for any given model, while **others may not be sampled at all**. In general, only about 63% of the training instances are sampled on average for each model. The remaining 37% of the training instances are not sampled. These are called the out-of-bag (oob) instances. 

The bagging method-based Random Forest can be evaluated using oob instances. We don't need a separate validation set. The oob evaluation can be used as an estimation for the test accuracy of the model.

- For implementing the oob evaluation, we need to set oob_score=True.


### Note:
We use default setting for the following **two key hyperparameters**.

- Sampling from data: bootstrap=True (it ensures that the bagging method is used)

- Sampling from features: max_features="auto" (selects a subset of the features, i.e., sqrt(n_features) features to consider when looking for the best split)


In [5]:
%%time

param_grid = {'n_estimators': [10, 20, 50, 100],
              'criterion': ['entropy', 'gini'],
              'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8],
              'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'max_leaf_nodes': [2, 5, 10, 15]}


'''
Create a Random Forest classifier:
- Uses out-of-bag samples to estimate the generalization accuracy
- Uses the values of y to automatically adjust weights inversely proportionalto class frequencies 
   in the input data 
'''
dt_clf = RandomForestClassifier(class_weight="balanced", oob_score=True)

dt_clf_cv = GridSearchCV(dt_clf, param_grid, scoring='f1_micro', cv=3, verbose=1, n_jobs=-1)
dt_clf_cv.fit(X_train, y_train)

params_optimal = dt_clf_cv.best_params_

print("Best Score (F1 score): %f" % dt_clf_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal)
print("\n")

Fitting 3 folds for each of 2304 candidates, totalling 6912 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 782 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 2282 tasks      | elapsed:   27.8s
[Parallel(n_jobs=-1)]: Done 4382 tasks      | elapsed:   53.2s


Best Score (F1 score): 0.966667
Optimal Hyperparameter Values:  {'criterion': 'entropy', 'max_depth': 3, 'max_leaf_nodes': 15, 'min_samples_leaf': 3, 'n_estimators': 100}


CPU times: user 10.4 s, sys: 369 ms, total: 10.8 s
Wall time: 1min 23s


[Parallel(n_jobs=-1)]: Done 6912 out of 6912 | elapsed:  1.4min finished


## Train the Optimal Model

In [6]:
forest_clf = RandomForestClassifier(class_weight="balanced", oob_score=True, **params_optimal)

forest_clf.fit(X_train, y_train)

y_train_predicted = forest_clf.predict(X_train)

print("Train Accuracy: ", accuracy_score(y_train, y_train_predicted))

print("Out of Bag (oob) Score: ", forest_clf.oob_score_)

y_test_predicted = forest_clf.predict(X_test)

print("\nTest Accuracy: ", accuracy_score(y_test, y_test_predicted))

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

Train Accuracy:  0.9583333333333334
Out of Bag (oob) Score:  0.9416666666666667

Test Accuracy:  1.0

Test Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Task 2: Feature Selection

Below we determine the importance of the Iris features using the trained Random Forest model's $feature\_importances\_$ property. 

In [7]:
for i in range(len(iris.feature_names)):
    print("%10s : %.2f" % (iris.feature_names[i], forest_clf.feature_importances_[i]))

sepal length (cm) : 0.11
sepal width (cm) : 0.01
petal length (cm) : 0.39
petal width (cm) : 0.48


## Observation

We observe the two most important features are: 
- Petal length
- Petal width