# Random Forest Feature Selecion

In this notebook, we perform feature selecion using the Random Forest model.

**Feature Selection** is a very useful Machine Learning technique that we often use. The Random Forest model is very handy to get a quick understanding of what features actually matter. This helps to perform feature selection.

The Random Forest model enables us to perform feature selection by finding the **relative importance of each feature**. 

A Random Forest measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). 

More precisely, it is a **weighted average**, where each node’s weight is equal to the number of training samples that are associated with it.

Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. 

We can access the result using the $feature\_importances\_$ attribute of the model.

In this notebook, we train an optimal Random Forest model on the Iris dataset. Then, use the model to determine feature importance.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Dataset


We use the Iris dataset, which is a multivariate data set. 

This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica

There are 4 features: 
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)

Total number of samples: 150

The dataset is also known as Fisher's Iris data set as it was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis".


<img src="https://cse.unl.edu/~hasan/IrisFlowers.png" width=800 height=400>


## Explore the Dataset

In [2]:
iris = load_iris()

# See the key values
print("\nKey Values: \n", list(iris.keys()))

# The feature names
print("\nFeature Names: \n", list(iris.feature_names))

# The target names
print("\nTarget Names: \n", list(iris.target_names))

# The target values (codes)
#print("\nTarget Values: \n", list(iris.target))


Key Values: 
 ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

Feature Names: 
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target Names: 
 ['setosa', 'versicolor', 'virginica']


## Create Data Matrix (X) and the 1D Target Array (y)


In [3]:
# Feature matrix
X = iris["data"]

# Target Array
y = iris["target"]


print(X.shape)
print(y.shape)

print("\nX data type: ", X.dtype)
print("y data type: ", y.dtype)

(150, 4)
(150,)

X data type:  float64
y data type:  int64


## Split Data Into Training and Test Sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Model Selection

We use some of the hyperparameters of the RandomForestClassifier class for model selection. 

For a full list of the hyperparameters visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


- n_estimators : The number of trees in the forest. Default=10


- criterion : The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific. Default=”gini”


- max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Default=None


- min_samples_split : The minimum number of samples required to split an internal node: Default=2

        -- If int, then consider min_samples_split as the minimum number.
        
        -- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
        
        
        
- min_samples_leaf : The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. Default=1


         -- If int, then consider min_samples_leaf as the minimum number.
         
         -- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.


- max_features : The number of features to consider when looking for the best split. Default="auto".

        -- If int, then consider max_features features at each split.
        
        -- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
        
        -- If “auto”, then max_features=sqrt(n_features).
        
        -- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
        
        -- If “log2”, then max_features=log2(n_features).
        
        -- If None, then max_features=n_features.


- max_leaf_nodes : Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Default=None


- class_weight : Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Default=None

         -- The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))



- oob_score : Whether to use out-of-bag samples to estimate the generalization accuracy. Default=False

        -- When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting oob_score=True. Then, we can get the score of the training dataset obtained using an out-of-bag estimate (use the $oob\_score\_$ attribute)
        
        
- bootstrap : Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. Default=True
 

- verbose : Controls the verbosity when fitting and predicting. Default=0


In [5]:
%%time

param_grid = {'n_estimators': [10, 20, 50, 100],
              'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8],
              'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
              'max_leaf_nodes': [2, 5, 10, 15]}

dt_clf = RandomForestClassifier(criterion="gini", max_features="auto", class_weight="balanced", 
                                oob_score=True)

dt_clf_cv = GridSearchCV(dt_clf, param_grid, scoring='f1_micro', cv=3, verbose=1, n_jobs=-1)
dt_clf_cv.fit(X_train, y_train)

params_optimal = dt_clf_cv.best_params_

print("Best Score (F1 score): %f" % dt_clf_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal)
print("\n")

Fitting 3 folds for each of 1152 candidates, totalling 3456 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 560 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done 1560 tasks      | elapsed:   18.6s
[Parallel(n_jobs=-1)]: Done 2960 tasks      | elapsed:   34.4s


Best Score (F1 score): 0.966667
Optimal Hyperparameter Values:  {'max_depth': 2, 'max_leaf_nodes': 10, 'min_samples_leaf': 7, 'n_estimators': 100}


CPU times: user 5.83 s, sys: 268 ms, total: 6.1 s
Wall time: 40.7 s


[Parallel(n_jobs=-1)]: Done 3456 out of 3456 | elapsed:   40.6s finished


## Train the Optimal Model

In [6]:
forest_clf = RandomForestClassifier(criterion="gini", max_features="auto", class_weight="balanced", 
                                oob_score=True, **params_optimal)

forest_clf.fit(X_train, y_train)

y_train_predicted = forest_clf.predict(X_train)

print("Train Accuracy: ", accuracy_score(y_train, y_train_predicted))

y_test_predicted = forest_clf.predict(X_test)

print("Test Accuracy: ", accuracy_score(y_test, y_test_predicted))

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

Train Accuracy:  0.9416666666666667
Test Accuracy:  1.0

Test Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Feature Selection

Below we determine the importance of the Iris features using the Random Forest model's $feature\_importances\_$ property. 

In [7]:
for i in range(len(iris.feature_names)):
    print("%10s : %.2f" % (iris.feature_names[i], forest_clf.feature_importances_[i]))

sepal length (cm) : 0.12
sepal width (cm) : 0.01
petal length (cm) : 0.37
petal width (cm) : 0.50


## Observation

We observe the two most important features are: 
- Petal length
- Petal width