<h1 style="color:blue" align="left"> Random Forest </h1>

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/

---------------

<h1 style="color:green" align="left"> 1. Theory </h1>

![1. Hyper Parameter](image/1.JPG)

## 1. n_estimators (Number of trees)

- **How does increasing the number of trees lead to overfitting?**


- Increasing the **number of trees** does **improve the model performance** to some extent. But after a certain point, if you keep increasing the number of trees, the model will start learning the data, instead of the pattern it is supposed to learn.

- **n_estimators : integer, optional (default=10)**. The number of trees in the forest.

  .. versionchanged:: 0.20
  
     The default value of **``n_estimators``** will change from **10** in
     version 0.20 to **100** in version 0.22.

![1. Hyper Parameter](image/2.JPG)

## 2. max_features
- Total number of features in datset is 100.


- sqrt 100 =10. Randomly select 10 features for each node 

#### max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
- The number of features to consider when looking for the best split:

  - If int, then consider max_features features at each split.

  - If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

  - If “auto”, then max_features=sqrt(n_features).

  - If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

  - If “log2”, then max_features=log2(n_features).

  - If None, then max_features=n_features.

**Note:** the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

![1. Hyper Parameter](image/3.JPG)

![1. Hyper Parameter](image/4.JPG)

## 3. max_depth

#### max_depth : int, default=None
- The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

![1. Hyper Parameter](image/5.JPG)

## 4. min_samples_split

#### min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node:

  - If int, then consider min_samples_split as the minimum number.

  - If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

## 5. min_samples_leaf

#### min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  - If int, then consider min_samples_leaf as the minimum number.

  - If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

## 6. criterion

#### criterion{“gini”, “entropy”}, default=”gini”
- The function to measure the quality of a split.


- Supported criteria are **“gini”** for the **Gini impurity** and **“entropy”** for the **information gain.** Note: this parameter is tree-specific.

------------------------

<h1 style="color:blue" align="left"> 2. CODE </h1>

## a. Classification

### 1. GridSearchCV

In [None]:
param_grid = {'n_estimators':[50, 100,150,200,300],
              'criterion':['gini','entropy'],
              'max_depth':[5,10,50,100,200]}

grid = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=4)
grid.fit(xx_train,yy_train)

print("Tuned Model Parameters: {}".format(grid.best_params_))

---------------------------------------------------------------------------------------
Tuned Model Parameters: {'criterion': 'gini', 'max_depth': 200, 'n_estimators': 100}
---------------------------------------------------------------------------------------

In [None]:
rfc = RandomForestClassifier(criterion='gini', max_depth=200, n_estimators=100)
rfc.fit(xx_train,yy_train)

### 2. RandomizedSearchCV

In [None]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 50)
classifier.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [None]:
est = RandomForestClassifier(n_jobs=-1)
rf_p_dist={'max_depth':[3,5,10,None],
              'n_estimators':[10,100,200,300,400,500],
              'max_features':randint(1,3),
               'criterion':['gini','entropy'],
               'bootstrap':[True,False],
               'min_samples_leaf':randint(1,4),
              }

In [None]:
def hypertuning_rscv(est, p_distr, nbr_iter,X,y):
    rdmsearch = RandomizedSearchCV(est, param_distributions=p_distr,
                                  n_jobs=-1, n_iter=nbr_iter, cv=9)
    # CV = Cross-Validation ( here using Stratified KFold CV)
    rdmsearch.fit(X,y)
    ht_params = rdmsearch.best_params_
    ht_score = rdmsearch.best_score_
    return ht_params, ht_score

In [None]:
rf_parameters, rf_ht_score = hypertuning_rscv(est, rf_p_dist, 40, X, y)

In [None]:
claasifier=RandomForestClassifier(n_jobs=-1, n_estimators=300,bootstrap= True,criterion='entropy',
                                  max_depth=3,max_features=2,min_samples_leaf= 3)

------------

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(random_grid)

-------------

## b. Regression

### 2. RandomizedSearchCV

In [None]:
reg_rf = RandomForestRegressor()

In [None]:
# Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [None]:
# Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [None]:
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid,scoring='neg_mean_squared_error',
                               n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train)

In [None]:
rf_random.best_params_

--------------------------------------------

{'n_estimators': 700,
 'min_samples_split': 15,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 20}

In [None]:
prediction = rf_random.predict(X_test)

------------------

<h1 style="color:blue" align="left"> 3. Default Parameters </h1>

- **n_estimators : integer, optional (default=10)**. The number of trees in the forest.

  .. versionchanged:: 0.20
  
     The default value of **``n_estimators``** will change from **10** in
     version 0.20 to **100** in version 0.22.

------------------------------------------------------------------------------------------------------

- Parameters currently in use:

- {'bootstrap': True,

  'criterion': 'mse',
  
  'max_depth': None,
  
  'max_features': 'auto',
  
  'max_leaf_nodes': None,
  
  'min_impurity_decrease': 0.0,
  
  'min_impurity_split': None,
  
  'min_samples_leaf': 1,
  
  'min_samples_split': 2,
  
  'min_weight_fraction_leaf': 0.0,
  
  'n_estimators': 10,
  
  'n_jobs': 1,
  
  'oob_score': False,
  
  'random_state': 42,
  
  'verbose': 0,
  
  'warm_start': False}

The **main parameters** used by a **Random Forest Classifier** are:

- **criterion** = the function used to evaluate the quality of a split.


- **max_depth** = maximum number of levels allowed in each decision tree.


- **max_features** = maximum number of features considered when splitting a node.


- **min_samples_leaf** = minimum number of samples which can be stored in a tree leaf.


- **min_samples_split** = minimum number of data points placed in a node before the node is split


- **n_estimators** = number of trees in the ensamble.


- **bootstrap** = method for sampling data points (with or without replacement)

### Look at parameters used by our current forest
### from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()

from pprint import pprint

### Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())