## [Mastering Random Forest: A Deep Dive with Gradient Boosting Comparison](https://pub.towardsai.net/mastering-random-forest-a-deep-dive-with-gradient-boosting-comparison-2fc50427b508)

> Ensemble methods are common techniques in machine learning.


**Random Forest (RF)** is a versatile supervised learning model, categorized in bagging ensemble.

At its core, a Random Forest builds and trains multiple decision trees in parallel using a random subset of the training samples.

By averaging or taking a majority vote for the predictions from these many trees, Random Forest significantly reduces variance while improving the robustness and accuracy of its predictions.

Its key characteristics include:

- **Supervised Learning Model:** Requires labeled data for training.
- **Ensemble Method (Bagging):** Combines predictions from multiple individual models (decision trees) to achieve better performance than any single model (Learn More: Ensemble Technique in Machine Learning ).
- **Non-Parametric:** Makes no assumptions about the underlying data distribution and doesn’t require a fixed number of features.
- **Non-linearity:** Capable of modeling complex, non-linear relationships within the data.
- **Versatile:** Can handle both regression and classification problems.
- **Training in Parallel:** Individual trees are trained simultaneously, speeding up the training process on appropriate hardware.

### Key Hyperparameters

Random Forest’s capabilities are heavily influenced by

1. **Forest Complexity**

- `n_estimators`: Number of trees to build before averaging the prediction.
- `oob_score`: Whether to secure the OOB samples.
- `bootstrap`: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- `max_samples`: Number of bootstrap samples to draw from the entire training samples.

2. **Tree Complexity**

- `max_depth`: Tree’s depth
- `min_samples_split`: Ratio or absolute number of minimum samples to split the node.
- `min_samples_leaf`: Minimum number of samples required to create a leaf node.
- `max_features`: Maximum number of features to split to a node (log or square).
- `max_leaf_nodes`: Maximum number of leaf nodes in each tree (None = infinite).
- `min_impurity_decrease`: threshold of impurity to split the node
- `ccp_alpha`: Complexity parameter used for Minimal Cost-Complexity Pruning.

3. **Training Control**

- `criterion`: Objective function on measuring the gain.
- `class_weight`: Handles class imbalance.
- `random_state`: Handles random sampling.
- `n_jobs`: Number of processors allowed to use for the training.

In [None]:
## Model Tuning: Grid Search

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# sets baseline model
base_rf = RandomForestClassifier(
    # 1. forest complexity
    bootstrap=True,
    oob_score=True,
    # 3. training control
    random_state=42,
    n_jobs=-1,
    class_weight='balanced',
    verbose=0,
    warm_start=False,
)
pipeline = Pipeline([('scaler', preprocessor),('model', base_rf)])
 
# defines options to test
param_grid =  {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [5, 10, 20, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__max_features': ['sqrt', 0.7],
    'model__criterion': ['gini', 'entropy']
}

# grid search
grid_search_rm_bt = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
grid_search_rm_bt.fit(X_train, y_train)
best_params_search_model = grid_search_rm_bt.best_params_
rf_opt = grid_search_rm_bt.best_estimator_

print(f"Best params for bootstrapped Random Forest: {best_params_search_model}")

print(f"Best score: {grid_search_rm_bt.best_score_:.4f}")

In [None]:
## Random Forests with Different Model Complexity

from sklearn.ensemble import RandomForestClassifier

# common training conditions across the models
training_controls = dict(
    random_state=42,
    n_jobs=-1,
    class_weight='balanced',
    verbose=0,
    warm_start=False, 
)

# complexity = low
rf_s = RandomForestClassifier(
    # 1. forest complexity
    bootstrap=False,
    oob_score=False,
    n_estimators=50,

    # 2. tree complexity
    max_depth=3,
    min_samples_leaf=20,
    max_features=1,        # no feature bagging

    # 3. training control
    **training_controls,
).fit(X_train_processed, y_train)


# complexity = low (counterpart)
rf_s_bootstrap = RandomForestClassifier(
    **{ k: v for k, v in rf_s.get_params().items() if k != 'bootstrap' and k != 'oob_score'},
    bootstrap=True,
    oob_score=True,
).fit(X_train_processed, y_train)


# complexity = middle
rf_m = RandomForestClassifier(
    # 1. forest complexity
    bootstrap=True,
    oob_score=True,
    n_estimators=200,

    # 2. tree complexity
    max_depth=10,
    min_samples_leaf=5,
    max_features='sqrt',      # turns on the feature bagging
    
    # 3. training control
    **training_controls,
).fit(X_train_processed, y_train)


# complexity = high
rf_l = RandomForestClassifier(
    # 1. forest complexity
    bootstrap=True,
    oob_score=True,
    n_estimators=500,

    # 2. tree complexity
    max_depth=None,          # depth can be infinite.
    min_samples_leaf=1,
    max_features='log2',

    # 3. training control
    **training_controls,
).fit(X_train_processed, y_train)

In [None]:
## GBM Family and Baseline Model

from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier

# complexity hyperparams
learning_rate = 0.01
n_estimators = 500        # same as rf_l
max_depth = None

# regularization hyperparams
validation_fraction = 0.1
n_iter_no_change = 10


# XGBoost
xgbm = GradientBoostingClassifier(
    loss='log_loss',
    learning_rate=learning_rate,
    n_estimators=n_estimators,
    max_depth=max_depth,
    validation_fraction=validation_fraction, 
    n_iter_no_change=n_iter_no_change,
    tol=1e-4,
).fit(X_train_processed, y_train)


# LightGBM
lgbm = HistGradientBoostingClassifier(
    loss='log_loss',
    learning_rate=learning_rate,
    max_depth=max_depth, 
    max_iter=n_estimators,
    validation_fraction=validation_fraction,
    l2_regularization=0.01,
    early_stopping=True,
    n_iter_no_change=n_iter_no_change,
    max_leaf_nodes=31,
    min_samples_leaf=20,
).fit(X_train_processed, y_train)


# CatBoost
cat = CatBoostClassifier(
    loss_function='Logloss',
    learning_rate=learning_rate,
    iterations=n_estimators,
    depth=max_depth,
    early_stopping_rounds=n_iter_no_change,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=0,
).fit(X_train_processed, y_train)


# Logistic Regression (on L2)
lr = LogisticRegression(
    max_iter=n_estimators,
    penalty='l2',
    tol=1e-4,
).fit(X_train_processed, y_train)

In [None]:
## Evaluation

from sklearn.metrics import accuracy_score

y_pred_train =  model.predict(X=X_train_processed)
y_pred_val = model.predict(X_val_processed)
y_pred_test = model.predict(X=X_test_processed)
print(f'\n{model_names[i]}\nTrain: {accuracy_score(y_train, y_pred_train):.4f} Test: {accuracy_score(y_test, y_pred_test):.4f}')