```{contents}
```

# Workflows

## **1. Problem Understanding & Data Preparation**

1. **Define the task:**

   * Classification → predict class labels
   * Regression → predict continuous values

2. **Collect & clean data:**

   * Handle missing values (RF can sometimes handle missing values internally)
   * Encode categorical features if needed

3. **Split dataset:**

   * Usually into **train and test sets**
   * Optional: cross-validation for hyperparameter tuning

---

## **2. Feature Selection (Optional)**

* Random Forest is robust to irrelevant features because it **selects random subsets of features** at each split.
* Still, removing completely useless features can improve efficiency.

---

## **3. Bootstrap Sampling (Bagging)**

* For each tree in the forest:

  1. Randomly sample **N observations with replacement** from the training set.
  2. This sample is called a **bootstrapped dataset**.

**Effect:** Each tree sees slightly different data → introduces diversity among trees.

---

## **4. Tree Building (Individual Decision Trees)**

* For each tree:

  1. Start at the root node.
  2. At each split, **select a random subset of features**.
  3. Choose the best split based on:

     * **Regression:** Variance reduction / MSE
     * **Classification:** Gini impurity or entropy
  4. Repeat recursively until stopping criteria:

     * Maximum depth reached (`max_depth`)
     * Minimum samples per leaf (`min_samples_leaf`)
     * All leaves are pure

**Note:** Trees are usually grown deep → individual trees may overfit.

---

## **5. Repeat for All Trees**

* Build **n\_estimators** trees independently using their own bootstrapped samples.
* Each tree is slightly different due to **data sampling + random feature selection**.

---

## **6. Aggregation of Predictions (Ensembling)**

* After all trees are trained, make predictions on **new/unseen data**:

  * **Regression:** Average predictions from all trees.
  * **Classification:** Majority vote across all trees.

**Effect:**

* Reduces variance → predictions are more stable than a single tree.
* Robust to noise → errors from individual trees are averaged out.

---

## **7. Model Evaluation**

* Evaluate performance on **test/validation set**:

  * **Regression:** R², RMSE, MAE
  * **Classification:** Accuracy, Precision, Recall, F1-score, ROC-AUC

* Optional: **Out-of-Bag (OOB) Error**

  * Since each tree sees only \~63% of data (bootstrap sampling), remaining 37% can act as a **validation set** → gives an unbiased error estimate without separate test set.

---

## **8. Hyperparameter Tuning**

* Common parameters to tune:

  * `n_estimators` → number of trees
  * `max_depth` → maximum depth of trees
  * `min_samples_split` / `min_samples_leaf` → control complexity
  * `max_features` → number of features considered at each split

* Use **GridSearchCV or RandomizedSearchCV** for optimal performance.

---

## **9. Feature Importance (Optional)**

* Random Forest can compute **feature importance scores**:

  * Measures how much each feature reduces impurity across all trees
  * Helps in **interpreting the model**

---

## **10. Deployment**

* Once trained and validated, the Random Forest model can be deployed to **predict unseen data**.
* Predictions are **aggregated outputs** of all the trees.

---

### **Workflow Summary Diagram (Mental Picture)**

1. **Input Data** → clean & split →
2. **Bootstrap Samples** → multiple trees trained independently →
3. **Random Feature Selection at Each Split** → tree grows deep →
4. **Aggregate Predictions** → final ensemble output →
5. **Evaluate & Tune Hyperparameters** → deploy model

