```{contents}
```         

# Cost Function

* A **cost function** (also called **splitting criterion**) measures **how “good” a split is at a node** in a decision tree.
* Random Forest builds **many decision trees**, and each tree uses a cost function to decide **where to split** the data.
* The goal is to **minimize prediction error** in the leaves.

---

## **2. Cost Functions for Regression**

For **Random Forest Regressor**, the commonly used criteria are:

| Criterion                 | What it Measures         | Formula / Intuition                                                                                                                    |
| ------------------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------- |
| `squared_error` (default) | Variance reduction       | Measures **how much splitting reduces variance of target values** in child nodes. The split that reduces variance the most is chosen.  |
| `absolute_error`          | Mean Absolute Error (L1) | Measures **sum of absolute differences** between target values and mean in child nodes. Less sensitive to outliers than squared error. |
| `poisson`                 | Poisson deviance         | Used for **count data**, assumes a Poisson distribution of targets.                                                                    |

**Variance Reduction Intuition:**

* Suppose a node contains target values `[10, 12, 15, 14]`.
* Splitting them into `[10, 12]` and `[15, 14]` reduces variance in child nodes.
* The split that **minimizes the average variance across children** is chosen.

**Mathematically:**

$$
\text{Cost (variance)} = \text{Var(parent)} - \frac{n_\text{left}}{n_\text{parent}} \text{Var(left)} - \frac{n_\text{right}}{n_\text{parent}} \text{Var(right)}
$$

* The **larger the reduction**, the better the split.

---

## **3. Cost Functions for Classification**

For **Random Forest Classifier**, the commonly used criteria are:

| Criterion  | What it Measures | Formula / Intuition                                                                                                                            |
| ---------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `gini`     | Gini Impurity    | Measures **how often a randomly chosen sample would be misclassified** if labeled according to the node’s class distribution. Lower is better. |
| `entropy`  | Information Gain | Measures **uncertainty of class labels**. Split that maximizes **information gain** (reduces entropy) is chosen.                               |
| `log_loss` | Logarithmic loss | Measures **probability-based error**. More precise for probabilistic classification.                                                           |

**Gini Impurity Formula:**

$$
Gini = 1 - \sum_{i=1}^{C} p_i^2
$$

* $p_i$ = proportion of class $i$ in the node
* Gini = 0 → node is pure (all samples same class)
* Gini = max → node is mixed evenly

**Entropy Formula:**

$$
Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)
$$

* Measures **uncertainty**
* Split that reduces entropy the most → preferred

---

## **4. Intuition Behind Cost Functions**

1. **Regression (Variance Reduction):**

   * Split the node so that children are **as homogeneous as possible** in target values.

2. **Classification (Gini/Entropy):**

   * Split the node so that children are **as pure as possible**, meaning samples in a child node mostly belong to one class.

3. **Random Forest Aggregates Trees:**

   * Even if one tree chooses a suboptimal split (high cost), averaging across many trees **reduces the impact of bad splits**.

---

**Key Points**

* Random Forest **does not have a global cost function**; each tree **optimizes locally at each node**.
* Aggregation of predictions across trees handles **variance and bias**, making RF robust.
* Choice of criterion affects:

  * Model performance (sometimes small differences)
  * Sensitivity to outliers (`squared_error` vs `absolute_error`)
  * Training speed (`gini` is faster than `entropy`)

## Classification Example (Gini vs Entropy)

In [1]:
# Import Libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest with Gini
rf_gini = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=42)
rf_gini.fit(X_train, y_train)
y_pred_gini = rf_gini.predict(X_test)
print("Classification with Gini:", accuracy_score(y_test, y_pred_gini))

# Random Forest with Entropy
rf_entropy = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=42)
rf_entropy.fit(X_train, y_train)
y_pred_entropy = rf_entropy.predict(X_test)
print("Classification with Entropy:", accuracy_score(y_test, y_pred_entropy))


Classification with Gini: 1.0
Classification with Entropy: 1.0


## Regression Example (Squared Error vs Absolute Error)

In [2]:
# Import Libraries
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Create Sample Regression Data
np.random.seed(42)
X = np.sort(np.random.rand(100,1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.3, X.shape[0])

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest with Squared Error (Variance Reduction)
rf_squared = RandomForestRegressor(n_estimators=100, criterion='squared_error', random_state=42)
rf_squared.fit(X_train, y_train)
y_pred_squared = rf_squared.predict(X_test)

# Random Forest with Absolute Error (L1)
rf_absolute = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=42)
rf_absolute.fit(X_train, y_train)
y_pred_absolute = rf_absolute.predict(X_test)

# Metrics
def print_metrics(name, y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    print(f"{name} -> R²: {r2:.2f}, RMSE: {rmse:.2f}")

print_metrics("Squared Error", y_test, y_pred_squared)
print_metrics("Absolute Error", y_test, y_pred_absolute)


Squared Error -> R²: 0.78, RMSE: 0.31
Absolute Error -> R²: 0.79, RMSE: 0.30
