# 🌳 Cost Function of a Decision Tree

A **Decision Tree** doesn’t optimize a single global cost function like linear regression.
Instead, it makes **greedy, local decisions** at each split.

---

## 1. **At Each Split (Node-Level Cost Function)**

The algorithm tries to find the "best" feature and threshold to split data.
"Best" means **reducing impurity (classification)** or **reducing variance (regression).**

### 🔹 For Classification

* **Entropy (Information Gain):**

  $$
  H(S) = -\sum_{k=1}^K p_k \log_2(p_k)
  $$

  * Where $p_k$ = proportion of class $k$ in node $S$.
  * A pure node (all samples from one class) → $H(S) = 0$.

* **Gini Impurity:**

  $$
  Gini(S) = 1 - \sum_{k=1}^K p_k^2
  $$

* **Information Gain (Cost Reduction):**

  $$
  IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)
  $$

  where $S_v$ is the subset after splitting on feature $A$.

👉 The split chosen is the one **maximizing Information Gain** (or equivalently, minimizing weighted impurity).

---

### 🔹 For Regression

* **Mean Squared Error (MSE):**

  $$
  MSE(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2
  $$

* **Variance Reduction (Cost Reduction):**

  $$
  \Delta = Var(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Var(S_v)
  $$

👉 The split chosen is the one that **minimizes variance (or MSE)**.

---

## 2. **Global Cost Function (Tree-Level Evaluation)**

At the end, the Decision Tree is evaluated with standard metrics:

* Classification → Accuracy, Log Loss, F1-score.
* Regression → MSE, MAE, RMSE, $R^2$.

---

# ⚙️ Hyperparameters in Decision Tree

Decision Trees are **prone to overfitting**, so hyperparameters act like regularizers.

---

### 1. **Tree Growth Parameters**

* `max_depth`: Maximum depth of the tree.

  * Shallow tree → high bias, low variance.
  * Deep tree → low bias, high variance (overfits).

* `min_samples_split`: Minimum samples needed to split a node.

  * Prevents splitting on very small subsets.

* `min_samples_leaf`: Minimum samples required at a leaf node.

  * Larger values → smoother predictions, less overfitting.

* `max_leaf_nodes`: Maximum number of leaves in the tree.

---

### 2. **Feature Selection Parameters**

* `max_features`: Number of features considered for splitting at each node.

  * If less than total features → introduces randomness, prevents greedy fitting.

---

### 3. **Splitting Parameters**

* `criterion`: Cost function used for splits.

  * `"gini"` or `"entropy"` for classification.
  * `"mse"` or `"friedman_mse"` for regression.

---

### 4. **Regularization Parameters**

* `ccp_alpha`: Complexity pruning parameter (post-pruning).

  * Higher `ccp_alpha` → simpler tree (removes weak branches).

---

# 🎯 Example in Scikit-Learn

```python
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(
    criterion="gini",       # or "entropy"
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features="sqrt",
    random_state=42
)

tree.fit(X_train, y_train)
```

---

✅ **Summary**

* **Cost function (splitting criterion):**

  * Classification → minimize Gini or Entropy (maximize info gain).
  * Regression → minimize MSE / variance.
* **Hyperparameters:** control tree depth, splits, leaf size, and pruning to balance bias–variance.

