# Introduction to Decision Trees

## 1. What is a Decision Tree?

- A **Decision Tree** is a **supervised learning algorithm** that can be used for both **classification** and **regression** tasks.
- It models decisions as a tree-like structure, where:
  - **Internal nodes** represent a feature or attribute on which the data is split.
  - **Branches** represent the outcome of the decision based on that feature.
  - **Leaf nodes** represent the final output or class label.
- Decision trees are easy to interpret and can handle both numerical and categorical data.

---

## 2. How Decision Trees Work

### Steps:
1. **Root Node**: Start with the entire dataset and choose the feature that best splits the data. This feature becomes the root of the tree.
2. **Splitting**: At each internal node, the algorithm selects a feature and threshold to split the data in a way that maximizes the separation between different classes (for classification) or reduces the prediction error (for regression).
3. **Recursion**: The splitting process continues recursively, creating branches and nodes, until one of the stopping conditions is met (e.g., maximum depth, minimum samples at a node).
4. **Leaf Nodes**: The tree stops growing when a node cannot be split further, and the prediction (class or regression value) is made at the leaf nodes.

---

## 3. Decision Criteria

To decide how to split the data at each node, decision trees use criteria like:

- **Gini Impurity** (for classification): Measures the likelihood of incorrect classification at a node.
  
  \[
  Gini = 1 - \sum (p_i)^2
  \]

- **Entropy** (for classification): Measures the information gain from a split.
  
  \[
  Entropy = - \sum p_i \log_2(p_i)
  \]
  
- **Information Gain**: The decrease in entropy after a dataset is split on an attribute.
  
  \[
  \text{Information Gain} = Entropy(parent) - \sum \left( \frac{n_{child}}{n_{parent}} \right) Entropy(child)
  \]

- **Mean Squared Error (MSE)** (for regression): Measures the variance reduction at each split.

---

## 4. Tree Pruning

- **Overfitting**: A decision tree can become overly complex and fit the noise in the training data. This leads to poor generalization to unseen data.
- **Pruning** is a technique used to reduce the size of the tree and prevent overfitting.
  - **Pre-pruning** (early stopping): Stop the tree from growing if certain conditions are met, such as reaching a maximum depth or minimum number of samples per node.
  - **Post-pruning**: Build the entire tree and then remove branches that provide little benefit.

---

## 5. Important Parameters in Decision Trees

- **max_depth**: The maximum depth of the tree. Limiting depth prevents overfitting.
- **min_samples_split**: The minimum number of samples required to split an internal node.
- **min_samples_leaf**: The minimum number of samples required to be in a leaf node.
- **criterion**: The function used to measure the quality of a split (e.g., `gini` for Gini impurity or `entropy` for information gain in classification, `mse` for regression).
- **max_features**: The number of features to consider when looking for the best split.

---

## 6. Example Code

```python
from sklearn.tree import DecisionTreeClassifier

# Initialize and configure the decision tree model
model = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=0)

# Fit the model to training data (X_train, y_train)
model.fit(X_train, y_train)

# Make predictions on new data
predictions = model.predict(X_test)

# Visualize the tree
from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
tree.plot_tree(model, filled=True)
plt.show()
```

---

## 7. Advantages of Decision Trees

- **Interpretability**: Decision trees are easy to visualize and understand, even for non-technical stakeholders.
- **No Need for Feature Scaling**: Unlike algorithms such as SVM or k-NN, decision trees do not require features to be scaled or normalized.
- **Handles Both Numerical and Categorical Data**: It can work with different types of data without much preprocessing.
- **Non-parametric**: Does not assume any specific distribution of data.

---

## 8. Disadvantages of Decision Trees

- **Overfitting**: Decision trees can easily overfit the training data, especially if they are allowed to grow very deep. This can lead to poor generalization to new data.
- **Unstable**: Small changes in the data can result in a completely different tree structure, which makes decision trees less robust.
- **Biased Towards Features with More Categories**: Decision trees can favor attributes with many distinct values, which can lead to biased splits.

---

## 9. Evaluation Metrics for Decision Trees

For classification tasks:
- **Accuracy**: The ratio of correct predictions to total predictions.
- **Precision, Recall, and F1 Score**: Useful for imbalanced datasets.
  
For regression tasks:
- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values.
- **R² (Coefficient of Determination)**: Indicates how well the predictions approximate the real data.

---

## 10. Regularization in Decision Trees

To avoid overfitting, you can apply **regularization** techniques:
- **max_depth**: Limits the depth of the tree, reducing overfitting.
- **min_samples_split**: Sets the minimum number of samples required to split a node.
- **min_samples_leaf**: Ensures that leaf nodes have a minimum number of samples.

---

## 11. Decision Tree Variants

- **Classification Trees (CART)**: Used for classification tasks where the target is categorical.
- **Regression Trees**: Used for regression tasks where the target is continuous.
- **Random Forests**: An ensemble of decision trees to improve accuracy and robustness.
- **Gradient Boosting**: Combines multiple weak decision trees into a strong learner by optimizing residuals.

---

## 12. Applications of Decision Trees

- **Credit Scoring**: To predict whether a customer is likely to default on a loan.
- **Medical Diagnosis**: To classify patients based on the presence or absence of a disease.
- **Customer Segmentation**: To group customers based on purchasing behavior.
- **Fraud Detection**: To detect anomalous behavior in financial transactions.

---

## 13. Limitations of Decision Trees

- **Overfitting**: Decision trees are prone to overfitting if not pruned or regularized properly.
- **Bias Toward Features with More Categories**: Decision trees can be biased toward features with many categories.
- **Unstable**: A small change in the data can result in a significantly different tree.

---

## 14. Summary

- **Decision Trees** are a powerful and interpretable model for classification and regression tasks.
- They split data recursively based on feature values and output class labels (for classification) or continuous values (for regression).
- Proper regularization (through depth control and pruning) is essential to avoid overfitting.
- Decision trees are widely used across various domains due to their simplicity and interpretability.

---