# Decision Trees

A decision tree is a flowchart-like structure used for making decisions. Each internal node tests a feature (e.g., checking if a coin flip is heads or tails). Each leaf node represents a final decision or class label. The branches show how different features combine to lead to these decisions. The paths from the root to the leaves represent the rules for classification.

![image.png](attachment:image.png)

A decision tree has two main parts: decision nodes and leaves. The leaves are the final outcomes or decisions. The decision nodes are where the data gets divided based on certain parameters. The goal is to split the data into the most similar groups possible. The accuracy of the tree depends on how well these splits are made. The criteria for making splits differ between classification and regression trees.

Decision trees are classified based on the type of target variable they predict. There are two types:

1. **Categorical Variable Decision Tree**: This type is used when the target variable is categorical.
2. **Continuous Variable Decision Tree**: This type is used when the target variable is continuous.

### Terminologies:

1. **Root Node (Top Decision Node)**: Represents the entire population or sample and is the starting point of the tree. It gets divided into two or more homogeneous sets.
2. **Splitting**: The process of dividing a node into two or more sub-nodes.
3. **Decision Node**: A node that splits into further sub-nodes.
4. **Leaf/Terminal Node**: Nodes that do not split further are called leaf or terminal nodes.
5. **Pruning**: The process of reducing the size of the decision tree by removing nodes, the opposite of splitting.
6. **Branch/Sub-Tree**: A subsection of the decision tree.
7. **Parent and Child Node**: A node that is divided into sub-nodes is called a parent node, and the resulting sub-nodes are called child nodes.

### Algorithm Selection Based on Target Variables

The choice of algorithm for building decision trees depends on the type of target variables. Here are some common algorithms used in decision trees:

1. **ID3**: An extension of D3.
2. **C4.5**: The successor of ID3.
3. **CART (Classification and Regression Tree)**: Used for both classification and regression tasks.
4. **CHAID (Chi-square Automatic Interaction Detection)**: Performs multi-level splits when computing classification trees.
5. **MARS (Multivariate Adaptive Regression Splines)**: Used for regression tasks and can model complex relationships.

### Decision Tree Induction Algorithm

From a high level, decision tree induction involves four main steps to build the tree:

1. **Begin with the Root Node**: Start with the entire dataset.
2. **Determine the Best Feature**: Find the best feature to split the data on, aiming to improve the Gini impurity or a similar metric for the derived nodes compared to the parent node. This feature should provide the best separation of the data.
3. **Split the Data**: Divide the data into subsets based on the possible values of the best feature. Each node in the tree represents a split point based on a specific feature, which could be a yes/no question or a comparison.
4. **Recursively Generate New Nodes**: Use the subsets of data from step 3 to create new nodes. Continue splitting until the tree is optimized for maximum accuracy while minimizing the number of splits/nodes.

### General Algorithm for a Decision Tree

1. **Pick the Best Attribute/Feature**: Choose the attribute that best splits or separates the data.
2. **Ask the Relevant Question**: Formulate a question based on the chosen feature.
3. **Follow the Answer Path**: Based on the answer, follow the corresponding branch.
4. **Repeat**: Go back to step 1 until a final decision (leaf node) is reached.

This iterative process continues until the tree reaches a point where further splits do not significantly improve the accuracy, or a predefined stopping criterion is met.

### Feature Selection Criteria for Decision Trees

Feature selection in decision trees involves choosing the best attributes to split the data at each node. Here are some common criteria and approaches:

1. **Gini Impurity**:
   - Used in CART (Classification and Regression Trees).
   - Measures the frequency at which any element of the dataset would be mislabeled if it was randomly labeled according to the distribution of labels in the dataset.
   - Formula: \( Gini(D) = 1 - \sum_{i=1}^{C} p_i^2 \)
     where \( p_i \) is the probability of an element being classified into a particular class.

2. **Information Gain**:
   - Used in ID3 and C4.5.
   - Measures the reduction in entropy or uncertainty about the dataset.
   - Formula: \( IG(T, X) = Entropy(T) - \sum_{v \in Values(X)} \frac{|T_v|}{|T|} \times Entropy(T_v) \)
     where \( T \) is the total dataset, \( X \) is a feature, and \( T_v \) is the subset of \( T \) where feature \( X \) has value \( v \).

3. **Gain Ratio**:
   - Used in C4.5.
   - Modifies Information Gain by taking the intrinsic information of a split into account.
   - Formula: \( GainRatio(T, X) = \frac{IG(T, X)}{SplitInformation(T, X)} \)
     where \( SplitInformation(T, X) = - \sum_{v \in Values(X)} \frac{|T_v|}{|T|} \log_2 \frac{|T_v|}{|T|} \).

4. **Chi-Square**:
   - Used in CHAID.
   - Measures the statistical significance of the association between the feature and the target variable.
   - Formula: \( \chi^2 = \sum \frac{(O - E)^2}{E} \)
     where \( O \) is the observed frequency and \( E \) is the expected frequency.

5. **Reduction in Variance**:
   - Used for regression trees.
   - Measures the reduction in variance as a result of a split.
   - Formula: \( \Delta Var = Var(T) - \sum_{v \in Values(X)} \frac{|T_v|}{|T|} \times Var(T_v) \)
     where \( Var \) is the variance of the target variable in dataset \( T \).

6. **Mean Squared Error (MSE)**:
   - Another criterion for regression trees.
   - Measures the average of the squares of the errors or deviations from the predicted values.
   - Formula: \( MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \)
     where \( y_i \) is the actual value and \( \hat{y}_i \) is the predicted value.

### Approaches to Feature Selection

1. **Greedy Approach**:
   - Selects the best feature at each node based on the chosen criterion.
   - Iteratively adds features that contribute most to the model’s performance.

2. **Recursive Feature Elimination (RFE)**:
   - Starts with all features and recursively removes the least important feature at each iteration.
   - Used to select the most relevant features.

3. **Embedded Methods**:
   - Integrates feature selection as part of the model training process.
   - Decision trees naturally perform feature selection as part of the algorithm.

4. **Filter Methods**:
   - Uses statistical techniques to evaluate the relevance of features before the model training process.
   - Examples include correlation coefficients and ANOVA tests.

5. **Wrapper Methods**:
   - Uses a predictive model to evaluate the combination of features and select the best subset.
   - Computationally expensive but often provides better performance.

By employing these criteria and approaches, decision trees can effectively select the most relevant features, leading to better model performance and interpretability.