# DECISION TREE

## GENERAL

A Decision Tree is a supervised learning model used for both classification and regression tasks. It works by splitting the data into branches based on feature values, creating a tree-like structure where each internal node represents an outcome (class label or predictid value)

In a Decision Tree algorithm, there is a tree like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. The paths from the root node to leaf node represent classification rules.

* ***root node***: the starting point, containing all the data.
* ***splitting***: the dataset is split based on the feature that provides the best separation (e.g., using *gini impurity* or *entropy* for classification).
* ***decision nodes***: intermediate nodes that split the data further based on feature values.
* ***leaf/terminal nodes***: the final nodes that provide the prediction (class label for classification or numerical value for regression).
* ***pruning***: the removal of sub-nodes. It is the opposite process of splitting.
* ***branch/sub-tree***: a sub-section of an entire tree is called a branch or sub-tree.
* ***parent and child node***: a node, which is divided into sub-nodes is called the parent node of sub-nodes where sub-nodes are the children of a parent node.

## ADVANTAGES
* ***easy to understand and interpret***: decision trees provide clear visualizations that make them highly interpretable. The structure of the tree allows for easy understanding of how decisions are made at each node, which can be particularly useful for non-experts.

* ***non-linear relationships:***: capable of modeling non-linear relationships between features and the target variable, decision trees offer greater flexibility compared to linear models that assume a linear connection between inputs and outputs.

* ***handling of numerical and categorical data***: both numerical and categorical data can be processed without the need for additional preprocessing or encoding, distinguishing decision trees from other machine learning models that require feature scaling or transformation.

* ***robustness to outliers:***: the model is relatively insensitive to outliers. Since splits are made based on feature values that separate the majority of data, extreme values do not heavily influence the model’s decision-making process.

* ***automatic feature selection***: decision trees inherently perform feature selection by choosing the most relevant features for creating splits in the data, which can lead to reduced dimensionality and enhanced model performance.

## DISADVANTAGES

* ***prone to overfitting***: decision trees have a tendency to overfit, especially when the tree depth increases. A deep tree may capture noise and minor patterns, leading to poor generalization on new, unseen data.

* ***instability***: decision trees can exhibit instability since slight changes in the data may result in a completely different structure. This is due to the greedy nature of the algorithm that selects the best split at each step, which can cause drastic changes in the final model.

* ***bias towards dominant features***: the model can exhibit bias towards features with more categories or numerical splits. It may favor features that provide the most significant reduction in impurity, even if they are not the most informative in predicting the target variable.

* ***difficulty modeling complex relationships***: while decision trees can handle non-linear data, they often struggle with complex interactions between features. They create axis-aligned splits that may not be the best way to model certain types of intricate relationships.

* ***computationally intensive***: as tree depth increases, decision trees can become computationally expensive. Training and inference times may slow significantly, especially for very deep trees or large datasets, making them less efficient in some scenarios.

## ALGORITHM STEPS

1. ***select the best feature to split***: the algorithm starts with the entire dataset and selects the best feature to split the data. It chooses the feature that provides the best separation based on impurity measures:
    * *classification*: uses *gini index* or *entropy (information gain)*
    * *regression*: uses *mean squared error (mse)* or *variance reduction*

2. ***split the data***: the dataset is split into two *child nodes (binary split)*. Each split should reduce impurity, making the subgroups more homogeneous.

3. ***recursively repeat***: the process continues recursively for each child node. This forms a tree-like structure until a stopping condition is met:
    * maximum depth is reached.
    * minimum number of samples per node.
    * no further reduction in impurity.

4. ***make predictions***: once the tree is built, it makes predictions by traversing from the root to a leaf node based on the feature values of the input. The class or value in the leaf node is the final prediction.

CLASSIFICATION TREE 