

# Decision Tree

## Introduction

Decision trees are a type of supervised learning algorithm that can be used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

## Decision tree Steps

1. Calculate the entropy of the target.
2. Calculate the entropy of the target for each feature.
3. Calculate the information gain for each feature.
4. Choose the feature with the largest information gain as the root node.
5. Repeat steps 1 to 4 for each branch until you get the desired tree depth.

## Decision tree for classification

Entropy is the measure of impurity in a bunch of examples. The entropy of a set $S$ is defined as:

$$
H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)
$$

where $p_i$ is the proportion of the ith class.

The entropy is 0 if all samples at a node belong to the same class, and the entropy is maximal if we have a uniform class distribution. For example, in a binary class setting, the entropy is 0 if $p_1 = 1$ or $p_2 = 0$. If the classes are distributed uniformly with $p_1 = p_2 = 0.5$, the entropy is 1. Therefore, we can say that the entropy reaches its maximum value if the classes are uniformly distributed.

The following equation shows how to calculate the entropy of a dataset $D$:

$$
H(D) = -\sum_{i=1}^{c} p_i \log_2(p_i)
$$

where 
- $p_i$ is the proportion of the ith class
- $c$ is the number of classes
- $y$ is the class label.

For $y = 0$ and $y = 1$ (binary class setting), we can rewrite the equation as follows:

$$
H(D_1) = -p_1 \log_2(p_1) - (1 - p_1) \log_2(1 - p_1)
$$

where 
- $ p_1 $ is the proportion of the positive class
- $p_2 = 1 - p_1$ is the proportion of the negative class
- $D_1$ is the dataset of the left node.

The information gain is the entropy of the dataset before the split minus the weighted entropy after the split by an attribute. The following equation shows how to calculate the information gain $IG$ for a decision tree:

$$
IG(D_p, f) = H(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} H(D_j)
$$

$$
IG(D_p, f) = H(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} \left(-p_{j1} \log_2(p_{j1}) - p_{j2} \log_2(p_{j2})\right)
$$
where 
- $f$ is the feature to perform the split
- $D_p$ and $D_j$ are the dataset of the parent and $j$th child node
- $N_p$ is the total number of samples at the parent node
- $N_j$ is the number of samples in the $j$th child node
- $m$ is the number of child nodes

For $y = 0$ and $y = 1$ and $m = 2$(binary class setting) and two child nodes, we can rewrite the equation as follows:
$$ IG(D_p, f) = H(D_p) - \sum_{j=1}^{2} \frac{N_j}{N_p} H(D_j) $$
$$ IG(D_p, f) = H(D_p) - (\frac{N_{left}}{N_p} H(D_{left}) + \frac{N_{right}}{N_p} H(D_{right})) $$
$$ IG(D_p, f) = H(D_p) - (\frac{N_{left}}{N_p} \left(-p_{left1} \log_2(p_{left1}) - (1 - p_{left1}) \log_2(1 - p_{left1})\right) + \frac{N_{right}}{N_p} \left(-p_{right1} \log_2(p_{right1}) - (1 - p_{right1}) \log_2(1 - p_{right1})\right)) $$

where
where
- $p_{j1}$ is the proportion of the positive class in the $j$th child node
- $p_{j2} = 1 - p_{j1}$ is the proportion of the negative class in the $j$th child node

## Gini impurity 
Gini impurity is another criterion that is often used in training decision trees:

$$Gini(p) = \sum_{k=1}^{|\mathcal{Y}|} p_{k} (1 - p_{k}) = \sum_{k=1}^{|\mathcal{Y}|} p_{k} - \sum_{k=1}^{|\mathcal{Y}|} p_{k}^2 = 1 - \sum_{k=1}^{|\mathcal{Y}|} p_{k}^2$$

where $p_{k}$ is the proportion of the $k$th class.

Imformation gain for the Gini impurity is calculated as follows:

$$IG(D_p, f) = Gini(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} Gini(D_j)$$

where $f$ is the feature to perform the split, $D_p$ and $D_j$ are the dataset of the parent and $j$th child node, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples in the $j$th child node.

## Classification error
The classification error is another criterion that is often used in training decision trees:

$$E = 1 - \max_k p_{k}$$

where $p_{k}$ is the proportion of the $k$th class.

The information gain ratio is another criterion that is often used in training decision trees:

$$IGR(D_p, f) = \frac{IG(D_p, f)}{H(D_p)}$$

where $f$ is the feature to perform the split, $D_p$ and $D_j$ are the dataset of the parent and $j$th child node, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples in the $j$th child node.

The following code implements the entropy and information gain equations:






In [33]:

import numpy as np

def compute_entropy(y):
    """
    Compute the entropy of a set of labels
    """
    entropy = 0.
    if len(y)==0:
        return 0
    p1 = np.sum(y == 1)/y.shape[0]
    if p1==0 or p1==1:
        return 0
    entropy = -p1 * (np.log2(p1)) - (1 - p1) * (np.log2(1 - p1))
    return entropy
 
def split_dataset(X, node_indices, feature, threshold):
    """
    Split dataset X into two subsets given a feature and a threshold
    """
    left_indices = []
    right_indices = []
    for i in node_indices:
        if X[i, feature] <= threshold:
            left_indices.append(i)
        else:
            right_indices.append(i)
    return left_indices, right_indices

def information_gain(X_train, y_train,root_indices, feature, threshold=0):
    """
    Compute the information gain
    """
    left_indices, right_indices = split_dataset(X_train, root_indices, feature, threshold)
    left_entropy = compute_entropy(y_train[left_indices])
    right_entropy = compute_entropy(y_train[right_indices])
    entropy = compute_entropy(y_train)
    if len(left_indices)==0 or len(right_indices)==0:
        return 0
    ig = entropy - (len(left_indices)/len(root_indices))*left_entropy - (len(right_indices)/len(root_indices))*right_entropy
    return ig
        
X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
# dataset shape 
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("Total number of samples: ", X_train.shape[0])
print("Number of features: ", X_train.shape[1])
# entropy of root node 
print("\nentropy of root node" ,compute_entropy(y_train))

# split dataset into two subsets
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# split dataset into two subsets given a feature 
feature = 0
left_indices, right_indices = split_dataset(X_train, root_indices, feature, threshold=0)
print("\nCASE 1:")
print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

# split dataset into two subsets given a feature
feature = 1
left_indices, right_indices = split_dataset(X_train, root_indices, feature, threshold=0)
print("CASE 2:")
print("Left indices: ", left_indices)
print("Right indices: ", right_indices)

# split dataset into two subsets given a feature
print("\ninformation gain 0 : ", information_gain(X_train, y_train, root_indices, feature=0, threshold=0))
print("information gain 1 : ", information_gain(X_train, y_train, root_indices, feature=1, threshold=0))
print("information gain 2 : ", information_gain(X_train, y_train, root_indices, feature=2, threshold=0))

X_train shape:  (10, 3)
y_train shape:  (10,)
Total number of samples:  10
Number of features:  3

entropy of root node 1.0

CASE 1:
Left indices:  [5, 6, 8]
Right indices:  [0, 1, 2, 3, 4, 7, 9]
CASE 2:
Left indices:  [1, 2, 3, 6, 7, 9]
Right indices:  [0, 4, 5, 8]

information gain 0 :  0.034851554559677145
information gain 1 :  0.12451124978365319
information gain 2 :  0.2780719051126377


In [34]:
# find the best split
def get_best_split(X_train, y_train, root_indices):
    """
    Find the best split for a node
    """
    best_feature = 0
    best_threshold = 0
    max_ig = 0
    for feature in range(X_train.shape[1]):
        for threshold in range(0, 1):
            ig = information_gain(X_train,y_train, root_indices, feature, threshold)
            if ig > max_ig:
                max_ig = ig
                best_feature = feature
                best_threshold = threshold
    return best_feature, best_threshold 
 
# find the best split for a node
print("CASE 1:")
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
best_feature, best_threshold = get_best_split(X_train, y_train, root_indices)
print("Best feature: ", best_feature, "Best threshold: ", best_threshold)
print("CASE 2:")
root_indices = [0, 2, 4, 5, 6, 8]
best_feature, best_threshold = get_best_split(X_train, y_train, root_indices)
print("Best feature: ", best_feature, "Best threshold: ", best_threshold)


CASE 1:
Best feature:  2 Best threshold:  0
CASE 2:
Best feature:  1 Best threshold:  0


In [37]:
# build a decision tree recursively
def build_tree_recursive(X_train, y_train, node_indices,current_depth, max_depth,branch_label=""):
    """
    Build a decision tree recursively
    """
    # 1. compute information gain for all features
    # 2. find the best feature and the best threshold
    # 3. split dataset into two subsets
    # 4. build left and right subtrees recursively
    # 5. return a tree node

    if len(node_indices)==0:
        return None
    
    if max_depth==0:
        return None
    
    if current_depth==max_depth:
        formatting = " " * current_depth+"-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_label, node_indices)
        return {'indices': node_indices}

    best_feature, best_threshold = get_best_split(X_train, y_train, node_indices)
    formatting = "-" * current_depth
    print(formatting, "current_depth: ", current_depth, "best_feature: ", best_feature, "best_threshold: ", best_threshold)
    left_indices, right_indices = split_dataset(X_train, node_indices, best_feature, best_threshold)
    node = {}
    node['feature'] = best_feature
    node['threshold'] = best_threshold
    node['indices'] = node_indices
    node['left'] = build_tree_recursive(X_train, y_train, left_indices, current_depth+1, max_depth, branch_label="left")
    node['right'] = build_tree_recursive(X_train, y_train, right_indices, current_depth+1, max_depth, branch_label="right")
    return node

def print_tree_indices(tree, depth=0):
    if tree is None:
        return
    formatting = " " * depth + "-" * depth
    if 'indices' in tree:
        print(formatting, "Node with indices:", tree['indices'])
    
    if 'left' in tree:
        print_tree_indices(tree['left'], depth + 1)
    
    if 'right' in tree:
        print_tree_indices(tree['right'], depth + 1)


X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])

root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

node = build_tree_recursive(X_train, y_train, root_indices, current_depth=0, max_depth=3)
print("\nTree with indices:")
# You can call the function to print the tree with indices
print_tree_indices(node)



 current_depth:  0 best_feature:  2 best_threshold:  0
- current_depth:  1 best_feature:  1 best_threshold:  0
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [6]
   --- right leaf node with indices [2, 3, 9]
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [8]
- current_depth:  1 best_feature:  0 best_threshold:  0
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [5]
-- current_depth:  2 best_feature:  1 best_threshold:  0
   --- left leaf node with indices [1, 7]
   --- right leaf node with indices [0, 4]

Tree with indices:
 Node with indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
 - Node with indices: [2, 3, 6, 8, 9]
  -- Node with indices: [2, 3, 6, 9]
   --- Node with indices: [6]
   --- Node with indices: [2, 3, 9]
  -- Node with indices: [8]
   --- Node with indices: [8]
 - Node with indices: [0, 1, 4, 5, 7]
  -- Node with indices: [5]
   --- Node with indi

# Gini impurity 

In [38]:

# with gini index

def compute_gini(y):
    gini = 0.
    if len(y)==0:
        return 0
    p1 = np.sum(y == 1)/y.shape[0]
    if p1==0 or p1==1:
        return 0
    gini = 1 - p1**2 - (1 - p1)**2
    return gini

def gini_gain(X_train, y_train,root_indices, feature, threshold=0):
    left_indices, right_indices = split_dataset(X_train, root_indices, feature, threshold)
    left_gini = compute_gini(y_train[left_indices])
    right_gini = compute_gini(y_train[right_indices])
    gini = compute_gini(y_train)
    if len(left_indices)==0 or len(right_indices)==0:
        return 0
    gg = gini - (len(left_indices)/len(root_indices))*left_gini - (len(right_indices)/len(root_indices))*right_gini
    return gg

def get_best_split_gini(X_train, y_train, root_indices):
    best_feature = 0
    best_threshold = 0
    max_gg = 0
    for feature in range(X_train.shape[1]):
        for threshold in range(0, 1):
            gg = gini_gain(X_train,y_train, root_indices, feature, threshold)
            if gg > max_gg:
                max_gg = gg
                best_feature = feature
                best_threshold = threshold
    return best_feature, best_threshold

def build_tree_recursive_gini(X_train, y_train, node_indices,current_depth, max_depth,branch_label=""):

    if len(node_indices)==0:
        return None
    
    if max_depth==0:
        return None
    
    if current_depth==max_depth:
        formatting = " " * current_depth+"-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_label, node_indices)
        return {'indices': node_indices}

    best_feature, best_threshold = get_best_split_gini(X_train, y_train, node_indices)
    formatting = "-" * current_depth
    print(formatting, "current_depth: ", current_depth, "best_feature: ", best_feature, "best_threshold: ", best_threshold)
    left_indices, right_indices = split_dataset(X_train, node_indices, best_feature, best_threshold)
    node = {}
    node['feature'] = best_feature
    node['threshold'] = best_threshold
    node['indices'] = node_indices
    node['left'] = build_tree_recursive_gini(X_train, y_train, left_indices, current_depth+1, max_depth, branch_label="left")
    node['right'] = build_tree_recursive_gini(X_train, y_train, right_indices, current_depth+1, max_depth, branch_label="right")
    return node

# X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
# y_train = np.array([1,1,0,0,1,0,0,1,1,0])

# root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

node = build_tree_recursive_gini(X_train, y_train, root_indices, current_depth=0, max_depth=3)
print("\nTree with indices:")
# You can call the function to print the tree with indices
print_tree_indices(node)



 current_depth:  0 best_feature:  2 best_threshold:  0
- current_depth:  1 best_feature:  1 best_threshold:  0
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [6]
   --- right leaf node with indices [2, 3, 9]
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [8]
- current_depth:  1 best_feature:  0 best_threshold:  0
-- current_depth:  2 best_feature:  0 best_threshold:  0
   --- left leaf node with indices [5]
-- current_depth:  2 best_feature:  1 best_threshold:  0
   --- left leaf node with indices [1, 7]
   --- right leaf node with indices [0, 4]

Tree with indices:
 Node with indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
 - Node with indices: [2, 3, 6, 8, 9]
  -- Node with indices: [2, 3, 6, 9]
   --- Node with indices: [6]
   --- Node with indices: [2, 3, 9]
  -- Node with indices: [8]
   --- Node with indices: [8]
 - Node with indices: [0, 1, 4, 5, 7]
  -- Node with indices: [5]
   --- Node with indi

# Tree Ensemble:

Tree ensembles are a powerful machine learning technique that addresses the limitations of single decision trees. They consist of multiple decision trees, each trained on slightly different versions of the data, and their predictions are combined to improve accuracy and robustness.

- Weakness of Single Decision Trees: Single decision trees can be highly sensitive to minor data variations, leading to different predictions. This sensitivity limits their reliability.

- Tree Ensembles for Robustness: Tree ensembles overcome sensitivity issues by using multiple decision trees. Each tree is trained on a different version of the data, creating diversity in the ensemble.

- Voting in Tree Ensembles: When making predictions, all trees in the ensemble contribute their votes. The final prediction is determined by majority voting among the trees.

- Robustness through Voting: Ensembles reduce sensitivity to individual tree behavior. Errors or variations in one tree's prediction have limited impact as each tree gets one vote.

- Sampling with Replacement: To create diversity, tree ensembles use "sampling with replacement." Randomly selecting examples from the original dataset with duplicates allows for various training subsets.

- Constructing Random Training Sets: Multiple random training sets are generated by repeatedly selecting examples with replacement. Each tree is trained using one of these random sets.

- Diversity in Ensemble: Randomness in training data and tree construction leads to a diverse collection of decision trees within the ensemble, enhancing accuracy.

**Three Ensemble Techniques:**

- **Bagging (Bootstrap Aggregation):** Bagging involves resampling the training set multiple times, training decision trees on each resampled set, and then aggregating their predictions. It improves stability and reduces overfitting.

- **Boosting:** Boosting trains weak learners sequentially, with each learner focusing on examples previously misclassified by the ensemble. It adapts to errors and progressively enhances the model's accuracy.

- **Stacking:** Stacking combines multiple models, typically using a meta-classifier or meta-regressor. Base-level models are trained independently, and their predictions serve as features for a meta-model, leading to improved predictive performance.

<!-- ![tree_ensemble](images/tree_ensemble.png) -->

--- 

## Bagging

Bagging is an abbreviation for bootstrap aggregation. Bagging is a method that involves manipulating the training set by resampling, and then aggregating the predictions from each resampled training set. The following figure shows the bagging process:

<!-- ![bagging](images/bagging.png) -->

### Random forest

Random forest is an ensemble of decision trees, where each tree is slightly different from the others. The idea behind random forest is to average multiple (deep) decision trees that individually suffer from high variance(overfitting), to build a more robust model that has a better generalization performance and is less susceptible to overfitting. The following figure shows a random forest with three decision trees:

Building the Random Forest:

- Multiple Trees: A Random Forest consists of a collection (ensemble) of decision trees. The number of trees is a hyperparameter, typically denoted as "B."
- Bootstrapping: To create each decision tree in the forest, a random subset of the original dataset is sampled with replacement. This process is called bootstrapping. As a result, each tree is trained on a slightly different subset of the data.
- Feature Subsampling: At each node of a decision tree, instead of considering all features to determine the best split, a random subset of features (typically denoted as "K") is considered. This introduces an additional layer of randomness and helps ensure that different trees focus on different subsets of features.

Training the Decision Trees:

- Each decision tree in the Random Forest is trained independently using its bootstrapped subset of data and feature subsampling.
- The decision trees aim to learn patterns and relationships in the data, making splits based on features and their values.
- Splits are determined using criteria like information gain (for classification) or mean squared error reduction (for regression).

Making Predictions:

- For classification tasks, when a new data point needs to be classified, all the decision trees in the forest make predictions.
- Each tree "votes" for a class label, and the class with the majority of votes becomes the final prediction.
- For regression tasks, the predictions of all trees are averaged to produce the final prediction.

Advantages of Random Forest:

- Reduced Overfitting: By using multiple trees trained on different subsets of data, Random Forests are less prone to overfitting compared to a single decision tree.
- High Accuracy: The ensemble nature of Random Forests typically results in higher accuracy and better generalization to new data.
- Robustness: Random Forests are robust to noisy data and outliers due to the ensemble averaging.
- Feature Importance: Random Forests can provide insights into feature importance, helping to identify which features are most influential in making predictions.

Hyperparameters: 
- Random Forests have hyperparameters like the number of trees (B), the number of features to consider at each split (K), and tree-specific hyperparameters (e.g., tree depth). These can be tuned to optimize performance for specific tasks.

<!-- ![random_forest](images/random_forest.png) -->

--- 

## Boosting

Boosting is an ensemble machine learning method that combines multiple weak learners to create a strong learner. It focuses on training weak learners sequentially, with each learner attempting to correct the mistakes made by its predecessors. Here's a summary of the key points:

- Ensemble Learning:
   - Boosting is a type of ensemble learning, where multiple machine learning models, often referred to as "weak learners," are combined to improve predictive accuracy.

- Sequential Training:
   - Boosting trains the weak learners sequentially, one after another.
   - The training process emphasizes examples that the previous learners have classified incorrectly, allowing subsequent learners to focus on the challenging cases.

- Weighted Examples:
   - Examples in the training dataset are assigned weights, with misclassified examples given higher weights.
   - This weighting mechanism ensures that the next weak learner gives more attention to the examples that the ensemble is struggling to classify correctly.

- Adaptive Learning:
   - As boosting progresses, each weak learner adapts to the errors made by the previous learners.
   - The goal is to iteratively improve the ensemble's performance by reducing the bias and variance of the model.

- Aggregating Predictions:
   - After training all weak learners, boosting combines their predictions to make the final prediction.
   - For classification tasks, a weighted majority vote is often used, where the weight of each learner's prediction depends on its accuracy.
   - For regression tasks, predictions are typically aggregated by averaging.

- Boosting Algorithms:
   - Several boosting algorithms exist, with AdaBoost (Adaptive Boosting) and Gradient Boosting being two of the most well-known.
   - These algorithms differ in how they assign weights to examples and update the model.

- Strength in Weakness:
   - The term "weak learner" refers to models that perform slightly better than random chance.
   - Boosting demonstrates that by combining many weak learners, a strong learner with high predictive accuracy can be achieved.

- Model Generalization:
   - Boosting often leads to models with excellent generalization capabilities, making them suitable for a wide range of applications.

<!-- ![boosting](images/boosting.png) -->

**AdaBoost**

AdaBoost is a boosting algorithm that combines multiple weak classifiers to build a strong classifier. AdaBoost assigns weights to individual training samples, and trains a weak classifier on the weighted training samples. The following figure shows the AdaBoost process:

<!-- ![adaboost](images/adaboost.png) -->

**Gradient boosting**

Gradient boosting is a boosting algorithm that builds a strong classifier by combining multiple weak classifiers. Gradient boosting builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The following figure shows the gradient boosting process:

<!-- ![gradient_bo osting](images/gradient_boosting.png) -->

**XGBoost**

XGBoost stands for "extreme gradient boosting" and is a widely used open-source implementation of the boosted decision tree algorithm.It is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. The following figure shows the XGBoost process:

**Motivation for Boosting:** Boosting is introduced as a technique to improve the performance of decision tree ensembles by focusing on examples that the ensemble of trees is not yet classifying accurately. It is likened to the concept of "deliberate practice" in learning.

**Boosting Procedure:** 
   - After building each decision tree in the ensemble, the predictions of the ensemble are evaluated on the original training set.
   - Examples that were misclassified receive higher attention during the next round of training to correct errors.
   - This process continues for a total of "B" iterations, where each iteration focuses on improving predictions on examples previously misclassified by the ensemble.

**XGBoost Advantages:**
   - XGBoost is highlighted as a highly efficient and competitive implementation of the boosting algorithm.
   - It offers default settings for splitting criteria and stopping criteria for tree construction.
   - XGBoost includes built-in regularization techniques to prevent overfitting.
   
**XGBoost Working :** 

- XGBoost assigns different weights to training examples as a core part of its boosting algorithm. This approach helps to improve efficiency and focuses on challenging examples without the need to generate multiple randomly chosen training sets with replacement. Here's how this weighting process works:
1. **Initial Weights:**
   - At the beginning of the boosting process, all training examples are assigned equal weights, indicating their importance.

2. **First Iteration:**
   - In the first iteration, a decision tree (or weak learner) is trained on the initial weighted dataset.
   - After training, the model makes predictions on the training data, and some examples may be misclassified while others are correctly classified.

3. **Weight Adjustment:**
   - XGBoost assigns higher weights to the misclassified examples and lower weights to the correctly classified ones.
   - The intuition is that the model should pay more attention to examples it struggled with in the previous iteration.

4. **Subsequent Iterations:**
   - For each subsequent iteration (up to a specified number of boosting rounds), XGBoost repeats the process:
     - It trains a new decision tree on the dataset with updated example weights.
     - The weights are adjusted based on the performance of the model in the previous iteration.
     - Examples that were misclassified in previous iterations receive higher weights, making them more influential in training the next tree.
     - The model continues to refine its predictions by focusing on challenging examples.

5. **Ensemble of Trees:**
   - Over multiple boosting rounds, XGBoost builds an ensemble of decision trees, with each tree trained on a dataset where the weights reflect the difficulty of classifying each example.
   - The final prediction is a weighted combination of the predictions from all trees in the ensemble.

6. **Efficiency and Focus:**
   - XGBoost's approach avoids the need to create multiple training datasets with replacement, which can be computationally expensive.
   - Instead, it adapts the weights of examples directly, which makes the algorithm more efficient.
   - By assigning higher weights to challenging examples, XGBoost ensures that the model devotes more effort to improving its performance on those cases.

7. **Regularization:** In addition to focusing on challenging examples, XGBoost's weighting process also acts as a form of regularization, preventing the model from overfitting to the training data.

8. **Practical Benefits:** This weighting mechanism contributes to XGBoost's effectiveness in competitions and real-world applications, as it helps the algorithm concentrate on improving predictions where they matter most.

<!-- ![xgboost](images/xgboost.png) -->

**LightGBM**

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data

The following figure shows the LightGBM process:

<!-- ![lightgbm](images/lightgbm.png) -->

**CatBoost**

CatBoost is a gradient boosting library that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

The following figure shows the CatBoost process:

<!-- ![catboost](images/catboost.png) -->

---

# Stacking:

Stacking is an ensemble learning technique that combines multiple classification or regression models using a meta-classifier (for classification) or a meta-regressor (for regression). It involves two levels of model training, where base-level models are trained on the complete training set, and a meta-model is trained on the outputs of the base-level models as features. Here's a summary of the key points:

1. **Ensemble Learning:**
   - Stacking is a form of ensemble learning, where the goal is to improve predictive accuracy by combining the strengths of multiple models.

2. **Two-Level Modeling:**
   - Stacking involves a two-level modeling process.
   - In the first level, multiple base-level models are trained independently on the original training dataset.
   - In the second level, a meta-model is trained using the predictions (outputs) from the base-level models as input features.

3. **Base-Level Models:**
   - Base-level models can be of various types, such as decision trees, random forests, support vector machines, or any other suitable model.
   - Each base-level model is trained to make predictions on the target variable.

4. **Meta-Model:**
   - The meta-model is usually a simple model like logistic regression for classification or linear regression for regression tasks.
   - It takes the predictions made by the base-level models as input features and learns to make the final prediction.

5. **Training Process:**
   - Stacking involves training the base-level models on the original training set.
   - Once the base-level models are trained, they make predictions on the same training set.

6. **Feature Engineering:**
   - The predictions from the base-level models serve as new features or input for the meta-model.
   - This process effectively transforms the original dataset into a new feature space.

7. **Combining Predictions:**
   - The meta-model is trained using these transformed features as input and is optimized to make the final prediction.
   - For classification tasks, the meta-model often uses the class probabilities from the base models as input.
   - For regression tasks, the meta-model uses the base model predictions as input.

8. **Performance Boost:**
   - Stacking can significantly improve predictive performance by leveraging the diversity of base-level models.
   - It allows the ensemble to capture complex relationships in the data that individual models might miss.

9. **Hyperparameter Tuning:**
   - Stacking may involve hyperparameter tuning for both base-level models and the meta-model to optimize the ensemble's performance.

10. **Common Usage:**
    - Stacking is used in machine learning competitions and real-world applications where squeezing the last bit of predictive accuracy is essential.

<!-- ![stacking](images/stacking.png) -->

---

## Voting

Voting is an ensemble learning technique that combines multiple classification or regression models via a majority vote or averaging. The following figure shows the voting process:

<!-- ![voting](images/voting.png) -->



