# Contents 
- what are Ensemble Methods
- why these are popular
- what are different types of it

  - Votting Classifiers
- Bagging and Pasting
  - Out-of-bag evaluation
  - Random Patches and Random Subspaces
- Random Forests
  - Extra Tree
  - Feature Importance
- Boosting 
- Stacking
  

# 1. Ensemble Learning and Ensemble Method

- **Ensemble learning** is machine learning approach where multiple models (often called **weak learners**) are trained and combined to predict to solve the same problem, with aim of imporving the overall performance compared to single model



- **Ensemble Methods** are the specific methods used to implment ensemble learnings. Types of ensemble methods are
   - **Bagging** (also know as Bootstrap Aggregation): Create multiple models by training them on different random subsets of the data created via bootstrap sampling
      - **bootstrap sampling** (sampling with replacement).
      - In classification, bagging combines predictions via majority or weighted voting
      - In regression, bagging combines predictions via averaging.
      - Random Forest is a popular example of bagging applied to decision trees, and it uses these aggregation methods for final predictions.
   
   - **Boosting**: Builds Models Sequentially, with each model attempting to correct the errors of the previous one
   
   - **Stacking**: Combines the predications of multiple models using another model (called **metalearner**) to make final decisions
   
   - **Voting**: Aggregates predications by having each model(**base learners**) 'vote', and the final decision is based on majority or weighted votes
     - trained on same dataset (no different subsets) 
     - base learers/models are often of differnt type ( one model can be regression, another svm, another decision tree)
     - Voting focuses on combining the strengths of different models to make better predictions
   
   


# 2. Voting Clasifiers

- an ensemble method that combines the predictions of multiple different models (often called "base learners") to improve the overall accuracy. 
- The idea is to "vote" on the final prediction by aggregating the individual predictions from each model.

- **Types of Voting Classifiers**
    - **Hard Voting Classifier** : Each model makes predication and the final predication is defined by majority voting
    - the class gets the major votes is choose as the final predications
      - ex, 3 models predicts (Class A, Class B, Class A) , final predication will be **Class A**
      
    - **Soft Voting Classifier**: Instead of predicating just the class, each model outputs the probability of each class
    - the final predications is based on the **average of predicated probabities of each class**
    - the class with highest probablities will be selected


- voting classifier often achieves a higher accuracy than the best classifier in the ensemble

- In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners in the ensemble and they are sufficiently diverse.

- **Ensemble methods work best when the predictors are as independent from one another as possible.**
- One way to get diverse classifiers is to train them using very different algorithms. 
- This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.

In [1]:
# lets code: Voting classifier 

from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples = 500, noise = 0.30, random_state = 42)
X_train, X_test, y_train,y_test = train_test_split(X,y,random_state = 42)


In [2]:
voting_clf = VotingClassifier(estimators = [('lr', LogisticRegression(random_state = 42)),
                                            ('rf', RandomForestClassifier(random_state = 42)),
                                            ('svc', SVC(random_state = 42))
                                            
                                           ])

voting_clf.fit(X_train, y_train)

In [3]:
# lets check accuracy from each fitted classifier

for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.864
rf = 0.896
svc = 0.896


In [4]:
##lets use voting clf to predict,
## it performs hard voting

voting_clf.predict(X_test[:1])

array([1])

In [5]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

# voting classifier predicts class 1 for the first instance of X_test 
# because, below, out of three classifiers, first two predicat class 1

[array([1]), array([1]), array([0])]

In [6]:
# lets look performance of the voting classifier on test set

voting_clf.score(X_test, y_test)

# The voting classifer outperforms all the individual classifiers

0.912

**For Soft Voting**

 - we need to predict the probabilities of all different classifiers
 - it will only happen if all the classifiers have predict function is available
 - we will average the probabilities of the class, to find highes class via max avg value
 
 - **it oftens achieves higher performance than hard voting classifiers**
   - because it gives more weight to highly confident votes
   
 - we have to define votingclassifier **voting** hyperparmeter to soft
 - and ensure that all the classifiers can estimate the class probabilities
    - for above set, SVC doesn't have capcity to predict probabiliteis by default, we need to set the hyperparameter to true
    
    

In [7]:
voting_clf.voting = 'soft'
voting_clf.named_estimators['svc'].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

# which is bit better than hard voting

0.92

# 3. Bagging and Pasting 

- **Bagging**: (short for bootstrap aggregating) when sampling is performed with replacement
   - bagging allows for the duplication of rows within a single subset due to its sampling method (sampling with replacement)
      - This characteristic is what helps bagging to reduce variance and improve the performance of the ensemble model.
- **Pasting** when sampling is perfomed without replacment
- both Bagging and pasting allow training instances to be sampled several times across multiple predicators
   - but only bagging allows training instances to be sampled sevaral times for the same preditor

## ----
**Sampling Without Replacement: When creating subsets in pasting, the method samples rows without replacement. This means that each row selected for a specific subset cannot be repeated within that same subset. However, it can be included in a different subset**   


### <u>Not here</u>
- **Difference b/w Bootstrap and Bagging**
   - Bootstrap :  It is a sampling technique where random samples are drawn with replacement from a dataset.
      - Purpose: It is used to estimate the distribution of a statistic by creating multiple datasets (called bootstrap samples) from the original dataset. These samples are then used to compute statistics like the mean, variance, or more complex model predictions.
      
   - Bagging: Bagging is an ensemble learning method that uses bootstrapping to improve model performance. It involves creating multiple models by training them on different bootstrapped datasets and then aggregating their predictions (usually by averaging for regression or voting for classification).
   
      - Purpose: The goal of bagging is to reduce model variance and improve overall accuracy by combining multiple weak models into a stronger, more stable model

### <u>Not Here</u>
In decision trees, the objectives differ between classification and regression tasks. Here's a breakdown of each:

 **Decision Tree Objective**

1. **For Classification**:
   - **Objective**: To create the next node with the aim of achieving a **pure node**.
     - A **pure node** is a node where all the observations belong to a single class. 
     - The goal is to minimize impurity (e.g., Gini impurity, entropy) in the nodes.
     - The decision tree algorithm evaluates potential splits based on how well they separate the classes, seeking to maximize the homogeneity of the classes in the child nodes.

2. **For Regression**:
   - **Objective**: To create the next node with the aim of minimizing the **prediction error** (or variance) within the node.
     - In regression trees, the aim is to reduce the variance of the target variable (e.g., the numerical output) in the nodes.
     - The algorithm typically uses metrics like **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)** to evaluate splits.
     - By minimizing the prediction error, the algorithm aims to make the predictions for the target variable as accurate as possible within each leaf node.

**Summary**:
- **Classification**: Aim for pure nodes (homogeneous classes).
- **Regression**: Aim to minimize prediction error (homogeneity of target values).

  ## 3.1 Bagging and Pasting in Scikit Learn

In [8]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                           n_estimators = 500,
                           max_samples = 100,
                           n_jobs = 1,
                           random_state = 42)

bag_clf.fit(X_train, y_train)




A BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with decision tree classifiers

**Bagging Classifiers parameters**

- **base_estimator**: estimator object, classifier name
- **n_estimators** = 500 ::  This parameter defines the number of base estimators (classifiers) in the ensemble. 500 decision tree classifiers

- **max_samples**, training sample size for each decision tree classifiers

- **max_features**: This parameter defines the number of features to draw from the total number of features to train each base estimator. Similar to max_samples, if set as a float, it represents a fraction. For instance, max_features=0.5 means that each base estimator will use 50% of the features.
   - deafult 1,
- **bootstrap**: default=True
   - This parameter indicates whether samples are drawn with replacement. If True, it allows the model to sample with replacement; if False, it samples without replacement (known as pasting).

- **bootstrap_features** : default=False
   - This parameter allows for bootstrapping features. If True, features are sampled with replacement. It’s often used for high-dimensional data.
   
- **n_job** : it tells Scikit-Learn the number of CPU cores to use for training and predictions, and –1 tells Scikit- Learn to use all available cores

- **oob_score**: default=False
  - If True, this parameter enables out-of-bag (OOB) evaluation, which provides an internal validation score by using the unused samples in the bootstrap samples. It can give insights into model performance without the need for a separate validation set

- **Bagging introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting** 
- but the extra diversity also means that the predictors end up being less correlated,so the ensemble’s variance is reduced. 

- Overall, **bagging often results in better models, which explains why it’s generally preferred**. 
- But if you have spare time and CPU power, you can use cross-validation to evaluate both bagging and pasting and select the one that works best

## 3.2 Out-of-Bag Evaluation

-  **Out-of-Bag (OOB) Evaluation** is a method used in ensemble learning, particularly with bagging algorithms like the BaggingClassifier in Scikit-learn. It provides a way to assess the performance of an ensemble model without the need for a separate validation set

- The concept of out-of-bag instances and the average sampling rate of about 63% is fundamental to understanding how Bagging works

- the remaining 37% of the training instances that are not sampled are called out-of-bag (OOB) instances. **Note that they are not the same 37% for all predictors/models/tree samples**

**OOB Instances**:

  - The instances that are not included in each bootstrap sample for a given estimator are considered its out-of-bag instances.
- Since each estimator has its unique bootstrap sample, the OOB instances for Estimator 1 are different from those for Estimator 2. This leads to a different set of OOB instances for each base model.

###  ----

- we can set **"oob_score=True"** when creating a BaggingClassifier to request an automatic OOB evaluation after training.
- The resulting evaluation score is available in the **oob_score_**  attribute:


In [9]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                               n_estimators = 500,
                               oob_score = True,
                               n_jobs = 1,
                                random_state =42
                               )

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

In [10]:
#according to above result, oob suggests we will likely to achieve about 89% accuracy in test data lets check

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

In [11]:
# another method to find accuracy score
bag_clf.score(X_test, y_test)

0.912

**Not here** 
- Decision function is a crucial component in many classifiers, providing a numerical score that helps in determining class membership.

- By understanding the decision function, one can interpret the model’s confidence in its predictions and make adjustments to thresholds as necessary for specific applications

  - For binary classifiers: A common threshold is 0, where scores greater than 0 indicate one class, and scores less than 0 indicate the other.
   - For multi-class classifiers: The class with the highest score is typically chosen as the predicted class

In [12]:
# lets check oob_decision_function 

bag_clf.oob_decision_function_[:3] #probas of the first 3 instance of training set

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

## 3.3 Random Patches and Random Subspaces

- Random Patches and Random Subspaces are two advanced ensemble techniques that are variations of bagging. They both **aim to enhance the diversity among the individual learners in an ensemble model, improving generalization performance**.

#### ------
- **Random Patches**: its a method combines two techniques: bootstrapping (sampling training instances with replacement) and feature selection. In this method, each base learner is **trained on a random subset of both the training instances and the features**
   - diversity in both instances and features helps to reduce overfitting and improves the model's robustness.
   - seful for high-dimensional datasets (such as images) where feature selection can significantly impact model performance
   

- **Random Subspaces** method specifically focuses on the **random selection of features while using the entire training dataset**. Each base learner is trained on the same set of instances but with a different subset of features
   - this method emphasizes increasing diversity among the models by varying the feature sets while maintaining the full dataset.
   - It helps to reduce correlation between individual estimators in the ensemble.
   
   
## -----

- Sampling features (by setting bootstrap_features to True and/or max_features to a value smaller than 1.0
- Keeping all training instances (by setting bootstrap=False and max_samples=1.0)

# 4. Random Forests

- Random Forest is an ensemble method that can do both classfication and regression.
- it builds multiple decision tree during training and outputs the mode of classes (for classfication) and average(for regression) of the individual trees

**Random Forest works**

- Bootstrap Sampling: also know as bagging
   - create random samples/subsets of datasets with replacement
- Random Feature Selection: 
   - at each node, random forest select randomly a set of feature instead of considering all features to find the best split
- Multiple Decison tree:
   - Multiple DT are trained independentally on these random subsets of data and features.
- Aggregation of Predications:
   - Classfication : majority vote
   - Regression: Average of the predications
   


In [13]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators = 500, 
                                max_leaf_nodes = 16,
                                 n_jobs = -1,
                                 random_state = 42)

rnd_clf.fit(X_train, y_train)

y_test_pred_rf = rnd_clf.predict(X_test)

# print score
rnd_clf.score(X_test, y_test)

0.912


- **With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown)**, plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

- The random forest algorithm introduces extra randomness when growing trees; **instead of searching for the very best feature when splitting a node , it searches for the best feature among a random subset of features**

## 4.1 Extra Trees

**Extra Trees (short for "Extremely Randomized Trees")** 
- it is a variant of the Random Forest algorithm, and it's part of the broader family of ensemble methods. 
- While it shares many similarities with Random Forest, there are a few key differences that make Extra Trees more "random" than Random Forest.

| Feature               | Random Forest                                             | Extra Trees                                               |
|-----------------------|-----------------------------------------------------------|-----------------------------------------------------------|
| **Bootstrap Sampling** | Yes (by default)                                          | No (uses entire dataset unless `bootstrap=True`)           |
| **Splitting Criterion**| Finds the best split for each node (greedy search)        | Randomly selects the split threshold                       |
| **Diversity among Trees** | Diversity comes from bootstrapped samples and random feature selection | Diversity comes from random splits and random feature selection |
| **Training Speed**     | Slower (due to finding optimal splits)                    | Faster (no search for best splits)                         |
| **Bias-Variance Tradeoff** | Lower bias, higher variance                              | Higher bias, lower variance                                |


- Both methods are very effective, and often the best choice depends on your specific dataset and the trade-offs between speed and performance.

-  **Both technique trades more bias for a lower variance**
- We can create an extra-trees classifier using Scikit-Learn’s **ExtraTreesClassifier** class
  -  Its API is identical to the RandomForestClassifier class, except bootstrap defaults to False

## 4.2 Feature Importance

- Random Forest has a great quality, it make easy to measure the relative importance of each feature.


- **Random Forest computes feature importance based on how much a feature reduces the impurity (like Gini impurity or entropy in classification) or how much it reduces the variance in regression**


   - More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it
   

- Scikit-Learn computes this score automatically for each feature after training, **then it scales the results so that the sum of all importances is equal to 1**
   - Interpretation: Higher values mean that a feature contributes more to the predictions
   - Model-Specific: Feature importance in Random Forest is specific to the model; other algorithms may give different rankings

- **We can access the result using the feature_importances_**



In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


iris = load_iris(as_frame = True)

rnd_clf = RandomForestClassifier(n_estimators = 500, # build 500 Tree
                           random_state = 42)

rnd_clf.fit(iris.data, iris.target)

for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score,2), name)


0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


**Key Differnce Between Bagging and Boosting**

| **Aspect**            | **Bagging**                                               | **Boosting**                                             |
|-----------------------|-----------------------------------------------------------|----------------------------------------------------------|
| **Primary Focus**      | Reducing variance (overfitting)                           | Reducing bias (underfitting)                             |
| **Model Type**         | Suitable for high-variance models (e.g., Decision Trees)  | Suitable for high-bias models (e.g., weak learners)       |
| **Training**           | Models are trained independently                          | Models are trained sequentially, each improving on the previous |
| **Risk of Overfitting**| Lower risk of overfitting (especially with Random Forest) | Higher risk of overfitting, especially if not properly regularized |
| **Speed**              | Faster due to parallel training                           | Slower due to sequential training                        |
| **Interpretability**   | Easier to interpret (due to averaging of models)          | Harder to interpret (due to multiple dependent models)    |


#### Not here

**Can Gini Impurity and Entropy be used for both numerical and categorical input features, as well as for both numerical and categorical target variables?**
 - **For Input Features**:

     - Yes, both Gini Impurity and Entropy can be applied to numerical and categorical input features.
     - For numerical features, the decision tree evaluates thresholds (e.g., feature ≤ threshold) to split the data.
     - For categorical features, it splits based on the categories (e.g., feature == 'category_A').


 - **For Target Variables**:

    - No, Gini Impurity and Entropy are only used for categorical target variables.
    - These metrics measure the "purity" of categorical classes (e.g., classifying into categories like 'yes' or 'no').
    - For numerical target variables, decision trees use different metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE), not Gini Impurity or Entropy, since these are regression tasks.



# 5. Boosting

**Boosting** is a machine learning technique, where multiple weak leaners (often decision tree) are trained sequentially, with each new model focusing on correcting the errors made by the previous models.

- the goal is to combine weak learners to form a strong, accurate model
- there are many boosting model, most popular are
   - **AdaBoost** : Short for Adaptive Boosting
   - **Gradient Boosting**: (some of its version are XGBoost, CatBoost, LightGBM)
   
**How Boosting Works**:
- **Initial Model**: A weak model is trained on the dataset.
- **Error Focus**: After training, the model's performance is evaluated. The instances that were predicted incorrectly are given higher importance (weights), so the next model focuses more on them.
- **Sequential Training**: This process repeats, with each new model trying to correct the mistakes of the previous one. Models are trained sequentially, not independently.
- **Final Prediction**: At the end, the predictions of all the models are combined (usually through weighted voting for classification or averaging for regression) to make the final prediction

- **Pros of Boosting** 
  - Can significantly improve model performance by correcting errors made by weak learners
  - Works well for both classification and regression tasks.
  - Many variations (like XGBoost, LightGBM) include built-in regularization to prevent overfitting.

**Cons of Boosting**:
- Training can be slow since models are trained sequentially.
- More prone to overfitting if not properly regularized.
- Hyperparameter tuning is crucial for optimal performance and can be time-consuming

#### Types of Boosting:
**AdaBoost (Adaptive Boosting)**:
   - Focuses on adjusting the weights of incorrectly predicted instances. 
   - Each subsequent model pays more attention to the errors of the previous model.

**Gradient Boosting**:
  - Sequentially builds models to minimize a loss function (e.g., mean squared error). 
  - Each new model is trained to correct the residuals (errors) of the previous model's predictions.

  - **XGBoost**:
      - An optimized version of gradient boosting that includes regularization and faster computation (parallelization). 
      - It is popular for structured/tabular data.
  
  - **LightGBM**:
      - A more efficient version of gradient boosting that works particularly well with large datasets by using histogram-based algorithms and leaf-wise splits.

  - **CatBoost**:
     - A gradient boosting algorithm designed to handle categorical features effectively without requiring explicit one-hot encoding.

## 5.1 AdaBoost -----


- One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfit. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

- Adaboost, sequential learning technique has some similarities with gradient descent, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better

- there is one important drawback to this sequential learning technique: training cannot be parallelized since each predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as bagging or pasting.

## -----

**AdaBoost Algorithm**
  - Each instance, at initial, their weight is set to 1/m
  - A first predicator is trained, and its weight error rate r_1 is computed on the training set
      - Weighted error rate of the jth predictor
          - rj = ∑ i=1 y^ j (i) ≠ y (i) m w(i) 
          - where y^ j(i) is the jth predictor’s predictionfor the i th instance
  - the predicator weight αj is then computed using   **α j = η log 1-r j r j** 
      - where  η is the learning rate hyperparameter (defaults to 1)
      - the more accurate the predicator is, the higher its weight will be
      - if it is guessing randomly, its value will be close to zero
      - if it is most oftern wrong (i.e. less accurate than random guessing), then its weight will be negative
  - Next, the AdaBoost algorithm updates the instance weights, which boosts the weights of the misclassified instances.
  - Then all the instance weights are normalized (i.e., divided by ∑i=1mw(i))
  - Finally, a new predictor is trained using the updated weights, 
  - and the whole process is repeated:
       - the new predictor’s weight is computed, 
       - the instance weights are updated, 
       - then another predictor is trained, and so on. 
   - The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.

  

## -----

- **Scikit learn uses a multiclass version of Adaboost called SAMME**
    - Stagewise Additive Modelling using Multiclass Exponetial loss function
- when there are just two class, SAMME is equivalent o Adaboost
- if predicators can estimate class probabilities (i.e. if they havea predict_proba() method) sckit learn can use a variant of SAMME called SAMME.R (R stands for "Real") 
   - it reilies on class probabilites rather than predications and generally performs better

In [15]:
# Below code trains Adaboost classifier, based on 30 decision stump
# A decision Stump is a decision tree with max depth =1, 
# in laymen terms, a tree composed of single decision node plus two leaf nodes

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1),
                            n_estimators = 30,
                            learning_rate = 0.5,
                            random_state = 42)

ada_clf.fit(X_train, y_train)

**If your Adaboost is overfitting the training set, we can try reducing the number of estimators or more strongly reqularizing the base estimator**

## 5.2 Gradient Boosting -----

- Just like AdaBoost, gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. 
- However, instead of tweaking the instance weights at every iteration like AdaBoost does, 
   - **this method tries to fit the new predictor to the residual errors made by the previous predictor**
   
   
- Key Differences:
  - **AdaBoost**:
      - Increases weight on misclassified instances, focusing directly on errors in the training set.
  - **Gradient Boosting**:
      - Fits a new model to the residuals of the entire ensemble's predictions, aiming to minimize the overall error.

In [16]:
# lets go through a regression example, 
# using decision tree as the base predicator, called Gradient Tree Boosting or gradient boosted regression Trees (GBRT)



import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

X = np.random.rand(100,1) - 0.5
y = 3*X[:, 0] ** 2 + 0.005 * np.random.rand(100) # y = 3x^2 + gausian noise

tree_reg1 = DecisionTreeRegressor(max_depth = 2, random_state = 42)
tree_reg1.fit(X,y)

In [17]:
# now we will train secong DecisionTreeRegressor on the residuals error made by the first predictor

y2 = y - tree_reg1.predict(X)

tree_reg2 = DecisionTreeRegressor(max_depth = 2, random_state = 43)

tree_reg2.fit(X, y2)

In [18]:
# again we ill train a third regressor on residuals error made by the second predicator

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth = 2, random_state = 44)

tree_reg3.fit(X, y3)

In [19]:
# here need to go through chatgpt for understanding


X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.44462706, 0.05141457, 0.64251501])

In [20]:
#Combined Accuracy of the First 3 Models:

from sklearn.metrics import r2_score

y_pred_combined = sum(tree.predict(X) for tree in (tree_reg1, tree_reg2, tree_reg3))
accuracy = r2_score(y, y_pred_combined)

accuracy

0.9556789807796691

- The accuracy of individual models in this approach isn't as relevant since each tree is trained on the residuals.
- We should focus on the combined prediction from the models and evaluate their performance on the dataset.
- Use a metric like R-squared to assess the final accuracy after combining the predictions from all three models.

#### GBRT 

- we can use Scikit-Learn’s GradientBoostingRegressor class to train GBRT ensembles more easily (there’s also a GradientBoostingClassifier class for classification).

- Much like RandomForestRegressor Class it has hyperparameters to the control the ensemble training


In [21]:
from sklearn.ensemble import GradientBoostingRegressor


gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 3,
                                learning_rate = 1.0, random_state = 42)

gbrt.fit(X,y)

- If you set it to a low value, such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage.

## -----

- **Shrinkage (also known as learning rate)** is a crucial parameter in gradient boosting ensemble methods like Gradient Boosting Machines (GBMs) and XGBoost. 

- It controls the contribution of each weak learner (or tree) to the final prediction, effectively slowing down the learning process to avoid overfitting and improve generalization.

**What is Shrinkage in Gradient Boosting?**
- In gradient boosting, the model is built iteratively by fitting weak learners (typically decision trees) to the residuals (errors) of the previous iteration. 
- Each iteration's tree corrects the errors of the previous ones. 
- The shrinkage (or learning rate) scales down the influence of each tree by a factor η, which reduces the impact of individual trees on the final model.

The prediction for a new instance after t iterations is given by:

$$\hat{y} = \sum_{i=1}^{t} \eta \cdot f_i(x)$$



**Example of Shrinkage Impact:**
 - Let’s imagine we are using gradient boosting to predict house prices, and we are at the 3rd iteration (or tree).
 - Without shrinkage, the model would simply add the predictions of the 3rd tree directly to the sum of previous trees’ predictions. 
 - However, with shrinkage, the contribution of each tree is scaled down by η, say 0.1:

   - If the prediction from tree 3 is 10,000 (the amount it predicts should be added to the previous predictions), then the contribution with shrinkage will be: **0.1 × 10,000 = 1000**

- This reduced contribution helps avoid overfitting early in the learning process.

**Conclusion**
- Shrinkage (learning rate) is a regularization technique in gradient boosting that scales down the contribution of each tree.
- It helps prevent overfitting by slowing down the learning process.
- Smaller learning rates usually require more trees, while larger learning rates may lead to quicker but less generalizable models

- GBRT would start to overfit if we add extra tree
- to find the optimal number of trees, we can perform cross-validation using GridSearchCV or RandomizedSearchCV
- another simple way, by setting **n_iter_no_change** hyperparameter to an integer value 10
   - then GradientBoostingRegressor will automatically stop adding trees, during training, if it sees that **the last 10 trees didn't help**
- This is simply early stopping but with a little bit of **patience**: 
  - it tolerates having no progress for a few iterations before it stops
  
- **If you set n_iter_no_change too low, training may stop too early and the model will underfit**. 
   - But if you set it too high, it will overfit instead
   
- The combination of a small learning rate and a high initial number of estimators allows the model to explore a wide range of potential solutions without quickly converging to a suboptimal one.
- However, since the model will stop adding trees if they do not contribute to improving performance (thanks to early stopping), the final ensemble will consist of only the most beneficial trees. 
- **This helps in creating a well-balanced model that utilizes the strengths of many trees while avoiding the pitfalls of overfitting**
  

In [22]:
gbrt_best = GradientBoostingRegressor(max_depth = 2, 
                                      learning_rate = 0.05,
                                      n_estimators =500,
                                      n_iter_no_change = 10,
                                      tol=1e-4,               # Using the default tolerance, can adjust it if we need to make the model more or less sensitive to small improvements in the loss.
                                      random_state = 42
)

gbrt_best.fit(X,y)

# finding no of estimators/trees after which there is almost no change
gbrt_best.n_estimators_

91

**When n_iter_no_change is set** 
  - the fit() method automatically splits the training set into a smaller training set and a validation set: 
  - this allows it to evaluate the model’s performance each time it adds a new tree. 
  - the size of the validation set is controlled by the validation_fraction hyperparameter, which is 10% by default. 
  - The tol hyperparameter determines the maximum performance improvement that still counts as negligible. It defaults to 0.0001

**The GradientBoostingRegressor class also supports a subsample hyperparameter**, 
   - which specifies the fraction of training instances to be used for training each tree. 
   - For example, if subsample=0.25, 
       - then each tree is trained on 25% of the training instances, selected randomly. 
   - this technique trades a higher bias for a lower variance. 
   - It also speeds up training considerably. This is called **stochastic gradient boosting**.

## 5.3 Histogram-Based Gradient Boosting

- ScikitLearn have another GBRT method, optimized for larger datasets, HGB, Histogram Based Gradient Boosting
- **it works by binning the input features, replacing with integers**
- the numbers of bins is controlled by the max_bins hyperparameter, which defaults to **255** and **cannot be set any higher than this**.
- Binning can greatly reduce the number of possible thresholds that the training algorithm needs to evalute
- Moreever, working with integers makes it possible to use faster and more memory efficient data structures.
- And the way the bins are built removes the need for sorting the features when training each tree


### ----
 - As a result, this implementation has a computational complexity of O(b×m) instead of O(n×m×log(m)), 
    - where b is the number of bins, 
    - m is the number of training instances, 
    - and n is the number of features.
 - In practice, this means that **HGB can train hundreds of times faster than regular GBRT on large datasets**. 
 - However, **binning causes a precision loss, which acts as a regularizer**: 
    - **depending on the dataset, this may help reduce overfitting, or it may cause underfitting**


- Sckit-Learn provides two classes for HGB
   - HistGradientBoostingRegressor
   - HistGradientBoostingClassifier
 - 
 - similar to GradientBoostingRegressor and GradientBoostingClassifier with few notable differences
    - Early stopping is automatically activated if the number of instances is greater than 10,000. You can turn early stopping always on or always off by setting the early_stopping hyperparameter to True or False.
    - Subsampling is not supported. 
    - n_estimators is renamed to max_iter.
    - The only decision tree hyperparameters that can be tweaked are max_leaf_nodes, min_samples_leaf, and max_depth.
    
## ----
- The **HGB classes also have two nice features: they support both categorical features and missing values**. 
   - This simplifies preprocessing quite a bit. 
- However, the **categorical features must be represented as integers ranging from 0 to a number lower than max_bins**.
      

In [23]:
# lets use california housing dataset 
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

hgb_reg = make_pipeline(
make_column_transformer((OrdinalEncoder(), ['ocean_proximity']),
                       remainder = 'passthrough'),
    HistGradientBoostingRegressor(categorical_features = [0],
                                  random_state = 42)
)

In [24]:
df = pd.read_csv("/Users/saajanrajak/2024 Projects/2024_05 Machine Learning/Data/california_housing.csv")

# creating Input and output vars data frames

X = df.drop(['median_house_value'], axis = 1)
y = df['median_house_value']

# lets create bin for target variable,for stratifying, better split in test and train data

y_binned = pd.cut(y, bins = 6)


# Using sklearn package create train test data sets for the model
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y_binned, random_state = 122, test_size = .20)



In [25]:
X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
16099,-122.51,37.76,43.0,2345.0,624.0,1439.0,614.0,2.8448,NEAR OCEAN
10610,-117.78,33.68,11.0,1994.0,477.0,849.0,411.0,4.0187,<1H OCEAN
16933,-122.33,37.57,20.0,2126.0,643.0,1112.0,597.0,3.625,NEAR OCEAN
9731,-121.72,36.81,18.0,1984.0,379.0,1078.0,359.0,3.2969,<1H OCEAN
15982,-122.47,37.76,52.0,2941.0,783.0,1545.0,726.0,2.9899,NEAR BAY


In [26]:
hgb_reg.fit(X_train, y_train)



- The whole pipeline is just as short as the imports! 
  - No need for an imputer, scaler, or a one-hot encoder, so it’s really convenient. 
- Note that categorical_features must be set to the categorical column indices (or a Boolean array).
- Without any hyperparameter tuning, this model yields an RMSE of about 47,600, which is not too bad.

- Several other optimized implementations of gradient boosting are available in the Python ML ecosystem: 
  - in particular, XGBoost, CatBoost, and LightGBM. These libraries have been around for several years.
  - They are all specialized for gradient boosting, their APIs are very similar to Scikit-Learn’s, and they provide many additional features, including GPU acceleration will checkout in depth later
- Moreover, the TensorFlow Random Forests library provides optimized implementations of a variety of random forest algorithms, including plain random forests, extra-trees, GBRT, and several more

# 6. Stacking ----

- it is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, 
    - **we train a model to perform this aggregation**
       - example
       - suppose ensemble performing a regression task on a new instance, having three models
       - Each models (three) will predict a different value
       - the predications used to make final predications( called **Blender or Meta Learner**)

- Scenario:
   - building a regression model to predict house prices.
   - we have trained three different models:
      - Decision Tree Regressor
      - Random Forest Regressor
      - Linear Regression
   - Each of these models will make their own predictions for a new house based on features like the size of the house, number of rooms, location, etc.

#### Step-by-Step Process:
**Step 1: Predictions from Individual Models**
   - Suppose we input the features of a new house into all three models:

       - Decision Tree Regressor predicts: 200,000
       - Random Forest Regressor predicts: 210,000
       - Linear Regression predicts: 195,000
       - Now, you have three different predictions.

**Step 2: Meta-Learner (or Blender) for Final Prediction**
   - The blender or meta-learner will take these three predictions as input and learn how to combine them to produce a final prediction.

   - The blender could be a simple Linear Regression model trained to combine these outputs:

**Step 3 : Let’s assume the blender was trained and learned that:**

   - 30% weight should be given to the Decision Tree prediction,
   - 50% weight to the Random Forest prediction,
   - 20% weight to the Linear Regression prediction.
   
 
**Step 4: Final Prediction Calculation**
  - The blender (meta-learner) will use these weights to calculate the final prediction:
  - Final Prediction=(0.3×200,000)+(0.5×210,000)+(0.2×195,000)
  - Final Prediction=60,000+105,000+39,000=204,000

So, the blender/meta-learner makes a final prediction of $204,000 for the house price

#### Not here, just for knowledge (jfk)
Example (5-fold CV with RandomForestRegressor):
  - Fold 1: Train on Folds 2–5, predict on Fold 1.
  - Fold 2: Train on Folds 1, 3–5, predict on Fold 2.
  - Fold 3: Train on Folds 1, 2, 4, 5, predict on Fold 3.
  - Fold 4: Train on Folds 1–3, 5, predict on Fold 4.
  - Fold 5: Train on Folds 1–4, predict on Fold 5.
  
**At the end of the 5 folds, you will have one prediction for each instance in the training data from this model**.

- This process is repeated for each model in the ensemble, and these predictions become the input features for training the blender.

**Multilayer Stacking Ensemble**

- It is actually possible to train several different blenders this way (e.g., one using linear regression, another using random forest regression) to get a whole layer of blenders, and then add another blender on top of that to produce the final prediction, 
- may be able to squeeze out a few more drops of performance by doing this, but it will cost us in both training time and system complexity.
- Scikit-Learn provides two classes for stacking ensembles: StackingClassifier and StackingRegressor.


In [27]:
X, y = make_moons(n_samples = 500, noise = 0.30, random_state = 42)
X_train, X_test, y_train,y_test = train_test_split(X,y,random_state = 42)

In [28]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(estimators = [('lr', LogisticRegression(random_state = 42)),
                                                ('rf', RandomForestClassifier(random_state = 42)),
                                                ('svc', SVC(probability = True, random_state = 42))
                                               ],
                                  final_estimator = RandomForestClassifier(random_state = 43),
                                  cv = 5 #number of cross Vlaidation Folds
                                 )
stacking_clf.fit(X_train, y_train)



# 7. Conclusion

- ensemble methods are versatile, powerful, and fairly simple to use. 
- Random forests, AdaBoost, and GBRT are among the first models we should test for most machine learning tasks 
  - and they particularly shine with heterogeneous tabular data.
  - Moreover, as they require very little preprocessing, 
- these are great for getting a prototype up and running quickly. 
- Lastly, ensemble methods like voting classifiers and stacking classifiers can help push your system’s performance to its limits.

# Happy Learning

