# Ensemble Learning

-> Group of predictors - Ensemble . Ensemble learning algorithm - Ensemble method .

eg: Train a group of decision tree classifiers each on a different random subset of the training set. We can then obtain the predictions from all the individual trees, and the class that gets the most votes is the ensemble's prediction. Such an ensemble of decision trees is called a random forest.

-> To get diverse set of classifiers (for improved ensemble performance) -

1. Use different training algorithms - Voting Classifier
2. Use same training algorithm but train them on different random subset of training set - Bagging and Pasting
3. Ensemble methos that combines weak learners into a strong learner - Boosting
4. We train a model to perform the aggregation of predictions from different predictors - Stacking

## Voting Classifiers

Suppose we've trained a logistic regression classifier, an SVM classifier, a random forest classifier, a k-nearest neighbors classifier. An even better classifier is created by aggregating the predictions of each classifier. The class that gets the most votes is the ensemble's prediction. This majority-vote classifier is called a hard voting classifier. This voting classifier achieves higher accuracy than the best classifier in the ensemble (due to law of large numbers).

Ensemble methods work best when the predictors are as independent from one another as possible by training them using very different algorithms. This increases the chance that they will make different types of errors improving the ensemble's accuracy.

When we fit a VotingClassifier, it clones every estimator and fits the clones. predict() method performs hard voting.

If all classifiers are able to estimate class probabilities, ie they have predict_proba() method, then Scikit-Learn can predict the class with the highest class probability averaged over the individual classifiers. This is called soft voting.It gives more weight to highly confident votes. So higher performance.Set voting hyperparameter to "soft". Ensure that all classifiers can estimate class probabilities.

In [None]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
00

In [None]:
X,y=make_moons(n_samples=500,noise=0.30,random_state=42)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
voting_clf = VotingClassifier(
    estimators= [
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train,y_train)

In [None]:
for name,clf in voting_clf.named_estimators_.items():
  print(name,"=",clf.score(X_test,y_test))

lr = 0.864
rf = 0.896
svc = 0.896


In [None]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [None]:
voting_clf.predict(X_test[:1])

array([1])

In [None]:
voting_clf.voting="soft"
voting_clf.named_estimators["svc"].probability=True
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
## 92% accuracy with soft voting

0.92

# Bagging and Pasting (BaggingClassifier/BaggingRegressor)

For every predictor, same training algorithm but different subset of training set. Sampling with replacement - Bootstrap aggregating (Bagging). Sampling without replacement - Pasting

Once all predictors are trained, ensemble make prediction for a new instance by simply aggregating the predictions of all predictors. For classification aggregation is mode (most frequent prediction like hard voting) and for regression its average. Each individual predictor has a higher bias than if it were trained on original training set but aggregation reduces both bias and variance. Net result will be ensemble with similar bias but a lower variance than a single predictor trained on the original dataset.

Predictors can be trained in parallelvia different CPU cores or different servers. Similarly predictions can be made paarallel. Bagging and pasting scale very well.

More diversity in the subsets in bagging and thus a slightly higher bias than pasting. Also predictors end up being less correlated so variance is reduced. Bagging often results better models.

BaggingClassifier automatically performs soft voting if the base classifier can estimate class probabilities (predict_proba() , eg: DecisionTreeClassifier)

With bagging, some training instances may be sampled several times for any given predictor, while others may not be sampled at all. Instances that are not sampled are called Out-of-Bag instances (OOB). A bagging ensemble can be evaluated using OOB instances without the need for a separate validation set.

BaggingClassifier supports sampling the features.Sampling both training instances and features are called random patches method. Keeping all training instances but sampling features is called random subspaces method.

Hyperparameters - max_features,max_samples,bootstrap_features,bootstrap

n_jobs parameter - number of cpu cores to use for training and predictions. -1 means use all available cores.



In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf=BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,n_jobs=-1,random_state=42)
bag_clf.fit(X_train,y_train)

In [None]:
from sklearn.metrics import accuracy_score
y_pred=bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.904

# Random Forests (RandomForestClassifier/RandomForestRegressor)

Ensemble of decision trees generally trained via bagging method typically with max_Samples set to the size of training set. Algorithm by default samples sqrt(n) features. Algo results in greater tree diversity which gives higher bias and lower variance.

Extra-trees - Extremely randomized trees (ExtraTreesClassifier). In random forest, at each node only a random subset of features is considered for splitting. To make it more random, use random threshold for each feature by setting splitter="random" when creating DecisionTreeClassifier

## Feature importance

Scikit-Learn measures a feature's importance by looking at how much the tree nodes that uses that feature reduce impurity on average across all trees in the forest.Its a weighted average where each node's weight is the number of training samples associated with it.Scikit-Learn computes this score automatically for each feature after training then it scales the results so that sum of all importances equal to 1 - features_importances_

RandomForests are useful tp get a qyuick understanding of what features actually matter if we need to perform feature selection.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1,random_state=42)
rnd_clf.fit(X_train,y_train)
y_pred_clf=rnd_clf.predict(X_test)


# Boosting

Train predictors sequentially each trying to correct its predecessor. Training cannot be parallelized. So dont scale very well. 2 main boosting methods are -

1. AdaBoost (Adaptive Boosting)

Algorithm first trains a base classifier such as DT and uses it to make predictions on the training set. Algo then increases the relative weight of misclassified training instances and trains a second classifier using updated weights and so on.

2. Gradient Boosting

This method tries to fit the new predictor to the residual errors made by the previous predictor.

Popular libraries - XGBoost,CatBoost,LightGBM

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30,
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)  # y = 3x² + Gaussian noise
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,
                                 learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

# Stacking (Stacked generalization) - StackingClassifier/StackingRegressor

Final predictor is called blender or meta learner.

Input to blender - Use cross_val_predict() on every predictor to get out-of-sample predictions for each instance in the original training set. So one input feature per predictor.

Target - target from original training set

By default StackingClassifier uses LogisticRegression as final estimator. StackingRegressor uses RidgeCV.

For each predictor StackingClassifier calls predict_proba() if available. If not decision_function(). If not that, predict()

In [None]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5  # number of cross-validation folds
)
stacking_clf.fit(X_train, y_train)