In [1]:
#Chapter 7. Ensemble Learning and Random Forests
#Suppose you pose a complex question to thousands of random people, then aggregate their answers. 
#In many cases you will find that this aggregated answer is better than an expert’s answer. This is
#called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors
#(such as classifiers or regressors), you will often get better predictions than with the best
#individual predictor. A group of predictors is called an ensemble; thus, this technique is 
#called ensemble learning, and an ensemble learning algorithm is called an ensemble method.

#As an example of an ensemble method, you can train a group of decision tree classifiers, each 
#on a different random subset of the training set. You can then obtain the predictions of all 
#the individual trees, and the class that gets the most votes is the ensemble’s prediction 
#(see the last exercise in Chapter 6). Such an ensemble of decision trees is called a random forest,
#and despite its simplicity, this is one of the most powerful machine learning algorithms available today.

#As discussed in Chapter 2, you will often use ensemble methods near the end of a project, once 
#you have already built a few good predictors, to combine them into an even better predictor. 

#In fact, the winning solutions in machine learning competitions often involve several ensemble
#methods—most famously in the Netflix Prize competition.
#In this chapter we will examine the most popular ensemble methods, including voting classifiers, 
#bagging and pasting ensembles, random forests, and boosting, and stacking ensembles.

#A very simple way to create an even better classifier is to aggregate the predictions of each 
#classifier: the class that gets the most votes is the ensemble’s prediction. This majority-vote
#classifier is called a hard voting classifier (see Figure 7-2).


#Ensemble methods work best when the predictors are as independent from one another as possible.
#One way to get diverse classifiers is to train them using very different algorithms. This increases
#the chance that they will make very different types of errors, improving the ensemble’s accuracy.


#Scikit-Learn provides a VotingClassifier class that’s quite easy to use: just give it a list of 
#name/predictor pairs, and use it like a normal classifier. Let’s try it on the moons dataset 
#(introduced in Chapter 5). We will load and split the moons dataset into a training set and a 
#test set, then we’ll create and train a voting classifier composed of three diverse classifiers:


from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)




voting_clf = VotingClassifier(
       estimators=[
           ('lr', LogisticRegression(random_state=42)),
           ('rf', RandomForestClassifier(random_state=42)),
           ('svc', SVC(random_state=42))
])

voting_clf.fit(X_train, y_train)




In [2]:
#When you fit a VotingClassifier, it clones every estimator and fits the clones. The original estimators
#are available via the estimators attribute, while the fitted clones are available via the estimators_ 
#attribute. If you prefer a dict rather than a list, you can use named_estimators or named_estimators_
#instead. To begin, let’s look at each fitted classifier’s accuracy on the test set:

voting_clf.estimators

[('lr', LogisticRegression(random_state=42)),
 ('rf', RandomForestClassifier(random_state=42)),
 ('svc', SVC(random_state=42))]

In [3]:
voting_clf.estimators_

[LogisticRegression(random_state=42),
 RandomForestClassifier(random_state=42),
 SVC(random_state=42)]

In [4]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.864
rf = 0.896
svc = 0.896


In [26]:
voting_clf.named_estimators

{'lr': LogisticRegression(random_state=42),
 'rf': RandomForestClassifier(random_state=42),
 'svc': SVC(probability=True, random_state=42)}

In [5]:
#When you call the voting classifier’s predict() method, it performs hard voting. 
#For example, the voting classifier predicts class 1 for the first instance of the 
#test set, because two out of three classifiers predict that class:

voting_clf.predict(X_test[:1])



array([1])

In [6]:
voting_clf.estimators_

[LogisticRegression(random_state=42),
 RandomForestClassifier(random_state=42),
 SVC(random_state=42)]

In [7]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

In [8]:
#Now let’s look at the performance of the voting classifier on the test set:

voting_clf.score(X_test, y_test) #hard voting

0.912

In [9]:
#There you have it! The voting classifier outperforms all the individual classifiers.


#If all classifiers are able to estimate class probabilities (i.e., if they all have a predict_proba() method),
#then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the
#individual classifiers. This is called soft voting. It often achieves higher performance than hard voting
#because it gives more weight to highly confident votes. All you need to do is set the voting classifier’s
#voting hyperparameter to "soft", and ensure that all classifiers can estimate class probabilities. This
#is not the case for the SVC class by default, so you need to set its probability hyperparameter to True
#(this will make the SVC class use cross-validation to estimate class probabilities, slowing down training,
#and it will add a predict_proba() method). Let’s try that:


voting_clf.voting = "soft" #soft voting
voting_clf.named_estimators["svc"].probability = True #set to original estimator, not the cloned one (with _ at the end)
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)


0.92

In [10]:
#We reach 92% accuracy simply by using soft voting—not bad!

In [27]:
#Bagging and Pasting
#One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. 
#Another approach is to use the same training algorithm for every predictor but train them on different random
#subsets of the training set. When sampling is performed with replacement,1 this method is called bagging 
#(short for bootstrap aggregating3). When sampling is performed without replacement, it is called pasting.
#In other words, both bagging and pasting allow training instances to be sampled several times across multiple
#predictors, but only bagging allows training instances to be sampled several times for the same predictor.
#This sampling and training process is represented in Figure 7-4.

#Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating 
#the predictions of all predictors. The aggregation function is typically the statistical mode for classification
#(i.e., the most frequent prediction, just like with a hard voting classifier), or the average for regression.
#Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation
#reduces both bias and variance.5 Generally, the net result is that the ensemble has a similar bias but a lower
#variance than a single predictor trained on the original training set.

#As you can see in Figure 7-4, predictors can all be trained in parallel, via different CPU cores or even
#different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging
#and pasting are such popular methods: they scale very well.


#Bagging and Pasting in Scikit-Learn
#Scikit-Learn offers a simple API for both bagging and pasting: BaggingClassifier class (or BaggingRegressor
#for regression). The following code trains an ensemble of 500 decision tree classifiers:6 each is trained
#on 100 training instances randomly sampled from the training set with replacement (this is an example of
#bagging, but if you want to use pasting instead, just set bootstrap=False). The n_jobs parameter tells 
#Scikit-Learn the number of CPU cores to use for training and predictions, and –1 tells Scikit-Learn to
#use all available cores:
    
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                           max_samples=100, n_jobs=1, random_state=42)
bag_clf.fit(X_train, y_train)


#A BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can 
#estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with decision 
#tree classifiers.


In [28]:
#Figure 7-5 compares the decision boundary of a single decision tree with the decision boundary of a bagging 
#ensemble of 500 trees (from the preceding code), both trained on the moons dataset. As you can see, the 
#ensemble’s predictions will likely generalize much better than the single decision tree’s predictions: the
#ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the 
#training set, but the decision boundary is less irregular).

#Bagging introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up 
#with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being 
#less correlated, so the ensemble’s variance is reduced. Overall, bagging often results in better models, which
#explains why it’s generally preferred. But if you have spare time and CPU power, you can use cross- validation
#to evaluate both bagging and pasting and select the one that works best.


#Out-of-Bag Evaluation
#With bagging, some training instances may be sampled several times for any given predictor, while others may
#not be sampled at all. By default a BaggingClassifier samples m training instances with replacement 
#(bootstrap=True), where m is the size of the training set. With this process, it can be shown mathematically
#that only about 63% of the training instances are sampled on average for each predictor.7 The remaining 37%
#of the training instances that are not sampled are called out-of-bag (OOB) instances. Note that they are not
#the same 37% for all predictors.


#A bagging ensemble can be evaluated using OOB instances, without the need for a separate validation set: 
#indeed, if there are enough estimators, then each instance in the training set will likely be an OOB 
#instance of several estimators, so these estimators can be used to make a fair ensemble prediction for
#that instance. Once you have a prediction for each instance, you can compute the ensemble’s prediction
#accuracy (or any other metric).


#In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to request an automatic OOB 
#evaluation after training. The following code demonstrates this. The resulting evaluation score is 
#available in the oob_score_ attribute:


bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                           oob_score=True, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_


0.896

In [29]:
#According to this OOB evaluation, this BaggingClassifier is likely to achieve about 89.6% accuracy on the test set. Let’s verify this:

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)


0.912

In [31]:
bag_clf.score(X_test, y_test) #quicker than accuracy_score


0.912

In [30]:
#We get 91.2% accuracy on the test. The OOB evaluation was a bit too pessimistic, just over 1% too low.


#The OOB decision function for each training instance is also available through the oob_decision_function_ attribute.
#Since the base estimator has a predict_proba() method, the decision function returns the class probabilities
#for each training instance. For example, the OOB evaluation estimates that the first training instance has
#a 67.6% probability of belonging to the positive class and a 32.4% probability of belonging to the negative class:

bag_clf.oob_decision_function_[:3]



array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

In [32]:
#Random Forests
#As we have discussed, a random forest10 is an ensemble of decision trees, generally trained via the bagging
#method (or sometimes pasting), typically with max_samples set to the size of the training set. Instead of
#building a BaggingClassifier and passing it a DecisionTreeClassifier, you can use the RandomForestClassifier
#class, which is more convenient and optimized for decision trees11 (similarly, there is a RandomForestRegressor
#class for regression tasks). The following code trains a random forest classifier with 500 trees, each limited
#to maximum 16 leaf nodes, using all available CPU cores:



from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, 
                                n_jobs=1, random_state=42)

rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
    

In [34]:
rnd_clf.score(X_test, y_test)

0.912

In [35]:
#With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier 
#(to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the
#ensemble itself.

#The random forest algorithm introduces extra randomness when growing trees; instead
#of searching for the very best feature when splitting a node (see Chapter 6), it searches for the best 
#feature among a random subset of features By default, it samples sqrt(n) features (where n is the total 
#number of features). The algorithm results in greater tree for the best feature among a random subset of features.
#The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance, 
#generally yielding an overall better model. So, the following BaggingClassifier is equivalent to the
#previous RandomForestClassifier:


bag_clf = BaggingClassifier(DecisionTreeClassifier(max_features="sqrt", 
                                                  max_leaf_nodes=16),
                            n_estimators=500, n_jobs=-1, random_state=42)



In [36]:
#Extra-Trees
#When you are growing a tree in a random forest, at each node only a random subset of
#the features is considered for splitting (as discussed earlier). It is possible to make trees
#even more random by also using random thresholds for each feature rather than
#searching for the best possible thresholds (like regular decision trees do). For this,
#simply set splitter="random" when creating a DecisionTreeClassifier.
#A forest of such extremely random trees is called an extremely randomized trees12 (or
#extra-trees for short) ensemble. Once again, this technique trades more bias for a lower
#variance. It also makes extra-trees classifiers much faster to train than regular random
#forests, because finding the best possible threshold for each feature at every node is one
#of the most time-consuming tasks of growing a tree.
#You can create an extra-trees classifier using Scikit-Learn’s ExtraTreesClassifier
#class. Its API is identical to the RandomForestClassifier class, except bootstrap
#defaults to False. Similarly, the ExtraTreesRegressor class has the same API as the
#RandomForestRegressor class, except bootstrap defaults to False.

#It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an 
#ExtraTreesClassifier. Generally, the only way to know is to try both and compare them using cross-validation.


from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)



0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


In [46]:
iris.data.head(3)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2


In [45]:
iris.target.head(3
                )

0    0
1    0
2    0
Name: target, dtype: int64

In [47]:
#Random forests are very handy to get a quick understanding of what features actually matter, in particular
#if you need to perform feature selection.


#Boosting
#Boosting (originally called hypothesis boosting) refers to any ensemble method that can combine several 
#weak learners into a strong learner. The general idea of most boosting methods is to train predictors 
#sequentially, each trying to correct its predecessor. There are many boosting methods available, but
#by far the most popular are AdaBoost13 (short for adaptive boosting) and gradient boosting. Let’s 
#start with AdaBoost.

#AdaBoost
#One way for a new predictor to correct its predecessor is to pay a bit more attention to the training 
#instances that the predecessor underfit. This results in new predictors focusing more and more on the 
#hard cases. This is the technique used by AdaBoost.

#For example, when training an AdaBoost classifier, the algorithm first trains a base classifier
#(such as a decision tree) and uses it to make predictions on the training set. The algorithm then 
#increases the relative weight of misclassified training instances. Then it trains a second classifier,
#using the updated weights, and again makes predictions on the training set, updates the instance weights,
#and so on (see Figure 7-7).
#Figure 7-8 shows the decision boundaries of five consecutive predictors on the moons dataset
#(in this example, each predictor is a highly regularized SVM classifier with an RBF kernel).14 
#The first classifier gets many instances wrong, so their weights get boosted. The second classifier 
#therefore does a better job on these instances, and so on. The plot on the right represents the same
#sequence of predictors, except that the learning rate is halved (i.e., the misclassified instance 
#weights are boosted much less at every iteration). As you can see, this sequential learning technique 
#has some similarities with gradient descent, except that instead of tweaking a single predictor’s 
#parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

#Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, 
#except that predictors have different weights depending on their overall accuracy on the weighted training set.


#There is one important drawback to this sequential learning technique: training cannot be parallelized
#since each predictor can only be trained after the previous predictor has been trained and evaluated.
#As a result, it does not scale as well as bagging or pasting.

#maths explained on page 290-291!!!!

#The following code trains an AdaBoost classifier based on 30 decision stumps using
#Scikit-Learn’s AdaBoostClassifier class (as you might expect, there is also an
#AdaBoostRegressor class). A decision stump is a decision tree with max_depth=1—in
#other words, a tree composed of a single decision node plus two leaf nodes. This is the
#default base estimator for the AdaBoostClassifier class:



from sklearn.ensemble import AdaBoostClassifier


ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=1), n_estimators=30,
        learning_rate=0.5, random_state=42)

ada_clf.fit(X_train, y_train)


#If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators 
#or more strongly regularizing the base estimator.




In [54]:
#Gradient Boosting


#Another very popular boosting algorithm is gradient boosting.17 Just like AdaBoost,
#gradient boosting works by sequentially adding predictors to an ensemble, each one
#correcting its predecessor. However, instead of tweaking the instance weights at every
#iteration like AdaBoost does, this method tries to fit the new predictor to the residual
#errors made by the previous predictor.
#Let’s go through a simple regression example, using decision trees as the base
#predictors; this is called gradient tree boosting, or gradient boosted regression trees
#(GBRT). First, let’s generate a noisy quadratic dataset and fit a
#DecisionTreeRegressor to it:
    
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100) # y= 3x**2 + Gaussian noise

tree_reg1 = DecisionTreeRegressor(max_depth=1, random_state=42)
tree_reg1.fit(X, y)

In [56]:
#Next, we’ll train a second DecisionTreeRegressor on the residual errors made by the first predictor:

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

In [57]:
#And then we’ll train a third regressor on the residual errors made by the second predictor:

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

In [61]:
#Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding 
#up the predictions of all the trees:

X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

array([0.62493205, 0.01135118, 0.62505246])

In [62]:
tree_reg1.predict(X_new) #1

array([0.57435458, 0.19765192, 0.19765192])

In [63]:
tree_reg2.predict(X_new) #2

array([ 0.04332508, -0.10506767,  0.42740054])

In [64]:
tree_reg3.predict(X_new) #3

array([ 7.25239814e-03, -8.12330666e-02,  4.27008856e-18])

In [65]:
#Figure 7-9 represents the predictions of these three trees in the left column, and the ensemble’s
#predictions in the right column. In the first row, the ensemble has just one tree, so its predictions
#are exactly the same as the first tree’s predictions. In the second row, a new tree is trained on 
#the residual errors of the first tree. On the right you can see that the ensemble’s predictions are
#equal to the sum of the predictions of the first two trees. Similarly, in the third row another 
#tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions
#gradually get better as trees are added to the ensemble.
#You can use Scikit-Learn’s GradientBoostingRegressor class to train GBRT ensembles more easily
#(there’s also a GradientBoostingClassifier class for classification). Much like the RandomForestRegressor
#class, it has hyperparameters to control the growth of decision trees (e.g., max_depth, min_samples_leaf),
#as well as hyperparameters to control the ensemble training, such as the number of trees (n_estimators). 
#The following code creates the same ensemble as the previous one:


from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0,
                                random_state=42)
gbrt.fit(X, y)


In [66]:
#The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, 
#such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions 
#will usually generalize better. This is a regularization technique called shrinkage. Figure 7-10 shows
#two GBRT ensembles trained with different hyperparameters: the one on the left does not have enough
#trees to fit the training set, while the one on the right has about the right amount. If we added 
#more trees, the GBRT would start to overfit the training set.


#To find the optimal number of trees, you could perform cross-validation using GridSearchCV or 
#RandomizedSearchCV, as usual, but there’s a simpler way: if you set the n_iter_no_change hyperparameter
#to an integer value, say 10, then the GradientBoostingRegressor will automatically stop adding more
#trees during training if it sees that the last 10 trees didn’t help. This is simply early stopping
#(introduced in Chapter 4), but with a little bit of patience: it tolerates having no progress for a
#few iterations before it stops. Let’s train the ensemble using early stopping:

gbrt_best = GradientBoostingRegressor(
            max_depth=2, learning_rate=0.05, 
            n_estimators=500, n_iter_no_change=10, random_state=42)
gbrt_best.fit(X, y)


In [67]:
#If you set n_iter_no_change too low, training may stop too early and the model will underfit. But if
#you set it too high, it will overfit instead. We also set a fairly small learning rate and a high
#number of estimators, but the actual number of estimators in the trained ensemble is much lower,
#thanks to early stopping:

gbrt_best.n_estimators_


92

In [86]:
#When n_iter_no_change is set, the fit() method automatically splits the training set into a smaller
#training set and a validation set: this allows it to evaluate the model’s performance each time it
#adds a new tree. The size of the validation set is controlled by the validation_fraction hyperparameter,
#which is 10% by default. The tol hyperparameter determines the maximum performance improvement that still
#counts as negligible. It defaults to 0.0001.


#The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the
#fraction of training instances to be used for training each tree. For example, if subsample=0.25,
#then each tree is trained on 25% of the training instances, selected randomly. As you can probably
#guess by now, this technique trades a higher bias for a lower variance. It also speeds up training
#considerably. This is called stochastic gradient boosting.


#Histogram-Based Gradient Boosting
#Scikit-Learn also provides another GBRT implementation, optimized for large datasets: histogram-based
#gradient boosting (HGB). It works by binning the input features, replacing them with integers. The number
#of bins is controlled by the max_bins hyperparameter, which defaults to 255 and cannot be set any higher
#than this. Binning can greatly reduce the number of possible thresholds that the training algorithm needs
#to evaluate. Moreover, working with integers makes it possible to use faster and more memory-efficient
#data structures. And the way the bins are built removes the need for sorting the features when training each tree.
    

#As a result, this implementation has a computational complexity of O(b×m) instead of O(n×m×log(m)), where b
#is the number of bins, m is the number of training instances, and n is the number of features. In practice,
#this means that HGB can train hundreds of times faster than regular GBRT on large datasets. However, binning
#causes a precision loss, which acts as a regularizer: depending on the dataset, this may help reduce overfitting,
#or it may cause underfitting.

#Scikit-Learn provides two classes for HGB: HistGradientBoostingRegressor and HistGradientBoostingClassifier.
#They’re similar to GradientBoostingRegressor and GradientBoostingClassifier, with a few notable differences:

#Early stopping is automatically activated if the number of instances is greater than 10,000. You can 
#turn early stopping always on or always off by setting the early_stopping hyperparameter to True or False.
#Subsampling is not supported. n_estimators is renamed to max_iter.
#The only decision tree hyperparameters that can be tweaked are max_leaf_nodes, min_samples_leaf, and max_depth.
#The HGB classes also have two nice features: they support both categorical features and missing values. 
#This simplifies preprocessing quite a bit. However, the categorical features must be represented as 
#integers ranging from 0 to a number lower than max_bins. You can use an OrdinalEncoder for this. 
#For example, here’s how to build and train a complete pipeline for the California housing dataset 
#introduced in Chapter 2:
        
        
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder

housing = pd.read_csv("/Users/rokosango/PhD/HandsOnML/handson-ml3/datasets/housing/housing.csv")

hgb_reg = make_pipeline(
        make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]),
                               remainder="passthrough"),
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)


#The whole pipeline is just as short as the imports! No need for an imputer, scaler, or a one-hot encoder, 
#so it’s really convenient. Note that categorical_features must be set to the categorical column indices 
#(or a Boolean array). Without any hyperparameter tuning, this model yields an RMSE of about 47,600, which
#is not too bad.

#Several other optimized implementations of gradient boosting are available in the Python ML ecosystem:
#in particular, XGBoost, CatBoost, and LightGBM. These libraries have been around for several years. 
#They are all specialized for gradient boosting, their APIs are very similar to Scikit- Learn’s, and they
#provide many additional features, including GPU acceleration; you should definitely check them out!
#Moreover, the TensorFlow Random Forests library provides optimized implementations of a variety of
#random forest algorithms, including plain random forests, extra-trees, GBRT, and several more.





In [None]:
#Stacking
#The last ensemble method we will discuss in this chapter is called stacking (short for stacked generalization).
#18 It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the 
#predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? Figure 
#7-11 shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors
#predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner)
#takes these predictions as inputs and makes the final prediction (3.0).




#If you evaluate this stacking model on the test set, you will find 92.8% accuracy, which is a bit better
#than the voting classifier using soft voting, which got 92%.
#In conclusion, ensemble methods are versatile, powerful, and fairly simple to use. Random forests, AdaBoost,
#and GBRT are among the first models you should test for most machine learning tasks, and they particularly
#shine with heterogeneous tabular data. Moreover, as they require very little preprocessing, they’re great
#for getting a prototype up and running quickly. Lastly, ensemble methods like voting classifiers and stacking
#classifiers can help push your system’s performance to its limits.


    
    
    