Welcome to the **6th tutorial** in this tutorial series! In the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), we talked about Random Forest. Random Forest consists of an ensemble of decision trees, and is one popular example of **ensemble learning**. 

This tutorial is the first part of the two-part tutorial on ensemble learning, covering basic concepts and techniques such as <font color='blue'> bagging and pasting</font>. The next tutorial is the second part, covering advanced and powerful techniques such as <font color='blue'>boosting and stacking</font>. As one of the most popular boosting techniques on Kaggle, <font color='blue'>XGBoost</font> will also be covered in the next tutorial. However, if you are new to ensemble learning, it is best to start from this tutorial. Understanding the basic concepts and techniques of ensemble learning is the foundation for understanding the advanced techniques.

Here is a list of my previous tutorials, if you are interested:
* [Machine Learning 1 - Regression, Gradient Descent](https://www.kaggle.com/fengdanye/machine-learning-1-regression-gradient-descent)  
* [Machine Learning 2 Regularized LM, Early Stopping](https://www.kaggle.com/fengdanye/machine-learning-2-regularized-lm-early-stopping)  
* [Machine Learning 3 Logistic and Softmax Regression](https://www.kaggle.com/fengdanye/machine-learning-3-logistic-and-softmax-regression)
* [Machine Learning 4 Support Vector Machine](https://www.kaggle.com/fengdanye/machine-learning-4-support-vector-machine)
* [Machine Learning 5 Random Forests](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests)
--------------------------------

**Table of Content**
* Ensemble Learning Classification
    * Introduction
        * Voting rules
    * Ensemble of different classifiers
    * Ensemble of same classifiers
        * Random sampling of training instances
            * Bagging
                * oob score
            * Pasting
            * Example
        * Random sampling of features
            * Example
        * Random thresholds - Extra Trees
            * Example
    * Summary of performance

* Ensemble Learning Regression
    * Introduction
    * Example - BaggingRegressor
        * oob score
    * Example - RandomForestRegressor
    * Example - ExtraTreesRegressor
    * Summary of performance

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

import os
print(os.listdir("../input"))

In [None]:
plt.rc('axes', lw = 1.5)
plt.rc('xtick', labelsize = 14)
plt.rc('ytick', labelsize = 14)
plt.rc('xtick.major', size = 5, width = 3)
plt.rc('ytick.major', size = 5, width = 3)

# Ensemble Learning Classification
## Introduction
In the previous tutorials, we have introduced several families of classifiers - logistic regression, SVM, decision trees, etc. We mostly focused on using one classifier of one family to classify a sample $\vec{x}$. However, we can also combine the knowledge of multiple classifiers and make a "consensus" decision from each classifier's own decision. Usually this method performs better (e.g. better generalization to test set, less overfitting) than a single classifier, and we saw that in the last tutorial with the example of a Random Forest. The technique of classifying instances based on multiple classifiers' decisions is called **Ensemble learning classification**.

### Voting rules
Now, let's imagine we have five classifiers in the ensemble and we are predicting a class for a single instance $\vec{x}$. Each classifier will make a decision on its own. In this case, the ensemble will predict the class based on **majority of votes**:
<img src="https://imgur.com/eAruUj3.png" width="600px"/>
As you can see, since four classifiers predicted "Class 2", and only one classifier predicted "Class 1", the ensemble decides the final prediction is "Class 2". This voting rule is sometimes called **hard voting**.

Now, remember that many classifiers we have introducted not only predict a class, but also provide prediction probabilities for each class. If all classifiers in the ensemble have prediction probabilities, we can also use the so-called **soft voting** rule:
<img src="https://imgur.com/ud382N9.png" width="600px"/>
Here, the probabilities for each class are averaged over all classifiers in the ensemble, and the class that has the highest average probability is predicted. **In many cases, soft voting performs better than hard voting**, since it takes in more information and gives higher weight to highly confident predictions (i.e. predictions with high probability). As introduced in the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), for example, Scikit-learn's Random Forest classifier uses soft voting by default.

#### Side note on how to obtain prediction probability in SVM (only if you are interested)
SVM is one classifier that, by default, does not produce prediction probability. However, you can force a probability calculation in Scikit-learn's SVC by setting "probability" to *True*. You can then call the predict_proba() function to obtain class probablities. To understand how the probabilities are obtained, you can read the [documentation](https://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities), as well as this [stack overflow answer](https://stackoverflow.com/questions/15111408/how-does-sklearn-svm-svcs-function-predict-proba-work-internally). The basic idea is as follows:
* First, train SVM in a cross validation manner. For each fold, there is a training set and a hold-out set. The $\vec{w}$ and $b$ are obtained through training on the training set, and then $\vec{w}\cdot \vec{x}+b$ is calculated on the hold out set. If this is a five-fold cross validation (which is the case for Scikit-learn's SVC), then you will have 5 hold-out sets, the union of which is the whole data set. Note that in the aforementioned links,  $\vec{w}\cdot \vec{x}+b$ is represented by $f$.
* The union of the $\vec{w}\cdot \vec{x}+b$ calculated on the hold out sets will be used to train a logistic sigmoid function: $P(y=1|f)=\frac{1}{1+exp(Af+B)}$, where $f=\vec{w}\cdot \vec{x}+b$ on the hold-out sets. $P(y=1|f)>0.5$ predicts $y=1$, and $P(y=1|f)\leq0.5$ predicts $y=0$. During training, parameter $A$ and $B$ will be optimized to minimize the cross-entropy loss function. This is basically a logistic regression on the SVM scores.
* Now that we have $A$ and $B$, SVM will be re-trained on the entire data set. For a given instance $\vec{x}$, the re-trained SVM will produce a $f=\vec{w}\cdot \vec{x}+b$ value, and then $P(y=1|f)=\frac{1}{1+exp(Af+B)}$ will produce the probability.  

This method of conducting logistic regression on SVM scores, is called **Platt scaling**. If you are interested in learning more about this method, you can read Platt's paper “[Probabilistic outputs for SVMs and comparisons to regularized likelihood methods](http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf)". The multiclass case is extended by Wu, Lin and Weng, “[Probability estimates for multi-class classification by pairwise coupling](https://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf)”, JMLR 5:975-1005, 2004.

Note that asking SVM to calculate prediction probabilities will significantly slow down the training speed since there is a 5-fold cross validation.

When we talk about an ensemble of classfiers, there are two possibilities: 
* An ensemble of different classifiers, which, for example, can contain logistic regression classifiers, SVM classifiers, and decision trees, all at the same time.
* An ensemble of the same type of classifiers, such as a Random Forest.

<img src="https://imgur.com/mlRpzUf.png" width="400px"/>

We will first talk about ensemble of different classifiers.

## Ensemble of different classifiers
Let's build a simple ensemble of different classifiers as follows:
<img src="https://imgur.com/3G6054b.png" width="400px"/>
In this ensemble, there are only three classifiers: **a Random Forest classifier, a SVM classifier, and a logistic regression classifier**. Note that random forest itself is an ensemble learning classifier. Again, we will be using the Red Wine Quality dataset:

In [None]:
wineData['category'].value_counts()

In [None]:
wineData = pd.read_csv('../input/winequality-red.csv')


wineData['category'] = wineData['quality'] >= 7
X = wineData[wineData.columns[0:11]].values
y = wineData['category'].values.astype(np.int)

In [None]:
wineData.head()

The wine quality is binarized into either "good" ($y=1$, quality>=7) or "not good" ($y=0$, quality<7). The input $X$ consists of 11 features such as fixed acidity and pH. We will then split the data set into a trianing set and a test set:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

print('X train size: ', X_train.shape)
print('y train size: ', y_train.shape)
print('X test size: ', X_test.shape)
print('y test size: ', y_test.shape)

Again, the "random_state" is set to guarantee a repeatable result for this tutorial. In practice, you should remove the random_state argument.  
To construct a voting classifier, we use scikit-learn's **VotingClassifier**. Note that we set the voting rule to "soft". This is because only soft voting produces prediction probabilities, and we need the probilities to plot ROC curve.

In [None]:
# Below, random_state is only used to guarantee repeatable result for the tutorial. 
rfClf = RandomForestClassifier(n_estimators=500, random_state=0) # 500 trees. 
svmClf = SVC(probability=True, random_state=0) # force a probability calculation
logClf = LogisticRegression(random_state=0)

clf = VotingClassifier(estimators = [('rf',rfClf), ('svm',svmClf), ('log', logClf)], voting='soft') # construct the ensemble classifier

**To construct the ensemble classifier, all you need to do is to specify what estimators you want to include in VotingClassifier()**. In this case we have "estimators = [('rf',rfClf), ('svm',svmClf), ('log', logClf)]". Now, we train the classifier on the training set:

In [None]:
clf.fit(X_train, y_train) # train the ensemble classifier

And then examine how the classifier performs on the test set:

In [None]:
from sklearn.metrics import precision_score, accuracy_score
y_true, y_pred = y_test, clf.predict(X_test)
print('precision on the test set: ', precision_score(y_true, y_pred))
print('accuracy on the test set: ', accuracy_score(y_true, y_pred))

We have achieved a precision of 68.2% and an accuracy of 87.7%. Note that in the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), we achieved 58.3% precision and 87.7% accuracy with a single Random Forest classifier. Though the accuracy has not improved, precision has improved considerably with the use of an ensemble classifier. Remeber that precision = TP/(TP+FP). An improvement in precision indicates that among all predicted positivies, the proportion of true positives has increased.

Of course, there are more measures you can calculate than just precision and accuracy. You can find the definition of a full list of performance measures on this [Wikipedia page](https://en.wikipedia.org/wiki/Confusion_matrix).

Now, let's plot the ROC curve and calculate AUC:

In [None]:
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
phat = clf.predict_proba(X_test)[:,1]

In [None]:
plt.subplots(figsize=(8,6))
fpr, tpr, thresholds = roc_curve(y_test, phat)
plt.plot(fpr, tpr)
x = np.linspace(0,1,num=50)
plt.plot(x,x,color='lightgrey',linestyle='--',marker='',lw=2,label='random guess')
plt.legend(fontsize = 14)
plt.xlabel('False positive rate', fontsize = 18)
plt.ylabel('True positive rate', fontsize = 18)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

In [None]:
print('AUC is: ', auc(fpr,tpr))

This AUC (0.914) is a slight improvement to the AUC calculated from a single Random Forest classifier (0.909) in the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests/).  For a introduction to ROC curve and AUC, see my [previous tutorial](https://www.kaggle.com/fengdanye/machine-learning-3-logistic-and-softmax-regression).

## Ensemble of same classifiers
**Random Forest is one of the most popular examples of an ensemble of same classifiers**. In a random forest, each tree has the same hyperparameters (e.g. max_depth and min_samples_leaf), but is trained on a bootstrap of the training set. If max_features is smaller than one, trees in a random forest are also split on randomly sampled subset of features. In the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), we have presented an example of training and using a random forest.

**Now, let's jump out of the context of random forests. Let's imagine a more general case, where we simply have an ensemble of the same type of classifiers**. This can be an ensemble of SVMs, an ensemble of logistic regression classifiers, an ensemble of decision trees, or even an ensemble of random forest classifiers. **For the ensemble to perform well, we would want the individual classifiers to be as independent as possible**. In the previous section, this is taken care of by having different types of classifiers. In the case where the calssifiers are of same type, we can achieve so by introducing randomness for each classifier. This tutorial will talk about three ways to do that:
1. Randomly sample training instances for each classifier
2. Randomly sample features that are used to train each classifier
3. Specifically for decision tress - use random threshold for each feature (as compared to finding the best threshold)  

We will go over them one by one.

### Random sampling of training instances
#### Bagging
Bagging refers to the method of randomly sampling training instances *with replacement*. In statistics, sampling with replacement is also called *bootstrapping*. **The term "with replacement" means that after one instance is taken randomly from the training set, a replacement of this instance is put into the training set. When the next instance is selected, there is a chance that this next instance selected is the same as the previous instance selected**. Here is a simple example of bagging:
<img src="https://imgur.com/XA7mf26.png" width="400px"/>
As you can see, the same instance can appear multiple times in the subsample. This is the characteristic of the bagging method.

##### oob score
During bagging, each subsample is used to train one classifier. For each classifier, the samples that are *not* seen during training is called **out-of-bag instances**, or **oob instances**:
<img src="https://imgur.com/ssSY5Kj.png" width="500px"/>
These oob instances can be used to evaluate the performance of the classifiers, since they serve the same function as a test set - a dataset that is not seen during training. To evaluate an oob score using the bagging method, we use Scikit-learn's **BaggingClassifier**, and set oob_score=True. From the [source code of BaggingClassifier](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/ensemble/bagging.py#L430) (line 583-618), we can see that the oob score is calculated as follows:
* For each trained classifier, locate its corresponding oob instances.
* Use the trained classifier to predict on its oob instances
    * If the classifier has predict_proba(), predict the probabilities of each class for each oob instance
    * If the classifier does not have predict_proba(), predict which class each oob instance belongs to
* Do this for all classifiers in the ensemble
* For each instance in the whole training set, locate the classifiers for which this instance is an oob instance. For convenience, I will call these classifiers the "oob classifiers" for this instance.
    * If the classifier has predict_proba(), predict the class of this instance to be the one that has the highest average probability across all "oob classifiers" for this instance.
    * If the classifier does not have predict_proba(), predict the class of this instance to be the one that has the majority of votes from all "oob classifiers" for this instance.
* Do this for all instances in the whole training set. For convenience, let's name the prediction on the whole training set as y_oob.
* **The final oob score is the accuracy score of the above prediction: accuracy_score(y_true, y_oob)**. Both y_true and y_oob have a size of m, where m is the total number of training instances.

**To summary, the BaggingClassifier's oob socre gives you an estimation of the accuracy of the ensemble classifier on a test set**. At the end of this section, we will show an example of how to use BaggingClassifier and obtain an oob score.

#### Pasting
Pasting refers to the method of randomly sampling training instances *without replacement*. This means that, in a certain subsample, the same instance can only appear at most once:
<img src="https://imgur.com/5ZMvOoL.png" width="400px"/>
Scikit-learn's BaggingClassifier can also perform pasting if the argument *bootstrap* is set to *False*.

#### Example
In this example, we will compare the performance of a single logistic regression classifier with that of an ensemble of logistic regression classifiers. Let's first read the data:

In [None]:
wineData = pd.read_csv('../input/winequality-red.csv')

wineData['category'] = wineData['quality'] >= 7

X = wineData[wineData.columns[0:11]].values
y = wineData['category'].values.astype(np.int)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

print('X train size: ', X_train.shape)
print('y train size: ', y_train.shape)
print('X test size: ', X_test.shape)
print('y test size: ', y_test.shape)

**Let's first use a single logistic regression classifier and see how it performs in terms of precision, accuracy and AUC**. In [tutorial 3](https://www.kaggle.com/fengdanye/machine-learning-3-logistic-and-softmax-regression), we introduced logistic regression and did very similar training/predicting, but we did not do a train-test split in that tutorial. Here, we will do the same training/predicting, but with a train-test split.

Don't forget, the first step is to standardize the training data:

In [None]:
scaler = StandardScaler()
X_train_stan = scaler.fit_transform(X_train)

Then we train a logisitc regression classifier on the standardized training set and evalute its performance on the test set:

In [None]:
logReg = LogisticRegression(random_state=0, solver='lbfgs') # random_state is only set to guarantee for repeatable result for the tutorial
logReg.fit(X_train_stan, y_train)

X_test_stan = scaler.transform(X_test) # don't forget this step!
y_pred = logReg.predict(X_test_stan)

print('precision on the test set: ', precision_score(y_test, y_pred))
print('accuracy on the test set: ', accuracy_score(y_test, y_pred))

In [None]:
phat = logReg.predict_proba(X_test_stan)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, phat)

plt.subplots(figsize=(8,6))
plt.plot(fpr, tpr)
x = np.linspace(0,1,num=50)
plt.plot(x,x,color='lightgrey',linestyle='--',marker='',lw=2,label='random guess')
plt.legend(fontsize = 14)
plt.xlabel('False positive rate', fontsize = 18)
plt.ylabel('True positive rate', fontsize = 18)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

In [None]:
print('AUC is: ', auc(fpr,tpr))

**Now, let's use an ensemble of 500 logistic regression classifiers with the bagging method**. To do this, we will use Scikit-learn's **BaggingClassifier**:

In [None]:
bagClf = BaggingClassifier(LogisticRegression(random_state=0, solver='lbfgs'), n_estimators = 500, oob_score = True, random_state = 90)
# again, random_state is only set to guarantee repeatable result for this tutorial.

Note that by default, BaggingClassifier has arguments *max_samples=1.0* and *bootstrap=True*. **This means that this ensemble classifier is drawing subsamples with replacement (i.e. bagging), and the subsample size is equal to the size of the whole training set**. We have also set the argument *oob_score* to *True* for out-of-bag accuracy estimation.

Now let's train and evaluate the ensemble classifier:

In [None]:
bagClf.fit(X_train_stan, y_train)
print(bagClf.oob_score_) # The oob score is an estimate of the accuracy of the ensemble classifier, as introduced earlier. 

In [None]:
y_pred = bagClf.predict(X_test_stan)
phat = bagClf.predict_proba(X_test_stan)[:,1]

print('precision on the test set: ', precision_score(y_test, y_pred))
print('accuracy on the test set: ', accuracy_score(y_test, y_pred))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, phat)
plt.subplots(figsize=(8,6))
plt.plot(fpr, tpr)
x = np.linspace(0,1,num=50)
plt.plot(x,x,color='lightgrey',linestyle='--',marker='',lw=2,label='random guess')
plt.legend(fontsize = 14)
plt.xlabel('False positive rate', fontsize = 18)
plt.ylabel('True positive rate', fontsize = 18)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

In [None]:
print('AUC is: ', auc(fpr,tpr))

The ensemble classifier has slight improvement compared to a single logistic regression classifier, but it is no match to the performance of the simple three-classifier (SVM + random-forest + logistic) ensemble we used earlier. This ensemble of 500 logisitc regression classifiers also performs worse than the Random Forest we used in the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests) which has 500 decision trees. **One possible reason is that this ensemble of 500 logistic regression classifiers only randomly samples the training instances, but not the features, whereas the random forest randomly samples both training instances and features (with the *max_feature* argument). This may lead to a lack of independence between individual classifiers in the current ensemble, thus the slightly worse performance**. 

In the next section, we will talk about randomly sampling the input features.

### Random sampling of features
To further increase the independence between individual classifiers, we can **train each classifier on a different random subset of features**. In the Wine Quality dataset, we have 11 input features: 

In [None]:
wineData.head(0)

Random sampling *with replacement* of the above features with a sample size of 3 will look like this:
<img src="https://imgur.com/aeRjeR6.png" width="500px"/>
Again, if sampling *without replacement*, then there won't be repeated features in the subsample.  **To do random sampling of features, we can again use BaggingClassifier**. The two key arguments are *bootstrap_features* and *max_features*. The *bootstrap_features* decides whether to sample with replacement, and *max_features* decides the proportion of features to draw from the input features. You can set, for example, *bootstrap_features = True* and *max_features = 1.0* to draw a bootstrap sample of the features with size equal the total number of features.

Random sampling features without random sampling the training instances is called the **Random Subspaces** method. This corresponds to *bootstrap = False, max_samples = 1.0*, and *bootstrap_features = True* and/or *max_features < 1.0*. Of course, features and training instances can be sampled at the same time. In this case, the method is called **Random Patches. In the Random Patches method, each classifier is trained on its own corresponding subsamples of training instances and features**. The oob score is calculated as described earlier. The only additional detail is that when each classifier makes predictions on its oob samples, the classifier only uses a subset of features.

#### Example
Now, let's use the ensemble of 500 logisitc regression classifiers again, but this time randomly **sample both training instances and features**:

In [None]:
bagClf = BaggingClassifier(LogisticRegression(random_state=0, solver='lbfgs'), n_estimators = 500, 
                           bootstrap_features = True, max_features = 1.0, oob_score = True, random_state = 90)
# Notice that bootstrap_features is set to True.

In [None]:
bagClf.fit(X_train_stan, y_train)
print(bagClf.oob_score_) # The oob score is an estimate of the accuracy of the ensemble classifier

In [None]:
y_pred = bagClf.predict(X_test_stan)
phat = bagClf.predict_proba(X_test_stan)[:,1]

print('precision on the test set: ', precision_score(y_test, y_pred))
print('accuracy on the test set: ', accuracy_score(y_test, y_pred))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, phat)
plt.subplots(figsize=(8,6))
plt.plot(fpr, tpr)
x = np.linspace(0,1,num=50)
plt.plot(x,x,color='lightgrey',linestyle='--',marker='',lw=2,label='random guess')
plt.legend(fontsize = 14)
plt.xlabel('False positive rate', fontsize = 18)
plt.ylabel('True positive rate', fontsize = 18)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

In [None]:
print('AUC is: ', auc(fpr,tpr))

With the use of Random Patches method, the precision of the prediction has considerably increased (0.54 -> 0.61), but the accuracy and AUC have stayed more or less the same.

### Random thresholds - Extra Trees
In the specific case of decision trees, we can further randomize individual trees by introducing random thresholds at each node. The trees are called **Extremely Randomized Trees (or Extra Trees)**. The Extra Trees Classifier work similarly as a Random Forest Classifier. It searches a random subset of the features for the best split of each node. **The difference is that a Random Forest searches for the best (feature, threshold) combination at each split, whereas in Extra Trees each candidate feature's threshold is drawn at random (one threshold per feature) and the best among them is selected (see [documentation](https://scikit-learn.org/stable/modules/ensemble.html#forest))**. You can also take a look at the [source code for random splitter](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx#L652), which is [used by Extra Trees](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/tree/tree.py#L1146) (totally optional, only if you are interested). The random splitter basically works by drawing one feature at a time, choosing a random threshold for this feature, and evaluating the split's performance. If the split is better than the previous best, than the split becomes the current best. The loop finishes when *max_features* have been drawn (without replacement), or when the number of drawn features is below *max_features* but the remaining features are all constants. That's why the random splitter is said to "choose the best random split".

**Also note that, by default, Scikit-learn's RandomForestClassifier has bootstrap = True, whereas ExtraTreesClassifier has bootstrap = False. This means that ExtraTressClassifier uses the full training instances with no random sampling** (both RandomForestClassifier and ExtraTreesClassifier has sample size = full training set size, therefore a bootstrap=False means the full training instances are used).

#### Example
Now, let's try out the **ExtraTreesClassifier** on the Wine Quality dataset. Just like what we did in the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), we will use **GridSearchCV** to search for the best hyperparameters. GridSearchCV is introduced in my [4th tutorial](https://www.kaggle.com/fengdanye/machine-learning-4-support-vector-machine).

In [None]:
tuned_parameters = {'n_estimators':[500],'n_jobs':[-1], 'max_features': [0.5,0.6,0.7,0.8,0.9,1.0], 
                    'max_depth': [10,11,12,13,14],'min_samples_leaf':[1,10,100],'random_state':[0]} 

clf = GridSearchCV(ExtraTreesClassifier(), tuned_parameters, cv=5, scoring='roc_auc')
clf.fit(X_train, y_train)

In [None]:
print('The best model is: ', clf.best_params_)
print('This model produces a mean cross-validated score (auc) of', clf.best_score_)

GridSesearchCV has determined the best model to be the one with *max_depth = 13*, *max_features = 0.9*, and *min_samples_leaf = 1*. Now let's see how the Extra Trees classifier performs on the test set:

In [None]:
y_pred = clf.predict(X_test)
print('precision on the evaluation set: ', precision_score(y_test, y_pred))
print('accuracy on the evaluation set: ', accuracy_score(y_test, y_pred))

In [None]:
phat = clf.predict_proba(X_test)[:,1]
plt.subplots(figsize=(8,6))
fpr, tpr, thresholds = roc_curve(y_test, phat)
plt.plot(fpr, tpr)
x = np.linspace(0,1,num=50)
plt.plot(x,x,color='lightgrey',linestyle='--',marker='',lw=2,label='random guess')
plt.legend(fontsize = 14)
plt.xlabel('False positive rate', fontsize = 18)
plt.ylabel('True positive rate', fontsize = 18)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

In [None]:
print('AUC is: ', auc(fpr,tpr))

This Extra Trees classifier has by far the best accuracy and AUC score on the test set! 

## Summary of performance
In this tutorial and the [last tutorial](http://https://www.kaggle.com/fengdanye/machine-learning-5-random-forests), we have trained various ensemble learning classifiers on the Wine Quality dataset. Since we have always split the train/test set with *random_state=42*, all these classifiers were trained on the same data and tested on the same data. This makes it possible for us to compare the performance of different ensemble classifiers. The *random_state* set for the classifiers also affect the performance, of course, and usually in practice *random_state* is generated at random (i.e. *random_state = None*). But for the purpose of repeatable results, I have to set a specific *random_state* value in this tutorial.

Here is the summary table of the performance of different ensemble classifiers on the Red Wine Quality dataset:

| Classifier name | Precision on test set   | Accuracy on test set | AUC on test set | Comments |
|------|------|------|------|------|
|   Random Forest with 500 trees (with GridSearchCV)  | 0.58 | 0.88 | 0.91 | from the [last tutorial](https://www.kaggle.com/fengdanye/machine-learning-5-random-forests) |
|RF-SVM-logistic classifier | **0.68** | 0.88 | 0.91| simple ensemble of three different classifiers |
| 500 logisitic classifiers with bagging | 0.54 | 0.87 | 0.88| bagging method - randomly sample training instances|
| 500 logisitic classifiers with Random Patches | 0.61 | 0.87 | 0.88| Random Patches - randomly sample both training instances and features|
|Extra Trees with 500 trees (with GridSearchCV)| 0.62 | **0.89** | **0.93**| Extra trees use best random split|

The best score has been bolded. Some discussion:
* If you care the most about precision, then the RF-SVM-logisitc classifier is the best. If you care the most about accuracy or AUC, then Extra Trees classifier is the best.
* Random Patches method improves the 500 logisitc classifiers in terms of precision.
* Extra Trees classifier performs better than the Random Forest classifier.
* Adding SVM and logisitc classifier to a Random Forest classifier improves precision considerably. 

# Ensemble Learning Regression
## Introduction
Ensemble learning regression works similarly to classification. However, in regression, it is more common to use an ensemble of the same type of regressors than using different types of regressors. Scikit-learn does not have an equivalence of VotingClassifier() for regression, therefore to build an ensemble with different types of regressors you will have to put in extra work. On the other hand, building an ensemble of the same type of regressors is quick and easy. Scikit-learn provides classes such as **BaggingRegressor, RandomForestRegressor, and ExtraTreesRegressor** for such purposes. 

In the following sections, I will show one example for each type of regressor. **This time, we will view the Red Wine Quality dataset as a regression problem. The inputs will be the 11 features, and the output will be the quality of the wine (0-10)**. And instead of evaluating precision, accuracy and AUC, we will evaluate the **r2 score** of the prediction. The r2 score is also often called "coefficient of determination" or shows up as "R squared" in many statistical packages. You can find the definition of r2 score on [this Wikipedia page](https://en.wikipedia.org/wiki/Coefficient_of_determination). The closer the r2 score to 1.0, the better.

Let's first read the data:

In [None]:
wineData = pd.read_csv('../input/winequality-red.csv')
wineData.head()

In [None]:
X = wineData[wineData.columns[0:11]].values
y = wineData['quality'].values.astype(np.float)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=5)

print('X train size: ', X_train.shape)
print('y train size: ', y_train.shape)
print('X test size: ', X_test.shape)
print('y test size: ', y_test.shape)

## Example - BaggingRegressor
Let's start with Scikit-learn's **BaggingRegressor**.  BaggingRegressor works very similar to BaggingClassifier. By default, BaggingRegressor trains each regressor in the ensemble on a bootstrap sample of the training instances. Additionally, we will set *bootstrap_features = True* to also do bootstrapping on input features.
### oob score
 Like BaggingClassifier, BaggingRegressor can calculate an oob score.  The oob score is calculated as follows (see [source code](https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/ensemble/bagging.py#L991)):
* For each trained regressor, locate its corresponding oob instances.
* Use the trained regressor to predict on its oob instances. If feature sampling is enabled, then only the selected features will be used to make predictions.
* Do this for all regressors in the ensemble
* For each instance in the whole training set, locate the regressors for which this instance is an oob instance. For convenience, I will call these regressors the "oob regressors" for this instance. Predict the value for this instance to be the average predicted values across all "oob regressors" for this instance.
* Do this for all instances in the whole training set. For convenience, let's name the prediction on the whole training set as y_oob.
* **The final oob score is the r2 score of the above prediction: r2_score(y_true, y_oob)**. Both y_true and y_oob have a size of m, where m is the total number of training instances.

**The BaggingRegressor's oob socre gives you an estimation of the r2 score of the ensemble regressor on a test set**. 

In [None]:
# Standardize the data
scaler = StandardScaler()
X_train_stan = scaler.fit_transform(X_train)
X_test_stan = scaler.transform(X_test)

In [None]:
# use an ensemble of 500 linear regressors. Use Random Patches method.
bagReg = BaggingRegressor(LinearRegression(), n_estimators = 500, 
                           bootstrap_features = True, max_features = 1.0, oob_score = True, random_state = 0)

bagReg.fit(X_train_stan, y_train)
print("oob score is: ", bagReg.oob_score_) # The oob score is an estimate of the r2 score of the ensemble classifier

In [None]:
from sklearn.metrics import r2_score
y_pred = bagReg.predict(X_test_stan)
print("The r2 score on the test set is: ", r2_score(y_test, y_pred))

## Example - RandomForestRegressor

Now let's try out **RandomForestRegressor**. Unlike RandomForestClassifier, RandomForestRegressor by default searches the full set of features at each split. But here, we will use GridSearchCV to search for an optimal combination of *max_features*, *max_depth* and *min_samples_leaf*.

In [None]:
tuned_parameters = {'n_estimators':[500],'n_jobs':[-1], 'max_features': [0.5,0.6,0.7,0.8,0.9,1.0], 
                    'max_depth': [16,20,24],'min_samples_leaf':[1,10,100],'random_state':[0]} 

reg = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring='r2')
reg.fit(X_train, y_train)

In [None]:
print('The best model is: ', reg.best_params_)
print('This model produces a mean cross-validated score (r2) of', reg.best_score_)

In [None]:
y_pred = reg.predict(X_test)
print("The r2 score on the test set is: ", r2_score(y_test, y_pred))

## Example - ExtraTreesRegressor
Similar to the Random Forest case, **ExtraTreesRegressor** differs from ExtraTreesClassifier in that the regressor by default considers all features to select the best split. But again, we will use GridSearchCV to search for the best combination of *max_features*, *max_depth* and *min_samples_leaf*.

In [None]:
tuned_parameters = {'n_estimators':[500],'n_jobs':[-1], 'max_features': [0.5,0.6,0.7,0.8,0.9,1.0], 
                    'max_depth': [20,24,28],'min_samples_leaf':[1,10,100],'random_state':[0]} 

reg = GridSearchCV(ExtraTreesRegressor(), tuned_parameters, cv=5, scoring='r2')
reg.fit(X_train, y_train)

In [None]:
print('The best model is: ', reg.best_params_)
print('This model produces a mean cross-validated score (r2) of', reg.best_score_)

In [None]:
y_pred = reg.predict(X_test)
print("The r2 score on the test set is: ", r2_score(y_test, y_pred))

## Summary of performance
| Classifier name | r2 score | Comments |
|------|------|------|------|------|
| 500 linear regressors with Random Patches | 0.35 | BaggingRegressor() |
| Random Forest with 500 trees (with GridSearchCV)  | 0.48| RandomForestRegressor() |
| Extra Trees with 500 trees (with GridSearchCV)| **0.51** | ExtraTreesRegressor() |

* The ensemble linear regressors perform the worst. This is expected since the linear assumption probably is not sufficient for this problem.
* Both Random Forest and Extra Trees perform much better than the ensemble linear regressors, with Extra Trees slightly better than Random Forest.
* However, no regressor achieves a satisfying r2 score. Let's plot the y_test and y_pred from the Extra Trees regressor:

In [None]:
plt.plot(y_test,y_pred, linestyle='',marker='o')
plt.xlabel('true y values', fontsize = 14)
plt.ylabel('predicited y values', fontsize = 14)
plt.show()

In the next tutorial, we will talk about advanced ensemble learning techniques such as boosting and stacking. We will revisit the classification and regression tasks that we have done in this tutorial and see if the advanced techniques can push the performance forward. I hope you find this tutorial useful, and please let me know if you have any questions or comments. See you next time!

-------------------
1st version published: 12/31/2018