<a href="https://colab.research.google.com/github/Tom-McDonald/studynotebooks/blob/master/Chapter_7_Exercises_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chapter 7 - Ensemble Learning and Random Forests
---

* **1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?**
> Theoretically combining five models should result in higher precision, provided the models are indeed markedly different. If the models are sufficiently independent, each will make different types of errors, hence a combination of the five models will be more generalised and likely less prone to error than any one model alone. The models could be made more independent of one another by training each model on different subsets of the training data; this approach would likely result in better results from the ensemble than otherwise.

* **2. What is the difference between hard and soft voting classifiers?**
> A *hard voting classifier* predicts the class which is the most common prediction among all models in the ensemble (if two models vote '1' and one votes '0', the hard voting classifier predicts '1'). On the other hand, if the models within the ensemble support prediction probabilities (`predict_proba()`), *soft voting classifiers* can be used, which average the probability score for each class across all the models and predict the class with the highest overall probability. This method can often be preferable and give better results as it assigns higher weighting to more confident predictions.

* **3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?**
> *Bagging and pasting* (sampling subsets of the training data with and without replacement respectively) can be sped up in this way; multiple models are trained separately on subsets of the training data which lends itself well to being distributed across multiple CPUs/servers. *Random forests* are similarly straightforward to distribute as they consist of a number of decision trees which could all be trained on separate CPUs. However, *Boosting ensembles* cannot be scaled in such a way as boosting is an inherently sequential process where subsuquent models are trained with the weights of misclassified training instances increased, so the process cannot be distributed in order to speed it up. With *stacking ensembles*, all models in a given layer are independent of one another and can be trained on different servers, however the layers themselves are sequential so all models in one layer must have finished training before the following layer can compute.

* **4. What is the benefit of out-of-bag evaluation?**
> Out of bag evaluation tests models on data from the training set that was not part of the subset which that individual model itself was trained on. This is advantageous as it gives a good indication of how well a model generalises to unseen data without the need to put aside x% of the training set for validation, so more data is available for training the model, resulting in better performance.

* **5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?**
> Regular decision trees use a random subset of features when splitting each node and an ensemble of these trees makes up a *Random Forest*, however trees can be made even more randomised by setting random thresholds for features when splitting each node instead of searching for the best split like regular decision trees do. A forest of these extremely random trees is known as *Extremely Randomized Trees* or '*Extra-Trees*'. This considerably speeds up training as searching for the best splitting threshold at each node is the most time consuming part of training a forest. The extra randomness can help as such a model is highly unlikely to overfit a training set, so if a regular forest is overfitting, *Extra-Trees* may perform better.

* **6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?**
> Possible solutions to this issue would be increasing `n_estimators` to increase the size of the ensemble, extending the iterative process of adaptive boosting. Also increasing the learning rate may be beneficial as this will assign a higher weighting to misclassified instances, meaning they're more readily accounted for in subsuquent training cycles. Alternatively, decreasing regularization on the base estimator itself would likely solve this issue.

* **7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?**
> *Gradient Boosting* differs from *Adaptive Boosting* in that instead of reassigning higher weights to misclassified instances and retraining on the entire training set, the boosting process involves fitting new models to the residual errors from the previous step's model. In this scenario, the regularization technique of *shrinkage* is likely the best course of action, decreasing the learning rate but increasing the number of estimators in order to give predictions which generalise better (i.e. a model which doesn't overfit). However, it is possible to overfit by having too many estimators so early stopping would be a good strategy here in order to find the sweet spot.

* **8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use the first 40,000 instances for training, the next 10,000 for validation, and the last 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?**

> Again, the first step is loading in the MNIST data with the code used in previous exercises.

In [10]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

np.random.seed(42)                                                               # makes notebook output reproducible across runs

def sort_by_target(mnist):                                                       # function is required as 'fetch_openml returns' the unsorted MNIST data
    reorder_train = np.array(sorted([(target, i) for i, target in \
                                     enumerate(mnist.target[:60000])]))[:, 1]    # 'sort_by_target' makes the dataset the same as is used in the book
    reorder_test = np.array(sorted([(target, i) for i, target in  \
                                     enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]
    
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8)                                      # fetch_openml() returns targets as strings
sort_by_target(mnist)                                                            # fetch_openml() returns an unsorted dataset

X, y = mnist["data"], mnist["target"]
print('Shape of MNIST data: ', X.shape)
print('Shape of MNIST target data: ', y.shape)

X_train_and_valid, y_train_and_valid = X[:60000], y[:60000]
X_test, y_test = X[60000:], y[60000:]
shuffle_index = np.random.permutation(60000)
X_train_and_valid = X_train_and_valid[shuffle_index]
y_train_and_valid = y_train_and_valid[shuffle_index]
X_train = X_train_and_valid[:50000]
y_train = y_train_and_valid[:50000]
X_valid = X_train_and_valid[50000:60000]
y_valid = y_train_and_valid[50000:60000]

Shape of MNIST data:  (70000, 784)
Shape of MNIST target data:  (70000,)


>The next step is training the classifiers; we'll train a Random Forest, Extra-Trees Classifier and a Logistic Regression Classifier and evaluate them using the validation set. LR was chosen over an SVC due to how poorly SVM models scale with large datasets (`LinearSVC` took 3-4 times as long to train for example).

In [11]:
rfc = RandomForestClassifier(n_estimators=10, n_jobs=-1)
etc = ExtraTreesClassifier(n_estimators=10, n_jobs=-1)
lrc = LogisticRegression(solver='lbfgs', multi_class='auto', n_jobs=-1,
                        max_iter=20)
models = [rfc, etc, lrc]
model_names = ['Random Forest', 'Extra Trees', 'Logistic Regression']

for i, model in enumerate(models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    print(model_names[i], 'accuracy on validation set:', 
          accuracy_score(y_valid, y_pred))

Random Forest accuracy on validation set: 0.9436
Extra Trees accuracy on validation set: 0.9456
Logistic Regression accuracy on validation set: 0.9073


> After evaluating those models on the validation set, we combine the three of them into an ensemble using a hard voting classifier. Hard voting classifiers are generally a better option than soft when the models are not highly optimised, and given that the model hyperparameters have not been tuned here, this is the approach we've taken here.

In [12]:
ensemble_clf = VotingClassifier([('rfc', rfc), ('etc', etc), ('lrc', lrc)],
                               voting='hard', n_jobs=-1)
ensemble_clf.fit(X_train, y_train)
y_pred = ensemble_clf.predict(X_valid)
print('Hard voting ensemble classifier accuracy on validation set:', 
     accuracy_score(y_valid, y_pred))

Hard voting ensemble classifier accuracy on validation set: 0.9504


> Finally, we use the models to make predictions on the test set. From the accuracy scores, we can see that the ensemble outperforms all three individual models on the test set, as well as on the validation set, as expected.

In [13]:
for i, model in enumerate(models):
    y_pred = model.predict(X_test)
    print(model_names[i], 'accuracy on test set:', 
          accuracy_score(y_test, y_pred))
    
y_pred = ensemble_clf.predict(X_test)
print('Hard voting ensemble classifier accuracy on test set:', 
     accuracy_score(y_test, y_pred))

Random Forest accuracy on test set: 0.9445
Extra Trees accuracy on test set: 0.947
Logistic Regression accuracy on test set: 0.9139
Hard voting ensemble classifier accuracy on test set: 0.9528


* **9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let’s evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?**

> The first step is creating an array of the individual model predictions where each instance is a 3 element vector containing the predictions for that image from all three models. The target is the same value as previously.

In [14]:
valid_preds = [0, 0, 0]
for i, model in enumerate(models):
    y_pred = model.predict(X_valid)
    valid_preds[i] = y_pred

X_train = np.array(valid_preds).transpose()
y_train = y_valid
print("Training set shape:", X_train.shape)

Training set shape: (10000, 3)


> We feed these predictions on the validation set into a Random Forest Classifier which acts as the blender, and then after transforming the individual model predictions on the test set into the same format as the training set, we can evaluate the blender accuracy on the test set. It gives an accuracy score higher than each of the individual models but not as high as the hard voting classifier from the previous exercise.

In [15]:
blender = RandomForestClassifier(n_estimators=200, random_state=42)
blender.fit(X_train, y_train)

test_preds = [0, 0, 0]
for i, model in enumerate(models):
    y_pred = model.predict(X_test)
    test_preds[i] = y_pred

X_test = np.array(test_preds).transpose()

y_pred = blender.predict(X_test)
print("Blender accuracy score:", accuracy_score(y_test, y_pred))

Blender accuracy score: 0.9518
