# Introduction

When faced with low accuracy in a classification problem, a general approach is (decreasing order of ease of application):
1. Try different classifiers 
2. Engineer existing features in a better way
3. Extract and engineer obscure features 

At stage 1, when trying different classifiers, the instance when none turn out to be better than the rest. Before moving on to Stage 2 and 3, a reasonable choice is to try ensemble techniques, i.e. keep all the learners and integrate the results from each of them. This tutorial is aimed at converting a set of weak systems (classifiers/learners) into a single strong system. It will try to answer questions like - How can you combine the predictions? What are the different techniques? When to apply the different techniques? 

## What are ensemble methods?

Ensemble is a Machine Learning concept in which the idea is to train multiple models using the same learning algorithm. The ensembles take part in a bigger group of methods, called multiclassifiers, where a set of hundreds or thousands of learners with a common objective are fused together to solve the problem. There are mainly three different ways - 
1. Bagging (also, Bootstrap Aggregating) : An ensemble technique where a set of systems that combine, use the same learning technique. The data samples are picked with replacement to generate N learners whose combined results predict the test sample. Metaphorically, this system resembles a fair democracy where the results are based on the majority voting and each votes independent of the other. 
2. Boosting: Similar to Bagging, it is an ensemble technique where a set of systems that combine use the same learning technique. Unlike bagging, the data samples are picked not at random but every new subset contains the elements that were misclassified by previous models (sequential). For predicting the results, the averaging is done based on the majority voting. The higher the learner accuracy, the higher its say in predicting an unknown sample.
3. Stacking: Stacking (or Stacked Generalization) is one of the main hybrid multiclassifiers similar to Boosting. Takes a set of different learners and combines them using new learning techniques.

The idea behind all the above methods is to combine the predictions of several base estimators in order to improve generalizability or robustness over a single estimator. And the main focus of this tutorial is to analyze the performance of bagging and boosting. We will not be covering Stacking in this one.




## When are these estimators used? 
As a general rule, it is a function of bias - variance tradeoff. 

Bagging is typically used when there is high variance in the model performance i.e. there is high training set accuracy but the validation set accuracy is poor. Assuming validation set is not ill-prepared, we try to beat the model inflexibility by resampling the data, picking non-flexible (overfitted), and averaging them together. This leads to same low bias but cancels out some of the variance. Most common bagging technique is Random forest (refer the code). 

On the other hand, boosting is employed when there is high bias in the model performance i.e. there is low training set performance. In other words, the model is too simple (i.e. biased) to catch the real relationship of the data. In such underfit cases, the model is boosted sequentially by increasing the weights of the misclassified ones emphasizing the most difficult cases. The results can be combined in various ways - weighted average of all predictions or selecting a few better performing learners. Gradient boosted trees sits at the core of the boosting algorithms. AdaBoost, LPBoost, XGBoost are some others.

We shall start by looking at pre curated dataset in scikit learn called 20 newsgroup dataset.The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training and the other one for testing. The split between the train and test set is based upon a messages posted before and after a specific date. 
First, we will try to build the dataset by picking only 5 out of the 20 topics - namely, baseball, hockey, motorcycles, politics, and med.

# Preparing train data

In [57]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball', 'rec.sport.hockey','rec.motorcycles', 'sci.med','talk.politics.misc']
newsgroups_train = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories=categories) 
list(newsgroups_train.target_names)

['rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.med',
 'talk.politics.misc']

Following the recommendation on the Scikit-learn page, I have removed the metadata that has little to do with topic classification. The recommendation reads as - "When evaluating text classifiers on the 20 Newsgroups data, you should strip newsgroup-related metadata. In scikit-learn, you can do this by setting remove=('headers', 'footers', 'quotes'). The F-score will be lower because it is more realistic."

In [58]:
# Sample data
newsgroups_train.data[:1]

["There is no data to show chromium is effective in promoting weight loss.  The\n few studies that have been done using chromium have been very flawed and inher\nently biased (the investigators were making money from marketing it).\n  Theoretically it really doesnt make sense either. The claim is that chromium\nwill increase muscle mass and decrease fat.  Of course, chromium is also used t\no cure diabetes, high blood pressure and increase muscle mass in athletes(just\nas well as anabolic steroids). Sounds like snake oil for the 1990's :-)\n On the other hand, it really cant hurt you anywhere but your wallet, and place\nbo effects of anything can be pretty dramatic...\n\n                                    -Paul\n     ----------------------------------------------------------\n    |  Paul Sovcik, Pharm.D. U of Illinois College of Pharmacy |\n    |                                                          |\n    |    Email- U18183@UICVM.UIC.EDU                           |\n    |         

In [61]:
#It has 1900 documents 
print (newsgroups_train.target.shape)

(1900,)


# Feature engineering

Just like we did in class, we are converting text to vectors using tfidf vectorizer. We wil not be spending much time on this as this tutorial is not focussed on putting across good feature engineering. The main focus is around teaching you the difference between the performances of bagging and boosting and when to use what?

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(1900, 23375)

You can use sparse functions for faster computations as the extracted TF-IDF vectors are very sparse, with an average of 89 non-zero components by sample in a more than 20000-dimensional space (less than .5% non-zero features):

In [62]:
vectors.nnz / float(vectors.shape[0])

88.2578947368421

# Preparing test data

In [64]:
newsgroups_test = fetch_20newsgroups(subset='test',categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)

## Multiclass Classification using Decision tree Classifier  

### Fitting the model

In [79]:
from sklearn import tree
from sklearn import metrics
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors, newsgroups_train.target)

### Train accuracy - check for overfitting

In [80]:
pred_train = clf.predict(vectors)
metrics.f1_score(newsgroups_train.target, pred_train, average='macro') #average = 'macro' is for multiclass targets

0.9782430918834631

The model seems to be fitting the data really well(high variance) as the train accuracy is really high and the test accuracy is low. This typically happens when the learner tries to fit the function on almost all the data points.

You will notice that the scoring metric used is f1_score. By definition, this is the harmonic mean of precision and recall. It reaches its best value at 1 and worst score at 0. This is used here because the summing up AUC under the ROC for one class vs other classes is not very intuitive and there exists no direct implementation in python. Thus, in the multi-class and multi-label case, this accuracy score is just the weighted average of the F1 score of each class.

### Test accuracy

In [82]:
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.7486636060460113

#### A point to be noted here is that the system is highly unstable. If you keep retraining the model and test it against the same test cases, you will be surprised to see different f1 scores. There is a need for a stable, strong, and robust system. What do we do now? After going through the paragraphs above, you might have rightly guessed! 

## Multiclass Classification using Random Forest Classifier

The sklearn.ensemble module includes an algorithm based on randomized decision trees: the RandomForest algorithm specifically designed for trees. This means keeping the underlying technique as decision tree classifier, a diverse set of classifiers is created by introducing randomness by picking samples with replacement. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

In [100]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf = clf.fit(vectors, newsgroups_train.target)
pred_train = clf.predict(vectors)
print("Train f1 score:", metrics.f1_score(newsgroups_train.target, pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:", metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9730348084068229
Test f1 score: 0.8776711399931504


### Hyperparameter tuning for model selection - RF

In [105]:
#Caution : This step might take time to finish
from sklearn.model_selection import GridSearchCV
import numpy as np
clf_rf = RandomForestClassifier()
parameters = {'n_estimators':np.array([50,60,70,90,100,110]), 'criterion':('gini','entropy')}
gridsearch = GridSearchCV(clf_rf, parameters)
gridsearch_rf =gridsearch.fit(vectors, newsgroups_train.target)
gridsearch_rf.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [109]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
clf = clf.fit(vectors, newsgroups_train.target)
pred_train = clf.predict(vectors)
print("Train f1 score:", metrics.f1_score(newsgroups_train.target, pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:", metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9782430918834631
Test f1 score: 0.9725533233915294


Do you see the bump in the accuracy just by choosing the right hyperparameters? For the best accuracy, I have tuned the hyperparameters using gridsearch which effectively tries all the combinations to get to the best results. In the above case, it runs the model 6*2 times to select the best model given the input parameters.  

## Multiclass Classification using Gradient Boosted Classifier

In [127]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=150,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False) # gridsearch results of the model from below 
clf = clf.fit(vectors, newsgroups_train.target)
pred_train = clf.predict(vectors)
print("Train f1 score:", metrics.f1_score(newsgroups_train.target, pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:", metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9782249071021802
Test f1 score: 0.9469486406408695


# Subsetting the training dataset

Say, I had very less number of rows in my dataset .i.e only say 100. This is done to show you how low bias systems can be boosted by using gradient boosting classifier.

In [119]:
import random
indices=[]
for x in range(100):
    indices.append(random.randint(1,1900))

## Multiclass Classification using Decision tree Classifier

In [120]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors[indices,], newsgroups_train.target[indices])
pred_train = clf.predict(vectors[indices,])
print("Train f1 score:",metrics.f1_score(newsgroups_train.target[indices,], pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:",metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9712363026530125
Test f1 score: 0.4457757779168527


## Multiclass Classification using Gradient Boosted Classifier

In [121]:
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=3, random_state=0)
clf = clf.fit(vectors[indices,], newsgroups_train.target[indices])
pred_train = clf.predict(vectors[indices,])
print("Train f1 score:",metrics.f1_score(newsgroups_train.target[indices,], pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:",metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9712363026530125
Test f1 score: 0.5408965888836423


### Hyperparameter tuning for model selection - GB

In [123]:
#Caution : This step will take time to finish
from sklearn.model_selection import GridSearchCV
import numpy as np
clf_gb = GradientBoostingClassifier()
parameters = {'n_estimators':np.array([100,110,150]), 'max_depth':[1,3,4], 'learning_rate':[1,0.1]}
gridsearch = GridSearchCV(clf_gb, parameters)
gridsearch_gb =gridsearch.fit(vectors[indices,], newsgroups_train.target[indices])
gridsearch_gb.best_estimator_

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=150,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [125]:
clf =GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=150,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
clf = clf.fit(vectors[indices,], newsgroups_train.target[indices])
pred_train = clf.predict(vectors[indices,])
print("Train f1 score:",metrics.f1_score(newsgroups_train.target[indices,], pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:",metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9712363026530125
Test f1 score: 0.5619697200050406


## Multiclass Classification using Random Forest Classifier

In [126]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
clf = clf.fit(vectors[indices,], newsgroups_train.target[indices])
pred_train = clf.predict(vectors[indices,])
print("Train f1 score:",metrics.f1_score(newsgroups_train.target[indices,], pred_train, average='macro')) #average = 'macro' is for multiclass targets
pred = clf.predict(vectors_test)
print("Test f1 score:",metrics.f1_score(newsgroups_test.target, pred, average='macro'))

Train f1 score: 0.9712363026530125
Test f1 score: 0.4930040078802798


# Summary

In conclusion, you can apply any of the two depending on the datasets you have. Like in the example above, if you have very less data which is not representative of the complete dataset, then boosting will give better results (in theory). There is no one rule for all, both the techniques will work towards stabilizing the system but you will have to employ one or the other depending on the train and the test accuracy. Compare the test accuracies to see the difference.

PS - Not including a table to compare numbers as they might change if you re-run the notebook. 

### References

1. Scikit Learn: http://scikit-learn.org
2. Dataset : http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
3. GridSearch : http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
4. F1 Score : http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
4. Miscellaneous : https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
