<center><h1>QBUS6850 - Machine Learning for Business</h1></center>

# Tutorial 8 - Advanced Classification Techniques II

## Random Forest Classification

In sklearn tree's in a forest are built by bootstrapping the training set. So each tree is built from a slightly different set of data.

Sklearn also uses a random subset of features when deciding splits. This is to decrease the variance of the forest by introducing some randomness at the cost of increasing bias. The idea is that we will achieve a good spread of features used. If we were to split each tree using the same features then they would end up identical! Overall this strategy yields a better model.

The final classification of each class is given by averaging the probability output of each tree. In other words we caclulate predict_proba() from each tree and average them together. Then pick the most likely class.

### Bank Customer Example

Lets use a random forest to classify customers in the bank customer dataset.

First load the data and build the train/test sets.

In [1]:
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import time

bank_df = pd.read_csv("bank.csv")

X = bank_df.iloc[:, 0:-1]
y = bank_df['y_yes']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Building a Random Forest in sklearn is just like using any other classifier. Create the object and then call the object's fit() function.

In [2]:
clf = ensemble.RandomForestClassifier(class_weight = 'balanced')

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Finally let's check the classification accuracy on the test set

In [3]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.98      0.94       993
          1       0.62      0.18      0.28       138

avg / total       0.86      0.89      0.86      1131

[[978  15]
 [113  25]]


### Building an Optimal Forest

Of course we need to pick the best tree and there are many parameters to optimise. For a full list please refer to the documentation http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

The main ones to focus on are:
- number of trees
- max_depth

Here I will set up a grid of parameters to search through.

In [4]:
param_grid = {'n_estimators': np.arange(1,200,10),
              'max_depth': np.arange(1,20,1), } 

clf_cv = GridSearchCV(ensemble.RandomForestClassifier(class_weight = 'balanced'), param_grid)

print("Running....")
tic = time.time()
clf_cv.fit(X_train, y_train)

toc = time.time()
print("Training time: {0}s".format(toc - tic))

Running....
Training time: 255.89262914657593s


Let's get the final (optimal) forest classifier

In [5]:
clf = clf_cv.best_estimator_
print(clf)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=19, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=101, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


And finally check the accuracy on the test set

In [6]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.98      0.94       993
          1       0.60      0.23      0.34       138

avg / total       0.87      0.89      0.87      1131

[[972  21]
 [106  32]]


Here we have better misclassification rate and f1-score than the above random forest model, although the training set size is smaller as cross valdition used.

## ExtraTreesClassifier

Sklearn provides another class for classifying using forests: ExtraTreesClassifier. This class implements "Extremely Randomised Forest". As in random forests, a random subset of candidate features is used. **Additionally** thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This reduces the variance of the model a even more but also increases bias.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

## AdaBoost

AdaBoost is available in the sklearn ensemble library. It is used in the same way as every other sklearn class. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

The core principle is to fit a sequence of weak learners via boosting. Boosting is a process of increasing the weights of samples that were misclassified, then building a new classifier. The new classifier is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.

By default AdaBoostClassifier uses DecisionTreeClassifier objects as the base classifier, however you can use a different classifier if you prefer. Check the docs for compatible classes.

Some paramters to tune are:
- n_estimators
- learning_rate

Below is an example of how to build an AdaBoost classifier. By default n_estimators = 50.

In [7]:
clf = ensemble.AdaBoostClassifier()

clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

You should notice that the training accuracy is even better than our single tree and even the forest!

In [8]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.92      0.97      0.95       993
          1       0.67      0.38      0.48       138

avg / total       0.89      0.90      0.89      1131

[[967  26]
 [ 86  52]]


Note that there are two available algorithms in AdaBoostClassifier: "SAMME" and "SAMME.R". The "SAMME.R" is the default algorithm used in Python and always return estimator weights as 1. "SAMME" will ouput unequal estimator weights (voting powers) for different estimators. Refer to the Python doc and following post for more information:

https://stackoverflow.com/questions/31981453/why-estimator-weight-in-samme-r-adaboost-algorithm-is-set-to-1

## Combining Disparate Classifiers

We can go even further with ensemble classification. We can combine classes that are disparate and combine their predictions together for a more accurate classification. For example we could combine multiple RandomForests together or multiple Boosted Forests together!

This is implemented in sklearn with the VotingClassifier class http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html.

VotingClassifier provides two main parameters to specify:
- voting scheme, how predictions from ensemble are combined
- weights, we can weight each classifier's vote

The voting scheme is either "hard" or "soft". Hard scheme means majority voting, while soft means we sum up the class probabilites of each classifer then make a decision.

### Example

Below I will use the iris dataset to demonstrate how using multiple classifiers together can slightly improve classification accuracy.

In [9]:
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import cross_val_score

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.90 (+/- 0.05) [Logistic Regression]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [Naive Bayes]
Accuracy: 0.95 (+/- 0.05) [Ensemble]


## Multi-class Classification (and dealing with text, optional)

In this section I want to introduce a more complex case. First the data contains more than two classes. Second the dataset contains only text data.

    The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:
*source: http://qwone.com/~jason/20Newsgroups/, https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups**

**The goal is:** given a text document, correctly assign it to the newsgroup from which it came from.

We can directly use decision trees, forests etc for multi-class classification without any modification.

The difficulty is transforming the text data into a numeric representation. Recall that a decision tree operates on features and threshold values for those features. Text does not satisify this requirement.

### Bag of Words (Vectorising Text)

A simple method of dealing with text is to treat each word in a corpus as a feature. For each document we count the number or frequency of each word.

Below is a simple example. Note that X is a sparse matrix data type and some columns will be out of order.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

corpus = ['This is the first document.',
          'This is the second second document.']

X = count_vectorizer.fit_transform(corpus)

In [11]:
print(count_vectorizer.get_feature_names())

['document', 'first', 'is', 'second', 'the', 'this']


In [12]:
print(X)

  (0, 0)	1
  (0, 1)	1
  (0, 4)	1
  (0, 2)	1
  (0, 5)	1
  (1, 3)	2
  (1, 0)	1
  (1, 4)	1
  (1, 2)	1
  (1, 5)	1


In large datasets there will be lots of repeated words such as "a", "is" and "the" that don't carry much useful information. These terms should be ignored or given very small weights.

We can use the tf–idf transform to boost the weights of uncommon words (which are likely domain specific) and shrink common words.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf = tfidf_vectorizer.fit_transform(corpus)

print(tfidf_vectorizer.get_feature_names())

print(tfidf)

['document', 'first', 'is', 'second', 'the', 'this']
  (0, 5)	0.409090103683
  (0, 2)	0.409090103683
  (0, 4)	0.409090103683
  (0, 1)	0.574961866799
  (0, 0)	0.409090103683
  (1, 5)	0.289869335769
  (1, 2)	0.289869335769
  (1, 4)	0.289869335769
  (1, 0)	0.289869335769
  (1, 3)	0.814802474667


### N-grams

Unfortunately this simple method has two major drawbacks:
- it cannot handle phrases (multiple word expressions), which removes the order or dependancy information
- it cannot handle typos

So it is suggested that you use n-grams. Actually we have already been using unigrams so far. 

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))

tfidf = tfidf_vectorizer.fit_transform(corpus)

print(tfidf_vectorizer.get_feature_names())

print(tfidf)

['document', 'first', 'first document', 'is', 'is the', 'second', 'second document', 'second second', 'the', 'the first', 'the second', 'this', 'this is']
  (0, 11)	0.289569396523
  (0, 3)	0.289569396523
  (0, 8)	0.289569396523
  (0, 1)	0.406979683189
  (0, 0)	0.289569396523
  (0, 12)	0.289569396523
  (0, 4)	0.289569396523
  (0, 9)	0.406979683189
  (0, 2)	0.406979683189
  (1, 11)	0.224578375085
  (1, 3)	0.224578375085
  (1, 8)	0.224578375085
  (1, 0)	0.224578375085
  (1, 12)	0.224578375085
  (1, 4)	0.224578375085
  (1, 5)	0.631274140434
  (1, 10)	0.315637070217
  (1, 7)	0.315637070217
  (1, 6)	0.315637070217


### Usenet Newsgroups

Now we are armed with the right tools to tackle the problem at hand of classifying text documents.

Let's transform the text documents to an n-gram representation and build a random forest to classify new documents.

For the sake of running time I am going to manually pick the number of trees as 100 in my forest. Normally you should use CV to pick the best number of trees and the tree depth.

In [15]:
from sklearn.datasets import fetch_20newsgroups

categories = None
remove = ('headers', 'footers', 'quotes')

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)


X_train = data_train.data
y_train = data_train.target

X_test = data_test.data
y_test = data_test.target

# tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')

print("Fitting and Transforming")
usenet_tfidf = tfidf_vectorizer.fit_transform(X_train)
print("Done")

Fitting and Transforming
Done


In [16]:
usenet_clf = ensemble.RandomForestClassifier(n_estimators=100)

print("Training....")
usenet_clf.fit(usenet_tfidf, y_train)
print("Training completed.")

Training....
Training completed.


In [17]:
y_pred = usenet_clf.predict(tfidf_vectorizer.transform(X_test))

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.44      0.39      0.41       319
          1       0.58      0.60      0.59       389
          2       0.53      0.65      0.59       394
          3       0.61      0.56      0.58       392
          4       0.64      0.65      0.64       385
          5       0.63      0.68      0.65       395
          6       0.72      0.74      0.73       390
          7       0.41      0.71      0.52       396
          8       0.67      0.68      0.67       398
          9       0.71      0.78      0.74       397
         10       0.82      0.82      0.82       399
         11       0.78      0.67      0.72       396
         12       0.53      0.42      0.47       393
         13       0.78      0.64      0.70       396
         14       0.70      0.67      0.68       394
         15       0.60      0.77      0.67       398
         16       0.51      0.58      0.54       364
         17       0.80      0.70      0.74   