In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC 

# 6.5 Adaboost

## 6.5.1 Some notes on Adaboost

Adaboost is a very popular (and simple) example of an ensemble learning technique. 

The basic principle behind the Adaboost technique is a combination of weak classifiers (of the same type) in order to create a good over-arching classifier. A typical application of this technique is the use of very small decision trees (aka decision stumps) into a larger and more complex classification system. 

A key aspect of this approach is that the models are created in a **sequential** fashion, actively trying to find new classifiers that can improve on areas where the current ensemble method is performing badly. Due to this behaviour it is a technique that is very prone to overfitting, hence it is important to avoid the following: 

* **Noise in the dependent variable**: if there are errors in the Y variable, it is highly likely that these will be picked up by one or more weak estimators, so these have to be avoided at all cost. 
* **Outliers**: if the dataset you are using contains very specific outliers these are best removed prior to using an Adaboost technique. If not it is very likely that the Adaboost technique will construct a far-fetched logic just to get the outliers correct. 

## 6.5.2 Import dataset

To illustrate Adaboost we are going to try and classify mushrooms, the dataset used for this problem was downloaded from Kaggle (https://www.kaggle.com/uciml/mushroom-classification/data)

<img src="figures/mushroom.jpg" alt="mushroom" style="width: 35%;"/>

The dependent variable is 'class', defined as: 

* **e** for edible
* **p** for poisonous

The following independent variables are available to make predictions: 

* **cap-shape**: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
* **cap-surface**: fibrous=f,grooves=g,scaly=y,smooth=s
* **cap-color**: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
* **bruises**: bruises=t,no=f
* **odor**: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
* **gill-attachment**: attached=a,descending=d,free=f,notched=n
* **gill-spacing**: close=c,crowded=w,distant=d
* **gill-size**: broad=b,narrow=n
* **gill-color**: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
* **stalk-shape**: enlarging=e,tapering=t
* **stalk-root**: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
* **stalk-surface-above-ring**: fibrous=f,scaly=y,silky=k,smooth=s
* **stalk-surface-below-ring**: fibrous=f,scaly=y,silky=k,smooth=s
* **stalk-color-above-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* **stalk-color-below-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* **veil-type**: partial=p,universal=u
* **veil-color**: brown=n,orange=o,white=w,yellow=y
* **ring-number**: none=n,one=o,two=t
* **ring-type**: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
* **spore-print-color**: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
* **population**: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
* **habitat**: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [None]:
mushrooms_df = pd.read_csv('data/mushrooms.csv')

In [None]:
# To make the problem a bit harder we are throw away some data (heresy!)
mushrooms_df = mushrooms_df[mushrooms_df.columns[:5]].copy()

## 6.5.3 Explore the basic properties of the dataset

We will work with a reduced dataset.

In [None]:
mushrooms_df.describe()

In [None]:
# Check if the problem is not massively skewed
sns.countplot('class', data=mushrooms_df)

In [None]:
# The character encoded labels are difficult to work with
# So we are going to convert them to integer labels
factored_mapping = {}
mushrooms_fac_df = pd.DataFrame()
for c in mushrooms_df.columns[1:]:
    labels, levels = pd.factorize(mushrooms_df[c])
    mushrooms_fac_df[c] = pd.Series(labels)
    factored_mapping[c] = levels

In [None]:
# I'm going to explicitly map the dependent variable to make sure
# that my interpretation is correct 
mush_map = {}
mush_map['e'] = 0
mush_map['p'] = 1
mushrooms_fac_df['class'] = mushrooms_df['class'].map(mush_map).as_matrix()

In [None]:
mushrooms_fac_df.sample(2)

In [None]:
# Look at the correlations of the different categorical variables
sns.heatmap(mushrooms_fac_df.corr(method='kendall'))

## 6.5.4 Prepare the dataset for learning

We are going to do one-hot encoding on this dataset to make sure that it can be used with all types of models. This will make all variables, regardless of type, binary. Note that this assumes that we are not going to use techniques where the dummy variable trap (only n-1 dummy's needed!) might apply, such as simple linear regression.

In [None]:
ohe = OneHotEncoder()

In [None]:
mush_X = ohe.fit_transform(mushrooms_fac_df.drop(['class'], axis=1))

In [None]:
# This is outputted as a sparse matrix since there might be a lot of zeros! 
type(mush_X)

In [None]:
mush_X.shape

In [None]:
# Encode the dependent as a binary variable
mush_Y = mushrooms_fac_df['class'].as_matrix()

In [None]:
mush_Y.shape

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(mush_X, mush_Y, test_size=0.15, random_state=42)

In [None]:
Y_train.shape

In [None]:
X_test.shape

## 6.5.5 Find the right parameters for the Adaboost model

Because the Adaboost model is prone to overfitting, it is of paramount importance to use a good test design to implement this model. Ideally this is done using e K-fold cross validation test design, as you have been shown in previous notebooks. 

In [None]:
# Prepare an empty paramter grid for the Adaboost Model
p_grid = {}

### 6.5.5.1 Parameter: The base estimator

The Adaboost uses a combination of simple estimators, the most essential parameter to select for this model is the **type of base estimator** to be used. Ideally the complexity of these base estimators is limited. 

An important notion in this respect is that **these simple estimators have parameters of their own!** This immediately implies that these parameters must also be tuned! (parameter-ception!)

In [None]:
p_grid['base_estimator'] = [DecisionTreeClassifier(max_depth=1), # This is a true 'stump'
                            DecisionTreeClassifier(max_depth=2)] # This allows for a little bit extra complexity

### 6.5.5.2 Parameter: The number of estimators

The first key parameter for the Adaboost model is the number of estimators. These is the maximum number of 'simple' estimators that can be combined in order to feed the ensemble model. Using a higher number will generally increase performance, but also greatly increases the risk of overfitting the data. 

In [None]:
p_grid['n_estimators'] = [10, 25, 50, 100]

### 6.5.5.3 Parameter: The learning rate

The learning rate specifies how much the influence of additional estimators is decreased. This parameter can be important to limit the degree of overfitting of a model. 

The learning itself is a value in the range $]0, 1]$, where a smaller learning rate means that each additional estimator has less impact. 

It is evident that there is a trade-off between the value of the learning rate and the number of estimators, when the learning rate is decreased a larger number of estimators will be required to get a model to an identical level of performance. 

In [None]:
p_grid['learning_rate'] = [1, 0.5, 0.1, 0.01]

### 6.5.5.4 Running the grid search

In [None]:
# To ensure reproducibility we are going to use a single random seed
p_grid['random_state'] = [42]

In [None]:
abc = AdaBoostClassifier()
grid_search = GridSearchCV(abc, 
                           param_grid=p_grid, 
                           refit=True,
                           n_jobs=-1) # this makes sure your system uses all threads when fitting

In [None]:
# Training the model using cross-validation - this might take some time depending on your system
grid_search.fit(X_train, Y_train)

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.score(X_test, Y_test)

In [None]:
sum(grid_search.predict(X_test) != Y_test) #This is the number of wrong predictions

In [None]:
# Let's investigate the confusion matrix for this problem
prediction = grid_search.predict(X_test)
conf_matrix = pd.DataFrame(
    confusion_matrix(Y_test, prediction), 
    columns=["Predicted False", "Predicted True"], 
    index=["Actual False", "Actual True"]
)
conf_matrix

## 6.5.6 Task 5

### Minimize the risk of eating poisonous mushrooms

Looking at the confusion matrix you can see that there are still a number of mushrooms that incorrectly classified, this is of course a quite substantial risk. 

Is it possible to set the 'threshold' which you are using at such a position that it practically guarantees that all the mushrooms that are classified as edible are not poisonous? 

The following function should come in handy:
```
grid_search.predict_proba
```

### Improve the classifier's performance

The code above only used one kind of kernel to make the prediction (the estimator parameter), however mutliple other kernels can be used and could potentially improve performance. Test and see if you can find one that further improves the performance of the model. 

The following estimators can be used: 

* BernoulliNB
* DecisionTreeClassifier
* ExtraTreeClassifier
* ExtraTreesClassifier
* MultinomialNB
* NuSVC
* Perceptron
* RandomForestClassifier
* RidgeClassifierCV
* SGDClassifier
* SVC

To find more information on these estimators you can look up their documentation at http://scikit-learn.org/

HINT. If you run across an error, try algorithm='SAMME' in stead of algorithm='SAMME.R'.

HINT2. Each classifier uses different parameters, google is your friend.