# **Let's classify mushrooms!**

![wow](http://2jl7cd1oy3fddj8w23cb3brh-wpengine.netdna-ssl.com/wp-content/uploads/2017/02/poisonousmushrooms.inpost.jpg)


Have you ever encountered a wild mushroom and wondered if it was safe for consumption? How do you even determine if it's edible? Do you look at its size? Or colour? Or smell?


In this kernel, we try to predict if a mushroom is poisonous or edible using 4 models: **Logistic Regression**, **Adaboosted Decision Trees**, **Random Forest** and **Support Vector Machine**.

## **Import libraries and dataset**

In [None]:
# =============================================================================
# Import libraries and dataset
# =============================================================================
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

data = pd.read_csv("../input/mushrooms.csv")

## **Data background and exploration**

Let's first try to explore and get an idea of what the data is like. The following code shows that there are  **8124 observations** and **23 variables** in the dataset: 

In [None]:
np.shape(data)

It seems that all of the variables are **categorical** in nature:

In [None]:
data.describe()

Taking a look at the first 5 observations of the dataset:

In [None]:
data.head()

Our objective is to predict the variable `class` using the other 22 variables, where "p" stands for poisonous and "e" stands for edible. Here are what the letters stand for if you are interested:
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
4. bruises: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

We remove NA values and recode the response variable: e to edible, p to poisonous 

In [None]:
# Remove NA values
data = data.dropna()

# Recode response variable
data.loc[data.iloc[:,0]=='e','class'] = 'edible'
data.loc[data.iloc[:,0]=='p','class'] = 'poisonous'

We may plot barplots for each variable grouped by its response value (poisonous or edible) in order to get an idea of variables that are most significant:


In [None]:
sns.set(style="darkgrid")
names = data.columns

# Plot each variable's proportion by level, according to their class (poisonous or edible)
for k in range(4):
    fig, axe = plt.subplots(2, 3, figsize=(20, 25))
    for i in range(1+k*(6),7+k*(6)): 
        if i == 23:
            break
        prop_df = (data.iloc[:,i].groupby(data.iloc[:,0]).value_counts(normalize=True).rename('proportion').reset_index())
        if i-k*(6)<4:
            sns.barplot(hue=prop_df.iloc[:,1], x=prop_df.iloc[:,0], y=prop_df.iloc[:,2], data=prop_df, ax=axe[0][i-k*(6)-1]).set_title(names[i])
        else:
            sns.barplot(hue=prop_df.iloc[:,1], x=prop_df.iloc[:,0], y=prop_df.iloc[:,2], data=prop_df, ax=axe[1][i-k*(6)-3-1]).set_title(names[i])

From the barplots, we can make some interesting observations:
* Edible mushrooms tend to have bruises, have either no smell or smell like almonds and anise, have a broad gill size and have a pendant ring-type
* Poisonous mushrooms tend to have spore print of colours white or chocolate
* The veil-type of a mushroom is probably not useful in determining if it is poisonous or edible

We can make such observations from barplots by quickly glancing through and looking out for levels of a variable that are highly present in one class and absent in the other class.

## **Data Pre-processing**

We separate the predictor variables and response variable into `x` and `y`. `x` and `y` were then split into 70%-30% training set and test sets. Whenever possible, we set `random_state = 10` to ensure the code's reproducibility. We convert the categorical variables into dummy variables. We also check for variables' levels to ensure that each level of a variable appears in both the training and test sets.

In [None]:
# =============================================================================
# Data Pre-processing
# =============================================================================
# Separate into predictor variables and response variable
x = (data.iloc[:,1:])
y = (data.iloc[:,0])

# Obtain train and test sets, set seed to 10
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)

# Checking for variables' levels
train_level = x_train.describe().iloc[1,:]
test_level = x_test.describe().iloc[1,:]

# Negate since we want the columns that are different
truth = ~(train_level == test_level)

col = x_train.columns[truth][0]
print(x_train.loc[:,col].value_counts())
print(x_test.loc[:,col].value_counts())

# Convert categorical variables into dummy variables of type int
x_traindummy = pd.get_dummies(x_train, drop_first=True, dtype=int)
x_testdummy = pd.get_dummies(x_test, drop_first=True, dtype=int)

# Encode response variable to 0 and 1
Encoder_y = LabelEncoder()
y_trainencoded = Encoder_y.fit_transform(y_train)
y_testencoded = Encoder_y.fit_transform(y_test)

We use `pandas.DataFrame.align` since some levels of some factors do not appear in both the training set and test set. More specifically, the `cap-surface` variable has 4 levels (y, s, f, g) but only 3 (y, s, f)  appear in the test set. This will create problems when we convert to dummy variables since a column is created for each level of a variable which means that the test set will have 1 less column.

In [None]:
# Align both train and test sets to ensure columns are the same (Since test-set's cap-surface variable has only 3 levels while train-set has 4)
x_finaltrain, x_finaltest = x_traindummy.align(x_testdummy, fill_value=np.int32(0), axis=1)

In [None]:
data.iloc[:,0].value_counts()

The classes are mostly balanced which is a good thing, since many problems may arise from imbalanced data. Next, we train models and use them for predictions.

For each model, we tune some hyperparameter using 10-fold cross validation. In each iteration, the training set (`k -1` folds) is standardised and its mean and standard deviation is applied to the test set (`k-th` fold). By using sklearn's built-in `pipeline`, this process can be greatly simplified. Also, we need to convert the data type in our training and test sets from `int` to `float`. Otherwise, `sklearn.preprocessing.StandardScaler` would produce an error.

In [None]:
# Convert data type from int to float otherwise there would be DataConversionWarning from StandardScaler
x_finaltrain = x_finaltrain.astype(float)
x_finaltest = x_finaltest.astype(float)

# Cross-validation to determine parameter for regularisation
# Split into 10 folds
kf = KFold(n_splits=10, shuffle=True, random_state=10)

## **Logistic Regression**

We apply Logistic Regression with L2-norm as the penalty. This would shrink the coefficients close to 0 and help to prevent overfitting and reduce variance.. I initially used L1-norm as penalty but I do not think it is appropriate since each column is a **level** of a variable, **not** a variable. It does not make sense to select only levels of a variable and not a variable itself for model fitting. 

We cross-validate 20 values of `C` to find the most appropriate value for regularization strength. Large values of `C` correspond to weak regularization while small values of `C` correspond to strong regularization. 

In [None]:
# =============================================================================
# Logistic Regression with Ridge
# =============================================================================
# Range of parameters to test
c_logreg = np.linspace(0,2,21)
c_logreg = c_logreg[1:] # We don't want 0 since 1/0 will produce an error

# Does a grid search over all parameter values and refits entire dataset using best parameters 
parameterslogreg = {'clf__C':c_logreg}
pipelogreg = Pipeline([('scale', StandardScaler()), ('clf', LogisticRegression(random_state=10, penalty='l2', solver='liblinear'))])
logreg = GridSearchCV(pipelogreg, parameterslogreg, cv=kf)
logreg.fit(x_finaltrain, y_trainencoded)

print("The confusion matrix is")
print(confusion_matrix(y_testencoded, logreg.predict(x_finaltest)))
print("Logistic regression with L2 norm accuracy is", logreg.score(x_finaltest, y_testencoded))

## **Adaboost Decision Tree**

Boosting refers to training several weak learners and combining them to form a strong learner by reducing **bias**. Here, we tune 3 hyperparameters. `depth` refers to the depth of each weak learner, with `depth = 1` being a decision stump. `num_est` refers to the number of weak learners to train and `rate` refers to the learning rate which determines the contribution of each weak learner. 

Since there is more than 1 hyperparameter, `sklearn.model_selection.GridSearchCV` performs a grid search by trying all possible combinations of hyperparameters. Since there are 3 options for each hyperparameter, it will evaluate 9 potential models during cross-validation. Naturally, if we increase the number of parameters to test, the number of models to be evaluated also increases which leads to longer training time. Thus, I tried to select possible values for each parameter over a wide range of values.

In [None]:
# =============================================================================
# Adaboost classification tree
# =============================================================================
# Range of parameters to test
depth = [1,3,5]
num_est = [50, 100, 150]
rate = [0.0001, 0.01, 1]

# Does a grid search over all parameter values and refits entire dataset using best parameters 
parameterstree = {'clf__base_estimator__max_depth':depth, 'clf__n_estimators':num_est, 'clf__learning_rate':rate}
DTC = DecisionTreeClassifier(random_state=10)
ABC = AdaBoostClassifier(base_estimator = DTC, random_state=10)
pipeada = Pipeline([('scale', StandardScaler()), ('clf', ABC)]) 
adatree = GridSearchCV(pipeada, parameterstree, cv=kf)
adatree.fit(x_finaltrain, y_trainencoded) 

print("The confusion matrix is")
print(confusion_matrix(y_testencoded, adatree.predict(x_finaltest)))
print("Adaboosted classification tree accuracy is", adatree.score(x_finaltest, y_testencoded))

## **Random Forest**
Random forests build on the idea of bagging, which stands for bootstrap aggregating. In contrast to boosting, bagging works by training several strong learners and averaging their predictions to reduce **variance**. Similarly, `depth` refers to the depth of each tree, with stronger learners usually having greater depth than weak learners. `trees` refers to the number of strong learners to train. When growing a tree, each time a split in the tree is considered, a **random** selection of `m` variables is chosen as potential splits where `m < p` and `p` is the full set of variables. This helps to de-correlate each tree to reduce variance when averaging and gives *Random* forest its name.

`m` can be chosen to be anything and here, we try $m =\sqrt{p}$ and  $m =$ log<sub>2</sub>($p$). Once again, `GridSearchCV` does the heavy lifting for us.

In [None]:
# =============================================================================
# Random Forest
# =============================================================================
# Range of parameters to test
depth = np.linspace(1,10,4)
trees = np.linspace(100,400,4)
trees = trees.astype(np.int64)
m = ['sqrt','log2']

# Does a grid search over all parameter values and refits entire dataset using best parameters 
parametersforest = {'clf__n_estimators':trees, 'clf__max_depth':depth, 'clf__max_features':m}
pipeforest = Pipeline([('scale', StandardScaler()), ('clf', RandomForestClassifier(random_state=10))]) 
rforest = GridSearchCV(pipeforest, parametersforest, cv=kf)
rforest.fit(x_finaltrain, y_trainencoded)

print("The confusion matrix is")
print(confusion_matrix(y_testencoded, rforest.predict(x_finaltest)))
print("Random forest accuracy is", rforest.score(x_finaltest, y_testencoded))


## **Support Vector Machine**
SVMs deal with the concept of margins. When presented with data (assume binary classification), the model attempts to find a hyperplane to separate the two classes. In 2 dimensions, a hyperplane is a line. If it succeeds, the data is linearly separable and the model will try to find the best line such that data points from both classes are furthest from each other. In other words, it **maximizes** the **margin**. Usually, data contains nosie and is not perfectly linearly separable. No straight line would be able to separate it. Support vector *classifiers* overcome this by allowing **soft** margins: margins that allow some points to be incorrectly classified. The amount of misclassification allowed is the hyperparameter `C` to be tuned. Small values of `C` correspond to less misclassification and stronger regularization. 

Sometimes, a linear boundary will not work regardless of the value of `C` because the data is non-linear. In this case, support vector *machines* are used and they overcome this by using kernel methods. The idea is to transform the data into higher dimensions and finding a hyperplane in that dimension. This may sound computationally expensive but thanks to kernel methods, only the inner products of data transformed into that dimensional space are needed.

Previously, we have seen the accuracy of logistic regression on the data. Logistic regression has a linear decision boundary i.e it will not perform well on data with non-linear boundary. This suggests that the data is linearly separable and thus I chose `kernel = 'linear'` for this model. Cross-validation is also performed over 11 values of `C`.

In [None]:
# =============================================================================
# SVM
# =============================================================================
# Range of parameters to test
c_svc = np.linspace(-5, 5, 11)
c_svc = [10**i for i in c_svc]

# Does a grid search over all parameter values and refits entire dataset using best parameters 
parameterssvc = {'clf__C':c_svc}
pipesvc = Pipeline([('scale', StandardScaler()), ('clf', SVC(random_state=10, kernel='linear'))]) 
svc = GridSearchCV(pipesvc, parameterssvc, cv=kf)
svc.fit(x_finaltrain, y_trainencoded)

print("The confusion matrix is")
print(confusion_matrix(y_testencoded, svc.predict(x_finaltest)))
print("Support vector classifier with linear kernel accuracy is", svc.score(x_finaltest, y_testencoded))


## **Conclusion**
It appears that all 4 models performed perfectly on the test set. It is likely that this data set is very nicely separated. This kernel also illustrates that simpler models such as logistic regression may perform as well as more complicated models such as SVMs. It it thus advisable to try simpler models first as they are easily understood, more interpretable and potentially have shorter training times before moving on to more advanced models such as neural networks or SVMs.