# Introduction

From the Mushroom Classification dataset, we will attempt to create a classifier that determines the odour of a mushroom from its other features.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

In [None]:
mushrooms = pd.read_csv("../input/mushroom-classification/mushrooms.csv")
mushrooms.info()

Thankfully there are no nulls, which will simplify things.

For now, let's just peek at the dataset to see what it looks like.

In [None]:
mushrooms.head()

In [None]:
mushrooms.describe()

According to the attribute information of the dataset:

almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

Counting up all the examples of each odour type in the dataset shows that our dataset is heavily imbalanced. Non-odourous and foul mushrooms basically dominate the entire dataset, while musty or creosote mushrooms are barely represented.

In [None]:
mushrooms['odor'].value_counts()

# Preprocessing

We're trying to find the odour of a mushroom, given all its other properties. So we must separate the the odour column from the rest of the features.

In [None]:
pred_data = mushrooms.drop('odor',axis=1)
odours = mushrooms['odor']

Looking into the csv file (and from the attribute information of the dataset), we see that all the features are categorical and their values are represented by letters. We will encode the categories into numeric values

In [None]:
from sklearn.preprocessing import LabelEncoder
Encoder_pred = LabelEncoder() 
for col in pred_data.columns:
    pred_data[col] = Encoder_pred.fit_transform(pred_data[col])
Encoder_odours = LabelEncoder()
odours = Encoder_odours.fit_transform(odours)

We split the dataset into training and test sets (80 train - 20 test)

In [None]:
from sklearn.model_selection import train_test_split
pred_data_train, pred_data_test, odours_train, odours_test = train_test_split(pred_data, odours, test_size=0.2, random_state=1)

Recall that our dataset is heavily imbalanced. We will try to resolve this by oversampling the minority odour classes (in the training set) by randomly re-sampling them until they are equal to the majority.

In [None]:
print("Before resampling:\n{}".format(np.asarray(np.unique(odours_train, return_counts=True)).T))

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
pred_data_train, odours_train = ros.fit_resample(pred_data_train, odours_train)

print("After resampling:\n{}".format(np.asarray(np.unique(odours_train, return_counts=True)).T))

# Using Categorical Naive Bayes Classifier

Let us use a Categorical Naive Bayes classifier to model the data

In [None]:
from sklearn.naive_bayes import CategoricalNB

clf = CategoricalNB()
clf.fit(pred_data_train, odours_train)

print(clf.score(pred_data_train, odours_train))
print(clf.score(pred_data_test, odours_test))

These accuracy scores for training and test data aren't particularly useful in showing us where the algorithm is struggling. We shall use their confusion matrices to retrieve more information.

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

def visualize_confusion(classifier, pred_data_test, odours_test, encoder):
    conf = confusion_matrix(odours_test, classifier.predict(pred_data_test), normalize='true')
    fig, ax = plt.subplots(figsize=(10,10))
    labels = encoder.inverse_transform(classifier.classes_)
    sns.heatmap(conf, annot=True, fmt='.2f', xticklabels=labels, yticklabels=labels)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show(block=False)

In [None]:
visualize_confusion(clf, pred_data_train, odours_train, Encoder_odours)

The confusion matrix above is for the training set; the one below for the test set.

Interestingly, we see (in both matrices) that the Categorical Naive Bayes classifier is doing a really good job predicting 'c', 'm', 'n', 'p' odours. That is:
1. When the odour is actually 'c', 'm', 'n' or 'p' - the classifier almost always guesses correctly
2. When the classifier guesses 'c', 'm', 'n' or 'p' - the actual odour almost always matches the guess

So the classifier can identify those mushrooms very well (and it isn't just blindly guessing 'c', 'm', 'n' or 'p' all the time either.)

The classifier is less accurate with identifying 'f' mushrooms, but the real problem lies in differentiating between:
1. 'l' and 'a' mushrooms
2. 's' and 'y' mushrooms

In [None]:
visualize_confusion(clf, pred_data_test, odours_test, Encoder_odours)

# The struggle with 'l' and 'a' mushrooms (and 's' and 'y')

Let us take a step back - all the way back to the original dataset.

First, let's look at what is happening with the 'l' and 'a' mushrooms. We select all the 'l' and 'a' mushrooms from the original dataset and numerically encode their features. Then we can use chi2() to test the independence of the odour from the other features (using the p-values - my understanding of sklearn's chi2() is that its returned chi2 stats are not the same thing as the conventional chi2 test stats.)

In [None]:
la_mushrooms = mushrooms[mushrooms.odor.isin(['l', 'a'])]
encoded_la = pd.DataFrame()
for col in la_mushrooms.columns:
    encoded_la[col] = LabelEncoder().fit_transform(la_mushrooms[col])

la_data = encoded_la.drop('odor', axis=1)
la_odours = encoded_la['odor']
from sklearn.feature_selection import chi2
_, pval = chi2(la_data, la_odours)
pval

Each entry of pval corresponds to the p-value a feature among the 'l' and 'a' mushrooms. The 1's suggest that for those features, they are not very correlated with the odour - which does not help us differentiate between 'l' and 'a' mushrooms.

From the code output below, we see that the NaN entries actually correspond to features with only one observed category among all 'l' and 'a' mushrooms. Hence those features cannot help differentiate between 'l' and 'a' mushrooms either

In [None]:
for col_name in la_mushrooms.drop('odor', axis=1).columns:
    print("{}: {}\n".format(col_name, np.asarray(np.unique(la_mushrooms[col_name], return_counts=True)).T))

Repeating this process for the 's' and 'y' mushrooms, we also see that it is very likely that the odour is independent from the other features.

Once again, the NaNs represent features who only have one observed category among the 's' and 'y' mushrooms.

In [None]:
sy_mushrooms = mushrooms[mushrooms.odor.isin(['s', 'y'])]
encoded_sy = pd.DataFrame()
for col in sy_mushrooms.columns:
    encoded_sy[col] = LabelEncoder().fit_transform(sy_mushrooms[col])

sy_data = encoded_sy.drop('odor', axis=1)
sy_odours = encoded_sy['odor']
_, pval = chi2(sy_data, sy_odours)
pval

In [None]:
for col_name in sy_mushrooms.drop('odor', axis=1).columns:
    print("{}: {}\n".format(col_name, np.asarray(np.unique(sy_mushrooms[col_name], return_counts=True)).T))

# Attempts at Training a Second Classifier for 'l', 'a', 's', 'y' Mushrooms

Below are a couple of attempts to train a second classifier, specifically to differentiate between 'l', 'a', 's', and 'y' mushrooms - as a way to improve upon the accuracy of the Categorical Naive Bayes classifier above. None of them performed particularly well.

## Preprocessing

In [None]:
# Get the 'l', 'a', 's', 'y' mushrooms
lasy_mushrooms = mushrooms[mushrooms.odor.isin(['l', 'a', 's', 'y'])]
lasy_data = lasy_mushrooms.drop('odor',axis=1)
lasy_odours = lasy_mushrooms['odor']

# Encoding categorical values into numerical ones
lasy_encoder_pred = LabelEncoder() 
for col in lasy_data.columns:
    lasy_data[col] = lasy_encoder_pred.fit_transform(lasy_data[col])
lasy_encoder_odours = LabelEncoder()
lasy_odours = lasy_encoder_odours.fit_transform(lasy_odours)

# Need to use one-hot encoding for classifiers that do not interpret categorical features correctly
# This will split all the categorical variables into binary ones - we will use PCA later to reduce dimensionality (while trying to retain variance information)
lasy_data = pd.get_dummies(lasy_data,columns=lasy_data.columns,drop_first=True)

# Split the dataset into training and test sets
lasy_data_train, lasy_data_test, lasy_odours_train, lasy_odours_test = train_test_split(lasy_data, lasy_odours, test_size=0.2, random_state=1)

# Oversample the training data for balance
ros = RandomOverSampler(random_state=1)
lasy_data_train, lasy_odours_train = ros.fit_resample(lasy_data_train, lasy_odours_train)

# PCA Step - Use Cumulative Summation of the Explained Variance to choose a good number of components
from sklearn.decomposition import PCA
pca = PCA().fit(lasy_data_train)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')
plt.show()

Let us say we reduce down to 15 components, we only lose a little over 5% of the variance information.

In [None]:
pca = PCA(n_components=15)
lasy_data_train = pca.fit_transform(lasy_data_train)
lasy_data_test = pca.transform(lasy_data_test)

## SVC

Below is a plot of SVC accuracy scores, over several different values of C (regularization parameter.)

In [None]:
from sklearn.svm import SVC
train_acc = []
test_acc = []
c_range = [0.05, 0.1, 0.2, 0.3, 0.5, 1, 1.5, 2, 3, 5, 10, 15, 20, 30, 40, 50, 100]

for c in c_range:
    svc = SVC(C=c, kernel='rbf',random_state=1)
    svc.fit(lasy_data_train, lasy_odours_train)
    train_acc.append(svc.score(lasy_data_train, lasy_odours_train))
    test_acc.append(svc.score(lasy_data_test, lasy_odours_test))
    
plt.plot(c_range, train_acc, label="training accuracy")
plt.plot(c_range, test_acc, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("C")
plt.xscale("log")
plt.legend()

Even among the best choices of C we've tried (say C=0.1), the classifier still performs poorly on both training and test datasets

In [None]:
svc = SVC(C=0.1, kernel='rbf',random_state=1)
svc.fit(lasy_data_train, lasy_odours_train)
visualize_confusion(svc, lasy_data_train, lasy_odours_train, lasy_encoder_odours)

In [None]:
visualize_confusion(svc, lasy_data_test, lasy_odours_test, lasy_encoder_odours)

## Decision Tree Classifier

Below is a plot of Decision Tree accuracy scores, over several different maximum tree depths - for both entropy of information and gini coefficient criteria.

In [None]:
from sklearn.tree import DecisionTreeClassifier
train_acc_ent = []
test_acc_ent = []
train_acc_gini = []
test_acc_gini = []

depth_range = range(1, 31, 1)

for n in depth_range:
    dtc = DecisionTreeClassifier(max_depth=n, criterion='entropy', random_state=1)
    dtc.fit(lasy_data_train, lasy_odours_train)
    train_acc_ent.append(dtc.score(lasy_data_train, lasy_odours_train))
    test_acc_ent.append(dtc.score(lasy_data_test, lasy_odours_test))
    
    dtc = DecisionTreeClassifier(max_depth=n, criterion='gini', random_state=1)
    dtc.fit(lasy_data_train, lasy_odours_train)
    train_acc_gini.append(dtc.score(lasy_data_train, lasy_odours_train))
    test_acc_gini.append(dtc.score(lasy_data_test, lasy_odours_test))

plt.plot(depth_range, train_acc_ent, label="training accuracy ent")
plt.plot(depth_range, test_acc_ent, label="test accuracy ent")
plt.plot(depth_range, train_acc_gini, label="training accuracy gini")
plt.plot(depth_range, test_acc_gini, label="test accuracy gini")
plt.ylabel("Accuracy")
plt.xlabel("Max Depth")
plt.legend()

In [None]:
dtc = DecisionTreeClassifier(max_depth=3, criterion='entropy', random_state=1)
dtc.fit(lasy_data_train, lasy_odours_train)
visualize_confusion(dtc, lasy_data_train, lasy_odours_train, lasy_encoder_odours)

In [None]:
visualize_confusion(dtc, lasy_data_test, lasy_odours_test, lasy_encoder_odours)

## K Nearest Neighbours Classifier

Below is a plot of K Nearest Neighbours accuracy scores, over several different numbers of neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier
train_acc = []
test_acc = []

k_range = range(1, 51, 1)

for k in k_range:
    knc = KNeighborsClassifier(n_neighbors=k)
    knc.fit(lasy_data_train, lasy_odours_train)
    train_acc.append(knc.score(lasy_data_train, lasy_odours_train))
    test_acc.append(knc.score(lasy_data_test, lasy_odours_test))

plt.plot(k_range, train_acc, label="training accuracy")
plt.plot(k_range, test_acc, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("# of Neighbours")
plt.legend()

In [None]:
knc = KNeighborsClassifier(n_neighbors=45)
knc.fit(lasy_data_train, lasy_odours_train)
visualize_confusion(knc, lasy_data_train, lasy_odours_train, lasy_encoder_odours)

In [None]:
visualize_confusion(knc, lasy_data_test, lasy_odours_test, lasy_encoder_odours)