# Abstract

Mushroom picking is a popular pastime. Wild mushrooms vary in edibility, from edible and pleasant-tasting to deadly poisonous. Correct identification is difficult for the layman, and different mushrooms can easily be confused with each other.

The aim of this coursework is to create a Machine Learning model to classify mushroom specimens as edible or poisonous. This model must be highly credible and accurate, since the nature of its task means errors could mean fatalities.

## Dataset

The [Mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) dataset is a well-known dataset for classification purposes. It contains more than 8,000 instances of specimens in the *Agaricus* and *Lepiota* genera of mushrooms. Each instance is classified as either *edible* or *poisonous*.

This copy of the Mushroom dataset was retrieved from the [Wolfram Data Repository](https://datarepository.wolframcloud.com/resources/Sample-Data-Mushroom-Classification) on roughly November 1.

# Research & Data exploration

## Literature Review

The paper [*Data Mining on Mushroom Database*](http://csis.pace.edu/~ctappert/srd2008/b2.pdf) makes use of the Mushroom dataset. The authors worked to find the most effective Machine Learning model to use for Mushroom, with preliminary results showing that the most effective one used was a Decision Tree algorithm (J48). Test data was split into 2 sets, on which the model reached 99.6% and 100% accuracy, respectively.

The paper's goal is not dissimilar to that of this coursework. The authors built a user-facing web application powered by the Decision Tree model, which returns a mushroom's edibility given some attributes by the user.

In *8. Limitations and Opportunities for Research*, the authors state that the `StalkRoot` attribute, which is known to be about 30% missing, was not removed except in the case of one particular algorithm (PRISM) which was unable to process missing values. It is noted that 'an acceptable level of accuracy was reached regardless of the missing data'.

The parameters of the Decision Tree model are not explained, other than that the tree is unpruned. Unpruned trees are prone to overfitting, a possibility which is not discussed.

The paper provides no exploration of the dataset, very little explanation of how the data was prepared, and no visualisation. Although its primary focus is not the dataset itself but rather the Machine Learning algorithms used, this means little can be known about what preparatory measures were undertaken with the data to produce the high degree of accuracy achieved by the model. Most of Mushroom's features are categorical and therefore uninterpretable by most Machine Learning algorithms, so the data was certainly processed; but this processing is not explained.

## Data Exploration

### Data Shape

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'numpy'

In [None]:
# Import data
df = pd.read_csv('data/mushroom.csv')
df.iloc[:5,:6]

In [None]:
print(f'Dataset contains {df.shape[0]} rows')
print(f'and {df.shape[1]} columns.')

### Features

All but two features are of `object` type: these are all categorical variables, having discrete values. They will need to be handled in order to be properly interpreted by most Machine Learning algorithms.

In [None]:
df.info()

Some attributes, such as `GillAttachment` and `RingNumber`, apparently show very little relation to the class variable. Others, such as `GillColor` & `Odor`, appear to be important features. For example, a `GillColor` of *buff* and an `Odor` of *foul* are apparently important predictors of poisonous mushrooms.

In [None]:
# 4 subplots
plt.figure(figsize=(14,11))
sns.set_style('darkgrid')

plt.subplot(2,2,1)
plt.title('GillAttachment vs Class')
sns.countplot(data=df,x='GillAttachment',hue='Class')

plt.subplot(2,2,2)
plt.title('RingNumber vs Class')
sns.countplot(data=df,x='RingNumber',hue='Class')

plt.subplot(2,2,3)
plt.title('GillColor vs Class')
sns.countplot(data=df,x='GillColor',hue='Class')

plt.subplot(2,2,4)
plt.title('Odor vs Class')
sns.countplot(data=df,x='Odor',hue='Class')

The dataset's class distribution is balanced.

In [None]:
# Class label counts
edible = len(df[df.Class=='edible'])
poisonous = len(df[df.Class=='poisonous'])

# Class label proportions
ediblePC = round(edible/df.shape[0]*100,2)
poisonousPC = round(poisonous/df.shape[0]*100,2)

print(f'Edible: {edible} ({ediblePC}%)')
print(f'Poisonous: {poisonous} ({poisonousPC}%)')

In [None]:
# Plot class label proportions
sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(9,7)})

sns.countplot(x='Class',data=df)
plt.show()

### Missing values

Mushroom is known to contain missing values. They can be spotted by a knowledgeable eye in the following output: among the listed unique values of the `StalkRoot` feature is a value `Missing[]`. As previously mentioned, the dataset was downloaded from the Wolfram Data Repository, and according to the [Wolfram Language Reference](https://reference.wolfram.com/language/ref/Missing.html), `Missing[]` is the standard notation for missing values in the Wolfram Language.

In [None]:
# Print unique values of each column if they number 5 or fewer
for i in df.columns:
    if len(df[i].unique()) <= 5:
        print (f'{len(df[i].unique())} unique values in {i}: {df[i].unique()}')

In [None]:
print(f'Unique values in StalkRoot: {df["StalkRoot"].unique()}')

We can be sure that there are no more missing values in the dataset by converting any `Missing[]` values into a more traditional `None`, and then checking the data frame for null values:

In [None]:
# Function to convert Missing[] values to None
def missingToNone(value):
    if value == 'Missing[]':
        return None
    else:
        return value

In [None]:
# Apply function to data frame
for i in df.columns:
    df[i] = df[i].apply(missingToNone)

In [None]:
# Check again for missing values 
df.isnull().sum()

# Data Preparation

## Missing Values

In [None]:
print(f'Percentage of missing values in StalkRoot: {round(((df["StalkRoot"].isnull().sum() / df.shape[0]) * 100), 2)}%')

Since the proportion of missing values is not significant, the StalkRoot feature can be retained.

## Zero-Variance Predictors

The following output reveals that the feature `VeilType` contains only 1 unique value. This makes it a *zero-variance predictor* which is of no use for modelling, so it can be dropped from the dataset.

In [None]:
# Number of unique values per feature
for i in df.columns:
    print (i,len(df[i].unique()))

In [None]:
# Drop VeilType feature
df.drop(['VeilType'],axis=1,inplace=True)

## Categorical Features

The categorical features in the dataset are subject to one-hot encoding with Pandas' `get_dummies` method.

In [None]:
# Set of categorical features
non_cat_variables = ['Bruises', 'RingNumber', 'Class']
cat_variables = set(df.columns) - set(non_cat_variables)

# New data frame for encoded features
df_encoded = pd.get_dummies(df, columns=cat_variables)
df_encoded.iloc[:5, :6]

This results in the class label being shifted from the end of the data frame to near the beginning, so it is moved back to the end.

In [None]:
# Move class label to end of data frame
target = df['Class']
df_encoded.drop(labels=['Class'], axis=1, inplace=True)
df_encoded.insert(len(df_encoded.columns), 'Class', target)
df_encoded.iloc[:5, -3:]

The dataset contains one Boolean feature. To make this value consistent with the encoded categorical variables, and to preclude any cross-algorithm differences in interpretation of Booleans, the feature is encoded into an integer.

In [None]:
# Function to encode into an integer
def toInteger(value):
    return int(value)

# Apply
df_encoded['Bruises'] = df_encoded['Bruises'].apply(toInteger)
df.iloc[:5, :1]

The class label is also a categorical variable. It is a requirement of most machine learning models that the class label is encoded into an integer. This can be achieved with scikit-learn's `LabelEncoder`, which also allows the label values to be transformed back into their original string representations.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

df_encoded.Class = le.fit_transform(df.Class)
df_encoded.iloc[:5, -1:]

In [None]:
labels_original = le.inverse_transform(df_encoded.Class)
labels_original[:5]

In [None]:
# Replace original data frame
df = df_encoded

# Modeling

## Train/Test Split

The data is split into training and testing sets, representing 80% and 20% of the data respectively. Sampling is random.

In [None]:
# Separate class label from input
X = df.iloc[:,:(df.shape[1]-1)]
y = df['Class']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.20)

The class distribution in the testing and training sets is roughly identical to the dataset overall.

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

## SVM

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='poly',gamma='auto',C=1)

# Train
svc.fit(X_train,y_train)

# Accuracy
accuracy = svc.score(X_test, y_test)
print(f'Accuracy: {round(accuracy*100, 2)}%')

A significant improvement in accuracy can be achieved by tuning the model's C value, with perfect accuracy being achieved with a C of around 50.

In [None]:
# C = 40
svc = SVC(kernel='poly',gamma='auto',C=50)

# Train
svc.fit(X_train,y_train)

# Accuracy
accuracy = svc.score(X_test, y_test)
print(f'Accuracy: {round(accuracy*100, 2)}%')

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=2, random_state=0)

# Train
dtc.fit(X_train, y_train)

# Accuracy
accuracy = dtc.score(X_test,y_test)
print(f'{round(accuracy*100, 2)}%')

An initial run with a `max_depth` of 2 yields an unimpressive accuracy compared with the SVM classifier's first run, but the model's accuracy can be improved by tuning the `max_depth`. The model achieves perfect accuracy at a depth of between 5 and 7.

In [None]:
# Calculate accuracies by max_depth, up to 9
max_depth_vals = list(range(1, 10))
accuracy_list = []
for max_depth in max_depth_vals:
    dtc = DecisionTreeClassifier(max_depth = max_depth, random_state = 0)
    dtc.fit(X_train, y_train)
    accuracy = dtc.score(X_test, y_test)
    accuracy_list.append(accuracy)
print(accuracy_list)

In [None]:
# Accuracies to data frame
depth_accuracies = list(zip(max_depth_vals, accuracy_list))
results = pd.DataFrame(data=depth_accuracies, columns=['max_depth','accuracy'])
results.head(9)

In [None]:
# Visualise accuracies
ax = sns.lineplot(x="max_depth", y="accuracy", data=results, marker="o")
ax.set(xlabel='Tree Depth', ylabel='Accuracy')
ax.plot()

In [None]:
# Update original classifier
dtc = DecisionTreeClassifier(max_depth=7, random_state=0)

# Train
dtc.fit(X_train, y_train)

The features identified as most important by the Decision Tree model do not appear to include those judged by eye as such in the visualisations in [Data Preparation](#Data_Preparation).

In [None]:
# Features by importance
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(dtc.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances.head(10)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=1)

# Train
rfc.fit(X_train, y_train)

# Accuracy
accuracy = rfc.score(X_test, y_test)
print(f'Accuracy: {round(accuracy*100, 2)}%')

It is somewhat difficult to produce a wrong prediction from the Random Forest model; on some runs it achieves perfect accuracy with even 1 estimator. The most important features identified on some runs include those identified in the visualisations in [Data Preparation](#Data_Preparation).

In [None]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(rfc.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances.head(10)