# Mushroom Classification
This notebook aims to visualise the different features given about mushrooms and use them to predict whether they are poisonous or edible.

If you enjoy this notebook and find it helpful, please upvote it, as it helps me make more of these.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from collections import Counter
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

In [None]:
df

## Data visualisation
The first thing we will do is visualising the data.

Here we see five different features plotted out with pie charts.

In [None]:
cols = [[['convex', 'bell', 'sunken', 'flat', 'knobbed', 'conical'], 'cap-shape'], 
        [['smooth', 'scaly', 'fibrous', 'grooves'], 'cap-surface'], 
        [['pungent', 'almond', 'anise', 'none', 'foul', 'creosote', 'fishy', 'spicy', 'musty'], 'odor'], 
        [['scattered', 'numerous', 'abundant', 'several', 'solitary', 'clustered'], 'population'], 
        [['urban', 'grasses', 'meadows', 'woods', 'paths', 'waste', 'leaves'], 'habitat']]

for column in cols:
    fig, ax = plt.subplots(figsize=(5, 5))
    labels = column[0]
    col = column[1]
    
    count = Counter(df[col])
    ax.pie(count.values(), labels=labels, shadow=True, autopct=lambda p:f'{p:.2f}%')
    ax.set_title(col)
    plt.show()

## Feature engineering
Next, we will engineer the data so that it can be inputted into our model.

The first piece of data engineering that we will perform is using a LabelEncoder turning the categorical columns into numerical ones.

In [None]:
for col in df:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

Then, we split up our dataframe into X and y.

In [None]:
X = df.drop('class', axis=1)
y = df['class']

Afterwards, we graph the distribution of five different columns using the log, box cox, standard and minmax scaler transformations.

In [None]:
for col in ['cap-color', 'odor', 'gill-color', 'stalk-color-above-ring', 'stalk-color-below-ring', 
            'spore-print-color', 'population', 'habitat']:
    fig, axes = plt.subplots(1, 5, figsize=(15,  3))
    
    X[col].hist(ax=axes[0], color='skyblue')
    (X[col]+1).transform(np.log).hist(ax=axes[1], color='pink')
    pd.DataFrame(stats.boxcox(X[col]+1)[0]).hist(ax=axes[2], color='lightgreen')
    pd.DataFrame(StandardScaler().fit_transform(np.array(X[col]).reshape(-1, 1))).hist(ax=axes[3], color='yellow')
    pd.DataFrame(MinMaxScaler().fit_transform(np.array(X[col]).reshape(-1, 1))).hist(ax=axes[4], color='orange')
    
    axes[0].set_title('Normal')
    axes[1].set_title('Log')
    axes[2].set_title('Box Cox')
    axes[3].set_title('Standard Scaler')
    axes[4].set_title('Min Max Scaler')
    
    for ax in axes:
        ax.set_xlabel(col)
    
    plt.show()

Subsequently, we check the correlation of the different columns using a heatmap.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.heatmap(X.corr(), annot=True)
plt.show()

An essential piece of data cleaning is splitting the X and y into train and test sets, which is what we do next:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Lastly, we want to see how much the features contribute to the dataset. I have used a PCA to decide that the fifteen most important features are the ones worth dealing with, therefore, I use a bar chart to graph their explained variance ratio.

In [None]:
pca = PCA(n_components=15)
pca.fit(X_train, y_train)

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

evr = pca.explained_variance_ratio_
plt.bar(range(len(evr)), evr, color='blue')
plt.title('Explained variance ratio for 15 features')
plt.ylabel('Explained variance ratio')
plt.xlabel('Features')
plt.show()

## Classifying data
Finally, we now wish to use our data to create a classifier which predicts whether a mushroom is edible or poisonous. I loop over XGBoost, SVC, SGD, KNN and Random Forest classifiers, fitting the train datasets to them while I do so and evaluating their score using the test sets.

In [None]:
models = [XGBClassifier(), SVC(), SGDClassifier(), KNeighborsClassifier(), RandomForestClassifier()]
model_names = ['XGBoost', 'SVC', 'SGD', 'KNN', 'Random Forest']
scores = []
cross_vals = []

for model in models:
    model_name = model_names[models.index(model)]
    
    model.fit(X_train, y_train)
    
    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()
    
    scores.append(score)
    cross_vals.append(cross_val)
    
    print('score: ' + str(round(score*100, 2)) + "% cross val: " + str(round(cross_val*100, 2)) + '% ' + model_name)

Using our findings, we can conclude that the models which perform the best on this dataset are the XGBoost and Random Forest, having accuracies near 100%. Two bar charts are graphed out below which show the two metrics of the different models.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

ax1.bar(model_names, scores, color='skyblue')
ax1.set_title('Model scores')
ax1.set_ylabel('Score')
ax1.set_xlabel('Model')

ax2.bar(model_names, cross_vals, color='pink')
ax2.set_title('Cross validation scores')
ax2.set_ylabel('Cross validation score')
ax2.set_xlabel('Model')

plt.show()

### Thank you for reading my notebook.
### If you enjoyed this notebook and found it helpful, please give it an upvote so that I can do more of these in the future.