# Mushroom Classification

Determining the odor of a mushroom based off of the rest of its properties

## Imports

We will be using Keras to create a simple feed-forward neural network.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf

from tensorflow.keras import datasets, layers, models

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
mushrooms_df = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv')

## Format Dataset

Split the table into labels (odor) and data (everything else).

In [None]:
mushrooms_df.head()

In [None]:
data_c = mushrooms_df.loc[:, mushrooms_df.columns != 'odor'].to_numpy()
labels_c = mushrooms_df[['odor']].to_numpy()

### One-hot encoding

Use automatic one-hot encoding using sklearn's OneHotEncoder.

In [None]:
enc_data = OneHotEncoder(handle_unknown='ignore')
enc_data.fit(data_c)

enc_labels = OneHotEncoder(handle_unknown='ignore')
enc_labels.fit(labels_c)

data = enc_data.transform(data_c).todense()
labels = enc_labels.transform(labels_c).todense()

In [None]:
print(data.shape)
print(labels.shape)

N = data.shape[0]

### Scramble dataset

This is so we can have some variation between training sessions.

In [None]:
seed = np.random.get_state()
np.random.shuffle(data)
np.random.set_state(seed)
np.random.shuffle(labels)

### Split into testing and training

Important so we have a validation set and a training set.

In [None]:
split_idx = int(N * 0.8)
train_data = data[0:split_idx]
train_labels = labels[0:split_idx]
test_data = data[split_idx:N]
test_labels = labels[split_idx:N]

# Define Model

Very simple model; one input layer, a dropout layer then an output layer.

In [None]:
model = models.Sequential([
    layers.Dense(data.shape[1], input_shape=(data.shape[1],), activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(labels.shape[1])
])
model.summary()

# Compile and Train Model

We use categorical crossentrophy since we used one-hot encoding. Adam optimizer for simplicity.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_data, train_labels, epochs=10, 
                    validation_data=(test_data, test_labels))

# Evaluate results

We got around 80% validation accuracy using this simple model. Not bad considering the size of our dataset (8k) as well as the number of categories (9).