# Mushroom Classification

### Goal
The goal of this project is to be able to classifiy mushrooms into either safe to eat or poisionous based on certain physical features. The dataset was originally contributed to the UCI Machine Learning Library.    
In doing so, I attempt to answer the following questions:
- What model works best to classify this data?
- What features are imporatant when trying to discern between a mushroom thats poisionous and safe to eat?

### Description
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

Dataset and information about the dataset can be found here: https://www.kaggle.com/uciml/mushroom-classification

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

In [None]:
data = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
data.head()

## Data Exploration

In [None]:
#Data about information in each column in dataset
for col in data.columns:
    print(col, ':', data[col].unique())

In [None]:
#Finding missing values
data.isna().sum()

-------------------------------------------------------------------------------------------------------------------------

Every column in the dataset consists of catagorical variables. 

There are no columns with numerical data. Additionally, there are no missing values in the dataset at all. 

This makes the preprocessing stage a breeze. 

However, do note that the column 'veil-type' contains only a single variable. This makes the column useless to the dataset because it adds no insight into the data. Therefore, we shall drop this column.

All we need to do is to biforcate the columns into two lists. The first will contain all the columns that need to be label encoded, while the second will contain those that need to be onehot encoded.

Before that, however, I label encode the entire dataset in order to better visualise the trends in the data.

-------------------------------------------------------------------------------------------------------------------------

In [None]:
#Creating labeled dataframe in order to visualize trends in the data
try:
    data.drop(columns = 'veil-type', inplace = true)
    print('veil-type dropped')
except:
    pass

lenc = LabelEncoder()
labeled_data = pd.DataFrame()
for col in data.columns:
    labeled_data[col] = lenc.fit_transform(data[col]).astype('int64')
    print(col, 'done')

In [None]:
#Distribution of data in the dataframe in the form of histograms.
fig = plt.figure(figsize = (30,30))
ax = fig.gca()
labeled_data.hist(ax=ax)
fig.suptitle('Distributions of each Feature', size = 30)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()

In [None]:
#Plotting features vs class data to uncover any strong trends between features and target class
fig, ax = plt.subplots(6,4, figsize = (28,40))
columns = list(labeled_data.columns)
n = 1
for i in range(6):
    for j in range(4):
        sns.barplot(x = labeled_data[['class',columns[n]]].groupby(columns[n])['class'].sum().keys(),
                    y = labeled_data[['class',columns[n]]].groupby(columns[n])['class'].sum().values,
                   palette = 'viridis',
                   ax = ax[i,j]
                   ).set_title(columns[n], size = 20)
        ax[i,j].set_xlabel(" ")
        n += 1
        if n == len(columns):
            break
fig.suptitle('Poisonous Mushrooms in each category', size = 30)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
ax[5,1].set_visible(False)
ax[5,2].set_visible(False)
ax[5,3].set_visible(False)
plt.show()

-------------------------------------------------------------------------------------------------------------------
The barcharts above display the number of poisonous mushrooms for each category in each feature. 

We notice that some of the data like 'ring-number' and 'gill-attachment' is heavily skewed. This is because they are highly correlated with our taget class, which would allow us to make good predictions based on those classes alone. This means that for a feature like 'ring-number', if the 'ring-number' is 1, then there is a high probability that the mushroom in consideration is poisonous. 


-------------------------------------------------------------------------------------------------------------------

## Data Preprocessing

In [None]:
#Creating lists of colums to be label encoded and onehot encoded. 
## Columns to be label encoded have only 2 categories while those that need to be onehot encoded have more than 2 categories. 
label_encoding_columns = [col for col in data.columns if data[col].nunique() == 2]
onehot_encoding_columns = [col for col in data.columns if col not in label_encoding_columns]

In [None]:
#Label Encoding and Onehot Encoding
label_encoder = LabelEncoder()
onehot_encoder = OneHotEncoder(handle_unknown='ignore')

In [None]:
#Creating a Transformer to smoothly transform columns
preprocessor = ColumnTransformer(
    transformers=[
    ('label', label_encoder, label_encoding_columns),
    ('onehot', onehot_encoder, onehot_encoding_columns)
])

In [None]:
#Create a dataframe with labeled data
data_label = pd.DataFrame()
for col in label_encoding_columns:
    data_label[col] = label_encoder.fit_transform(data[col]).astype('int64')
    print(col, 'label encoded')

In [None]:
#Create dataframe with onehot encoded (dummy) data
data_dummies = pd.get_dummies(data[onehot_encoding_columns], columns = onehot_encoding_columns, drop_first = False, dtype= 'int64')

In [None]:
#Concatinate both labeled and onehot dataframes to create dataframe with all data which has been iether label encoded or onehot encoded
data_encoded = pd.DataFrame()
data_encoded = pd.concat([data_label ,data_dummies], axis = 1)

In [None]:
data_label.shape, data_dummies.shape, data_encoded.shape

## Modelling

In [None]:
#Splitting data into features X and target y
X = data_encoded.drop('class', axis = 1)
y = data_encoded['class']

In [None]:
#Splitting data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

### Random Forest Classifier

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)
score = accuracy_score(y_pred, y_test)

print('The accuracy score for Random Forest Classifier is %.1f'%(score))

---


In conclusion, we obtained a perfect accuracy score on our testing set. Normally, a perfect score would mean the model has overfit the dataset, however, in this case, because of extremely skewed features and high correlation with our target, our model was able to obtained a perfect accuracy with minimal effort and no hyper parameter tuning. This result atypical and should not be expected to be replicated for other datasets. 

Thank you for reading! Please comment down below if you have any questions!

---