# Mushroom Identifier Classification Model

In this repo I will use classification models to identify if mushrooms are edible or poisonous. The data is an old UCI dataset from about 30 years ago - but still in good shape today and can be found on Kaggle!

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

# Libraries

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import folium
import datetime as dt

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import (train_test_split, 
                                     cross_val_score,
                                    GridSearchCV)

from sklearn.metrics import (accuracy_score, 
                            f1_score, 
                            confusion_matrix, 
                            classification_report,
                            confusion_matrix)

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import SGDClassifier

import xgboost as xgb

# Data

In [2]:
# Load the data
df = pd.read_csv("mushrooms.csv")
df.head()

FileNotFoundError: [Errno 2] File mushrooms.csv does not exist: 'mushrooms.csv'

In [None]:
# The data is all categorical!

# EDA

In [None]:
# Do some initial research
df.info()

In [None]:
# Get a look at the statistics
df.describe()

In [None]:
# No missing values
df.isnull().sum()

In [None]:
# Check for duplicates
duplicates = df[df.duplicated()]
print(len(duplicates))

In [None]:
# Number of edible and poisonous mushrooms 
df['class'].value_counts()

In [None]:
# Look at the two classes - edible and poisonous
sns.countplot(data=df, x = 'class', palette = 'gist_earth_r')
plt.title('Mushrooms by Class', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Class', fontsize=16)
plt.xticks(np.arange(0, 2), ['Poisonous', 'Edible'])
plt.show()

In [None]:
# From documentation - Different types of mushroom colors
# cap-color = [brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y]
df['cap-color'].value_counts()

In [None]:
# Colors of mushrooms on a color-coded bar graph
labels = ['brown','gray','red','yellow','white','buff','pink','cinnamon','purple','green']
colors = list(['#8B4513','gray','red','yellow','beige','#E0AB76','pink','brown','purple','green'])

df['cap-color'].value_counts().plot(kind='bar', color = colors, figsize=(10,5))
plt.title('Colors of Mushrooms', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Cap Colors', fontsize=16)
plt.xticks(np.arange(0, 10), labels, rotation = 45)
plt.show()

In [None]:
# I tried to match mushroom colors with Seaborn but couldn't get the colors correct.
color_labels = ['brown','gray','red','yellow','white','buff','pink','cinnamon','purple','green']

plt.figure(figsize=(10,5))
cap_color = df['cap-color'].value_counts()
sns.barplot(cap_color.index, cap_color.values, palette = 'gist_earth_r')
plt.title('Colors of Mushrooms', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Cap Colors', fontsize=16)
plt.xticks(np.arange(0, 10), color_labels)
plt.show()

In [None]:
# This is another visual with an earthy palatte color.
color_label_again = ['brown','yellow','white','gray','red','pink','buff','purple','cinnamon','green']
plt.figure(figsize=(10,5))
sns.countplot(data=df, x = 'cap-color', hue = 'class', palette = 'gist_earth_r')
plt.title('Mushrooms by Class and Color', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Cap Colors', fontsize=16)
plt.xticks(np.arange(0, 10), color_label_again)
plt.legend(loc='upper right', labels = ['Poisonous', 'Edible'])
plt.show()

In [None]:
# From documention - How do mushrooms smell?
# odor = [almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s]
df['odor'].value_counts()

In [None]:
# Visualize mushrooms by odor
odor_labels = ['pungent','almond','anise','none','foul','creosote','fishy','spicy','musty']
plt.figure(figsize=(10,5))
sns.countplot(data=df, x = 'odor', hue = 'class', palette = 'gist_earth_r')
plt.title('Mushrooms by Class and Odor', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Odor', fontsize=16)
plt.xticks(np.arange(0, 9), odor_labels)
plt.legend(loc='upper right', labels = ['Poisonous', 'Edible'])
plt.show()

In [None]:
# From documention - where do you find mushrooms?
# habitat = [grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d]
df['habitat'].value_counts()

In [None]:
# Take a look at the different habitat of mushrooms
habitat_labels = ['urban', 'grasses', 'meadows','woods','paths','waste','leaves']
plt.figure(figsize=(10,5))
sns.countplot(data=df, x = 'habitat', hue = 'class', palette = 'gist_earth_r')
plt.title('Mushrooms by Class and Habitat', fontsize = 20)
plt.ylabel('Number of Mushrooms', fontsize=16)
plt.xlabel('Habitat', fontsize=16)
plt.xticks(np.arange(0, 7), habitat_labels)
plt.legend(loc='upper right', labels = ['Poisonous', 'Edible'])
plt.show()

In [None]:
# From Documentation - What does population mean?
# population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
df['population'].value_counts()

# Label Encoding

In [None]:
# Check out the dataframe again
df.head()

In [None]:
# Independent target variable
y = df['class']
y

In [None]:
# Load Label Encoder
labels = LabelEncoder()
# Transform 'p' = 1 and 'e' = 0 
y = labels.fit_transform(df['class'])
y

# Replace target values - This is what I wanted!
# y = [1 if x == 'e' else 0  for x in df['class']]

# Replace target values - This way didn't work for some reason
# y = y.replace({'class': {'e' : 1, 'p' : 0 }})

In [None]:
# Drop target and set the dependent values variables 
X = df.drop('class', axis=1)
X

In [None]:
# Use One Hot Encoding to change all categorical data to 0 and 1's
X = pd.concat([pd.get_dummies(X[col], drop_first = True) for col in X], axis=1, keys=X.columns)
X.head()

# Test/Train Split

In [None]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Models

## Logistic Regression Model

In [None]:
# Log model
log_clf = LogisticRegression(random_state=123)
log_model = log_clf.fit(X_train, y_train)

log_training_preds = log_clf.predict(X_train)
log_training_accuracy = accuracy_score(y_train, log_training_preds)

log_val_preds = log_clf.predict(X_test)
log_val_accuracy = accuracy_score(y_test, log_val_preds)

print("Log Training Accuracy: {:.4}%".format(log_training_accuracy * 100))
print("Log Validation Accuracy: {:.4}%".format(log_val_accuracy * 100))

In [None]:
# Classification report, precision and recall
print(classification_report(y_test, log_val_preds))

In [None]:
# Confusion matrix 
log_matrix = confusion_matrix(y_test, log_val_preds)
print('Confusion Matrix:\n', log_matrix)

## Random Forest Model

In [None]:
# Forest Model
forest_clf = RandomForestClassifier()
forest_model = forest_clf.fit(X_train, y_train)

forest_training_preds = forest_clf.predict(X_train)
forest_training_accuracy = accuracy_score(y_train, forest_training_preds)

forest_val_preds = forest_clf.predict(X_test) # y_hat
forest_val_accuracy = accuracy_score(y_test, forest_val_preds)

print("Forest Training Accuracy: {:.4}%".format(forest_training_accuracy * 100))
print("Forest Validation accuracy: {:.4}%".format(forest_val_accuracy * 100))

In [None]:
# Classification report
print(classification_report(y_test, forest_val_preds))

In [None]:
# Confusion matrix 
forest_matrix = confusion_matrix(y_test, forest_val_preds)
print('Confusion Matrix:\n', forest_matrix)

## XGBoost

In [None]:
# XGB classifier
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)

xgb_training_preds = xgb_clf.predict(X_train)
xgb_training_accuracy = accuracy_score(y_train, xgb_training_preds)

xgb_val_preds = xgb_clf.predict(X_test)
xgb_val_accuracy = accuracy_score(y_test, xgb_val_preds)

print("XGB Training Accuracy: {:.4}%".format(xgb_training_accuracy * 100))
print("XGB Validation accuracy: {:.4}%".format(xgb_val_accuracy * 100))

In [None]:
# XGB classification report
print(classification_report(y_test, xgb_val_preds))

In [None]:
# Regular confusion matrix
xgb_matrix = confusion_matrix(y_test, xgb_val_preds)
print('Confusion Matrix:\n', xgb_matrix)

## KNN Model

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

knn_training_preds = knn_clf.predict(X_train)
knn_training_accuracy = accuracy_score(y_train, knn_training_preds)

knn_val_preds = knn_clf.predict(X_test)
knn_val_accuracy = accuracy_score(y_test, knn_val_preds)

print("KNN Training Accuracy: {:.4}%".format(knn_training_accuracy * 100))
print("KNN Validation accuracy: {:.4}%".format(knn_val_accuracy * 100))

In [None]:
# KNN classification report
print(classification_report(y_test, knn_val_preds))

In [None]:
# Regular confusion matrix
knn_matrix = confusion_matrix(y_test, knn_val_preds)
print('Confusion Matrix:\n', knn_matrix)

## Support Vector Model - SVC

In [None]:
svc_clf = SVC()
svc_clf.fit(X_train, y_train)

svc_training_preds = svc_clf.predict(X_train)
svc_training_accuracy = accuracy_score(y_train, svc_training_preds)

svc_val_preds = svc_clf.predict(X_test)
svc_val_accuracy = accuracy_score(y_test, svc_val_preds)

print("SVC Training Accuracy: {:.4}%".format(svc_training_accuracy * 100))
print("SVC Validation accuracy: {:.4}%".format(svc_val_accuracy * 100))

In [None]:
# SVC classification report
print(classification_report(y_test, svc_val_preds))

In [None]:
# Regular confusion matrix
svc_matrix = confusion_matrix(y_test, svc_val_preds)
print('Confusion Matrix:\n', svc_matrix)

## Stochastic Gradient Decent Model

In [None]:
sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train)

sgd_training_preds = sgd_clf.predict(X_train)
sgd_training_accuracy = accuracy_score(y_train, svc_training_preds)

sgd_val_preds = sgd_clf.predict(X_test)
sgd_val_accuracy = accuracy_score(y_test, sgd_val_preds)

print("Stochastic GD Training Accuracy: {:.4}%".format(sgd_training_accuracy * 100))
print("Stochastic GD Validation accuracy: {:.4}%".format(sgd_val_accuracy * 100))

In [None]:
# Stochastic classification report
print(classification_report(y_test, sgd_val_preds))

In [None]:
# Regular confusion matrix
sgd_matrix = confusion_matrix(y_test, sgd_val_preds)
print('Confusion Matrix:\n', sgd_matrix)

# Summary

Something is very wrong with my models... since they are all getting 100%. This means they are over-fitting. What can I do to correct this?

# Future Work

I would love to go find some mushrooms around my house and categorize them in order to determine if they are poisonous or not! For now I just need to work on fixing my models.