# Machine Learning on Mushroom Toxicity

### Description of Dataset

This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class).

### Variables

One binary class divided in edible=e and poisonous=p (with the latter one also containing mushrooms of unknown edibility).

Twenty remaining variables (n: nominal, m: metrical)
1. cap-diameter (m): float number in cm
2. cap-shape (n): bell=b, conical=c, convex=x, flat=f,
sunken=s, spherical=p, others=o
3. cap-surface (n): fibrous=i, grooves=g, scaly=y, smooth=s,
shiny=h, leathery=l, silky=k, sticky=t,
wrinkled=w, fleshy=e
4. cap-color (n): brown=n, buff=b, gray=g, green=r, pink=p,
purple=u, red=e, white=w, yellow=y, blue=l,
orange=o, black=k
5. does-bruise-bleed (n): bruises-or-bleeding=t,no=f
6. gill-attachment (n): adnate=a, adnexed=x, decurrent=d, free=e,
sinuate=s, pores=p, none=f, unknown=?
7. gill-spacing (n): close=c, distant=d, none=f
8. gill-color (n): see cap-color + none=f
9. stem-height (m): float number in cm
10. stem-width (m): float number in mm
11. stem-root (n): bulbous=b, swollen=s, club=c, cup=u, equal=e,
rhizomorphs=z, rooted=r
12. stem-surface (n): see cap-surface + none=f
13. stem-color (n): see cap-color + none=f
14. veil-type (n): partial=p, universal=u
15. veil-color (n): see cap-color + none=f
16. has-ring (n): ring=t, none=f
17. ring-type (n): cobwebby=c, evanescent=e, flaring=r, grooved=g,
large=l, pendant=p, sheathing=s, zone=z, scaly=y, movable=m, none=f, unknown=?
18. spore-print-color (n): see cap color
19. habitat (n): grasses=g, leaves=l, meadows=m, paths=p, heaths=h,
urban=u, waste=w, woods=d
20. season (n): spring=s, summer=u, autumn=a, winter=w

### Objectives

Identify what features are most commonly associated with poisonous and non-poisonous mushrooms.




## Step 1: Data Cleaning & Preprocessing

In [1]:
# Import packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Import data
mushrooms = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Project-4/Resources/secondary_data_shuffled.csv', sep=';')

In [4]:
# Check the data
mushrooms.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,e,1.26,x,g,y,f,d,,w,5.04,...,,t,y,,,f,f,,d,a
1,e,10.32,f,e,b,f,,c,b,4.68,...,,,w,,,t,f,,d,a
2,p,0.92,x,g,p,f,a,,p,4.59,...,,h,k,,,f,f,,d,u
3,p,4.27,x,,p,f,x,,w,4.55,...,,,w,,,f,f,,d,a
4,e,3.08,f,s,w,f,d,d,w,2.67,...,,,w,,,f,f,,m,a


In [5]:
# The data must be cleaned, normalized, and standardized prior to modeling 
#clean = mushrooms.dropna(axis=1)

In [5]:
mushrooms.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,e,1.26,x,g,y,f,d,,w,5.04,...,,t,y,,,f,f,,d,a
1,e,10.32,f,e,b,f,,c,b,4.68,...,,,w,,,t,f,,d,a
2,p,0.92,x,g,p,f,a,,p,4.59,...,,h,k,,,f,f,,d,u
3,p,4.27,x,,p,f,x,,w,4.55,...,,,w,,,f,f,,d,a
4,e,3.08,f,s,w,f,d,d,w,2.67,...,,,w,,,f,f,,m,a


In [6]:
mushrooms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-surface           46949 non-null  object 
 4   cap-color             61069 non-null  object 
 5   does-bruise-or-bleed  61069 non-null  object 
 6   gill-attachment       51185 non-null  object 
 7   gill-spacing          36006 non-null  object 
 8   gill-color            61069 non-null  object 
 9   stem-height           61069 non-null  float64
 10  stem-width            61069 non-null  float64
 11  stem-root             9531 non-null   object 
 12  stem-surface          22945 non-null  object 
 13  stem-color            61069 non-null  object 
 14  veil-type             3177 non-null   object 
 15  veil-color         

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf


In [8]:
mushrooms.nunique(axis=0)

class                      2
cap-diameter            2564
cap-shape                  7
cap-surface               11
cap-color                 12
does-bruise-or-bleed       2
gill-attachment            7
gill-spacing               3
gill-color                12
stem-height             2198
stem-width              4634
stem-root                  5
stem-surface               8
stem-color                13
veil-type                  1
veil-color                 6
has-ring                   2
ring-type                  8
spore-print-color          7
habitat                    8
season                     4
dtype: int64

In [9]:
mushrooms['class'] = mushrooms['class'].map({'e': 1, 'p': 0})

In [10]:
# Convert categorical data to numeric with `pd.get_dummies`

mushrooms = pd.get_dummies(mushrooms, columns=['cap-shape'
,'cap-surface'
,'cap-color'
,'does-bruise-or-bleed'
,'gill-attachment'
,'gill-spacing'
,'gill-color'
,'stem-root'
,'stem-surface'
,'stem-color'
,'veil-type'
,'veil-color'
,'has-ring'
,'ring-type'
,'spore-print-color'
,'habitat'
,'season'])
mushrooms.head()

Unnamed: 0,class,cap-diameter,stem-height,stem-width,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_o,cap-shape_p,cap-shape_s,...,habitat_h,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w,season_a,season_s,season_u,season_w
0,1,1.26,5.04,1.73,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,1,10.32,4.68,19.44,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0.92,4.59,1.15,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,4.27,4.55,6.52,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,3.08,2.67,5.18,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0


## Step 3: Predictive Analyses
#### Deep Neural Net

In [11]:
# Split our preprocessed data into our features and target arrays
y = mushrooms['class'].values
X = mushrooms.drop(columns='class').values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3, stratify=y)

In [12]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [13]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
number_input_features = len(X_train_scaled[0])
hidden_nodes_layer1 = 50
hidden_nodes_layer2 = 30

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation='relu'))

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation='relu'))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 50)                6000      
                                                                 
 dense_1 (Dense)             (None, 30)                1530      
                                                                 
 dense_2 (Dense)             (None, 1)                 31        
                                                                 
Total params: 7,561
Trainable params: 7,561
Non-trainable params: 0
_________________________________________________________________


In [14]:
# Compile the model
nn.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

In [15]:
# Train the model
fit_model = nn.fit(X_train_scaled, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [16]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

478/478 - 1s - loss: 7.3648e-05 - accuracy: 1.0000 - 1s/epoch - 3ms/step
Loss: 7.364802149822935e-05, Accuracy: 1.0
