# Predicting Mushroom Edibility

Disclaimer: Do not use any of the data or machine learning techniques found in this notebook to decide whether or not to pick a mushroom in the wild. The dataset used only includes information on 32 different species of mushroom and is not indicative of all fungi and their toxicological properties. Predictions made by the machine learning models in this notebook are also not necessarily indicative of whether a mushroom is poisonous or not.

I have never gone foraging for mushrooms and quite frankly, I don't plan to in my lifetime.  There's a very good reason why: if you pick the wrong mushroom and eat it, you could DIE! That's a terrifying prospect. If only there were a way to know if you picked an edible mushroom... 

The intent for this project is to determine the optimal machine learning model for predicting the edibility of mushrooms. The data used for this project was provided by UCI Machine Learning on Kaggle. It can be downloaded [here](https://www.kaggle.com/uciml/mushroom-classification). Here's a quick overview of what each of the columns represents:

* classes: Indicates whether the mushroom is edible or not; edible=e, poisonous=p
* cap-shape: Describes the shape of the mushroom cap; bell=b, conical=c, convex=x, flat=f,  knobbed=k, sunken=s
* cap-surface: Describes the surface texture of the mushroom cap; fibrous=f, grooves=g, scaly=y, smooth=s
* cap-color: Indicates the mushroom's cap colour; brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
* bruises: Identifies whether there are any bruises on the mushroom; bruises=t, no=f
* odor: Describes the odor of the mushroom; almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
* gill-attachment: Describes how the gill of the mushroom is attached; attached=a, descending=d, free=f, notched=n
* gill-spacing: Describes the spacing of the mushroom's gill; close=c, crowded=w, distant=d
* gill-size: Describes the size of the mushroom's gills; broad=b, narrow=n
* gill-color: Indicates the colour of the mushroom's gills; black=k, brown=n, buff=b, chocolate=h, gray=g,  green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
* stalk-shape: Describes the shape of the mushroom's stalk; enlarging=e, tapering=t
* stalk-root: Describes the shape of the "stalk" root of the mushroom;  bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
* stalk-surface-above-ring: Describes the texture of the surface above the mushroom's ring; fibrous=f, scaly=y, silky=k, smooth=s
* stalk-surface-below-ring: Describes the texture of the surface below the mushroom's ring; fibrous=f, scaly=y, silky=k, smooth=s
* stalk-color-above-ring: Indicates the colour of the stalk above the mushroom's ring; brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
* stalk-color-below-ring: Indicates the colour of the stalk above the mushroom's ring; brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
* veil-type: Describes the type of veil the mushroom has; partial=p, universal=u
* veil-color: Indicates the colour of the veil of the mushroom; brown=n, orange=o, white=w, yellow=y
* ring-number: Indicate the number of rings on the stalk of the mushroom; none=n, one=o, two=t
* ring-type: Describes the type of ring on the stalk of the mushroom; cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
* spore-print-color: Indicates the colour of the spore print; black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
* population: Describes population of mushrooms; abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
* habitat: Indicates the type of habitat that the mushroom can be found in; grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

To start, we will import the libraries required for the intended analysis. Then, we can import the data using the `pd.read_csv()` function which returns a Pandas DataFrame.

In [1]:
# Import libraries:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score, KFold
import time
import warnings
warnings.filterwarnings('ignore')

raw_mushroom = pd.read_csv("mushroom_data\mushrooms.csv")

## Data Exploration
Next, let's take a look at our data. We will show the top 5 rows of the dataset, using the `.head()`, method to get a sense of what kinds of data we have at our disposal. We will then use the `.info()` method to understand what datatypes have been assigned to each column and whether there are any null values in the dataset.

In [2]:
# Explore the data
raw_mushroom.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
raw_mushroom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

According to the documentation, the 'class' column represents whether or not the mushroom is edible ('p' representing poisionous and 'e' representing edible). Given that this is exactly what we are trying to predict, we will use this column for making predictions.

It is also clear from the documentation that the remaining columns all contain data pretaining to the physical description of each mushroom. Although certain attributes of the fungi may be more indicative of whether or not it is edible, we will try our luck and see if we can make an accurate prediction model based on all of the features provided. There is also no need for feature engineering in this instance, as the data given provides a thourough description of each fungi.

There are a few other observations that are worth noting:
* It looks like there aren't any null values present in the dataset, which means we won't have to worry about them when processing the data.
* All of the columns are shown as dtype object. Having a quick glance over the .head() output and the documentation, it's clear that each column is a str type.

## Data Processing
Before we throw our DataFrame in to a machine learning algorithm, we need process the data a bit. Because each column is a multivariate classification of a given attribute of the mushroom, we will need to transform those columns into dummy columns. The code below achieves this.

In [4]:
# Make dummy columns so that data can be used in machine-learning algorithm:

cols = list(raw_mushroom.drop('class', axis=1).columns)
features_df = pd.DataFrame()

for col in cols:
    dummy = pd.get_dummies(raw_mushroom[col], prefix=col)
    features_df = pd.concat([features_df, dummy], axis=1)
    
features = list(features_df.columns)

mushrooms = pd.concat([raw_mushroom['class'], features_df], axis=1)

Now that the data is in a numeric format, it can be fit to a machine learning model.

## Model Creation and Evaluation
Before we commit to a particulart model for making predictions, it would be ideal to create a number of different models and compare how well they perform relative to one another.

In this scenario, the consequence of identifying a poisonous mushroom as edible is far greater than labelling an edible mushroom poisonous. Given this situation, it would be appropriate to consider the false positive rate (fpr) as the primary metic for evaluating the model's performance. 

It is also prudent to evaluate the model's performance more generally. Because predictions are a binary classification, ROC AUC is a great metric to look at.

Finally, as a tertiary goal, we want to pick as many mushrooms as possible. The true positive rate (tpr) can be used to evaluate how well we do in this sense.

In the event that the performance of two models are approximately the same, the time it takes for the model to run will be taken into consideration.

The functions below takes a model and the pertinent data as inputs and returns the evaluation metrics of interest using K-fold validation.

In [5]:
def positive_rates(k, model, df):
    """
    Inputs sklearn model and training data to determine the estimated false positive rate and true positive rate of the model using kfold validation.
    
    Arguments:
    k -- Number of folds in in k-fold validation.
    model -- sklearn model to train.
    df -- Pandas DataFrame containing training data.
    
    Returns: 
    mean_fpr -- The average of the false positive rates calculated from each fold in the kfold validation.
    mean_tpr -- The average of the true positive rates calculated from each fold in the kfold validation.
    """
    # Create 'sklearn.model_selection._split.KFold' object:
    kf = KFold(k, shuffle=True, random_state=1)
    
    # Create empty lists to store fpr and tpr values for each fold:
    fp_rates = []
    tp_rates = []
    
    # Create train and test datasets for different folds in the data :
    for train_index, test_index in kf.split(df):
        train = df.iloc[train_index]
        test = df.iloc[test_index]
        
        # Fit model with train dataset and make predictions on test set (for given fold):
        model.fit(train[features], train['class'])
        predictions = model.predict(test[features])
        
        # Find number of true positives (for given fold):
        tp_filter = (predictions == 'e') & (test['class'] == 'e')
        tp = len(predictions[tp_filter])
        
        # Find number of true negatives (for given fold):
        tn_filter = (predictions == 'p') & (test['class'] == 'p')
        tn = len(predictions[tn_filter])
        
        # Find number of false positives (for given fold):
        fp_filter = (predictions == 'e') & (test['class'] == 'p')
        fp = len(predictions[fp_filter])
        
        # Find number of false negatives (for given fold):
        fn_filter = (predictions == 'p') & (test['class'] == 'e')
        fn = len(predictions[fn_filter])
        
        # Determine false positive rate and true positive rate (for given fold):
        fpr = fp / (fp + tn)
        tpr = tp / (tp + fn)
        
        # Append fpr and tpr to lists of rates:
        fp_rates.append(fpr)
        tp_rates.append(tpr)
    
    # Use lists of rates to determine average (most accurate) fpr and tpr:
    mean_fpr = np.mean(fp_rates)
    mean_tpr = np.mean(tp_rates)
    
    return mean_fpr, mean_tpr

def auc_roc_validation(k, model, df):
    """
    Inputs sklearn model and training data to determine the estimated false positive rate and true positive rate of the model using kfold validation.
    
    Arguments:
    k -- Number of folds in in k-fold validation.
    model -- sklearn model to train.
    df -- Pandas DataFrame containing training data.
    
    Returns: 
    auc -- The average of the ROC AUCs calculated from each fold in the kfold validation.
    """
    # Create 'sklearn.model_selection._split.KFold' object:
    kf = KFold(k, shuffle=True, random_state=1)
    
    # Determine ROC AUC for each fold:
    aucs = cross_val_score(model, df[features], df['class'], scoring='roc_auc', cv=kf)
    
    # Calculate average ROC AUC:
    auc = np.mean(aucs)
    
    return auc

Now that the tools to evaluate multiple models have been created, it's time to instanciate some machine learning models. Becuase we are making binary classification predictions, the three types of models we will explore will be:
* Logistic Regression
* Random Forest Classifier
* Feed Forward Neural Network

The code below instanciates these models.

In [6]:
lr = LogisticRegression()
rfc = RandomForestClassifier(n_estimators=200, random_state=1, min_samples_leaf=2, max_depth=15)
mlp = MLPClassifier(hidden_layer_sizes=(5,4,4), max_iter=1000)

We can now take the models and data and feed them into the evaluation functions to determine which model is best suited for predicting the ediblity of mushrooms!

In [7]:
# Evaluate Logistic Regression Performance:
time1 = time.time()
lr_fp, lr_tp = positive_rates(5, lr, mushrooms)
lr_auc = auc_roc_validation(5, lr, mushrooms)
time2 = time.time()
print("False Positive Rate: ", lr_fp, "\n", "True Positive Rate: ", lr_tp, "\n", "ROC AUC: ", lr_auc, "\n", "Time to make predictions: ", str(time2-time1), " s")

False Positive Rate:  0.0007453416149068323 
 True Positive Rate:  1.0 
 ROC AUC:  1.0 
 Time to make predictions:  4.0169517993927  s


In [8]:
# Evaluate Random Forest Performance:
time1 = time.time()
rfc_fp, rfc_tp = positive_rates(5, rfc, mushrooms)
rfc_auc = auc_roc_validation(5, rfc, mushrooms)
time2 = time.time()
print("False Positive Rate: ", rfc_fp, "\n", "True Positive Rate: ", rfc_tp, "\n", "ROC AUC: ", rfc_auc, "\n", "Time to make predictions: ", str(time2-time1), " s")

False Positive Rate:  0.0 
 True Positive Rate:  1.0 
 ROC AUC:  1.0 
 Time to make predictions:  11.94596791267395  s


In [9]:
# Evaluate Feed-forward Neutral Network Performance:
time1 = time.time()
mlp_fp, mlp_tp = positive_rates(5, mlp, mushrooms)
mlp_auc = auc_roc_validation(5, mlp, mushrooms)
time2 = time.time()
print("False Positive Rate: ", mlp_fp, "\n", "True Positive Rate: ", mlp_tp, "\n", "ROC AUC: ", mlp_auc, "\n", "Time to make predictions: ", str(time2-time1), " s")

False Positive Rate:  0.0 
 True Positive Rate:  1.0 
 ROC AUC:  1.0 
 Time to make predictions:  74.24195551872253  s


## Discussion / Conclusions / Next Steps
From the evaluation metrics used, it is clear that all models performed almost perfectly. The only model which falls short of the mark of perfect is the Logisitic Regression model which falsely predicted that a small number of poisonous mushrooms were edible. This inevitably excludes it from being the model of choice.

Given that the performace of the RandomForestClassifier and the MLPClassifier are virually the same, it is more preferable to use the RandomForestClassifier as it takes far less time to run. How fitting that the best machine learning model for mushroom edibility classification is a "Forest" classifier!

Unfortunately the models used ran almost a bit too well for the purposes of the model design process. We could still undertake an optimization effort, but it would be pointless given that our models predicted the edibility of the mushrooms "perfectly".

## References
[1] Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf