# Mushroom Edibility Classification

Author: Harrison Hong

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

This project will analyze and attempt to classify various mushrooms as either poisonous or edible based on a multitude of parameters provided about each mushroom. Two classification methods will be used: logistic regression and naive Bayes, and their respective results will be compared. Data visualization is done with tools provided in the Altair library.

## Project

In [1]:
import numpy as np
import pandas as pd
import altair as alt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import CategoricalNB

### Loading the Data

The data is presented as a series of categorical attributes. The `class` attribute indicates whether the given mushroom is poisonous or not, with `e` being edible and `p` being poisonous. This will be our target variable.

The rest of the attributes are also represented with a single character representing category. The meaning of each character can be found in the included `agaricus-lepiota.names` file.

In [2]:
# dictionary of attributes as provided in file
# possible categories for each attribute are provided in tuple as value corresponding to key
ATTRIBUTES = {
    "class":                    ("edible=e","poisonous=p"),
    "cap-shape":                ("bell=b","conical=c","convex=x","flat=f",
                                "knobbed=k","sunken=s"),
    "cap-surface":              ("fibrous=f","grooves=g","scaly=y","smooth=s"),
    "cap-color":                ("brown=n","buff=b","cinnamon=c","gray=g","green=r",
                                "pink=p","purple=u","red=e","white=w","yellow=y"),
    "bruises?":                 ("bruised=t","unbruised=f"),
    "odor":                     ("almond=a","anise=l","creosote=c","fishy=y","foul=f",
                                "musty=m","none=n","pungent=p","spicy=s"),
    "gill-attachment":          ("attached=a","descending=d","free=f","notched=n"),
    "gill-spacing":             ("close=c","crowded=w","distant=d"),
    "gill-size":                ("broad=b","narrow=n"),
    "gill-color":               ("black=k","brown=n","buff=b","chocolate=h","gray=g",
                                "green=r","orange=o","pink=p","purple=u","red=e",
                                "white=w","yellow=y"),
    "stalk-shape":              ("enlarging=e","tapering=t"),
    "stalk-root":               ("bulbous=b","club=c","cup=u","equal=e",
                                "rhizomorphs=z","rooted=r"), # removed "missing" category
    "stalk-surface-above-ring": ("fibrous=f","scaly=y","silky=k","smooth=s"),
    "stalk-surface-below-ring": ("fibrous=f","scaly=y","silky=k","smooth=s"),
    "stalk-color-above-ring":   ("brown=n","buff=b","cinnamon=c","gray=g","orange=o",
                                "pink=p","red=e","white=w","yellow=y"),
    "stalk-color-below-ring":   ("brown=n","buff=b","cinnamon=c","gray=g","orange=o",
                                "pink=p","red=e","white=w","yellow=y"),
    "veil-type":                ("partial=p","universal=u"),
    "veil-color":               ("brown=n","orange=o","white=w","yellow=y"),
    "ring-number":              ("none=n","one=o","two=t"),
    "ring-type":                ("cobwebby=c","evanescent=e","flaring=f","large=l",
                                "none=n","pendant=p","sheathing=s","zone=z"),
    "spore-print-color":        ("black=k","brown=n","buff=b","chocolate=h","green=r",
                                "orange=o","purple=u","white=w","yellow=y"),
    "population":               ("abundant=a","clustered=c","numerous=n",
                                "scattered=s","several=v","solitary=y"),
    "habitat":                  ("grasses=g","leaves=l","meadows=m","paths=p",
                                "urban=u","waste=w","woods=d")
}

Define a function to obtain a unique string for each attribute-category pair.

In [3]:
def attr_cat(cat: str, attr: str, full: bool = True) -> str:
    """
    Return corresponding meaning of attribute and category letter pair. Pass False to
    full to exclude attribute name.
    Throw error if not found.
    """
    for long_cat in ATTRIBUTES[attr]:
        cat_name, cat_char = long_cat.split("=")

        if cat == cat_char:
            return pair_format(attr, cat_name) if full else cat_name

    raise ValueError("invalid category")


def pair_format(attr: str, cat_name: str) -> str:
    return f"{attr}:{cat_name}"

In [4]:
attr_cat("e", "ring-type")

'ring-type:evanescent'

Read data from the file. Attribute 11, `stalk-root`, contains missing values and will not be used as a feature.

In [5]:
df = pd.read_csv("agaricus-lepiota.data", names = ATTRIBUTES.keys())
df.shape

(8124, 23)

In [6]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


### Initial Visualization

In [7]:
# choosing features (and target)
features = list(ATTRIBUTES.keys())[1:] # class is target
features.remove("stalk-root")

target = "class"

[target] + features

['class',
 'cap-shape',
 'cap-surface',
 'cap-color',
 'bruises?',
 'odor',
 'gill-attachment',
 'gill-spacing',
 'gill-size',
 'gill-color',
 'stalk-shape',
 'stalk-surface-above-ring',
 'stalk-surface-below-ring',
 'stalk-color-above-ring',
 'stalk-color-below-ring',
 'veil-type',
 'veil-color',
 'ring-number',
 'ring-type',
 'spore-print-color',
 'population',
 'habitat']

It would be very difficult to visualize all 21 selected features simultaneously.
As an initial visualization, category counts for the (arbitrarily selected) attribute `cap-shape` are plotted in a bar chart. The bars are further divided by `cap-color`. 1000 data points are sampled from the larger pool.

In [8]:
# constant for seeds
RSTATE = 125

In [9]:
# smaller sample for plotting
sample_df = df.sample(1000, random_state = RSTATE)

Columns are added to provide more complete category names.

In [10]:
sample_df["Class"] = sample_df["class"].apply(attr_cat, args = ["class", False])
sample_df["Cap Shape"] = sample_df["cap-shape"].apply(attr_cat, args = ["cap-shape", False])
sample_df["Cap Color"] = sample_df["cap-color"].apply(attr_cat, args = ["cap-color", False])

In [11]:
# domain and range for colors
cdomain = ["brown", "buff", "cinnamon", "gray", "green", "pink", "purple", "red", "white", "yellow"]
crange_ = ["brown", "tan", "orange", "grey", "green", "pink", "purple", "red", "lightgrey", "yellow"]

alt.Chart(sample_df).mark_bar().encode(
    x = "count()",
    y = "cap-shape",
    color = alt.Color("Cap Color", scale = alt.Scale(domain = cdomain, range = crange_)),
    row = "Class",
    tooltip = ["Cap Shape", "Cap Color", "count()"]
)

The `conical` cap shape is uncommon enough that it does not appear in the sample data and thus is excluded from the chart.
Based on this chart, it can be inferred that mushrooms with `bell` shaped caps are more likely to be edible than not. The same can be inferred about mushrooms with `white` caps, since the upper chart contains much more white-capped mushrooms.
Such an inference cannot be made about mushrooms with `flat` caps due to the somewhat similar counts of flat-capped mushrooms in the edible and poisonous classes. It would also seem that red caps are proportionally more common with knobbed caps in poisonous mushrooms than in edible mushrooms.

In [12]:
# repeat with different attributes
sample_df["Odor"] = sample_df["odor"].apply(attr_cat, args = ["odor", False])
sample_df["Habitat"] = sample_df["habitat"].apply(attr_cat, args = ["habitat", False])

In [13]:
alt.Chart(sample_df).mark_bar().encode(
    x = "count()",
    y = "odor",
    color = "Habitat",
    row = "Class",
    tooltip = ["Odor", "Habitat", "count()"]
)

This chart features much stronger divides between categories than the previous.
From this chart, it seems that having `almond`, `anise`, or `none` odor is a strong indicator of edibility, while having `pungent`, `spice`, or `fishy` odor is a strong indicator of inedibility. Most mushrooms that grow in `leaves` also seem to be poisonous.

### Logistic Regression

To more thoroughly analyze these mushroom attributes and their relation to edibility, we first perform a logistic regression.

#### One-hot encoding

In its current categorical form, the data cannot be analyzed in this manner. To remedy this, every category is given its own binary dummy variable, effectively numericizing the data. If the point fits into a given category, then the corresponding variable is set to 1; otherwise, it is set to 0.

In [14]:
columns_ohe = []
features_ohe = []

# get all categories
for attr in features:
    for long_cat in ATTRIBUTES[attr]:
        cat_name, cat_char = long_cat.split("=")

        string = pair_format(attr, cat_name)
        columns_ohe.append(string)
        features_ohe.append(string)

df_ohe = pd.DataFrame(np.zeros((len(df), len(columns_ohe))), columns = columns_ohe)

In [15]:
df_full = pd.DataFrame()
# turn all categories into pair format
for attr in features:
    df_full[attr] = df[attr].apply(attr_cat, args = [attr])

In [16]:
# populate the ohe dataframe
# might take a while
for i in range(len(df_full)):
    df_ohe.loc[i, df_full.loc[i]] = 1

In [17]:
df_ohe[target] = df[target]

In [18]:
df_ohe.head()

Unnamed: 0,cap-shape:bell,cap-shape:conical,cap-shape:convex,cap-shape:flat,cap-shape:knobbed,cap-shape:sunken,cap-surface:fibrous,cap-surface:grooves,cap-surface:scaly,cap-surface:smooth,...,population:several,population:solitary,habitat:grasses,habitat:leaves,habitat:meadows,habitat:paths,habitat:urban,habitat:waste,habitat:woods,class
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,p
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,e
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,e
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,p
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,e


The data is now properly formatted for logistic regression.

#### Train-test split

The data must now be split between training data and testing data. This project will use a training-testing ratio of 70:30.

In [19]:
TRAIN_SIZE = 0.7

In [20]:
X = df_ohe[features_ohe]
y = df_ohe[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = TRAIN_SIZE, random_state = RSTATE + 1)

#### Classifier and accuracy

The training data can now be used to fit a logistic regression classifier.

In [21]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

The accuracy of the model can be verified with both the training and testing data.

In [22]:
clf.score(X_train, y_train)

1.0

In [23]:
clf.score(X_test, y_test)

1.0

Despite being extremely accurate with the training data, the model does not appear to be overfit, as predictions on the testing data are also very accurate.

The model also predicts with very high confidence:

In [24]:
y_confs = pd.DataFrame(clf.predict(X_test), columns = ["Pred"])
y_confs[clf.classes_] = clf.predict_proba(X_test)

# mean confidences for class predictions
y_confs.groupby("Pred").mean()

Unnamed: 0_level_0,e,p
Pred,Unnamed: 1_level_1,Unnamed: 2_level_1
e,0.994625,0.005375
p,0.006809,0.993191


The fitting coefficients can be used to determine the impact each attribute has on the final guess.

In [25]:
clf.classes_

array(['e', 'p'], dtype=object)

In [26]:
coefs = pd.DataFrame()
coefs["Category"] = clf.feature_names_in_
coefs["Coef"] = clf.coef_[0]

coefs

Unnamed: 0,Category,Coef
0,cap-shape:bell,0.258623
1,cap-shape:conical,0.486899
2,cap-shape:convex,-0.183022
3,cap-shape:flat,-0.030378
4,cap-shape:knobbed,-0.005568
...,...,...
114,habitat:meadows,0.480272
115,habitat:paths,-0.034307
116,habitat:urban,0.077879
117,habitat:waste,-1.532937


As expected from the initial visualization, odor is a significant determiner of a mushroom's edibility, though the cap-size predictions seem to have been misguided. Variables being considered in conjunction with each other rather than simple counts is likely to account for this discrepancy.

In [27]:
# charting the coefficients
alt.Chart(coefs).mark_bar().encode(
    x = "Category",
    y = "Coef",
    color = alt.Color("Coef", scale = alt.Scale(scheme = "purples")),
    tooltip = ["Category", "Coef"]
)

It can also be seen that having `green` spore print color is a very significant indicator of inedibility, according to the model.

### Categorical Naive Bayes

We move onto a different classification algorithm that utilizes Bayes' theorem. Unlike logistic regression, this algorithm is able to handle categorical data off the bat. However, the attributes must be converted to integers, one for each category. This is achieved with the help of OrdinalEncoder.

#### OrdinalEncoder

Use the encoder to find unique categories for each feature.

In [28]:
enc = OrdinalEncoder()
enc.fit(df[features])

enc.categories_

[array(['b', 'c', 'f', 'k', 's', 'x'], dtype=object),
 array(['f', 'g', 's', 'y'], dtype=object),
 array(['b', 'c', 'e', 'g', 'n', 'p', 'r', 'u', 'w', 'y'], dtype=object),
 array(['f', 't'], dtype=object),
 array(['a', 'c', 'f', 'l', 'm', 'n', 'p', 's', 'y'], dtype=object),
 array(['a', 'f'], dtype=object),
 array(['c', 'w'], dtype=object),
 array(['b', 'n'], dtype=object),
 array(['b', 'e', 'g', 'h', 'k', 'n', 'o', 'p', 'r', 'u', 'w', 'y'],
       dtype=object),
 array(['e', 't'], dtype=object),
 array(['f', 'k', 's', 'y'], dtype=object),
 array(['f', 'k', 's', 'y'], dtype=object),
 array(['b', 'c', 'e', 'g', 'n', 'o', 'p', 'w', 'y'], dtype=object),
 array(['b', 'c', 'e', 'g', 'n', 'o', 'p', 'w', 'y'], dtype=object),
 array(['p'], dtype=object),
 array(['n', 'o', 'w', 'y'], dtype=object),
 array(['n', 'o', 't'], dtype=object),
 array(['e', 'f', 'l', 'n', 'p'], dtype=object),
 array(['b', 'h', 'k', 'n', 'o', 'r', 'u', 'w', 'y'], dtype=object),
 array(['a', 'c', 'n', 's', 'v', 'y'], dty

Transform the dataframe into integers using the encoder.

In [29]:
df_nb = df.copy()

df_nb[features] = enc.transform(df[features])

In [30]:
df_nb.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,5.0,2.0,4.0,1.0,6.0,1.0,0.0,1.0,4.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,2.0,3.0,5.0
1,e,5.0,2.0,9.0,1.0,0.0,1.0,0.0,0.0,4.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,3.0,2.0,1.0
2,e,0.0,2.0,8.0,1.0,3.0,1.0,0.0,0.0,5.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,3.0,2.0,3.0
3,p,5.0,3.0,8.0,1.0,6.0,1.0,0.0,1.0,5.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,2.0,3.0,5.0
4,e,5.0,2.0,3.0,0.0,5.0,1.0,1.0,0.0,4.0,...,2.0,7.0,7.0,0.0,2.0,1.0,0.0,3.0,0.0,1.0


#### Test-train-split

In [31]:
Xnb = df_nb[features]
ynb = df_nb[target]

Xnb_train, Xnb_test, ynb_train, ynb_test = train_test_split(Xnb, ynb, train_size = TRAIN_SIZE, random_state = RSTATE + 2)

#### Classifier and Accuracy

Now that the features are all integral, the classifier can be fit to the data.

In [32]:
nb_clf = CategoricalNB()
nb_clf.fit(Xnb_train, ynb_train)

CategoricalNB()

Testing the accuracy of predictions with both training and testing inputs:

In [33]:
nb_clf.score(Xnb_train, ynb_train)

0.9613084769609568

In [34]:
nb_clf.score(Xnb_test, ynb_test)

0.9581624282198523

The accuracy is slightly lower than with the logistic regression, but the test accuracy is still not significantly lower than the training accuracy, indicating little to no problematic overfitting.

### Comparison

Overall, logistic regression came out with more accurate and higher confidence results than categorical naive bayes. However, the latter was far easier to implement.

In [35]:
# comparing confidences
conf_cols = ["e Conf", "p Conf"]

X_test_sample = X_test.sample(1000, random_state = RSTATE - 1)

y_confs_comp = pd.DataFrame(clf.predict(X_test_sample), columns = ["Pred"])
y_confs_comp[conf_cols] = clf.predict_proba(X_test_sample)
y_confs_comp["Algo"] = "LR"

Xnb_test_sample = Xnb_test.sample(1000, random_state = RSTATE - 2)

y_confs_nb = pd.DataFrame(nb_clf.predict(Xnb_test_sample), columns = ["Pred"])
y_confs_nb[conf_cols] = nb_clf.predict_proba(Xnb_test_sample)
y_confs_nb["Algo"] = "NB"

y_confs_comp = y_confs_comp.append(y_confs_nb)

In [36]:
y_confs_comp.head()

Unnamed: 0,Pred,e Conf,p Conf,Algo
0,e,0.997725,0.002275,LR
1,e,0.998221,0.001779,LR
2,e,0.997125,0.002875,LR
3,e,0.998839,0.001161,LR
4,e,0.999607,0.000393,LR


The following chart displays confidence in `"e"` on the x-axis and confidence in `"p"` on the y-axis. Algorithm used is represented by color, and `"e"` vs. `"p"` prediction is indicated by shape.

In [37]:
alt.Chart(y_confs_comp).mark_point().encode(
    x = "e Conf",
    y = "p Conf",
    column = "Algo",
    color = "Algo",
    shape = "Pred",
    tooltip = ["e Conf", "p Conf"]
)

The confidence values for logistic regression are all fairly high and clump towards greater than 80% confidence in a specific guess. Meanwhile, the confidence values for naive Bayes are more evenly distributed, with minimal clumping at extremes and more points near the middle.

## Summary

This project used two classification methods to make predictions as to the toxicity of a mushroom based on its attributes. Odor was found to be the most major indicator of whether a mushroom is poisonous, with generally unpleasant odors correlating with inedibility. Overall, the logistic regression method was more accurate, with predictions matching 100% of the training and test data. However, this accuracy came with the added work of one-hot encoding all categorical variables.

## References

* Dataset retrieved from the UCI Machine Learning Database: https://archive.ics.uci.edu/dataset/73/mushroom

* sklearn documentation:
    * Logistic Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    * Categorical NB: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB
    * OrdinalEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

* StatQuest on naive Bayes: https://www.youtube.com/watch?v=O2L2Uv9pdDA

* More on naive Bayes variations: https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf

* Inspiration for stacked bar chart: https://chartexpo.com/blog/best-graphs-for-categorical-data#


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1475b75b-4ef2-4d5c-8970-5db2f2b43502' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>