# Sabriya Sowers' Mushroom Classification Analysis  
**Author:** Sabriya Sowers  
**Date:** November 10, 2025  

## Introduction  
This project applies **machine learning classification techniques** to predict whether a mushroom is **edible or poisonous** based on its physical characteristics. The dataset used is the **UCI Mushroom Dataset**, which contains 8,124 records describing mushrooms from 23 species of gilled mushrooms in the Agaricus and Lepiota family. Each record includes attributes such as cap shape, color, odor, and gill size, among others.   

## Section 1. Import and Inspect the Data

In [85]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
from matplotlib.colors import ListedColormap

# Point directly to the data file
data_file = Path("../../Data/agaricus-lepiota.data")

# Load data into a DataFrame named `df'
df = pd.read_csv(
    data_file,
    header=None,
    names=column_names,
    na_values=["?"]
)

In [86]:
# Add easier to understand features headers
column_names = [
    "class", "cap_shape", "cap_surface", "cap_color",
    "bruises", "odor", "gill_attachment", "gill_spacing", "gill_size",
    "gill_color", "stalk_shape", "stalk_root",
    "stalk_surface_above_ring", "stalk_surface_below_ring",
    "stalk_color_above_ring", "stalk_color_below_ring",
    "veil_type", "veil_color", "ring_number", "ring_type",
    "spore_print_color", "population", "habitat"
]

In [87]:
# Quick summary dataset.
df.info()

# Display first 10 rows
print(df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap_shape                 8124 non-null   object
 2   cap_surface               8124 non-null   object
 3   cap_color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill_attachment           8124 non-null   object
 7   gill_spacing              8124 non-null   object
 8   gill_size                 8124 non-null   object
 9   gill_color                8124 non-null   object
 10  stalk_shape               8124 non-null   object
 11  stalk_root                5644 non-null   object
 12  stalk_surface_above_ring  8124 non-null   object
 13  stalk_surface_below_ring  8124 non-null   object
 14  stalk_color_above_ring  

## Section 2. Data Exploration and Preparation

In [88]:
# Check for missing values
df.isnull().sum()

class                          0
cap_shape                      0
cap_surface                    0
cap_color                      0
bruises                        0
odor                           0
gill_attachment                0
gill_spacing                   0
gill_size                      0
gill_color                     0
stalk_shape                    0
stalk_root                  2480
stalk_surface_above_ring       0
stalk_surface_below_ring       0
stalk_color_above_ring         0
stalk_color_below_ring         0
veil_type                      0
veil_color                     0
ring_number                    0
ring_type                      0
spore_print_color              0
population                     0
habitat                        0
dtype: int64

In [89]:
# Fill missing values in 'stalk-root' with its mode
mode_stalk_root = df['stalk_root'].mode()[0]
df['stalk_root'] = df['stalk_root'].fillna(mode_stalk_root)

# Verify: should print 0
print(int(df['stalk_root'].isna().sum()))


0


**Data Inspection Summary**
- The Mushroom dataset contains 8,124 rows and 23 categorical features describing the physical characteristics of mushrooms. 
- Each feature was originally encoded as a single character value (e.g., x, f, n), and the target variable (class) indicates whether the mushroom is edible (e) or poisonous (p). 
- Data inspection showed that only one feature, stalk_root, contained missing values. These missing values were replaced using the mode, ensuring no missing data remained. 
- All other features contained complete data.

2.2 Feature Engineering

In [90]:
# Convert target 'class' to numeric first to ensure successful modeling
df['class'] = df['class'].map({'e': 0, 'p': 1}) # class: edible or poisonous

# One-hot encode all the other categorical features
X = pd.get_dummies(df.drop('class', axis=1), drop_first=True)

# Target feature
y = df['class']

# Testing
print("X shape:", X.shape)
print("y shape:", y.shape)
X.head()

X shape: (8124, 94)
y shape: (8124,)


Unnamed: 0,cap_shape_c,cap_shape_f,cap_shape_k,cap_shape_s,cap_shape_x,cap_surface_g,cap_surface_s,cap_surface_y,cap_color_c,cap_color_e,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,False,False,False,False,True,False,True,False,False,False,...,False,True,False,False,False,False,False,False,True,False
1,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,True,False,False,False,False,False,True,False,False,False
3,False,False,False,False,True,False,False,True,False,False,...,False,True,False,False,False,False,False,False,True,False
4,False,False,False,False,True,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False


## Section 3. Feature Selection and Justification

## 3.2 Define X (features) and y (target)

**Target (y): whether the mushroom is edible or poisonous**
y = df['class']   # where 'e' = edible, 'p' = poisonous (we mapped to 0/1 already)

**Features (X): all remaining mushroom characteristics**
X = df.drop(columns=['class'])

1. **Why are these features selected?**
All non-target features represent observable physical characteristics of mushrooms, such as cap shape, gill color, odor, and stalk pattern. These biological traits are relevant to determining whether a mushroom is edible or poisonous.

1. **Are there features likely to be highly predictive of class?**
Yes. Some mushroom characteristics are known to be strongly associated with toxicity. For example, the feature “odor” is often highly predictive because many poisonous mushroom species produce distinctive foul or chemical smells, while edible mushrooms typically do not. Other features such as spore print color and gill size may also contribute to distinguishing edible from poisonous mushrooms.