# Assignment 6: Mushroom Classifier
## DTSC 680: Applied Machine Learning

## Name: Tyrone Amos Myers

## Comprehensive Overview

So, the following notebook takes a mushroom dataset "drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf." It has 8124 instances and 22 features. Additionally, all of those features have at least two possible categories, resulting in over 100 categories, overall. That doesn't entail comprehensive representation across all possible categories. It simply means there are lots of mushroom morphologies that, given certain assumptions about the nature of organisms, mushrooms being an organism, are accurately correlated with the edibility of mushrooms. Anyway, the mushroom data set is imported. It is a .data file, which is a proprietary file type similar in almost every way to a CSV file except it has no feature labels. Thus, it must be passed in without a header or the resulting DataFrame will have individual feature categorical values heading up columns. Ugly. I chose to keep the data with the integer column indices/names for ease of viewing and because changing the DataFrame would have required writing more code that I didn't want to write. I simply created a DataFrame with the correct feature names from the .names file and commented it out. It is there largely for reference. 

Anyway, then, following the Assignment Directions, Assignment FAQ, and discussion board questions/posts/responses (which were so incredibly helpful, honestly), I split the data into two sets of data. Each instance that had a missing variable in at least one column formed one data set, and the instances that had no missing values formed another. The data without missing values then get's split into features and a target, which is NOT the target of the whole data set. Rather, per assignment instructions, I am training on present data to get a reasonable set of categories to impute to the missing values. So, I did that. I separated the data, one hot encoded and label encoded the resulting X and y, and then used that data to train a KNeighbors Classifier to predict the missing values. Per assignment instructions, I printed out the count of each imputed missing value. I make this sound easy, but the imputation step was hard until I remembered to read the Assignment FAQ. As soon as I read it I knew how to do it (don't do train_test_split first, duh!). 

Anyway, I correctly imputed the missing values. Once I had fully intact data, I then one hot encoded and label encoded again, this time for the poisonous/edible target, which is the target of the full data set. Then, I trained a RandomForest Classifier and a Logistic Regression Classifier. I should mention that after I reviewed the Assignment instructions, I dropped the first column of the one hot encoded data because that is best for Logistic Regression/Classification. While I don't quite understand it yet, I read in several places that the performance metrics require sophisticated interpretation, or something like that, if the first column isn't dropped in Logistic Regression Classificaton. I guess it is called the dummy trap. So, once I encoded the data (transformed the categorical features into boolean values, which is why it is called "one" hot encoding, because '1' is True). I trained the two models, timed each model fitting, and then evaluated them using the assignment specifications for evaluation, which were accuracy, precision, and recall. 

I could visualize them, but all the curves would be squares given the scores, so they wouldn't provide any meaningful information, I don't think. Then, I used Principal Component Analysis to reduce the data set to the minimum number of features necessary to account for 95% of the variance (the spread of the data), which was 38 prinicipal components from the 22 features. I then trained new RandomForest and Logistic Regression Classifiers on the reduced data set, timing them, too. Finally, I evaluated them using the performance metrics stipulated by the assignment.


- Import Mushroom dataset
- Impute missing values using KNN
- Train Random Forest and Logistic Regression models on full dataset
- Time each training and evaluate performance with accuracy, precision, and recall
- Perform PCA for dimensionality reduction
- Retrain models on reduced dataset and time the training, too
- Evaluate model accuracy, precision, recall
- Analyze model performance and training time

## Preliminaries

Import common packages:

In [1]:
# Common imports
import numpy as np
import pandas as pd
import os

# Import and Explore Data

- Load mushroom dataset
- Inspect features, target variable, missing values

In [2]:
mushroom_data_raw = pd.read_csv('agaricus-lepiota.data', header=None)
#with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
    #print(mushroom_data_raw)

# Right now, this serves no function short of helping me understand the data better, but it might serve a function later
# and I have already written the code anyway, so I will just keep it.
# mushroom_data = mushroom_data_raw.rename(columns={0: 'edible_or_poisonous', 1:'cap_shape', 2: 'cap_surface', 3: 'cap_color', 
                                                  # 4: 'bruises', 5: 'odor', 6: 'gill_attachment', 7: 'gill_spacin', 
                                                  # 8: 'gill_size', 9: 'gill_color', 10: 'stalk_shape', 11: 'stalk_root', 
                                                  # 12: 'stalk_surface_above_ring', 13: 'stalk_surface_below_ring', 
                                                  # 14: 'stalk_color_above_ring', 15: 'stalk_color_below_ring', 16: 'veil_type', 
                                                  # 17: 'veil_color', 18: 'ring_number', 19: 'ring_type', 20: 'spore_print_color',
                                                  # 21: 'population', 22: ' habitat'})

print(mushroom_data_raw)

# There are files that come with this mushroom data set which specify the different classes of each feature. Since I am one
# hot encoding anyway, just knowing that the features reference characteristics of mushrooms and that the aim is to correctly
# classify whether a given mushroom is edible or poisonous is sufficient, I think, to grasp the data. I can also inspect 
# feature importance later, and the above will be more meaningful then, if I choose to do that.

     0  1  2  3  4  5  6  7  8  9   ... 13 14 15 16 17 18 19 20 21 22
0     p  x  s  n  t  p  f  c  n  k  ...  s  w  w  p  w  o  p  k  s  u
1     e  x  s  y  t  a  f  c  b  k  ...  s  w  w  p  w  o  p  n  n  g
2     e  b  s  w  t  l  f  c  b  n  ...  s  w  w  p  w  o  p  n  n  m
3     p  x  y  w  t  p  f  c  n  n  ...  s  w  w  p  w  o  p  k  s  u
4     e  x  s  g  f  n  f  w  b  k  ...  s  w  w  p  w  o  e  n  a  g
...  .. .. .. .. .. .. .. .. .. ..  ... .. .. .. .. .. .. .. .. .. ..
8119  e  k  s  n  f  n  a  c  b  y  ...  s  o  o  p  o  o  p  b  c  l
8120  e  x  s  n  f  n  a  c  b  y  ...  s  o  o  p  n  o  p  b  v  l
8121  e  f  s  n  f  n  a  c  b  n  ...  s  o  o  p  o  o  p  b  c  l
8122  p  k  y  n  f  y  f  c  n  b  ...  k  w  w  p  w  o  e  w  v  l
8123  e  x  s  n  f  n  a  c  b  y  ...  s  o  o  p  o  o  p  o  c  l

[8124 rows x 23 columns]


# Impute missing values with KNN

- Define X (features) and y (target)
- Split the data for imputation
- Split the data for training
- Define target (the missing values, that is)
- One-hot encode features (all features without missing values except column = [0]), label encode target
- Train KNN model on just the instances without missing values
- Predict missing values
- Impute missing values in original dataset

In [3]:
# The files that come  with this data set note that the data has: "Missing Attribute Values: 2480 of them (denoted by "?"), 
# all for attribute 11," which is 'stalk_root'. Therefore, all the missing values are in one column across 2480 rows/instances.
# Thus, I can split the data into two dataframes. Following the FAQ, I can use the dataframe without missing values to train a 
# KNNClassifier (KNN won't work with missing values) and then use the trained model to predict for the missing values, which
# have become the first "target". So, I have to split the data first.

# Replace "?" with NaN
mushroom_data_raw.replace("?", np.nan, inplace=True)

# Create a dataframe from the column with missing values and drop the actual response variable to prevent data leakage.
# Basically, if I obtain imputed values from a classifier trained on the response data, the response data will become
# a feature of the imputed missing values, which would cause overfitting were I to model with the data afterwards.
column_with_missing_values = mushroom_data_raw.columns[mushroom_data_raw.isna().any()][0]

In [4]:
# Define X to exclude the column with missing values and define y to be the column with missing values.
# Exclude the first column which is the target for the full data set and the column with missing values, which is the target for
# this step.
X_for_impute = mushroom_data_raw.drop(columns=[0, column_with_missing_values])
y_for_impute = mushroom_data_raw[column_with_missing_values]

# For training data, use rows that don't have missing values
X_train_less_missing = X_for_impute.loc[y_for_impute.notna()]
y_train_less_missing = y_for_impute.dropna()

# For prediction, use rows that have missing values
X_missing_values = X_for_impute.loc[y_for_impute.isna()]

In [5]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encode the features. First, instantiate an object of the OneHotEncoder class (as Dr. Moribitu would say), and drop
# the first column (this time for logistic regression later). Handle unknowns by ignoring them (otherwise, I won't be able to 
# do one hot encoding to transform the predicted values into the imputed missing values). We have to impute missing values before
# one hot encoding because one hot encoding doesn't work properly with missing values (It's like putting a mask on thin air).
oh_encoder = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
X_train_less_missing_encoded = pd.DataFrame(oh_encoder.fit_transform(X_train_less_missing))
X_missing_values_encoded = pd.DataFrame(oh_encoder.transform(X_missing_values))

# Label encode the target data. Like One Hot Encoding in that it assigns numbers to the classes but different since one hot 
# encoding only assigns 1 or 0 for each distinct element in the features. Categories are conceptual and meaningless in machine 
# language. Only math matters. But math can be used to model categories nominally; that is, there is no numerical value to each
# number. Rather, each number functions like a name (again, nominally) or a label in a game or given context. In this sense, 
# the number is simply a unique representative and has no heirarchical or ordinal ranking. It is just a context-relative ID.
stalk_root_encoder = LabelEncoder()
y_train_less_missing_encoded = stalk_root_encoder.fit_transform(y_train_less_missing)



In [6]:
from sklearn.neighbors import KNeighborsClassifier

# Train KNN classifier and fit it to encoded data minus the rows with missing values
knn = KNeighborsClassifier()
knn.fit(X_train_less_missing_encoded, y_train_less_missing_encoded)

# Create an empty list to store the imputed values per instructions
missing_values = []

# Predict missing values and transform the encoded data back to original column names (which are still just integers)
y_pred_encoded = knn.predict(X_missing_values_encoded)
y_pred = stalk_root_encoder.inverse_transform(y_pred_encoded)

# Add the predicted values to the missing_values list
missing_values.extend(y_pred.tolist())

# Replace the missing values in the original data with the predicted values
mushroom_data_raw.loc[X_missing_values.index, column_with_missing_values] = y_pred

# Verify if all missing values have been filled
# print(mushroom_data_raw.isna().sum())

# Print the unique values in missing_values and their counts per instructions
unique_values, counts = np.unique(missing_values, return_counts=True)
for val, count in zip(unique_values, counts):
    print(f"Value: {val}, Count: {count}")

# Looking at the documentation that attended this file, we see that b, c, and e in stalk_root correspond to bulbous=b, club=c, 
# and equal=e.

Value: b, Count: 2014
Value: c, Count: 28
Value: e, Count: 438


# Graded Concept Question #1: 
#### Why don’t we one-hot encode the response data to train the KNN model instead?

Several reasons. First, Ockham's razor. The most simple model that sufficiently captures the data is the best model. One hot encoding introduces computational complexity since the target variable would become two variables, each of which could be 1 or 0. So, there would then be two inverse columns (like two sides to a coin). Both would contain the exact same information flipped. Thus, the data would have perfectly correlated redundancy. Further, from a cost-benefit analysis, the computational costs would increase beyond what is normal, stipulated, or necessary. The data would take more time to train and the code would take longer to write (to deal with the redundancy and complexity). Also, complexity is confusing. If something can be modeled
sufficiently with less complexity, the model and its meaning become approachable and understandable to a wider range of stakeholders/decision makers. Finally, it is more natural for many classification algorithms and for logistic regression to predict a single output. There are strategies for predicting more than one target, but they are still stacked binary strategies, ultimately.

# Train models on full dataset

- One-hot encode full X, label encode full y
- Using %timeit (since %%time is deprecated), train a Random Forest model
- Evaluate RF model
- Using %timeit, train Logistic Regression model
- Evaluate LR model

In [7]:
# Define X (features) and y(target). X is all of the mushroom feature data minus the target (edible/poisonous).
X = mushroom_data_raw.drop(columns=[0])
y = mushroom_data_raw[0]

# One-hot encode all features and drop first for consistency with logistic regression later.
oh_encoder = OneHotEncoder(sparse_output=False, drop='first')
X_encoded = pd.DataFrame(oh_encoder.fit_transform(X))

# Label encode the edible/poisonous column (the target)
poisonous_edible_encode = LabelEncoder()
y_encoded = poisonous_edible_encode.fit_transform(y)

# Check
# print(y.value_counts())

In [8]:
from sklearn.model_selection import train_test_split

# Split full encoded data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate an object of RandomForestClassifier class, with random_state=42 per instructions
rf_mushroom_model = RandomForestClassifier(random_state=42)

#### The instructions said to use %%time, but that is deprecated. %time outputs Wall time, and a GA on the DB recommended CPU time instead. Further, another comment on the DB recommended using only the magic command and .fit() in a single code box for each timing. Since the magic command works on the whole cell, this makes the most sense for comparing model train times per assignment instructions.

In [10]:
%timeit rf_mushroom_model.fit(X_train, y_train)

272 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Get RF predictions on test set for model evaluation
y_pred_rf = rf_mushroom_model.predict(X_test)

# Evaluate RF model according to assignment instructions
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)

print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)

Accuracy: 1.0
Precision: 1.0
Recall: 1.0


In [12]:
from sklearn.linear_model import LogisticRegression

# Instantiate an object of the Logistic Regression class, with random_state=42 for continuity.
log_reg_mushroom_model = LogisticRegression(random_state=42)

In [13]:
%timeit log_reg_mushroom_model.fit(X_train, y_train)

33.6 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
# Get LR predictions on test set for model evaluation
y_pred_log_model = log_reg_mushroom_model.predict(X_test)

# Evaluate LR model according to assignment instructions
log_reg_accuracy = accuracy_score(y_test, y_pred_log_model)
log_reg_precision = precision_score(y_test, y_pred_log_model)
log_reg_recall = recall_score(y_test, y_pred_log_model)

print("Accuracy:", log_reg_accuracy)
print("Precision:", log_reg_precision)
print("Recall:", log_reg_recall)

Accuracy: 1.0
Precision: 1.0
Recall: 1.0


#### I really owe the other students who posted on the DB for getting the %%timeit magic command right. I also read the documentation, but sometimes I miss obvious things when I am focused on confronting a more complicated problem (like the magic command has to be at the top of the cell, etc...)

# Perform PCA for dimensionality reduction

- Fit PCA model, retain 95% variance
- Reduce data to Principal Components
- Note variance retained and dimensionality reduction

In [15]:
from sklearn.decomposition import PCA

# Create PCA model
pca = PCA(n_components=0.95)

# Fit and transform training data
X_train_pca = pca.fit_transform(X_train)

# Transform test data
X_test_pca = pca.transform(X_test)

# Output the original number of encoded features compared to the reduced principal components.
print("Original number of features:", X_train.shape[1])
print("Reduced number of features:", X_train_pca.shape[1])

Original number of features: 94
Reduced number of features: 38


# Retrain models on reduced dataset

- Repeat model training on reduced dataset
- Time training for each model

In [16]:
# Retrain RF model on reduced dataset
rf_mushroom_pca_reduction = RandomForestClassifier(random_state=4)

In [17]:
%timeit rf_mushroom_pca_reduction.fit(X_train_pca, y_train)

2.97 s ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
# Retrain LR model on reduced dataset
log_reg_pca_reduction = LogisticRegression(random_state=4)

In [19]:
%timeit log_reg_pca_reduction.fit(X_train_pca, y_train)

19 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Evaluate models

- Compute accuracy, precision, recall on test set
- Compare model performance between full and reduced dataset

In [20]:
# Random Forest with PCA reduction trained again to calculate performance (my model names above are a little unwieldy for the
# code below.)
rf_pca = RandomForestClassifier(random_state=42)
rf_pca.fit(X_train_pca, y_train)

y_pred_rf_pca = rf_pca.predict(X_test_pca)

rf_pca_accuracy = accuracy_score(y_test, y_pred_rf_pca)
rf_pca_precision = precision_score(y_test, y_pred_rf_pca)
rf_pca_recall = recall_score(y_test, y_pred_rf_pca)

print("RF PCA Accuracy:", rf_pca_accuracy)
print("RF PCA Precision:", rf_pca_precision)
print("RF PCA Recall:", rf_pca_recall)

RF PCA Accuracy: 1.0
RF PCA Precision: 1.0
RF PCA Recall: 1.0


In [21]:
# Logistic Regression with PCA reduction trained again to calculate performance
log_reg_pca = LogisticRegression(random_state=42)
log_reg_pca.fit(X_train_pca, y_train)

y_pred_log_reg_pca = log_reg_pca.predict(X_test_pca)

log_reg_pca_accuracy = accuracy_score(y_test, y_pred_log_reg_pca)
log_reg_pca_precision = precision_score(y_test, y_pred_log_reg_pca) 
log_reg_pca_recall = recall_score(y_test, y_pred_log_reg_pca)

print("LogReg PCA Accuracy: {:.2f}".format(log_reg_pca_accuracy))
print("LogReg PCA Precision: {:.2f}".format(log_reg_pca_precision))
print("LogReg PCA Recall: {:.2f}".format(log_reg_pca_recall))

LogReg PCA Accuracy: 0.99
LogReg PCA Precision: 0.99
LogReg PCA Recall: 0.99


### Compare Performance of models on full vs. reduced data set
There are two principal differences. First, the RandomForest Classifier took less time on the full mushroom data set than the reduced one, and the Logistic Regression Classifier took less time on the reduced data set than the full one. Second, the Logistic Regression classifier scored lower on the reduced data set. I really have no idea if this is good or bad. Someone on the DB mentioned that values of 1.0 are expected. I suspect this is due to the absence of hyperparameter tuning, or maybe these kinds of mushrooms are just really easy to classify. I don't know. What I do know is that I inspected my code as good as I could and I can't find anywhere where I mix the train and test data, so it can't be that. I don't know what else could cause it to be so accurate. All in all, the models are not significantly different and it seems this exercise did not improve the computational costs because there was a tradeoff between each algorithm depending on the data set. Had I trained with Gradient Boosting and XGBoost maybe it would be different, I don't know?

# Graded Concept Question #2: 
#### Could we train these two models by one-hot encoding the response data instead, being careful to specify that the drop parameter of the OneHotEncoder class is set to ‘first’? Why or why not?

So, I have answered this question in part above in Graded Concept Question #1, but I can say two more things in response to this question. First, the RandomForest Classifier is designed to handle lots of targets due to the nature of the algorithm (a bunch of decision trees for gross simplification). However, my other points still stand. Simplification and complexity, costs, interpretability, and algorithm specific assumptions make it a very sloppy, inefficient, confusing, costly, and consuming strategy. One hot encoding the target variable would break the assumptions of the Logistic Regression Classifier, requiring additional steps. So, my answer here only expands on my answer for #1.