# Mushroom Classification - Safe to eat or deadly poison?

Samuel Opper
Dr. Dillon
CSCI-4455
12/17/18


Mushrooms are low in calories and packed with nutrients like fiber and protein but there are over 70 deadly species and many closely resemble their edible counterparts. So,given an unlabeled tasty looking mushroomhow do you know if it’s safe to eat? Many people have solved this problem by training machine learning models to classify the mushrooms based on their physical featuresusing a dataset originally from the UCI machine learning repositoryand now on Kaggle. My plan was to compare logistic regression and random forest performances on this dataset since they are both very popular for binary classification. I have also compared my results to many Kaggleusers’  and a research paper published to IEEE.The mushroom dataset  consists of 8124 hypothetical samples drawn  from The Audubon Society Field Guide to North American Mushrooms (1981).323 species from two families are represented with 22 features like size shape color population and habitat each labeled as poison or edible.The data is overall reasonably balanced,and density plots show no perfect separation of features but good separation of spore print color, ring type, population, and habitat.

There are over 520 kernels on Kaggle attempting this dataset. Basically every technique has been applied by someone on Kaggle on this dataset.Most Kaggle users are successful and they have accuracies of 95-100%. Many have accuracies of 99-
100%.The research paper I found compared decision tree, naïvebayes, and SVM on this dataset and concluded that the decision tree was the best. It had the highest accuracy of 100% along with SVM but it had faster processing speed.Their results can be seen in the table below.

<img src ="table.png">

I began my attempt by simply importing the dataset and training sklearns random forest model. To my suprise it had an accuracy of 100% on the testing set without any preprocessing or pruning. The decision tree model was just as good.

I decided to use SelectFromModel to create a subset of important features to train logistic regression and another decision tree. In order to avoid overfitting I split the dataset up multiple times. 20% of the dataset is used to test in the end, 24% was used for feature selection and the rest for training and validation.

In [317]:
#imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
from sklearn.feature_selection import VarianceThreshold

In [318]:
data = pd.read_csv('./mushrooms.csv')
x = data.iloc[:,1:]
y = data.iloc[:,0]
d = defaultdict(LabelEncoder)
xl = x.apply(lambda x: d[x.name].fit_transform(x))
yl = LabelEncoder().fit_transform(y)

x1, xTest, y1, yTest = train_test_split(xl,yl, test_size = 0.20)
xTrain, xfeatureselect, yTrain, yfeatureselect = train_test_split(x1, y1, test_size = .30)

In [319]:
xf, x2, yf, y2 = train_test_split(xfeatureselect,yfeatureselect, test_size = .1)
dt = tree.DecisionTreeClassifier(criterion='entropy').fit(xf,yf)
dtpred = dt.predict(x2)
print('decision tree accuracy on prelim testing set: ', accuracy_score(y2,dtpred))

decision tree accuracy on prelim testing set:  1.0


In [320]:
select = SelectFromModel(dt, prefit = True)
xTrain_s = select.transform(xTrain)
xTest_s = select.transform(xTest)
print(xTrain.shape)
print(xTrain_s.shape)

print('selected features: ')
idx = select.get_support(indices = True)
for i in idx:
    print(x.columns[i], end = ' ')

(4549, 22)
(4549, 5)
selected features: 
odor gill-size gill-color spore-print-color population 

### The decision tree usually selects from the following [odor gill-size gill-color stalk-root spore-print-color and population] . 

In [321]:
logreg = LogisticRegressionCV(Cs = [.001,.01,.1,1,10,100], cv=10,random_state=5, penalty = 'l2', tol = .0001, scoring = 'precision', multi_class='ovr').fit(xTrain_s, yTrain)
dt2 = tree.DecisionTreeClassifier(criterion='entropy').fit(xTrain_s, yTrain)

lpred = logreg.predict(xTest_s)
dpred = dt2.predict(xTest_s)

print('logistic regression accuracy:',accuracy_score(yTest, lpred))
print('decision tree accuracy:', accuracy_score(yTest, dpred))

logistic regression accuracy: 0.8646153846153846
decision tree accuracy: 0.9956923076923077


### depending on randomness logistic regression model will score 85% +/-5% and decision tree will be 99% +/- 1%

In [322]:
logcm = confusion_matrix(yTest, lpred).ravel()
dcm = confusion_matrix(yTest,dpred).ravel()
print('    tn fp fn tp')
print('d ',dcm)
print('log ',logcm)

#tn fp fn tp
dsense = dcm[3] / (dcm[3] + dcm[2])
logsense = logcm[3] / (logcm[3] + logcm[2])

dspec = dcm[0] / (dcm[0] + dcm[1])
logspec = logcm[0] / (logcm[0] + logcm[1])

print('sensitivity,   specificity')
print(dsense , dspec)
print(logsense, logspec)


    tn fp fn tp
d  [833   0   7 785]
log  [735  98 122 670]
sensitivity,   specificity
0.9911616161616161 1.0
0.8459595959595959 0.8823529411764706


## Conclusion

Overall decision tree is the best model for this dataset. It performs with 99.5% +/-.5% accuracy using only 4 out of the original 22 features but when the decision tree classifys incorrectly it is usually a false negative which is pretty bad for this purpose. If I had more time I would try to figure out how to improve the precision of the tree. The highest accuracy I could get logistic regression was about 90% and specifity of 91% which is not good enough for classifying something as edible or posion. The samples from this dataset were drawn up from a book about mushrooms and not actual recordings of real world samples so these models wont accurately predict if you go out and measure the physical features of an unlabeled mushroom you encounter. It does prove however that if you were to build a dataset of real world measurements and they follow the same patterns as in the mushroom book you could easily train a decision tree to accurately classify real world mushrooms.

### References 

1. Rabin18, R. C. (2018, January 19). What Is the Health and Nutritional Value of Mushrooms? Retrieved December 10, 2018, from https://www.nytimes.com/2018/01/19/well/eat/what-is-the-health-and-nutritional-value-of-mushrooms.htm

2. Petruzello, M. (n.d.). 7 of the World’s Most Poisonous Mushroom. Retrieved December 10, 2018, from https://www.britannica.com/list/7-of-the-worlds-most-poisonous-mushrooms

3. Mushrooms.csv[Csv]. (2016). Kaggle., from https://www.kaggle.com/uciml/mushroom-classification/homekaggle dataset

4. Preda, G. (2017, November 22). Model Comparison for Mushrooms Classification. Retrieved December 10, 2018, from https://www.kaggle.com/gpreda/model-comparison-for-mushrooms-classification

5. Aktas, B. (n.d.). Udemy Data Visualization Homework 1. Retrieved December 10, 2018, from https://www.kaggle.com/boheminsan/udemy-data-visualization-homework-1

6. Wibowo, A., Y., A., & T. (2018, April 30). Classification Algorithm for Edible Mushroom Identification [Scholarly project]. In IEEE Xplore. Retrieved November 28, 2018, from https://ieeexplore.ieee.org/document/8350746