# Classify mushrooms on whether they are edible or poisonous

### Data Set Information:

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.

### Attribute Information:

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

https://archive.ics.uci.edu/ml/datasets/mushroom

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression, SGDClassifier, LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, cross_val_predict ,GridSearchCV
from sklearn.model_selection import validation_curve, learning_curve
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, precision_recall_curve
from sklearn.metrics import precision_score, recall_score, roc_curve, roc_auc_score, plot_roc_curve

In [None]:
df_mushrooms = pd.read_csv("mushrooms.csv")

In [None]:
df_mushrooms

In [None]:
df_mushrooms.info()

In [None]:
df_mushrooms["class"].value_counts()

**&rarr; Balanced**

In [None]:
nans = 0
for col in df_mushrooms:
    s = df_mushrooms.loc[:,col].str.count("\?")
    if s.any():
        nans += s.sum() 

In [None]:
df_mushrooms.isna().any()

In [None]:
nans / len(df_mushrooms)

**&rarr; ~31% missing values, don't drop, replace and later impute**

In [None]:
df_mushrooms.replace("?", np.nan, inplace=True)

In [None]:
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for ix_train, ix_test in splitter.split(df_mushrooms, df_mushrooms["class"]):
    df_train = df_mushrooms.loc[ix_train]
    df_test = df_mushrooms.loc[ix_test]

# df_X = df_mushrooms.drop(["class"], axis=1)
# df_y = df_mushrooms["class"]

# X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.1, random_state=42)

In [None]:
for col in df_mushrooms:
    print(col)

In [None]:
fig, axs = plt.subplots(6, 4, figsize = (25, 35))
fig.subplots_adjust(hspace=0.3, wspace=0.3)

for ix in range(len(df_train.columns[1:])):
    col = df_train.columns[ix + 1]
    ax = axs.flat[ix]
    
    sns.countplot(x=df_train[col], hue=df_train["class"], ax=ax)
    
    label = ax.xaxis.get_label()
    ax.set_xlabel(label.get_text(), fontsize=22)
    
    for tick in ax.xaxis.get_major_ticks():
        tick.label.set_fontsize(20)

    for tick in ax.yaxis.get_major_ticks():
        tick.label.set_fontsize(15)

**Based on plots and some research choose these predictors:**

In [None]:
# include_cols = ["bruises", "odor", "gill-size", "gill-color"]
include_cols = ["bruises", "odor", "gill-size", "gill-color", "stalk-shape", "cap-color", "population", "habitat"]

In [None]:
simple_imp = SimpleImputer(strategy="most_frequent")

df_imp = simple_imp.fit_transform(df_train)
df_train = pd.DataFrame(df_imp, columns=df_train.columns)


trans_pips = []

for col in include_cols:
    trans_pips.append( (col, OneHotEncoder(categories=[df_mushrooms[col].unique()]), [col]) )

col_trans = ColumnTransformer(trans_pips)
target_trans = OneHotEncoder(categories=[["p", "e"]], drop="first", sparse=False)

**p = 0, e = 1**

In [None]:
y_train = df_train["class"]
X_train = df_train.drop(["class"], axis=1)

y_test = df_test["class"]
X_test = df_test.drop(["class"], axis=1)

df_imp = simple_imp.fit_transform(X_test)
X_test = pd.DataFrame(df_imp, columns=X_test.columns)


X_trans = col_trans.fit_transform(X_train)
y_trans = target_trans.fit_transform(y_train.to_numpy().reshape(-1,1))
y_trans = y_trans.flatten()

logreg = LogisticRegression()
logreg.fit(X_trans, y_trans)

In [None]:
target_trans.categories_

In [None]:
X_trans = col_trans.fit_transform(X_test)
y_trans = target_trans.fit_transform(y_test.to_numpy().reshape(-1,1))
y_trans = y_trans.flatten()

pred = logreg.predict(X_trans)

accuracy_score(y_trans, pred)

In [None]:
confusion_matrix(y_trans, pred)

**A few falsely classified as edible.**
In this scenario one want to tolerate only false negatives rather than false positives (you don't want to risk eating a poisonous mushroom). **&rarr; increase precision.**

In [None]:
plot_roc_curve(logreg, X_trans, y_trans)

In [None]:
pred_proba = logreg.predict_proba(X_trans)

thresh = 0.6
pred = pred_proba[:, 1] > thresh
confusion_matrix(y_trans, pred)