# Mushroom safety dataset
## Introduction

Informations from the [UCI Dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom)

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.

### Attribute informations
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
sns.set(font_scale = 2)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/mushrooms.csv')
data.head()

In [None]:
# simple preprocessing of the dataset
target_var = data['class']
data.drop('class', axis = 1, inplace=True)
y = LabelEncoder().fit_transform(target_var) # p == 1, e == 0
data.head()

In [None]:
sns.set_style('whitegrid')
target_dist = target_var.value_counts()
sns.barplot(target_dist.index, target_dist.values)
plt.title('Target variable distribution');

The target variable seems pretty balanced. I can just go straight with the data analysis

## Question to answer with the dataset

**1. Is there a quick and simple parameter that could be able to discriminate between poisonous and non poisonous mushrooms?**

**2. Is there any observable variation for mushroom parts (cap, gill, stalk, veil, or ring) able to discriminate between poisonous and edible mushrooms?**
    
**3. How the informations from question 2 could be combined for increase the precision of mushroom identification?**

**4. Use the full dataset for classification using Random Forest and identify the most significant features**

### 1. Is there a quick and simple characteristic that could be able to discriminate between poisonous and non poisonous mushrooms?

Some wild animals, especially small mammals, eat mushrooms and thus they should be able to discriminate between poisonous and edible for their survival. Except humans, the majority of mammals have a more developed sense of smells, and a less developed sight, especially regarding the ability of counting features and recognizing colors. Thus as a first exploratory step I will look how good mushroom smell could discriminate between edible and posonous mushrooms. I will evaluate the different classes using a heatmap

In [None]:
sns.set_style('whitegrid')
odor = pd.DataFrame({'p':y, 'odor':data['odor']})
# count the 1 (poison) occurence on the dataset per odor
odor_poison = odor.groupby('odor').sum()
# get the count per odor
odor_poison_size = odor.groupby('odor').size()
odor_data = pd.DataFrame({'P': odor_poison['p']/odor_poison_size, 
                          'E':(odor_poison_size-odor_poison['p'])/odor_poison_size})
print(odor_data.head())
sns.set_style('whitegrid')
plt.figure(figsize=(12,6))
sns.set(font_scale = 2)
sns.set(style='whitegrid', context='notebook')
sns.heatmap(odor_data.T*100, fmt = '.1f', cmap = 'plasma', cbar = True, annot = True, 
            linewidth = 2, yticklabels=('Edible', 'Poisonous'))
plt.yticks(rotation=0)
plt.show()

It seems some odor categories are able to discriminate perfectly between edible and poisonous mushrooms. The only class where the difference is not exact is where the mushrooms do not have any odor (odor == n). I will create a class processing odor data for returning a predicted value, and evaluate it with a confusion matrix.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class OdorDecision(BaseEstimator, TransformerMixin):
    def __init__(self, non_poison=['a', 'l', 'n']):
        self.non_poison = non_poison
    def fit(self, X, y=None):
        return self
    def predict(self, X):
        pred = [0 if i in self.non_poison else 1 for i in X]
        return pred
    
X = data['odor'].values
od = OdorDecision()
pred = od.predict(X)
print(classification_report(y, pred, target_names = ['edible', 'poisonous']))
print('Accuracy: ', accuracy_score(y, pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y, pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Odor rule',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()


99% precision by using only the odor parameter

##   2. Is there any observable variation for mushroom parts (cap, gill, stalk, veil, or ring) able to discriminate between poisonous and edible mushrooms?

Since humans are best at recognizing colors and counting features it would be useful to find some phenotypical feature(s) able to discriminate between edible and poisonous mushrooms. The use of this feature(s) can allow the identification of mushrooms ex-situ, without knowing where the mushroom have been harvested.

For this step I will divide the different parts and perform a cross validation using a Random Forest classifier for identifing the most predictive plant parts

### 2.1 caps

In [None]:
from sklearn.model_selection import cross_val_predict
model_accuracy = []

caps = data[['cap-shape', 'cap-surface', 'cap-color']]
X_dum = pd.get_dummies(caps).values

X_train, X_test, y_train, y_test= train_test_split(X_dum, y, stratify=y,
                                                  test_size=0.2, random_state=101)

rfc_caps = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_caps.fit(X_train, y_train)

# test in sample precision with CV

y_train_pred = cross_val_predict(rfc_caps, X_train, y_train, cv = 5)

print(classification_report(y_train, y_train_pred, target_names = ['edible', 'poisonous']))
print('Accuracy: ', accuracy_score(y_train, y_train_pred))
model_accuracy.append(accuracy_score(y_train, y_train_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Mushroom Cap',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

### 2.2 Gills

In [None]:
gills = data[['gill-attachment', 'gill-spacing', 'gill-size', 'gill-color']]

X_gills = pd.get_dummies(gills).values

X_train, X_test, y_train, y_test = train_test_split(X_gills, y, stratify=y,
                                                   test_size = 0.2, random_state = 101)

rfc_gills = RandomForestClassifier(n_estimators=100, random_state=42)

rfc_gills.fit(X_train, y_train)

# In sample precision with CV
y_train_pred = cross_val_predict(rfc_gills, X_train, y_train, cv = 5)
print(classification_report(y_train, y_train_pred))
print('Accuracy: ', accuracy_score(y_train, y_train_pred))
model_accuracy.append(accuracy_score(y_train, y_train_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Mushroom Gills',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

Gills are definitely a best predictor overall for this problem

### 2.3 Stalk

In [None]:
stalks = data[['stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring']]

X_stalks = pd.get_dummies(stalks).values

X_train, X_test, y_train, y_test = train_test_split(X_stalks, y, stratify=y,
                                                   test_size=0.2, random_state=101)

rfc_stalks = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_stalks.fit(X_train, y_train)

# in sample
y_train_pred = cross_val_predict(rfc_stalks, X_train, y_train, cv = 5)
print(classification_report(y_train, y_train_pred))
print('Accuracy: ', accuracy_score(y_train, y_train_pred))
model_accuracy.append(accuracy_score(y_train, y_train_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Mushroom Stalks',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

The stalks are the one most predictive so far

### 2.4 Veil

In [None]:
veil = data[['veil-type', 'veil-color']]

X_veil = pd.get_dummies(veil).values

X_train, X_test, y_train, y_test = train_test_split(X_veil, y, stratify=y,
                                                   test_size=0.2, random_state=101)

rfc_veil = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_veil.fit(X_train, y_train)

# in sample
y_train_pred = cross_val_predict(rfc_veil, X_train, y_train, cv = 5)
print(classification_report(y_train, y_train_pred))
print('Accuracy: ', accuracy_score(y_train, y_train_pred))
model_accuracy.append(accuracy_score(y_train, y_train_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Mushroom Veil',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

 Veis has high precision (but low recall for predicting this)
 
### 2.5 Rings

In [None]:
rings = data[['ring-number','ring-type']]

X_ring = pd.get_dummies(rings)

X_train, X_test, y_train, y_test = train_test_split(X_ring, y, stratify=y,
                                                   test_size=0.2, random_state=101)

rfc_ring = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_ring.fit(X_train, y_train)

y_train_pred = cross_val_predict(rfc_ring, X_train, y_train, cv = 5)
print(classification_report(y_train, y_train_pred))
print('Accuracy: ', accuracy_score(y_train, y_train_pred))
model_accuracy.append(accuracy_score(y_train, y_train_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Mushroom Rings',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

### Overall accuracy for the different mushroom parts



In [None]:
part_names=['Caps', 'Gills','Stalks', 'Veils', 'Rings']

sns.set(font_scale = 2)
sns.set_style('whitegrid')
sns.barplot(part_names, model_accuracy)#, kwargs={'fontsize':18})
plt.ylabel('Accuracy Score')
plt.title('Mushroom Parts Evaluation', fontsize = 20);


It seems that Stalks, Gills and Rings are the most accurate to use for combined learner

## 3. How the informations from question 2 could be combined for increase the precision of mushroom identification?

I will use two approaches:

* Combine stalks, gills and rings into in single dataset and fit a model: **combined approach**
* Use 3 different models (one for each mushroom part) and apply a mojority vote classifier on the 3 models: **meta ensemble**

## Combined approach



In [None]:
combined_data = data[['gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
                     'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
                      'stalk-surface-below-ring', 'stalk-color-above-ring',
                      'stalk-color-below-ring','ring-number','ring-type']]

X_comb = pd.get_dummies(combined_data)

X_train, X_test, y_train, y_test = train_test_split(X_comb, y, stratify=y,
                                                   test_size=0.2, random_state=101)

rf_comb = RandomForestClassifier(n_estimators=100, random_state=42)
rf_comb.fit(X_train, y_train)

y_pred = rf_comb.predict(X_test)
print(classification_report(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
plt.figure(figsize = (6,6))
sns.heatmap(confusion_matrix(y_test, y_pred),
           cmap = 'plasma', annot = True,
            fmt = '.1f', cbar = True,linewidth = 2, 
            yticklabels=('Edible', 'Poisonous'),
            xticklabels=('Edible', 'Posonous')
           )
plt.title('Confusion Matrix',fontsize=20)
plt.yticks(rotation=0)
plt.xlabel('Predicted', fontsize=20)
plt.ylabel
plt.show()

In [None]:
combined_names = ['Gills', 'Stalks', 'Rings']
n = [part_names.index(i) for i in combined_names]
model_acc = [model_accuracy[i] for i in n]

model_acc.append(accuracy_score(y_test, y_pred))
combined_names.append('Combined')
sns.barplot(combined_names, model_acc)#, kwargs={'fontsize':18})
plt.ylabel('Accuracy Score')
plt.title('Mushroom Parts Evaluation', fontsize = 20);

Slight improvement by combining gills,stalks and rings in a single dataset

## Meta ensemble

I will fit the same model on three different datasets and implement a majority vote classifie

In [None]:
# create subsets
from scipy.stats import mode
gills = data[['gill-attachment', 'gill-spacing', 'gill-size', 'gill-color']]
stalks = data[['stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
                      'stalk-surface-below-ring', 'stalk-color-above-ring',
                      'stalk-color-below-ring']]
rings = data[['ring-number','ring-type']]

rf_gills = RandomForestClassifier(n_estimators=100, random_state=42)
rf_stalks = RandomForestClassifier(n_estimators=100, random_state=42)
rf_rings = RandomForestClassifier(n_estimators=100, random_state=42)



In [None]:
# create function

def majorityClassifier(model, data, y):
    ys = []
    pred_class = []
    for clf, d in zip(model, data): # run in parallels model and data
        X_dum = pd.get_dummies(d)
        X_train, X_test, y_train, y_test = train_test_split(X_dum, y, stratify=y,
                                                   test_size=0.2, random_state=101)
        ys.append(y_test)
        clf.fit(X_train, y_train)
        preds = clf.predict(X_test)
        pred_class.append(preds)
    pred_class = np.array(pred_class).T
    maj_vote = mode(pred_class, axis = 1)[0]
    return maj_vote,ys


In [None]:
preds,y_all = majorityClassifier([rf_gills, rf_stalks, rf_rings],
                      [gills, stalks, rings], y)

In [None]:
print(classification_report(y_test, preds))

Actually the meta ensemble approach gives worst results than by combining the data

## To be continued ..