# Mushroom Classification

    a rather short analysis!

Welcome! The motivation for this dataset is already well defined. Namely, 

1. What types of machine learning models perform best on this dataset?
2. Which features are most indicative of a poisonous mushroom?

So, let's begin!

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [10]:
#Read the data using pandas
data=pd.read_csv("../input/mushrooms.csv")

Let's see how it looks like!

In [11]:
data.head()

To know more about the data, let's use the *describe* method of pandas.

In [12]:
data.describe().transpose()

There are lots of features! Also, there are 8124 entries, but the good news is that

In [13]:
total_null_values = sum(data.isnull().sum())
print(total_null_values)

There are no null values! So, we can easily skip some Feature Engineering. All we now need to do to the data is encode it such a way that our algorithms will understand. We need to encode the given labels, using Scikit-Learn's Label Encoder.

In [14]:
from sklearn.preprocessing import LabelEncoder
Enc = LabelEncoder()
data_tf = data.copy()
for i in data.columns:
    data_tf[i]=Enc.fit_transform(data[i])

Although it's just fine to proceed with transforming all the data, let's just transform the features and leave the targets alone. 

In [15]:
X = data_tf.drop(['class'], axis=1)
Y = data_tf['class']

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 42)

In [17]:
X_train.head(5)

That looks good so far.

## Classifiers

Since there are only two classes, we can use some standard Binary Classifiers. We can start with Logistic Regression, but I'll use that just to contrast with the RandomForestClassifier. Other Kernels have clearly shown that RandomForestClassifiers and Decision Trees are a better fit with this data set. 

In [18]:
from sklearn.linear_model import LogisticRegression

In [19]:
from sklearn.metrics import accuracy_score
log_clf = LogisticRegression()
log_clf.fit(X_train, Y_train)
LR_pred=log_clf.predict(X_test)

accuracy_score(Y_test, LR_pred)

In [20]:
from sklearn.metrics import confusion_matrix
confusion_matrix(LR_pred, Y_test)

Whoa! That's a lot of False Positives, and a lot of True Negatives! Yes, it's a default values, but can we somehow improve somewhere else? Let's try to cross validate this.

In [21]:
from sklearn.model_selection import cross_val_predict

y_pred_cv = cross_val_predict(log_clf, X_test, Y_test, cv=5)

In [22]:
confusion_matrix(y_pred_cv, Y_test)

In [23]:
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(log_clf, X_train, Y_train, cv = 30)
print(cv_score)
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_score.mean(), cv_score.std() * 2))

It's at best 95% accurate. Can Random Forest be better than this?

### Random Forest Classifier

In [24]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier()
rnd_clf.fit(X_train, Y_train)
Y_pred = rnd_clf.predict(X_test)

In [25]:
accuracy_score(Y_test, Y_pred)

What! 100% accuracy? What does the confusion matrix look like?

In [26]:
confusion_matrix(Y_pred, Y_test)

It really works well! All predictions are correct! We should still cross validate this, this time over the entire set.

In [27]:
scores = cross_val_score(rnd_clf, X, Y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

The Random Forest Classifier (RFC) is really good! Comparing it to the Logistic Classifier, it seems like an improvement. After all, it makes all the predictions correctly. The question is now whether RFC is overfitting; to ensure it is not, we can try to fine tune the model using 10-Fold Cross Validation and using Grid Search to get the best parameters over:

1. n_estimators 

In [28]:
from sklearn.model_selection import GridSearchCV

In [29]:
param_grid = [
{'n_estimators': [3, 10, 30, 50, 100], 'max_features': [2, 4, 6, 8]},
]

grid_search = GridSearchCV(rnd_clf, param_grid, cv=10, scoring='f1')

In [30]:
grid_search.fit(X, Y)

In [31]:
grid_search.best_params_

Let's use these to find which of the Features are the important ones?

## Feature Importance

In [32]:
feat_score=[]
for name, score in zip(X_train.columns, grid_search.best_estimator_.feature_importances_):
    feat_score.append([name, score])

In [33]:
feat_score.sort(reverse=True, key= lambda x:x[1])
for char in feat_score:
    print(char)

In [34]:
grid_search.best_estimator_.fit(X_train, Y_train)
y_pred = grid_search.best_estimator_.predict(X_test)
confusion_matrix(y_pred, Y_test)

I have seen that the f1 (Harmonic mean between the True Positives, False Positives, and False Negatives) is a bit unstable. By unstable, I mean that upon doing different iterations, I get different feature importances. One can try out different scoring systems. Maybe use a single decision tree?

In [35]:
param_grid = [
{'n_estimators': [3, 10, 30, 50, 100], 'max_features': [2, 4, 6, 8]},
]

grid_search = GridSearchCV(rnd_clf, param_grid, cv=10, scoring='roc_auc')
#using ROC Area Under Curve to evaluate this time round

In [36]:
grid_search.fit(X, Y)
grid_search.best_params_

In [37]:
feat_score=[]
for name, score in zip(X_train.columns, grid_search.best_estimator_.feature_importances_):
    feat_score.append([name, score])

In [38]:
feat_score.sort(reverse=True, key= lambda x:x[1])
for char in feat_score:
    print(char)

## Some Important Features

So, how to identify a wild mushroom? Three of the most important characteristics of mushrooms are their odor, gill colors, and gill size. Let's look at them in a bit more detail.

### Odor

In [39]:
odor_labels = data['odor'].value_counts().axes[0]
edible_o =[]
poi_o = []
N =0
for odor in odor_labels:
    size = len(data[data['odor'] == odor].index)
    edibles = len(data[(data['odor'] == odor) & (data['class'] == 'e')].index)
    edible_o.append(edibles)
    poi_o.append(size-edibles)
    N=N+1

#Plotting
ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots(figsize=(12,8))

rects1 = ax.bar(ind, poi_o, width, color='r')
rects2 = ax.bar(ind + width, edible_o, width, color='y')

# Labels and Ticks along the axes.
ax.set_ylabel('Instances')
ax.set_title('Poisonous and Edible Mushrooms by their Odors')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(('None', 'Foul', 'Fishy', 'Spicy', 'Almond', 'Anise', 'Pungent', 'Creosote', 'Musty'))

ax.legend((rects1[0], rects2[0]), ('Poisonous', 'Edible'))


def autolabel(rects):
    
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.show()

Clearly, the odor of a mushroom is a good indicator of whether it will be edible or otherwise. If the mushroom gives of an odor that seems concerning, it probably is poisonous! As on can clearly see from the graph, that if the smells coming off aren't Almond-like or Anise-like, it probably is poisonous!

### Gill-size
Gill size is the next feature of importance. I'll do pretty much the same thing.

In [40]:
gillsize_labels = data['gill-size'].value_counts().axes[0]

edible_o =[]
poi_o = []
N =0
for gs in gillsize_labels:
    size = len(data[data['gill-size'] == gs].index)
    edibles = len(data[(data['gill-size'] == gs) & (data['class'] == 'e')].index)
    edible_o.append(edibles)
    poi_o.append(size-edibles)
    N=N+1
    
#Plotting
ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots(figsize=(12,8))

rects1 = ax.bar(ind, poi_o, width, color='r')
rects2 = ax.bar(ind + width, edible_o, width, color='y')

# Labels and Ticks along the axes.
plt.ylim(0,4500)
ax.set_ylabel('Instances')
ax.set_title('Poisonous and Edible Mushrooms by the size of their Gills')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(('Broad','Narrow'))

ax.legend((rects1[0], rects2[0]), ('Poisonous', 'Edible'))


def autolabel(rects):
     #To plot the labels on top of the bars.
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.show()

Similarly, one can see that if Mushrooms have narrow gills, they might just be poisonous. That's not to say broad gilled mushrooms are all okay: there is a 30.15% chance of them being poisonous.

### Gill Color

In [41]:
gillcolor_labels = data['gill-color'].value_counts().axes[0]
edible_o =[]
poi_o = []
N =0
for gc in gillcolor_labels:
    size = len(data[data['gill-color'] == gc].index)
    edibles = len(data[(data['gill-color'] == gc) & (data['class'] == 'e')].index)
    edible_o.append(edibles)
    poi_o.append(size-edibles)
    N=N+1
    
#Plotting
ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots(figsize=(12,8))

rects1 = ax.bar(ind, poi_o, width, color='r')
rects2 = ax.bar(ind + width, edible_o, width, color='y')

# Labels and Ticks along the axes.
plt.ylim(0,2000)
ax.set_ylabel('Instances')
ax.set_title('Poisonous and Edible Mushrooms by the color of their Gills')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(('Buff', 'Pink', 'White', 'Brown', 'Gray', 'Chocolate', 'Purple', 'Black', 'Red', 'Yellow', 'Orange','Green'))

ax.legend((rects1[0], rects2[0]), ('Poisonous', 'Edible'))


def autolabel(rects):
    #To plot the labels on top of the bars.
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.show()

So, based on these factors, I can tell you that you should stay away from *Foul smelling, Chocolate colored, Narrow gilled* Mushrooms. On the other hand, *Odorless, Brown colored, Broad gilled* mushrooms do quite well for cooking purposes. 

Thank you for reading! Feel free to critique me in the comments. I will try and update this later if possible.