# Problem statement

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled
mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide
to North American Mushrooms (1981). Each species is identified as definitely edible, definitely
poisonous, or of unknown edibility and not recommended. This latter class was combined
with the poisonous one.

-  **What types of machine learning models perform best on this dataset?** 
-  **Which features are most indicative of a poisonous mushroom?**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
% matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
sns.set(color_codes=True)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
df = pd.read_csv('../input/mushrooms.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# This dataset is ready for exploration, no data cleaning required 
df.info()

# Class Distribuition

Imbalanced data typically refers to a problem with classification problems where the classes
are not represented equally. Imagine a case where 99% of the data belongs to one class, this can cause
the classification model to ignore the remaining class and indeed it would get very good accuracy.
But a small difference often does not matter and this is the case in mushrooms dataset.

In [None]:
# Class Distribuition
sns.countplot(x="class", data=df, palette="Greens_d")

In [None]:
class_dist = df['class'].value_counts()

print(class_dist)

In [None]:
prob_e = class_dist[0]/(class_dist[0]+class_dist[1])
prob_p = 1 - prob_e
print(prob_e)
print(prob_p)

# Feature Transformation

Now we have to convert all categorical variables using LabelEncoder from the awesome sklearn lib.  
ex. The class p will be maped to 1 and 0 as e

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
for col in df.columns:
    df[col] = labelencoder.fit_transform(df[col])
 
df.head()

# Pearson Correlation Heatmap

Person Correlation helps us represent the statistical relationships between features, this is a simple way to get an intuition of 
the contribution of each feature to the target variable. This correlation matrix can be easily plotted using Seaborn Heatmap. In Heatmap strong relationships are emphasized with sharp colors.

In [None]:
colormap = plt.cm.viridis
plt.figure(figsize=(15,15))
plt.title('Pearson Correlation of Features', size=15)

sns.heatmap(df.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

In [None]:
# sns.pairplot(df)

# Classification Models

In [None]:
X = df.drop('class', axis=1)
y = df['class']
RS = 123

# Split dataframe into training and test/validation set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RS)

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
import xgboost

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    XGBClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]

In [None]:
# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

for clf in classifiers:
    clf.fit(X_train, y_train)
    name = clf.__class__.__name__
    
    print("="*30)
    print(name)
    
    print('****Results****')
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    print("Accuracy: {:.4%}".format(acc))
    
    train_predictions = clf.predict_proba(X_test)
    ll = log_loss(y_test, train_predictions)
    print("Log Loss: {}".format(ll))
    
    log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
    log = log.append(log_entry)
    
print("="*30)

### Multiple classifiers

Let's evaluate multiple classifiers at once. After this picking a single model and improving 
parameter.

In [None]:
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')
plt.show()

sns.set_color_codes("muted")
sns.barplot(x='Log Loss', y='Classifier', data=log, color="g")

plt.xlabel('Log Loss')
plt.title('Classifier Log Loss')
plt.show()

# Conclusion 1: Classification Model
Clearly Tree based models are wining here, even the most simple one (DecisionTreeClassifier), If I had to pick a classifier 
i would pick Decision Tree Classifier as it is the simplest one from (Decision Tree, random forest and Boosted Trees) and would run well on production environments.
Let me know if you have different opinions, feel free to share your thoughts or ask any question. 

In [None]:
# Inspect the learned Decision Trees
# One of the major advantage of Decision Trees is the fact that they can easily be interpreted.  
clf = DecisionTreeClassifier()

# Fit with all the training set
clf.fit(X, y)

In [None]:
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X.columns

print("Feature ranking:")
for f in range(X.shape[1]):
    print("%s : (%f)" % (feature_names[f] , importances[indices[f]]))

In [None]:
f, ax = plt.subplots(figsize=(15, 15))
plt.title("Feature ranking", fontsize = 12)
plt.bar(range(X.shape[1]), importances[indices],
    color="b", 
    align="center")
plt.xticks(range(X.shape[1]), feature_names)
plt.xlim([-1, X.shape[1]])
plt.ylabel("importance", fontsize = 18)
plt.xlabel("index of the feature", fontsize = 18)

# Conclusion 2: Feature Importance
Wow, There are a lot of features with no meaning to predict our target variable.
cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size are the most significant features.