# Comparing machine learning algorithms on predicting edible/poisonous mushrooms

## Let's start with our imports!

In [None]:
# Imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
%matplotlib inline

In [None]:
# Let's import the data and start exploring it
data = pd.read_csv('../input/mushrooms.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

It becomes clear that we are dealing with all categorical variables here.
We can use sci-kit learn's Label Encoder to deal with these categorical variables.

In [None]:
labelEncoder = preprocessing.LabelEncoder()
for col in data.columns:
    data[col] = labelEncoder.fit_transform(data[col])
    
# Train Test Split
X = data.drop('class', axis=1)
y = data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Time to compare some Machine Learning models
We use a for loop to loop throuh the different models. The empty lists are made to create 
the overview table at the end.

In [None]:
keys = []
scores = []
models = {'Logistic Regression': LogisticRegression(), 'Decision Tree': DecisionTreeClassifier(),
          'Random Forest': RandomForestClassifier(n_estimators=30), 
          'K-Nearest Neighbors':KNeighborsClassifier(n_neighbors=1),
            'Linear SVM':SVC(kernel='rbf', gamma=.10, C=1.0)}

for k,v in models.items():
    mod = v
    mod.fit(X_train, y_train)
    pred = mod.predict(X_test)
    print('Results for: ' + str(k) + '\n')
    print(confusion_matrix(y_test, pred))
    print(classification_report(y_test, pred))
    acc = accuracy_score(y_test, pred)
    print(acc)
    print('\n' + '\n')
    keys.append(k)
    scores.append(acc)
    table = pd.DataFrame({'model':keys, 'accuracy score':scores})

print(table)

Logistic Regression clearly performs the poorest of our algorithms. The k-NN classifier comes extremely close to 100% accuracy. The tree-based methods and the linear SVM all achieve 100% accuracy. It looks like these machine learning algorithms have little trouble with this dataset. Let's explore the important features in predicting poisonous mushrooms next.

In [None]:
# Re-training the Random Forest
rfc = RandomForestClassifier(n_estimators = 30)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

importances = rfc.feature_importances_
plot = sns.barplot(x=X.columns, y=importances)

for item in plot.get_xticklabels():
    item.set_rotation(90)

Odor has the highest feature importance in the Random Forest. We can explore the effect of odor on the predicted class a bit further with this next plot.

In [None]:
sns.countplot(x = 'odor', data = data, hue='class', palette='coolwarm')
plt.show()

From this plot we can see how important odor is in predicting the right classes. Most odor categories are only linked to one outcome class. And for odor #5 almost all mushrooms belong to class 0.
This was a very clear dataset where most ML algorithms will not have a problem with.