# Decision Tree Classifier for Mushroom classification (poisonous or not?)
Playing around using a decision tree classifier for the Mushroom dataset. It is clear that it is very easy to classify if or if not a mushroom is poisonous from this dataset. This was just a useful exercise for me to play with decision trees, compare them against Weka j48 decision trees (which can handle categorical data much better) and see if we can classify without the ability to smell :)

Note - tree images copied in just in case pydot doesn't work

In [1]:
#import libraries and data and split into Training and Testing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

In [2]:
TEST_SIZE = 0.3
RANDOM_STATE = 0

In [3]:
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder
n_rows_X, n_cols_X = X.shape
for cols in range(0,n_cols_X):
    labelencoder_X = LabelEncoder()
    X[:, cols] = labelencoder_X.fit_transform(X[:, cols])

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [4]:
#Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

OneHotEncoder not required for application of a decision tree apparently ... 

In [5]:
# '#create CART classifier'
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion="entropy", splitter="best", max_depth=None, \
                                     min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., \
                                     max_features=None, random_state=0, max_leaf_nodes=None, min_impurity_decrease=0., \
                                     min_impurity_split=None, class_weight=None, presort=False)
#clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [6]:
# Predicting the Test set results
y_pred = clf.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [7]:
cm

# Classification can always be achieved independent of the random state and if or if not 10-fold cross validation is applied ... Therefore may be more useful to look at the trees created

In [8]:
import pydot
from IPython.display import Image
from sklearn.tree import export_graphviz

tree.export_graphviz(clf, out_file='tree_1.dot',feature_names=df.columns[:-1])  
(graph,) = pydot.graph_from_dot_file('tree_1.dot')
graph.write_png('tree_1.png')
Image("tree_1.png", width=700, height=700)

In [9]:
from IPython.display import Image
Image(filename='../input/images-decision-trees-created/tree_1.png')

# Analysis of tree...
It appears that the algorithm has split on 'gill-color' first, in other words it views this as the attribute which gives the maximum information gain on the first split. This appears incorrect, by using Weka the first split is always on 'odor' as 'odor' provides the biggest entropy gain for the first split. This can be checked by running infogain in Weka (below). It seems the decision tree algorithm is treating categorical values as numeric values and therefore not making the appropriate split. One way of handling this may be to use OneHotEncoder or pd.get_dummies so that each catorgoisation has its own column with a Boolean value. This would allow the classifier to 'access' the category that it should be splitting on first i.e. odor.

Output using infogain in Weka with default parameters and Ranker search method:

average merit | average rank  | attribute      
---|---|---
 0.906 +- 0.002  |   1   +- 0    |    5 odor
 0.481 +- 0.004  |   2   +- 0    |   20 spore-print-color
 0.417 +- 0.003  |   3   +- 0    |    9 gill-color
 0.318 +- 0.002  |   4   +- 0    |   19 ring-type
 0.285 +- 0.002  |   5   +- 0    |   12 stalk-surface-above-ring
 0.272 +- 0.003  |   6   +- 0    |   13 stalk-surface-below-ring

In [10]:
#Appears get_dummies does the job of onehotenoder and removes dummy variables we do not need 
#AND automatically gives nice column names ... win :)
df_2 = pd.get_dummies(df,drop_first=True)

In [11]:
X_new = df_2.iloc[:, 1:].values
y = df_2.iloc[:, 0].values

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
clf = tree.DecisionTreeClassifier(criterion="entropy", splitter="best", max_depth=None, \
                                     min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., \
                                     max_features=None, random_state=0, max_leaf_nodes=None, min_impurity_decrease=0., \
                                     min_impurity_split=None, class_weight=None, presort=False)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

In [13]:
tree.export_graphviz(clf, out_file='tree_2.dot',feature_names=df_2.columns[:-1])
(graph,) = pydot.graph_from_dot_file('tree_2.dot')
graph.write_png('tree_2.png')
Image("tree_2.png", width=700, height=700)

In [14]:
Image(filename='../input/images-decision-trees-created/tree_2.png')

Now we get the initial split on odor, which is what we expect, however it is still not as clean and intuitive as using Weka

# Output from Weka
Using j48 decision tree (10fold CV) with all parameters as per default. This appears so much easier to interpret and it frustrates me that I cannot get Python to provide such an intuitive view of the problem and make catagorical splits instead of binary ... 

In [15]:
Image(filename='../input/images-decision-trees-created/tree_weka.png')

Can we still predict if poisonous if we cannot smell it?

In [16]:
#Remove odor and continue as normal
df_1 = df.drop('odor',1)

In [17]:
df_2 = pd.get_dummies(df_1,drop_first=True)

X_new = df_2.iloc[:, 1:].values
y = df_2.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
clf = tree.DecisionTreeClassifier(criterion="entropy", splitter="best", max_depth=None, \
                                     min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., \
                                     max_features=None, random_state=0, max_leaf_nodes=None, min_impurity_decrease=0., \
                                     min_impurity_split=None, class_weight=None, presort=False)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm



In [18]:
tree.export_graphviz(clf, out_file='tree_3.dot',feature_names=df_2.columns[:-1])
(graph,) = pydot.graph_from_dot_file('tree_3.dot')
graph.write_png('tree_3.png')
Image("tree_3.png", width=700, height=700)

In [19]:
Image(filename='../input/images-decision-trees-created/tree_3.png')

Easy... no need to smell them