# Classifying Mushrooms with a Decision Tree


This is a beginner friendly introduction to making a decision tree classifier to predict whether a mushroom is edible or not. 

It accompanies my blog article on Decision Trees, which you can find [here](https://madelinecaples.hashnode.dev/if-mushrooms-grew-on-trees). 

### Importing libraries 

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

%matplotlib inline

### Loading Data 

In [None]:
DATA_PATH = '/kaggle/input/mushroom-classification/'

mushrooms = pd.read_csv(DATA_PATH + 'mushrooms.csv')

In [None]:
# Check our pandas dataframe to make sure the data was properly loaded
mushrooms.head()

We want to see what kind of features our data has. From our dataframe (above) we can see that all of the data appears to be categorical (as opposed to numerical).


Let's take a look at the columns in our dataframe. This will tell us what kind of features we are dealing with.

In [None]:
mushrooms.columns

## What are we predicting? 

We are predicting the class of the mushroom (found in `mushrooms['class']`). Specifically, we want to know whether a mushroom is edible or poisonous (e or p), based on it's features. Things like it's odor, habitat, capt shape, color, etc.  

### Overview of what we will be doing: 

1. Check distribution of data
2. Split the data into X and y 
3. Encode data 
4. Get a baseline model: we'll use a decision tree 
5. Evaluate the model: check the loss, score, and feature importances
6. Remove the features that have a low importance 
7. Create a new model without as many features

### Check distribution of data

We want to check to make sure our data is distributed pretty evenly across the two classes we have. This will tell us if our dataset is **balanced**.

In [None]:
x = mushrooms['class']
ax = sns.countplot(x=x, data=mushrooms)

We have slightly more instances of edible mushrooms than poisonous mushrooms, but the difference isn't so great that we will worry about it. 

## Split the data

We need to split our data into X - **features** and y - the **target**

In [None]:
# Split into features and labels 
X = mushrooms.drop("class", axis=1)
y = mushrooms["class"]
X.head()

In [None]:
y[:5]

## Encode the data 

Our machine learning model won't understand the categorical data that we have, unless we turn it into numbers. This process is called **encoding**. 

We are going to use Pandas categorical method to do this. This will turn the letters in our categorical data into a different number for each unique letter. 

In [None]:
columns = X.columns

In [None]:
for col in columns: 
    X[col] = X[col].astype('category').cat.codes

In [None]:
X.head()

Now we'll do the same for the label y.

In [None]:
y = y.astype('category').cat.codes

In [None]:
y[:5]

## Baseline Decision Tree

Now that our data is encoded into numbers we are ready to make our decision tree. 

We'll have to import another library, sklearn to do this. We'll also import the DecisionTreeClassifier, and train_test_split from sklearn, so that we can break our data up into training and test sets. 

In [None]:
import sklearn 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
baseline = DecisionTreeClassifier(random_state=42) # setting the random state will ensure that we get the same results each time
baseline.fit(X_train, y_train)

## Evaluate baseline 

Note that in the following section I will be comparing some different examples of the Decision Tree trained on our mushrooms dataset. While I will include scoring for the sake of comparison, I am not really trying to optimize the model. Instead I am playing around with different parameters to give you an idea of the capabilities of the Decision Tree Classifier, and what different trees will look like. 

To evaluate this baseline model we will look at the following: 

* Accuracy 
* Precision 
* Recall
* F1 Score

We can visualize those all at once by using the `metrics.classification_report` functionality that is built into sklearn. 

We'll also visualize: 

* Tree 
* Feature importance

### A few functions to make life a little easier: 

In [None]:
# Okay I give in, let's just turn this piece of code into a function!
from sklearn import tree
def show_tree(model): 
    fig = plt.figure(figsize=(30,25))
    ann = tree.plot_tree(model,
                       feature_names=mushrooms.columns, 
                       class_names=mushrooms["class"],
                   filled=True)

In [None]:
from sklearn import metrics

def print_classification_report(model, X_test, y_test): 
    y_preds = model.predict(X_test)
    print(metrics.classification_report(y_test, y_preds, target_names=['edible', 'poisonous']))

In [None]:
# Tip: don't forget to add the "print" or it will look weird and the columns won't line up!
print_classification_report(baseline, X_test, y_test)

So if our model scored perfectly on precision, recall, accuracy... doesn't that mean our work is done? Well, no. Not really. A perfect score is a sign of overfitting. The decision tree has classified everything a little too specifically by memorizing the training data, and is probably over optimizing at each split in the tree. 

### Visualize the tree

In [None]:
show_tree(baseline)

In [None]:
# Get feature importances 

# Define a function so we can use it again later 
def random_forest_feature_importance(model, df): 
  return pd.DataFrame({'cols': df.columns, 'imp': model.feature_importances_}).sort_values('imp', ascending=False)

feature_importance = random_forest_feature_importance(baseline, X)
feature_importance[:10]

In [None]:
def plot_feat_imp(feat_importance):
    return feat_importance.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_feat_imp(feature_importance[:30]);

As you can see with this plot, many of our features are barely doing anything at all for the model. The most important features appear to be gill-color, spore-print-color, population, gill-size, and odor. Bruises and habit are of almost equal importance, and it goes down from there. 

We are going to try to create a more robust tree by trimming away the less importance features. 

Note how the decision tree has been very helpful in discovering the features that we can pretty safely cut out of our model, without the accuracy suffering. We wouldn't have known this before making the decision tree, except perhaps with some expert knowledge of mushrooms. 

## Tinkering with the Decision Tree 

* Get rid of features that aren't serving the model well 


In [None]:
to_keep = feature_importance[:4].cols
len(to_keep)

In [None]:
len(X.columns)

In [None]:
X_train = X_train[to_keep]
X_test = X_test[to_keep]

len(X_train.columns), len(X_test.columns)

In [None]:
model_4features = DecisionTreeClassifier(random_state=42)
model_4features.fit(X_train, y_train)

In [None]:
print_classification_report(model_4features, X_test, y_test)

Our model is still doing quite well, but without as many features, it isn't achieving a perfect score. This is actually a sign that we might not be overfitting anymore. 

In [None]:
show_tree(model_4features)
# Tip: double click to see a larger version of the plot

Although this time we only have 4 features, because we still have a lot of samples, the tree is still pretty deep. Now Let's try to limit the depth of the tree. 

In [None]:
model_shallow = DecisionTreeClassifier(max_depth=4, random_state=42)
model_shallow.fit(X_train, y_train)

Setting max depth to 4 means that our tree will only be 4 nodes deep. Let's see if that significantly harmed the model's accuracy, precision, and recall. 

In [None]:
print_classification_report(model_shallow, X_test, y_test)

Now we will try setting the min_samples_split. This controls the minimum number of samples that are required to split an *internal node* (aka *branch*). 

In [None]:
show_tree(model_shallow)

In [None]:
model_min_split = DecisionTreeClassifier(min_samples_split=35, random_state=42)
model_min_split.fit(X_train, y_train)
show_tree(model_min_split)

In [None]:
print_classification_report(model_min_split, X_test, y_test)

### That's it for now! 

Please feel free to make a copy of this notebook and play around on your own with the DecisionTreeClassifier. There are a lot of other parameters that you can tinker with to see how it effects the tree. I hope this gives you a little bit of an idea about decision trees and how they can be useful to classifying data. 

Thank you for reading. Please leave me a comment with suggestions for future blog posts, if you'd like. 

### Further Reading

Don't forget you can check out the blog article that accompanies this notebook [here](https://madelinecaples.hashnode.dev/if-mushrooms-grew-on-trees). 

Also check out the [Sklearn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)