Classifying mushrooms as edible or poisonous, given 21 categorical features turns out to be very easy task for random forests - they consistently achieve almost 100 % accuracy, even when trained on as little as only 10% of the original data and tested on the remaining 90%.  

However, when collecting mushrooms on our own, we probably don't have enough time and/or executive control/willpower to check all 21 features in each mushroom we have collected. So maybe there is a better way. Maybe there is a small subset of features, which we could feed into our model and get a reasonable prediction, whether this mushroom is edible, possibly supplemented by the probability a.k.a. a measure of certainty of our judgment.

Here is where feature importances come into play. We can check what features our models mostly rely on and which features are basically useless in determining, whether we can safely eat it. As it turns out, we can restrict ourselves to only a small sample of features without losing any accuracy.

In the later part I did some experiments with constraining random forests to see how reducing their maximal depth and number of leaf nodes impacts accuracy and certainty of predictions.

# 1. Exploratory Data Analysis

First, let's import all the data

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

input_dir = '/kaggle/input/mushroom-classification/mushrooms.csv'

data = pd.read_csv(input_dir)

data.shape

We have 8124 rows representing individual mushrooms and 23 columns, each of which contains information about one feature.

There are no nulls in this data and all the data is categorical in type.

In [None]:
print("Any nulls?", data.isnull().any().any()) # no nulls
data.info() # all good, all categorical data

In [None]:
data.head()

We can extract the information about what unique labels occur in each feature columna and the number of these unique labels into a DataFrame:

In [None]:
labels_data = pd.DataFrame({col: [len(set(data[col])), list(set(data[col]))] for col in data.columns}).T
labels_data.columns = ['# unique labels', 'Labels']
labels_data

Since veil-type feature has only one value ('p'), it doesn't give any useful information and thus we can discard it.

In [None]:
if 'veil-type' in data.columns:
    data.drop('veil-type', axis=1, inplace=True)
data.shape

We can continue our exploration by comparing the distribution of each feature among the edible and poisonous examples:

In [None]:
sns.set_style('darkgrid')

fig, axs = plt.subplots(7,3, figsize=(20,15))

plt.subplots_adjust(top=2, bottom=None)


for i, col in enumerate(data.columns[1:]):
    sns.countplot(
        x=col, hue='class',
        data=data,
        ax=axs[i//3, i%3]
    )
    axs[i//3, i%3].set_title(col)
    axs[i//3, i%3].set_xlabel(None)
    axs[i//3, i%3].set_ylabel(None)
    axs[i//3, i%3].legend(loc='upper right')
plt.show()

**p** (blue color) stands for *poisonous*, whereas **e** (orange) stands for *edible*.

You can look up the dataset description if you would like to know what individual letters below each blue-orange pair of bars mean.

While most individual features can be found both among the edible and poisonous mushrooms, though to varying extents, some can be very clear indicators. For example, look at the spore-print-color chart. Almost all the examples with 'h' label (which stands for 'chocolate') are poisonous.

In [None]:
spc_count = data.query('`spore-print-color`=="h"')['class'].value_counts()
print(spc_count)
print(spc_count['p']/spc_count.sum())

So, if you encounter a mushroom with chocolate-color spore print, you can be 97% certain, it's not edible (assuming its dataset is reliable.

Gill-color seems to be an even better indicator. On its chart, we don't see any orange above the 'b' label (standing for 'buff').

In [None]:
gc_count = data.query('`gill-color`=="b"')['class'].value_counts()
print(gc_count)
print(gc_count['p']/gc_count.sum())

You definitely should not eat mushrooms with buff-colored gills.

But these two rules are not infallible. There are many mushrooms with neither chocolate-colored spore prints nor buff-colored gills, that are still toxic.

In [None]:
other_toxic_count = data.query('`spore-print-color`!="h" & `gill-color`!="b"')['class'].value_counts()
print(other_toxic_count)
print(other_toxic_count['p']/other_toxic_count.sum())

So you would throw away most of the toxic examples, but I still would not bet my life/health on 88% certainty.

But maybe if we exclude these two from our data, a third clear indicator will emerge...?

In [None]:
data_excl_1 = data.query('`spore-print-color`!="h" & `gill-color`!="b"')

sns.set_style('darkgrid')

fig, axs = plt.subplots(7,3, figsize=(20,15))

plt.subplots_adjust(top=2, bottom=None)


for i, col in enumerate(data.columns[1:]):
    sns.countplot(
        x=col, hue='class',
        data=data_excl_1,
        ax=axs[i//3, i%3]
    )
    axs[i//3, i%3].set_title(col)
    axs[i//3, i%3].set_xlabel(None)
    axs[i//3, i%3].set_ylabel(None)
    axs[i//3, i%3].legend(loc='upper right')
plt.show()

Odor labels 'p' (pungent) and 'c' (creosote) seem promising.

In [None]:
data_excl_2 = data_excl_1.query('odor!="c" & odor!="p"')
data_excl_2_counts = data_excl_2['class'].value_counts()

print(data_excl_2_counts)
print(data_excl_2_counts['p']/data_excl_2_counts.sum())

Down to 3.6%. Let's keep going...

In [None]:
sns.set_style('darkgrid')

fig, axs = plt.subplots(7,3, figsize=(20,15))

plt.subplots_adjust(top=2, bottom=None)


for i, col in enumerate(data.columns[1:]):
    sns.countplot(
        x=col, hue='class',
        data=data_excl_2,
        ax=axs[i//3, i%3]
    )
    axs[i//3, i%3].set_title(col)
    axs[i//3, i%3].set_xlabel(None)
    axs[i//3, i%3].set_ylabel(None)
    axs[i//3, i%3].legend(loc='upper right')
plt.show()

It seems like we could get close to 100% certainty, if we just threw away all the mushrooms with enlarging stalks (stalk shape 'e' label). This way, however, we would also lose a lot of edible mushrooms.

In [None]:
data_excl_3 = data_excl_2.query('`stalk-shape`!="e"')
data_excl_3_counts = data_excl_3['class'].value_counts()
print(data_excl_3_counts)

Alternatively, we could discriminate further, based on spore print color. It seems that all the remaining poisonous mushrooms have either white ('w' label) or red ('r' label) spores, although in this way we also lose many edible mushrooms.

In [None]:
data_excl_4 = data_excl_2.query('`spore-print-color`!="w" & `spore-print-color`!="r"')
data_excl_4_counts = data_excl_4['class'].value_counts()
print(data_excl_4_counts)

That's much better. In this way, we preserved almost 1000 more edible mushrooms.

In [None]:
data_excl_4_counts - data_excl_3_counts

Nevertheless, we lost quite a lot of edible mushrooms from our original mushroom set.

In [None]:
data['class'].value_counts()['e'] - data_excl_4_counts['e']

So maybe there is a better way. Maybe there is a way not to get mushroom-poisoned, while not losing (almost) any edible mushrooms.

Maybe there is a way Machine Learning methods may help us achieve this objective.

# 2. Encoding the features as binary and one-hot labels

Several features contain exactly 2 unique labels. These can be encoded as binary, i.e. 0 for one label and 1 for the other.

I don't think any of the others can be sensibly translated to an ordinal scale, so they will need to be encoded as one-hot sparse matrices.

However, note that one feature, 'stalk-root', contains a question mark ('?') label which stands for "missing". This should not be encoded as another index in the one-hot representation, since it does not stand for anything real but rather our lack of knowledge. Thus examples with this label will have a vector of zeros as their stalk-root one-hot representation.

In [None]:
labels_data = pd.DataFrame({col: [len(set(data[col])), list(set(data[col])), '?' in set(data[col])] for col in data.columns}).T
labels_data.columns = ['# unique labels', 'Labels', 'Unknown (\'?\') label']
labels_data

In [None]:
from sklearn.preprocessing import OneHotEncoder

# For convenience, pre-shuffle the data before splitting it into features and targets
data = data.sample(frac=1) 

# Target labels (edible - 1 ; poisonous - 0)
data_y = data['class'].apply(lambda x: 1 if x=='e' else 0)

# Predictive features. An empty DataFrame for now, but soon we will populate it with properly encoded data
data_X = pd.DataFrame()

# Encoder for encoding features with more than 2 labels
encoder = OneHotEncoder()

for col_name in data.columns[1:]: # For each column except the first one, which is 'class'
    # The list of all unique labels in the column
    col_unique = list(set(data[col_name].values))
    # If there are only two unique labels
    if len(col_unique)==2:
        # Encode as binary
        col_encoded = data[col_name].apply(lambda x: 0 if x==col_unique[0] else 1)
        # For better interpretability, contain the meaning of 0s and 1s with respect to the original descriptive labels in the name of the encoded column
        col_encoded_name = f'{col_name}-bin{col_unique[0]}-{col_unique[1]}'
        # Add this encoded column to the data_X DataFrame
        data_X[col_encoded_name] = col_encoded
    # If there are more than two unique labels
    else:
        # Encode as one-hot with previously initialized OneHotEncoder
        # Immediately convert into a numpy array
        col_encoded = encoder.fit_transform(data[col_name].values.reshape(-1,1)).toarray()
        # Labels for each column of the one-hot array, into which this column has just been encoded
        col_encoded_labels = encoder.categories_[0]
        # For each of this column's labels
        for i, label in enumerate(col_encoded_labels):
            if label=='?': # Skip, if this label is '?'
                continue
            # The name of the column encoding this label will contain this label's name (again, for interpretability)
            col_encoded_name = f'{col_name}-{label}'
            # Add this column to the data_X DataFrame 
            data_X[col_encoded_name] = col_encoded[:,i]
            
data_X.shape, data_y.shape

Sanity check:

We have 110 columns in the encoded data. We should have one new column for each column in the original data, which had 2 unique values and n for each with n>2 unique values minus one for the '?' stalk-root label.


In [None]:
expected_n_columns = 0
for col in data.columns[1:]:
    L = len(set(data[col])) 
    if L==2:
        expected_n_columns += 1
    else:
        if '?' in set(data[col]):
            L-=1
        expected_n_columns += L
expected_n_columns

All good

# 3. Random Forest Classifier

## 3.1. Finding the most important featuers

Before testing how a Random Forest Classifier performs on this dataset, let's test a single Decision Tree Classifier, of which many instances RFC is an ensemble.

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.model_selection import cross_val_score

dtc = DTC()

dtc_cvs_acc = cross_val_score(dtc, data_X, data_y, scoring='accuracy', cv=3, n_jobs=-1)

print("\tDecision Tree Classifier:")
print(dtc_cvs_acc.round(5))
print(f"Mean: {dtc_cvs_acc.mean().round(5)}\tStd: {dtc_cvs_acc.std().round(5)}")

It seems like this task is just trivial for this algorithm, it achieves almost 100% accuracy right from the start without any fine-tuning.

No wonder, why Random Forests do it even better.

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC

rfc = RFC()

rfc_cvs_acc = cross_val_score(rfc, data_X, data_y, scoring='accuracy', cv=3, n_jobs=-1)
print("\Random Forest Classifier:")
print(rfc_cvs_acc.round(7))
print(f"Mean: {rfc_cvs_acc.mean().round(5)}\tStd: {rfc_cvs_acc.std().round(5)}")

Let's make it a little harder. Instead of splitting the data into 3 parts and training the classifier on 2 of them and testing on the third, 3 times in all combinations (as we do with the cross_val_score function), let's test the model on a randomly selected 10% of the dataset and test it on the remaining 90%. We will repeat this process 10 times and calculate the average score.

For that we can use Scikit-Learn's KFold function.

(I don't know if this method has a name. From now on I will call it "reverse cross-validation".)

In [None]:
from sklearn.model_selection import KFold

splits = KFold(n_splits=10, shuffle=True)

rfc_scores = [] # A list to write scores into 
forests = [] # A list to contain the trained models (to be explained in a while)

for test_idx, train_idx in tqdm(splits.split(data_X)):
    train_X, train_y = data_X.iloc[train_idx], data_y.iloc[train_idx]
    test_X, test_y = data_X.iloc[test_idx], data_y.iloc[test_idx]
    
    rfc = RFC()
    rfc.fit(train_X, train_y)
    
    rfc_score = rfc.score(test_X, test_y)
    rfc_scores.append(rfc_score)
    forests.append(rfc)
    
print("\tRandom Forest Classifier (reverse cross-validation):")
print(f"Mean: {np.mean(rfc_scores)}\tMin: {np.min(rfc_scores)}")

With scores that high we can be reasonably certain that our model will perform well.

As I said at the end of section 1., we want to have models that don't need to rely all the features in our dataset, encoded in 110 columns, but rather a small subsection of them.

We can access information about to what extent the Random Forest Classifier relies on each feature with .feature_importances_ attribute.

However, although random forests perform quite well on many different kinds of tasks, they are quite 'chaotic', that is, two models initialized with the exact same set of hyperparameters and then trained on the exact same data may evolve quite differently and thus learn to rely on different features.

In [None]:
rfc_fi_df = pd.DataFrame()

rfc = RFC()
rfc.fit(data_X, data_y)
rfc_fi_df['RFC 1'] = rfc.feature_importances_
rfc = RFC()
rfc.fit(data_X, data_y)
rfc_fi_df['RFC 2'] = rfc.feature_importances_

rfc_fi_df.index = data_X.columns

rfc_fi_df

This difference would be even more pronounced, if we decided to select the most important features, e.g. with their importance greater than 0.02. We can see this by plotting a histogram for each of these two Forests:

In [None]:
fig, axs = plt.subplots(1,2, figsize=(15,9))

rfc_fi_df['RFC 1'].hist(ax=axs[0])
axs[0].set_ylim(top=10) # Most of the features are of very low importance, which is represented below as a "skyscrapper" at the left, so we will "zoom in" a little bit
rfc_fi_df['RFC 2'].hist(ax=axs[1])
axs[1].set_ylim(top=10)

plt.show()

This differences in feature importances make it less clear to judge, what features we should select.

That's why when training 100 models on 100 training mini-splits we saved each model after training. Now we can average over their learned respective feature importances and get a clearer picture:

In [None]:
# Average feature importances
av_fi = np.zeros((rfc.feature_importances_.shape))

for rfc in forests:
    av_fi += rfc.feature_importances_
    
av_fi /= len(forests)

# Plot the data

fig, ax = plt.subplots(figsize=(15,9))

ax.hist(av_fi)
ax.set_ylim(top=10)
plt.show()

We will select three, increasingly smaller and stricter sets of features, with importance thresholds of 0.01, 0.02, and 0.04:

In [None]:
av_fi_df = pd.Series({feature: importance for feature, importance in zip(data_X.columns, av_fi)}).to_frame('importance')
av_fi_df

In [None]:
features_01 = av_fi_df.query('importance >= 0.01').index.values
features_02 = av_fi_df.query('importance >= 0.02').index.values
features_04 = av_fi_df.query('importance >= 0.04').index.values
features_01.shape, features_02.shape, features_04.shape

We will repeat the previous training procedure, but now each training mini-split will be trained and evaluated on each of the three subsets of features. Scores will be stored in separate lists for each feature subset.

In [None]:
rfc_01_scores, rfc_02_scores, rfc_04_scores = [], [], []

splits = KFold(n_splits=10, shuffle=True)

for test_idx, train_idx in tqdm(splits.split(data_X)):
    train_X, train_y = data_X.iloc[train_idx], data_y.iloc[train_idx]
    test_X, test_y = data_X.iloc[test_idx], data_y.iloc[test_idx]
    
    rfc = RFC()
    rfc.fit(train_X[features_01], train_y)
    rfc_01_score = rfc.score(test_X[features_01], test_y)
    rfc_01_scores.append(rfc_01_score)
    
    rfc = RFC()
    rfc.fit(train_X[features_02], train_y)
    rfc_02_score = rfc.score(test_X[features_02], test_y)
    rfc_02_scores.append(rfc_02_score)
    
    rfc = RFC()
    rfc.fit(train_X[features_04], train_y)
    rfc_04_score = rfc.score(test_X[features_04], test_y)
    rfc_04_scores.append(rfc_04_score)

We can print these scores and compare them to those obtained for forests trained on all available features:

In [None]:
print(f"0.01:\n\tMean:{np.mean(rfc_01_scores).round(5)}\tMin:{np.min(rfc_01_scores).round(5)}\tStd:{np.std(rfc_01_scores).round(5)}")
print(f"0.02:\n\tMean:{np.mean(rfc_02_scores).round(5)}\tMin:{np.min(rfc_02_scores).round(5)}\tStd:{np.std(rfc_02_scores).round(5)}")
print(f"0.04:\n\tMean:{np.mean(rfc_04_scores).round(5)}\tMin:{np.min(rfc_04_scores).round(5)}\tStd:{np.std(rfc_04_scores).round(5)}")
print(f"All:\n\tMean:{np.mean(rfc_scores).round(5)}\tMin:{np.min(rfc_scores).round(5)}\tStd:{np.std(rfc_scores).round(5)}")

Restricting ourselves to only those features whose importance was greater than 0.01, we didn't lose mean, nor minimum accuracy. There are, however, significant losses for higher thresholds.

We can see, how applying each threshold influences the model's accuracy, when trained on a bigger chunk of data.

In [None]:
from sklearn.model_selection import train_test_split as tts

rfc_01_cv = cross_val_score(RFC(), data_X[features_01], data_y, cv=5, scoring='accuracy', n_jobs=-1,)
rfc_02_cv = cross_val_score(RFC(), data_X[features_02], data_y, cv=5, scoring='accuracy', n_jobs=-1,)
rfc_04_cv = cross_val_score(RFC(), data_X[features_04], data_y, cv=5, scoring='accuracy', n_jobs=-1,)
rfc_all_cv = cross_val_score(RFC(), data_X, data_y, cv=5, scoring='accuracy', n_jobs=-1,)

print(f"0.01:\n\tMean:{np.mean(rfc_01_cv).round(5)}\tMin:{np.min(rfc_01_cv).round(5)}\tStd:{np.std(rfc_01_cv).round(5)}")
print(f"0.02:\n\tMean:{np.mean(rfc_02_cv).round(5)}\tMin:{np.min(rfc_02_cv).round(5)}\tStd:{np.std(rfc_02_cv).round(5)}")
print(f"0.04:\n\tMean:{np.mean(rfc_04_cv).round(5)}\tMin:{np.min(rfc_04_cv).round(5)}\tStd:{np.std(rfc_04_cv).round(5)}")
print(f"All:\n\tMean:{np.mean(rfc_all_cv).round(5)}\tMin:{np.min(rfc_all_cv).round(5)}\tStd:{np.std(rfc_all_cv).round(5)}")

## 3.2. Restricting Random Forests' individual estimators

So what should we look at, when we're on a mushroom hunting?

In [None]:
important_df = av_fi_df.loc[features_01].sort_values(by='importance', ascending=False)
important_df

Random Forest Classifiers consistently tend to ascribe quite high importance to some of the features we included in our back-of-the-envelope selection algorithm fom section 1, such as 'buff-colored' gills and chocholate-colored or white spore-prints. They didn't, however, found an "enlarging" stalk shape to be a useful indicator, which is quite interesting.

Maybe we can gain some useful insight into the way these algorithms reason, if we visualize graphically one of the Decision Tree Classifiers (of which Random Forest Classifiers are ensembles).

Let's re-run a part of the previous loop. This time we will train only on the features with importance>=0.01 and save the best forest.

In [None]:
rfc_01_best = None
rfc_01_best_score = 0

splits = KFold(n_splits=10, shuffle=True)
for test_idx, train_idx in tqdm(splits.split(data_X)):
    train_X, train_y = data_X.iloc[train_idx][features_01], data_y.iloc[train_idx]
    test_X, test_y = data_X.iloc[test_idx][features_01], data_y.iloc[test_idx]
    
    rfc = RFC()
    rfc.fit(train_X, train_y)
    rfc_01_score = rfc.score(test_X[features_01], test_y)
    if rfc_01_score > rfc_01_best_score:
        rfc_01_best_score = rfc_01_score
        rfc_01_best = rfc

Now that we have our best RFC, let's make a sorted table with information about the score of its individual DTCs:

In [None]:
dtc_df = pd.Series({idx: dtc.score(data_X[features_01], data_y) for idx, dtc in enumerate(rfc_01_best.estimators_) }).to_frame('Score').sort_values(by='Score', ascending=False)

dtc_df,

We can also plot the distribution of these scores:

In [None]:
dtc_df['Score'].hist()
plt.show()

Let's take one of the best DTCs and visualize it in form of a graph.

In [None]:
from sklearn.tree import export_graphviz

dtc_best = rfc_01_best.estimators_[dtc_df.index[0]]

export_graphviz(
    dtc_best,
    out_file='dtc_best.dot',
    feature_names=features_01,
    class_names=['p', 'e'],
    rounded=True,
    filled=True
)
    

os.getcwd(), os.listdir() # See, that the saved .dot file is in our /working directory

In [None]:
! dot -Tpng dtc_best.dot -o dtc_best.png
os.listdir() # Convert this .dot file to .png

In [None]:
from IPython.display import Image

# Display this .png

img = 'dtc_best.png'
Image(url=img, embed=False)

Such an elaborate algorithm is basically guaranteed to overfit, even though it was trained on only 10% of the data. Evidently, our data is pretty consistent and unfiform, which allows this DTC to achieve such a high accuracy. However, if we encounter a mushroom, which was quite unlike anything we've ever seen, this decision tree with its elaborate branches may give us a possibly dangerously wrong impression about this mushroom's safety.

We will therefore try to constrain the freedom of the trees grown by our estimators, so that they do not overfit the data (with regards to possible data from beyond that distribution), while trying to minimize the losses in accuracy.

Let's see how really these trees are "overgrown". We can see that by investigating several of its characteristics:

* Depth - How long is **the longest** possible decision process for a given tree
* Number of leaves (leaf nodes) - How many final decision node a given tree has
* Mean probability - How certain (averaging over all of its leaf nodes)

In [None]:
# We need to access the predictor's index in order to access its depth, number of leaves and mean probability
if 'index' not in dtc_df.columns:
    dtc_df.reset_index(inplace=True, drop=False)

dtc_df['Depth'] = dtc_df['index'].apply(lambda idx: rfc_01_best.estimators_[idx].get_depth())
dtc_df['Leaves'] = dtc_df['index'].apply(lambda idx: rfc_01_best.estimators_[idx].get_n_leaves())
dtc_df['Mean proba'] = dtc_df['index'].apply(lambda idx: rfc_01_best.estimators_[idx].predict_proba(data_X[features_01]).max(axis=1).mean())

dtc_df

First things first, All these ters seem absolutely certain in their judgments. That's very dangerous in itself:

In [None]:
dtc_df['Mean proba'].value_counts()

We can plot the depth and number of leaves of all the trees:

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15, 9))

dtc_df['Depth'].hist(bins = dtc_df['Depth'].max() - dtc_df['Depth'].min() +1, ax=ax[0] )
ax[0].set_title('Depth')
dtc_df['Leaves'].hist(bins = dtc_df['Leaves'].max() - dtc_df['Leaves'].min() +1, ax=ax[1] )
ax[1].set_title('Leaves')

plt.show()

These distributions seem to somewhat approximate the normal (Gaussian) distribution.

Let's see what happens to mean accuracy and probability of the forest, if we try to constrain the maximum depth and maximum number of leaves to various maximum values. We will test each pair of max_depth and max_leaf_nodes hyperparameters and write the results into a matrix.

In [None]:
DEPTH = np.arange(2, 15)
LEAVES = np.arange(5, 35)

accuracy_matrix = np.zeros((len(DEPTH), len(LEAVES)))
probas_matrix = np.zeros((len(DEPTH), len(LEAVES)))

splits = KFold(n_splits=10, shuffle=True)
for i, depth in tqdm(enumerate(DEPTH)):
    for ii, leaves in enumerate(LEAVES):
        #print(f"{np.round(100*(i/len(DEPTH)+(ii/len(LEAVES))/len(DEPTH)), 2)} %")
        for test_idx, train_idx in splits.split(data_X):
            train_X, train_y = data_X.iloc[train_idx][features_01], data_y.iloc[train_idx]
            test_X, test_y = data_X.iloc[test_idx][features_01], data_y.iloc[test_idx]

            rfc = RFC(max_depth=depth, max_leaf_nodes=leaves)
            rfc.fit(train_X, train_y)
            
            accuracy_matrix[i,ii] += rfc.score(test_X, test_y)
            probas_matrix[i,ii] += rfc.predict_proba(test_X).max(axis=1).mean()

accuracy_matrix /= 10
probas_matrix /= 10

We can display these matrices as heatmaps:

In [None]:
fig, ax = plt.subplots(figsize=(15,9))

accuracy_heatmap = ax.imshow(accuracy_matrix)

ax.set_yticks(np.arange(len(DEPTH)))
ax.set_yticklabels(DEPTH)

ax.set_xticks(np.arange(len(LEAVES)))
ax.set_xticklabels(LEAVES)

ax.set_title("Accuracy")

fig.tight_layout()

plt.colorbar(accuracy_heatmap)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15,9))

probas_heatmap = ax.imshow(probas_matrix)

ax.set_yticks(np.arange(len(DEPTH)))
ax.set_yticklabels(DEPTH)

ax.set_xticks(np.arange(len(LEAVES)))
ax.set_xticklabels(LEAVES)

ax.set_title("Probas")

fig.tight_layout()

plt.colorbar(probas_heatmap)

plt.show()

Evidently, reducing both maximum depth and number of leaf nodes result in lower accuracy as well as certainty ('probas') of predictions. Nevertheless, such smaller models would probably be more adequate, since they would not foster overconfidence in our predictions.

Nevertheless, even the most constrained random forests score pretty high:

In [None]:
print("\tMaximally constrained forest:")
print(f"Accuracy: {accuracy_matrix[0,0].round(3)}")
print(f"Probas: {probas_matrix[0,0].round(3)}")

Even though its individual decision trees score pretty low.

In [None]:
# Find the best constrained forest
rfc_constrained = None
rfc_constrained_score = 0

splits = KFold(n_splits=10, shuffle=True)
for test_idx, train_idx in tqdm(splits.split(data_X)):
    train_X, test_X = data_X.iloc[train_idx][features_01], data_X.iloc[test_idx][features_01]
    train_y, test_y = data_y.iloc[train_idx], data_y.iloc[test_idx]
    
    rfc = RFC(max_depth=2, max_leaf_nodes=5)
    rfc.fit(train_X, train_y)
    rfc_score = rfc.score(test_X, test_y)
    if rfc_score > rfc_constrained_score:
        rfc_constrained_score = rfc_score
        rfc_constrained = rfc


In [None]:
# Plot the data about decision trees

constrained_dtc_df = pd.Series({idx: dtc.score(data_X[features_01], data_y) for idx, dtc in enumerate(rfc_constrained.estimators_)}).to_frame('Score').sort_values(by='Score', ascending=False)
constrained_dtc_df.reset_index(inplace=True, drop=False)
constrained_dtc_df['Probas'] = constrained_dtc_df['index'].apply(lambda idx: rfc_constrained.estimators_[idx].predict_proba(data_X[features_01]).max(axis=1).mean())


# Compare their average individual accuracy and certainty to that of the whole ensemble (tested on the entire dataset)
print(f"Mean DTC accuracy:\t{np.round(constrained_dtc_df['Score'].mean(), 4)}")
print(f"Mean DTC probas:\t{np.round(constrained_dtc_df['Probas'].mean(), 4)}")
print(f"Mean ensemble accuracy:\t{np.round(rfc_constrained.score(data_X[features_01], data_y), 4)}")
print(f"Mean ensemble probas:\t{np.round(rfc_constrained.predict_proba(data_X[features_01]).max(axis=1).mean())}")

fig, ax = plt.subplots(1,2,figsize=(15,9))

constrained_dtc_df['Score'].hist(ax=ax[0])
ax[0].set_title('Score')
constrained_dtc_df['Probas'].hist(ax=ax[1])
ax[1].set_title('Probas')

plt.show()

By combining many (in this case, 100) weaker models (mean accuracy 0.8527) into an ensemble we can create a significantly stronger model (mean accuracy 0.9458), as long as its constituent models are sufficiently diverse.

Also, we should take into account, that although our ensemble has only about 95% accuracy, it reports 100% certainty in its predictions, so the latter probably should not be taken as a good indicator of reliability of its particular predictions. We may rather want to use mean certainty of ensemble's constituent predictors or maybe geometric mean of both.

Let's see how one of its trees "reasons".

In [None]:
dtc_constrained = rfc_constrained.estimators_[0]

export_graphviz(
    dtc_constrained,
    out_file='dtc_constrained.dot',
    feature_names=features_01,
    class_names=['p', 'e'],
    rounded=True,
    filled=True
)
! dot -Tpng dtc_constrained.dot -o dtc_constrained.png
img = 'dtc_constrained.png'
Image(url=img, embed=False)

Come to think of that, we can also plot, how accuracy, probas change as we go from using all available features to only a handful of them.

Let's use a random forest with max_depth set to 10 and max_leaf_nodes set to 30, which does seem to incur huge accuracy or certainty losses.

In [None]:
feat_accuracies = np.zeros((av_fi_df.shape[0],))
feat_probas = np.zeros((av_fi_df.shape[0],))

splits = KFold(n_splits=10, shuffle=True)
for n_features in tqdm(range(av_fi_df.shape[0])):
    for test_idx, train_idx in splits.split(data_X):
        features = av_fi_df.index[:n_features+1]
        train_X, test_X = data_X.iloc[train_idx][features], data_X.iloc[test_idx][features]
        train_y, test_y = data_y.iloc[train_idx], data_y.iloc[test_idx]
        rfc = RFC(max_depth=10, max_leaf_nodes=30)
        rfc.fit(train_X, train_y)
        
        feat_accuracies[n_features] += rfc.score(test_X, test_y)
        feat_probas[n_features] += rfc.predict_proba(test_X).max(axis=1).mean()
        
feat_accuracies /= 10
feat_probas /= 10

In [None]:
feat_df = pd.DataFrame([feat_accuracies, feat_probas]).T
feat_df.columns = ['Accuracies', 'Probas']

fig, ax = plt.subplots(figsize=(15,9))

sns.lineplot(data=feat_df, ax=ax)

ax.set_xlabel('Number of features')
ax.set_title('Accuracy and certainty')

plt.show()

So we can see that both accuracy and cerainty rise steepily, almost hand in hand as we increase the number of features used (starting from the "most important" ones), until we reach the point of 23 features or so, which btw corresponds to our choice of 0.01 threshold.

# To do (maybe):

* Build and optimize a NN for this task
* Try dimensionality reduction (is it even possible for only categorical variables?)
* Try clustering
* 

In [None]:
model = models.Sequential(name='shroom_classifier',layers=[
    layers.Dense(32, activation='relu', kernel_regularizer='l2', input_shape=(train_X.shape[1],)),
    layers.BatchNormalization(),
    layers.Dropout(.1),
    layers.Dense(64, activation='relu', kernel_regularizer='l2'),
    layers.BatchNormalization(),
    layers.Dropout(.1),
    layers.Dense(64, activation='relu', kernel_regularizer='l2'),
    layers.BatchNormalization(),
    layers.Dropout(.1),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['acc']
)


train_X, test_X, train_y, test_y = tts(data_X, data_y_bin, test_size=.9, random_state=42, shuffle=True)

history = model.fit(
    train_X, train_y,
    validation_split=.1, shuffle=True,
    batch_size=32, epochs=5
)