# Mushroom classification using Random Forest Classifier

This notebook describes the development of a classification algorithm based on the use of Random Forest Classifier. The objective is to distinguish eadible and poisonous mushrooms, based on their appearance.

The considered dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

The features of the dataset are the following:
* classes: edible=e, poisonous=p
* cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
* cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
* cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
* bruises: bruises=t,no=f
* odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
* gill-attachment: attached=a,descending=d,free=f,notched=n
* gill-spacing: close=c,crowded=w,distant=d
* gill-size: broad=b,narrow=n
* gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
* stalk-shape: enlarging=e,tapering=t
* stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
* stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* veil-type: partial=p,universal=u
* veil-color: brown=n,orange=o,white=w,yellow=y
* ring-number: none=n,one=o,two=t
* ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
* spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
* population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
* habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Importing libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve

In [None]:
#Reading data 
data = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv')

# Data cleaning and preparation

In this section, the data is analysed to evaluate whether there are missing values, and data types are adjusted.

In [None]:
#Inspecting the first rows of the dataset
data.head()

In [None]:
#Checking if there are missing values
data.isnull().sum()

It is possible to notice how there are no missing values in the dataset. However, the various features are described by strings. There is therefore a need to transform this categorical values into integers. This can be done by applying label encoding to the dataset. 

Before doing that, it is suitable to check whether "eadible" and "poisonous" are the only available classes in the class feature.

In [None]:
#Checking unique categories of class feature
print(data['class'].unique())

In [None]:
#Applying encoder to transform string categorical data into integers
encoder = LabelEncoder()
df = data.apply(encoder.fit_transform)

#Printing first columns of new dataset
df.head()

# Data Analysis

In this section, the distribution of the various categories in the available features is analysed. This allows to identify if there is any categorical imbalance in the dataset - this could affect the reliability of the proposed classifier.

In [None]:
fig, ax = plt.subplots(nrows =4, ncols = 6, figsize = (30, 20))

sns.countplot(x="class", data=df, ax = ax[0,0])
ax[0,0].set_xlabel('Class', fontsize = 14)

sns.countplot(x="cap-shape", data=df, ax = ax[0,1])
ax[0,1].set_xlabel('Cap shape', fontsize = 14)

sns.countplot(x="cap-surface", data=df, ax = ax[0,2])
ax[0,2].set_xlabel('Cap surface', fontsize = 14)

sns.countplot(x="cap-color", data=df, ax = ax[0,3])
ax[0,3].set_xlabel('Cap color', fontsize = 14)

sns.countplot(x="bruises", data=df, ax = ax[0,4])
ax[0,4].set_xlabel('Bruises', fontsize = 14)

sns.countplot(x="odor", data=df, ax = ax[0,5])
ax[0,5].set_xlabel('Odor', fontsize = 14)

sns.countplot(x="gill-attachment", data=df, ax = ax[1,0])
ax[1,0].set_xlabel('Gill attachment', fontsize = 14)

sns.countplot(x="gill-spacing", data=df, ax = ax[1,1])
ax[1,1].set_xlabel('Gill spacing', fontsize = 14)

sns.countplot(x="gill-size", data=df, ax = ax[1,2])
ax[1,2].set_xlabel('Gill size', fontsize = 14)

sns.countplot(x="gill-color", data=df, ax = ax[1,3])
ax[1,3].set_xlabel('Gill color', fontsize = 14)

sns.countplot(x="stalk-shape", data=df, ax = ax[1,4])
ax[1,4].set_xlabel('Stalk shape', fontsize = 14)

sns.countplot(x="stalk-root", data=df, ax = ax[1,5])
ax[1,5].set_xlabel('Stalk root', fontsize = 14)

sns.countplot(x="stalk-surface-above-ring", data=df, ax = ax[2,0])
ax[2,0].set_xlabel('Stalk surface above ring', fontsize = 14)

sns.countplot(x="stalk-surface-below-ring", data=df, ax = ax[2,1])
ax[2,1].set_xlabel('Stalk surface below ring', fontsize = 14)

sns.countplot(x="stalk-color-above-ring", data=df, ax = ax[2,2])
ax[2,2].set_xlabel('Stalk color above ring', fontsize = 14)

sns.countplot(x="stalk-color-below-ring", data=df, ax = ax[2,3])
ax[2,3].set_xlabel('Stalk color below ring', fontsize = 14)

sns.countplot(x="veil-type", data=df, ax = ax[2,4])
ax[2,4].set_xlabel('Veil type', fontsize = 14)

sns.countplot(x="veil-color", data=df, ax = ax[2,5])
ax[2,5].set_xlabel('Veil color', fontsize = 14)

sns.countplot(x="ring-type", data=df, ax = ax[3,0])
ax[3,0].set_xlabel('Ring type', fontsize = 14)

sns.countplot(x="ring-number", data=df, ax = ax[3,1])
ax[3,1].set_xlabel('Ring color', fontsize = 14)

sns.countplot(x="spore-print-color", data=df, ax = ax[3,2])
ax[3,2].set_xlabel('Stalk color above ring', fontsize = 14)

sns.countplot(x="population", data=df, ax = ax[3,3])
ax[3,3].set_xlabel('Population', fontsize = 14)

sns.countplot(x="habitat", data=df, ax = ax[3,4])
ax[3,4].set_xlabel('Habitat', fontsize = 14)

fig.delaxes(ax[3,5])

fig.suptitle('Number of elements in each category for the available features', fontsize = 40)
fig.tight_layout()
fig.subplots_adjust(top=0.88)

plt.show()

From the plot it is possible to deduce at least 3 pieces of information:
1. The data is almost equally distributed between eadible and poisonous mushrooms, and therefore the regressor will not have to account for an univen distribution of the labels;
2. Only the category 0 ("partial") appears for the column "Veil type". This column can be thus dropped from the dataset as it does not provide any useful information for the purpose of the model;
3. In some columns there is a substantial imbalance of the labels (i.e. over 95 % of the elements in the column "Veil color" are described by the category 2: "white").

In [None]:
#Dropping column veil type
df.drop(columns = 'veil-type', inplace = True)

# Classification model using Random Forest

A classification model is now built using Random Forest Classifier. This classifier is selected as this notebook aims a fulfilling a specific task for the selected dataset, which requires the use of this classifier.

In order to train the classifier, the dateset is divided into features (X) and labels (y = poisonous/edible) and then split into train and test sets. Splitting the data into train and test sets enables to check the accuracy of the ML model when predicting non-previously seen data.

In [None]:
#Splitting dataset into labels and features
X = df.drop(columns = 'class')
y = df['class']

In [None]:
#Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.3,
                                                   random_state = 42)

Given that a Random Forest classifier is relatively complex model, some parameters (hyper-parameters) have to be user-specified before the model can be trained. The proper selection of these parameter is of uttermost importance in order to attain a suitable model. 

This selection process is generally called "hyper-parameter tuning" and it is here carried out by means of a random search approach: a list of potential values for the various hyper-parameters is defined and then multiple models are trained as a way to identify the best performing set of hyper-parameters.

The hyper-parameters here considered are the following:
1. n_estimators: the number of trees in the forest
2. max_depth: the maximum depth of the tree
3. min_samples_leaf: the minimum number of samples required to be at a leaf node (fraction of the number of samples, when a float value is provided);
4. max_features: the number of features to consider when looking for the best split (sqrt: max_features =  sqrt(n_features), log2: max_features = log2(n_features).

The scoring method for the random search is the "accuracy" of the regressor, which represent the percentage of labels that are correctly predicted. In case the results indicate a large number of false negatives (e.g. mushrooms predicted to be edible but are poisonous), the scoring method can be updated to limit the presence of false negatives.

In [None]:
#Initiating Random Forest Classifier
rf_model = RandomForestClassifier(random_state = 42)

#Define the distribution of hyperparameters
params_rf = {
    'n_estimators': range(50,500,50),
    'max_depth': range(1,10),
    'min_samples_leaf': np.arange(0.0025,0.02,0.0025),
    'max_features': ['log2', 'sqrt']   
}

#Initiate Randomized search
grid_rf = RandomizedSearchCV(estimator = rf_model, 
                             param_distributions = params_rf,
                             cv=3,
                             scoring='accuracy',
                             n_iter=20,
                             n_jobs= -1)

In [None]:
#Fitting the grid search
grid_rf.fit(X_train, y_train)

In [None]:
#Extracting best hyperparameters
rf_best_hyperparams = grid_rf.best_params_
print('Best hyperparameters for RF: \n', rf_best_hyperparams)

In [None]:
#Extracting best rf model
rf = grid_rf.best_estimator_

Given that Random Forest is a complex model, the chances of overfitting the training set are considerable. When a model is overfitting the training set, it does fit not only its trends, but also its noise. Therefore, an overfitted model will perform poorly on unseen data.

A way to check whether the model overfits the training set is to carry out a cross-validation procedure. During the cross-validation procedure, the training set is split into k folds, and the model is trained k times using a different portion of data from the training set. Each trained model will have a different prediction accuracy. The average accuracy of the models trained during the cross validation procedure is defined "cross-validation" accuracy.

Two scenarios are possible:

1. The cross-validation accuracy is lower than the training set accuracy -> in this case the model is overfitting the training set;
2. The cross-validation accuracy is similar to the training set accuracy -> it is possible to assume that the model is not overfitting the training set and will perform similarly on unseen data.

In [None]:
#Checking if there is overfitting through the use of Cross validation
rf_accuracy_CV = cross_val_score(rf, X_train, y_train,
                            cv = 10, 
                            scoring = 'accuracy',
                            n_jobs = -1)

In [None]:
#Computing the accuracy in the traning set, test set, and cross-validation procedure
print('CV accuracy:{:.4f}'.format(rf_accuracy_CV.mean()))
print('Train set accuracy:{:.4f}'.format(rf.score(X_train,y_train)))
print('Test set accuracy:{:.4f}'.format(rf.score(X_test, y_test)))

The results of the cross validation error indicate that the accuracy of in the cross validation procedure is similar to the one of the training set. Therefore, it is possible to assume that the selected model is not overfitting the training data.

This is further supported by the fact that the accuracy of the model in the test set is comparable to the previous two.

In [None]:
#Predicting the labels of the test set
y_pred = rf.predict(X_test)

In [None]:
#Printing the confusion matrix
print(confusion_matrix(y_test,y_pred))

The confusion matrix indicates that there are no false positives, thus the risk of predicting a mushroom to be eadible while it is poisonous seems to be negligible.

Another tool to evaluate the performance of a classification model is the classificaiton report, whic is reported below.

In [None]:
# Printing classification report
print(classification_report(y_test,y_pred))

As a last step in the evaluation of the performace of the regression model, it is possible to plot its ROC curve.

In [None]:
y_pred_proba = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr, label = "Random Forest Classifier")
plt.xlabel('False Positive Rate')
plt.ylabel('False Negative Rate')
plt.title('Random Forest Classifier ROC curve')
plt.show()

Finally, it is possible to identify which features are the most relevant for the classifier.

In [None]:
#Plotting feature importances for the classifier

#Creating a pd.Series of feature importances
importances_rf = pd.Series(rf.feature_importances_, index = X.columns)

#Sorting importances
sorted_importances_rf = importances_rf.sort_values()

#Plotting sorted importances
fig, ax = plt.subplots(figsize = (10,7))
sorted_importances_rf.plot(kind = 'barh')
ax.set_title('Random Forest Regressor Feature importances')
plt.show()



# Classification with a reduced number of features

The random forest classifier derived using all the possible features resulted in a very high prediction accuracy. In this second step another classification model is built using only the 3 most important features identified at the previous step.

The aim is to undertand how accurate can a classifier be, when using only "odor", "gill-color", and "gill-size" as input features.

The same procedure described in the previous section is followed here.

In [None]:
#Splitting dataset into labels and features
X_r = df[['odor','gill-color','gill-size']]


In [None]:
#Splitting data into train and test sets
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r,y,
                                                   test_size = 0.3,
                                                   random_state = 42)

In [None]:
#Define the distribution of hyperparameters
params_rf = {
    'n_estimators': range(50,500,50),
    'max_depth': range(1,10),
    'min_samples_leaf': np.arange(0.0025,0.02,0.0025),
    'max_features': ['log2', 'sqrt']   
}

#Initiate Randomized search
grid_rf_r = RandomizedSearchCV(estimator = rf_model, 
                             param_distributions = params_rf,
                             cv=3,
                             scoring='accuracy',
                             n_iter=20,
                             n_jobs= -1)

In [None]:
#Fitting the grid search
grid_rf_r.fit(X_train_r, y_train_r)

In [None]:
#Extracting best hyperparameters
rf_best_hyperparams_r = grid_rf.best_params_
print('Best hyperparameters for RF: \n', rf_best_hyperparams_r)

In [None]:
#Extracting best rf model
rf_r = grid_rf_r.best_estimator_

In [None]:
#Checking if there is overfitting through the use of Cross validation
rf_accuracy_CV_r = cross_val_score(rf_r, X_train_r, y_train_r,
                            cv = 10, 
                            scoring = 'accuracy',
                            n_jobs = -1)

In [None]:
#Computing the accuracy in the traning set, test set, and cross-validation procedure
print('CV accuracy:{:.4f}'.format(rf_accuracy_CV_r.mean()))
print('Train set accuracy:{:.4f}'.format(rf_r.score(X_train_r,y_train_r)))
print('Test set accuracy:{:.4f}'.format(rf_r.score(X_test_r, y_test_r)))

In [None]:
#Predicting lables of the test set
y_pred_r = rf_r.predict(X_test_r)

In [None]:
print(confusion_matrix(y_test,y_pred_r))

In [None]:
#Plotting ROC curve
y_pred_proba_r = rf_r.predict_proba(X_test_r)[:,1]
fpr_r, tpr_r, thresholds_r = roc_curve(y_test, y_pred_proba_r)

plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr_r, tpr_r, label = "Accuracy optimized")
plt.xlabel('False Positive Rate')
plt.ylabel('False Negative Rate')
plt.title('Random Forest Classifier ROC curve')
plt.show()

# Conclusions

A dataset describing 23 features of edible/poisonous mushrooms was analysed. The analysis indicated that only one "veil-type" was present among the considered data and thus this feature was dropped from the dataset. In addition, the data turned out to be almost equally distributed between eadible and poisonous mushrooms, hence simplyfing the training of the classification model.

The classification model was based on the use of a Random Forest Classifier, whose hyper-parameters where tuned by mean of a random search approach. The accuracy of the classifier was then validated by means of a cross-validation procedure.

The main results emerging from the investigation of the classification model are the following:
* The use of a Random Forest classifier is identified as a suitable choice for the considered problem. The estimated classification accuracy (using all the available features) in the test set was found to be equal to 0.992;
* A second classifier using as features only the 3 top features identified by the first model was developed. In this case the classification accuracy in the test set was found to be equal to 0.981;
* In both cases no false negatives were found in the predictions of the test set. False negatives would be the worst possible outcome for the considered problem (poisonous mushrooms predited to be edible). The scoring function used to train the regressor could be adjusted in case the number of predicted false negatives was considerable.