Some ideas in this analysis was sourced from [this notebook](https://www.kaggle.com/origraupera/mushroom-classification-comparing-ml-models), [this notebook](https://www.kaggle.com/vishalyo990/random-forest-classifier-100-accuracy) and [this notebook](https://www.kaggle.com/gpreda/model-comparison-for-mushrooms-classification).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading Data

In [None]:
mushrooms_df = pd.read_csv("/kaggle/input/mushroom-classification/mushrooms.csv")

# Exploratory Data Analysis (EDA)

## Data Check

In [None]:
# rows and columns
print(mushrooms_df.shape)

# check first 5 rows
mushrooms_df.head()

There are 8,124 rows and 23 columns.The first column `class` indicates whether each mushroom is edible (e) or poisonous (p).

Also we check all features as well as how many non-null values have and their type.

In [None]:
# check all features as well as how many non-null values have and their type
mushrooms_df.info()

All features are categorical and have no null values.

## Label Imbalance

Here we check if there is an imbalance in `class`, our target variable.

In [None]:
mushrooms_df['class'].value_counts(normalize=True)

The data is balanced.

## Distribution of Each Feature

We are going to plot the distribution of each feature to explore and understand better our data.


In [None]:
for column in mushrooms_df.columns[1:]:
  sns.countplot(x=column, hue="class", data=mushrooms_df)
  plt.title('Edible and Poisonous based on ' + column)
  plt.show()
  plt.clf()

We can find some features which are helpful for classification. For example, let's see `odor`. We can see that all types except none (n) have only one class of either edible or poisonous mushrooms: almond (a) and anise (l) have only edible ones, and creosote (c),fishy (y), foul (f), musty (m), pungent (p), and spicy (s) have only  poisonous ones.

In [None]:
mushrooms_df.groupby('odor')['class'].unique()

# Data Preparation

## Drop Unnecessary Columns

First, the columns which only take one value can be dropped.

In [None]:
mushrooms_df.nunique()

In [None]:
mushrooms_df.drop(columns=['veil-type'], axis=1, inplace=True)

## Convert Values to Integers

Our values have to be converted to integers to be able to work with them. To do so we'll be using Ordinal encoding.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

mushrooms_df = pd.DataFrame(OrdinalEncoder().fit_transform(mushrooms_df), columns=mushrooms_df.columns)
mushrooms_df.head()

## Split Data into Train and Test

In [None]:
X = mushrooms_df.drop(columns=['class'], axis=1)
y = mushrooms_df['class']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Prediction with Random Forest

## Model Building and Prediction

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

print(classification_report(y_test, y_predict))

## Feature Importances

In [None]:
fi_df = pd.DataFrame({
    "feature_importances" : model.feature_importances_,
    "features" : X.columns
})

fi_df.sort_values(by="feature_importances", ascending=False, inplace=True)

sns.barplot(x="feature_importances", y="features", data=fi_df)

# Conclusion

A Random Forest model works quite good for classifying. As we expected in EDA some features like `odor` seem to be helpful.