<a href="https://www.kaggle.com/code/romanvelichkin/mushroom-classification-rfc-1-0-score?scriptVersionId=142816653" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Mushroom classification
Сlassify mushrooms into poisonous and edible according to their appearance.

### Result:
Mushrooms are well studied, and people have long known how to distinguish poisonous mushrooms from edible ones.

As initial data, each fungus is described by a detailed set of external features.

These features were enough to create a model that predicts with 100% accuracy whether this mushroom can be eaten.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# enable drawing plots in jupyter
%matplotlib inline

# import models from scikit-learn
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# import model evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import data

In [None]:
# Read data
df = pd.read_csv("/kaggle/input/mushroom-classification/mushrooms.csv")
df

## Data exploration

In [None]:
# Check how data is distrubuted between two mushroom classes
df["class"].value_counts().plot(kind="bar");

In [None]:
# Look what type of data we have
df.info()

In [None]:
# Is there any missing data
df.isna().sum()

## Data preparation
All data is non-numerical. Transform it into numerical categories.

In [None]:
# Transform data into numerical categories

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
for i in df.columns:
    df[i] = encoder.fit_transform(df[i])

df

Cross-validation showed large score difference between batch samples.
Shuffling data could help to solve this problem.

In [None]:
# Shuffle data
np.random.seed(99)

df = df.sample(frac=1).reset_index()
df = df.drop("index", axis=1)
df

## Modelling

In [None]:
# Split data into X (features) and y (labels)
X = df.drop("class", axis=1)
y = df["class"]
y

In [None]:
# Split data into train and test sets
np.random.seed(99)

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# Function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    model: a dict of different models
    X_train: training data (no labels)
    X_test: test data (no labels)
    y_train: training labels
    y_train: training labels
    """
    np.random.seed(99)
    
    #dict to keep model scores
    model_scores = {}
    
    # loop through models
    for name, model in models.items():
        # fit model to the data
        model.fit(X_train, y_train)
        # evaluate model
        model_scores[name] = model.score(X_test, y_test)
    
    return model_scores

Look how different models solve this classification problem.

In [None]:
# LinearSVC
# LogisticRegression
# KNeighborsClassifier
# RandomForestClassifier

# Create dictionary with models
models = {"LinearSVC": LinearSVC(),
          "KNeighborsClassifier": KNeighborsClassifier(),
          "RandomForestClassifier": RandomForestClassifier(),
          "LogisticRegression": LogisticRegression()}

model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

model_scores

## Evaluating model 
I've decided to use RandomForestClassifier because it has 100% accuracy and doesn't require any tuning.

In [None]:
# Train RFC model
np.random.seed(99)

rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
print("RandomForestClassifier score:", rfc_model.score(X_test, y_test))

In [None]:
# Get predictions on test dataset using trained model
y_preds = rfc_model.predict(X_test)
y_preds

### Plot ROC curve, calculate AUC metric and confusion matrix

In [None]:
# ROC curve
plot_roc_curve(rfc_model, X_test, y_test);

In [None]:
# Confusion matrix

print(confusion_matrix(y_preds, y_test))

In [None]:
# Classification report as precision, recall and f1-score

print(classification_report(y_test, y_preds))

### Cross-validated score

In [None]:
# Cross-validated accuracy
np.random.seed(99)

cv_accuracy = cross_val_score(rfc_model, X, y, scoring="accuracy")
print(cv_accuracy)

cv_accuracy_mean = np.mean(cv_accuracy)
print("mean accuracy:", cv_accuracy_mean)

In [None]:
# Cross-validated precision
np.random.seed(99)

cv_precision = cross_val_score(rfc_model, X, y, scoring="precision")
print(cv_precision)

cv_precision_mean = np.mean(cv_precision)
print("mean precision:", cv_precision_mean)

In [None]:
# Cross-validated recall
np.random.seed(99)

cv_recall = cross_val_score(rfc_model, X, y, scoring="recall")
print(cv_recall)

cv_recall_mean = np.mean(cv_recall)
print("recall mean:", cv_recall_mean)

In [None]:
# Cross-validated f1-score
np.random.seed(99)

cv_f1 = cross_val_score(rfc_model, X, y, scoring="f1")
print(cv_f1)

cv_f1_mean = np.mean(cv_f1)
print("f1-score mean:", cv_f1_mean)

In [None]:
# Visialisaion of cross-validated metrics

cv_metrics = pd.DataFrame({"Accuracy": cv_accuracy_mean,
                           "Precision": cv_precision_mean,
                           "Recall": cv_recall_mean,
                           "F1-score": cv_f1_mean},
                          index=[0])
cv_metrics.T.plot.bar(title="Cross-validated classification metrics",
                      legend=False);

## Feature importance

In [None]:
# Get coefficients
rfc_model.feature_importances_

In [None]:
# Match coef of features to columns
# Exclude labels from 'columns' (0 column)
feature_dict = dict(zip(df.columns[1:], rfc_model.feature_importances_))
feature_dict

In [None]:
# Visualisation of feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature importance",
                      legend=False,
                      figsize=(10,10));

Some features are more important than others: odor, gill-size, gill-color and spore print color.

## Result
Mushrooms are well studied, and people have long known how to distinguish poisonous mushrooms from edible ones.

Given features were enough to create a model that predicts with 100% accuracy whether this mushroom can be eaten.