# Mushroom Classification

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict if the mushroom is safe to eat or deadly poison?

# 2. Data

Data from: https://www.kaggle.com/uciml/mushroom-classification

## Context

Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?

## Content

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

    Time period: Donated to UCI ML 27 April 1987


# 3. Evaluation

As this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## inputs / features

    cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

    cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

    cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

    bruises: bruises=t,no=f

    odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

    gill-attachment: attached=a,descending=d,free=f,notched=n

    gill-spacing: close=c,crowded=w,distant=d

    gill-size: broad=b,narrow=n

    gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

    stalk-shape: enlarging=e,tapering=t

    stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

    stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    veil-type: partial=p,universal=u

    veil-color: brown=n,orange=o,white=w,yellow=y

    ring-number: none=n,one=o,two=t

    ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

    spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

    population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

    habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

## Output / labels

    classes: edible=e, poisonous=p

## Standard imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## reading the dataset

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Local
# df = pd.read_csv('mushrooms.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv')
df.head()

## Data exploration

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe().transpose()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of class')
sns.countplot(data=df, x='class');

class is balanced

In [None]:
df['veil-type'].unique()

since there is only one unique value, we will drop veil type

In [None]:
df = df.drop('veil-type', axis=1)

In [None]:
pd.get_dummies(df, drop_first=True).corr()['class_p'].sort_values(ascending=False)[:20]

In [None]:
pd.get_dummies(df, drop_first=True).corr()['class_p'].sort_values(ascending=False)[-20:]

let take a look at the oder, stalk-surface, gill-size as that is high corrulated to the class

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of oder colored by class')
sns.countplot(data=df, x='odor', hue='class');

From the graph we can see that those that have an almond and anise odor are safe to eat, those that are none in odor have a few that are poison.

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of stalk-surface-above-ring colored by class')
sns.countplot(data=df, x='stalk-surface-above-ring', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of stalk-surface-below-ring by class')
sns.countplot(data=df, x='stalk-surface-below-ring', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of gill-size by class')
sns.countplot(data=df, x='gill-size', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of spore-print-color by class')
sns.countplot(data=df, x='gill-size', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of spore-print-color by class')
sns.countplot(data=df, x='spore-print-color', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of habitat by class')
sns.countplot(data=df, x='habitat', hue='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('count of ring-type by class')
sns.countplot(data=df, x='ring-type', hue='class');

# 5. Modelling

In [None]:
X = df.drop('class', axis=1)
y = df['class']

X = pd.get_dummies(X, drop_first=True)
y = pd.get_dummies(y, drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Baseline model scores

In [None]:
from sklearn.metrics import classification_report,precision_score, recall_score,f1_score

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        model_scores[name] = model.score(X_test,y_test)
        model_recall[name] = recall_score(y_test, y_preds)
        model_f1[name] = f1_score(y_test, y_preds)
        model_precision[name] = precision_score(y_test, y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    model_recall = pd.DataFrame(model_recall, index=['Recall']).transpose()
    model_recall = model_recall.sort_values('Recall')
    model_f1 = pd.DataFrame(model_f1, index=['F1']).transpose()
    model_f1 = model_f1.sort_values('F1')
    model_precision = pd.DataFrame(model_precision, index=['Precision']).transpose()
    model_precision = model_precision.sort_values('Precision')
        
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = fit_and_score(models, X_train, X_test, y_train, y_test)

Since all the models are scoring well, we will do a baseline evalution using cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def get_baseline_cv_scores(model, X, y, cv=5):
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        
        print(name)
        cv_accuracy = cross_val_score(model,X,y,cv=cv,
                             scoring='accuracy')
        print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
        print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')

        cv_precision = cross_val_score(model,X,y,cv=cv,
                             scoring='precision')
        print(f'Cross Validaion precision Scores: {cv_precision}')
        print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')

        cv_recall = cross_val_score(model,X,y,cv=cv,
                             scoring='recall')
        print(f'Cross Validaion recall Scores: {cv_recall}')
        print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')

        cv_f1 = cross_val_score(model,X,y,cv=cv,
                             scoring='f1')
        print(f'Cross Validaion f1 Scores: {cv_f1}')
        print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}') 
        print('\n')

        model_scores[name] = cv_accuracy.mean()
        model_recall[name] = cv_precision.mean()
        model_f1[name] = cv_recall.mean()
        model_precision[name] = cv_f1.mean()
    
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = get_baseline_cv_scores(models, X_train, y_train, cv=5)

In [None]:
model_scores = pd.DataFrame(model_scores, index=['Accuracy'])

In [None]:
model_scores.transpose().sort_values('Accuracy')

Since LogisticRegression scores well, we will build the model using that as it provides a faster and simpler model

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve

In [None]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test,y_preds))

## Confusion Matirx

In [None]:
plot_confusion_matrix(model,X_test, y_test)

## ROC curve

In [None]:
plot_roc_curve(model, X_test,y_test)

## Feature Importance

In [None]:
feat_impt = pd.DataFrame(model.coef_[0], index=X.columns.values)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Feature Importances')
plt.xticks(rotation=90)
sns.barplot(data=feat_impt.sort_values(0).T);

## Evalution using cross-validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=cv,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=cv,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=cv,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=cv,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X_train, y_train, cv=5)

In [None]:
cv_merics