In [1]:
# | echo: false
# ---
# title: "matplotlib example"
# format:
#   html:
#     code-fold: true
# ---

In [2]:
#| echo: false
# Supress warnings.
import warnings
warnings.filterwarnings('ignore')
warnings.warn('DelftStack')
warnings.warn('Do not show this message')

In [3]:
# Import necessary libraries
# Importing libraries 
#| echo: false
import numpy as np 
import pandas as pd
import janitor
import sklearn 
from sklearn.impute import KNNImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, precision_score, accuracy_score, recall_score, f1_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from copy import deepcopy

In [4]:
#| echo: false
data = pd.read_csv('../data/mushrooms.csv')

## Overview of the dataset

description, dictionary

Mushroom anatomy picture

Objective

## Data Munging

1) Clean column names automatically by replacing each - with _ using  `pyjanitor`

In [5]:
#| echo: true
print("Column names before cleaning:", '\n', data.columns.tolist(), '\n\n')
data = data.clean_names()
print("Column names after cleaning:", '\n', data.columns.tolist()) 

Column names before cleaning: 
 ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'] 


Column names after cleaning: 
 ['class', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring', 'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type', 'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat']


2) See what different values each column contains

From here, we can see that the `veil_type` has one single value and therefore is redundant and not informative so we can proceed with dropping it.
All the mushrooms in our dataset have partial veils, so the column `veil_type` is not informative. 

In [6]:
#| echo: false
data.columns.tolist()
for col in data.columns.tolist(): 
    print(col,':  ',data[col].unique())

class :   ['p' 'e']
cap_shape :   ['x' 'b' 's' 'f' 'k' 'c']
cap_surface :   ['s' 'y' 'f' 'g']
cap_color :   ['n' 'y' 'w' 'g' 'e' 'p' 'b' 'u' 'c' 'r']
bruises :   ['t' 'f']
odor :   ['p' 'a' 'l' 'n' 'f' 'c' 'y' 's' 'm']
gill_attachment :   ['f' 'a']
gill_spacing :   ['c' 'w']
gill_size :   ['n' 'b']
gill_color :   ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk_shape :   ['e' 't']
stalk_root :   ['e' 'c' 'b' 'r' '?']
stalk_surface_above_ring :   ['s' 'f' 'k' 'y']
stalk_surface_below_ring :   ['s' 'f' 'y' 'k']
stalk_color_above_ring :   ['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y']
stalk_color_below_ring :   ['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c']
veil_type :   ['p']
veil_color :   ['w' 'n' 'o' 'y']
ring_number :   ['o' 't' 'n']
ring_type :   ['p' 'e' 'l' 'f' 'n']
spore_print_color :   ['k' 'n' 'u' 'h' 'w' 'r' 'o' 'y' 'b']
population :   ['s' 'n' 'a' 'v' 'y' 'c']
habitat :   ['u' 'g' 'm' 'd' 'p' 'w' 'l']


In [7]:
#| echo: false
data.drop('veil_type', axis = 1, inplace = True)

We can see that the column `stalk_root` has a non-alphanumeric value and it might need some munging.
According to the dataset's documentation, the value '?' in `stalk_root` means that they are missing or unknown stalk root data. 

Let's see how many of these missing values we have to decide if it'd be okay to drop these rows. 

In [8]:
#| echo: true
vals = data['stalk_root'].value_counts().index.values.tolist()

NA_count = data['stalk_root'].value_counts().values

NA_frac = data['stalk_root'].value_counts().to_list()
NA_frac = [i/sum(NA_frac) for i in NA_frac]

pd.DataFrame(zip(NA_count,NA_frac), columns=['Count','Fraction'], index= vals)

Unnamed: 0,Count,Fraction
b,3776,0.464796
?,2480,0.305268
e,1120,0.137863
c,556,0.068439
r,192,0.023634


So, now we can see that if we drop the missing values in this column we're losing 30% of our data which accounts for about 2500 instances. 
Dropping the rows is not the best solution in this case. 
Therefore, we'll try to impute using KNN.
Before that, the categorical value must be numerically encoded/labelled from 0 to n. 

Let's see the order of values in this column before label encoding

In [9]:
data.stalk_root.unique().tolist()

['e', 'c', 'b', 'r', '?']

Let's see the corresponding label to '?' values so we can impute them

In [10]:
#| echo: true

le = preprocessing.LabelEncoder()
encoded_data = deepcopy(data)
for i in encoded_data.columns.tolist():
    encoded_data[i]= le.fit_transform(encoded_data[i])

encoded_data['stalk_root'].unique().tolist()

[3, 2, 1, 4, 0]

The corresponding label to '?' is 0. 
But, for the models to impute the missing data, we should replace each 0 with a NaN. 

In [11]:
#| echo: true
print("Stalk root column value counts before imputation: \n", encoded_data.replace({'stalk_root': {0: np.nan}}).stalk_root.value_counts(), '\n')
imputer = KNNImputer(missing_values = np.nan, n_neighbors=5, weights = 'distance')
imputer.fit_transform(encoded_data[['stalk_root']])
print("Stalk root column value counts before imputation: \n", encoded_data.stalk_root.value_counts())

Stalk root column value counts before imputation: 
 1.0    3776
3.0    1120
2.0     556
4.0     192
Name: stalk_root, dtype: int64 

Stalk root column value counts before imputation: 
 1    3776
0    2480
3    1120
2     556
4     192
Name: stalk_root, dtype: int64


We can see that KNNImputer didn't give us any useful results and we're again back on square 1. Therefore, we'll just drop the column.

In [12]:
encoded_data.drop('stalk_root', axis = 1, inplace = True)

Now, let's see if the classes in our dataset are balanced or not.

In [13]:
#| echo: false
vals = data['class'].value_counts().index.values.tolist()

class_count = data['class'].value_counts().values

class_frac = data['class'].value_counts().to_list()
class_frac = [round((i/sum(class_frac))*100, 2) for i in class_frac]

pd.DataFrame(zip(class_count,class_frac), columns=['Count','Fraction'], index= vals)

Unnamed: 0,Count,Fraction
e,4208,51.8
p,3916,48.2


They are "adequately" balanced and there's no need for any oversampling techniques. 

## EDA 

## Feature & Target Engineering

The only feature and target engineering step we need to perform is label encoding for our categorical variables and the categorical target. Let's identify which class is 1 and which class is 0. 

In [14]:
#| echo: true
print("Class values before encoding: ", data['class'].unique().tolist())
print("Class values after encoding: ", encoded_data['class'].unique().tolist())

Class values before encoding:  ['p', 'e']
Class values after encoding:  [1, 0]


We can see that the `Poisonous (p)` class is now labelled `1`, while the `Edible (e)` class is `0`.

## Baselining

Baseline models are the stepping stone on which AI developers base their initial assumptions of the direction they should take their developing. So, baseline models tend to be rule-based and understandable. Since we're aiming to perform binary classification, we chose to do `LogisticRegression` as a baseline model which we'll use comapre our refined models. 

The second step after label encoding our data to prepare for modeling is to perform standard scaling (z-score).  

In [15]:
#| echo: true
scaled_data = deepcopy(encoded_data)
scaled_data.drop('class', axis=1, inplace = True)

ss = StandardScaler()

ss.fit(scaled_data)
scaled_data = ss.transform(scaled_data)

The third step is to split our data to training and testing sets

75% of the data is used for training while the remaining 25% is for testing

In [16]:
X = scaled_data
y = encoded_data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [17]:
# Baseline Logistic Regression Model
lr = LogisticRegression()

# Training 
lr.fit(X_train, y_train)

# Predicting
lr_pred = lr.predict(X_test)

# Evaluating
print('Confusion Matrix :\n', confusion_matrix(y_test, lr_pred))
print()
print('Classification Report :\n', classification_report(y_test, lr_pred))

Confusion Matrix :
 [[1011   43]
 [  62  915]]

Classification Report :
               precision    recall  f1-score   support

           0       0.94      0.96      0.95      1054
           1       0.96      0.94      0.95       977

    accuracy                           0.95      2031
   macro avg       0.95      0.95      0.95      2031
weighted avg       0.95      0.95      0.95      2031



ADD SOMETHING HERE

## Feature Filtering & Dimensionality Reduction

Our dataset has 20 features which is considered high, therefore, means of feature filtering and dimensionality reduction is necessary. For this purpose, we've used 3 feature filtering/selection algorithms (`Variance Thresholding, Chi Square, and LASSO`), as well as `Principal Component Anlaysis` for dimensionality reduction. 

## Modelling

## Conclusions