# 001 Secondary Mushroom Classifier

Our goal is to create a model and then preserve that model using
`.joblib`.

In this notebook, we'll look at simulated mushroom database. These
mushrooms can be edible or poisonous, and we'll build a very rough
logistic regression classification model.

We'll reduce the features down to a few (3 continuous features and 2
categorical).

We'll preserve the preprocessing pipeline as well as the logistic
regression model itself. There are a few ways to do this, but we'll store
the things we want to preserve in a dictionary and then preserve that
dictionary using `.joblib`.


### Installs


In [1]:
# library for saving python objects
import joblib

In [2]:
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype

In [3]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import (
    OneHotEncoder, StandardScaler)
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.metrics import (
    f1_score, accuracy_score
)

### UC Irvine Machine Learning Repository

If this is your first time pulling a dataset from UCI, you will need to
install `ucimlrepo`.

Install the ucimlrepo package.

```bash
pip install ucimlrepo
```


[Secondary Mushroom - UCI Machine Learning
Repository](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset)

> Wagner,Dennis, Heider,D., and Hattab,Georges. (2023). Secondary
> Mushroom. UCI Machine Learning Repository.
> https://doi.org/10.24432/C5FP5Q.


Important Note

These mushrooms are generated, hypothetical mushrooms. Do not use this
model for actual mushroom identification!


In [4]:
from ucimlrepo import fetch_ucirepo

# fetch mushroom dataset
secondary_mushroom = fetch_ucirepo(id=848)

In [5]:
# data (as pandas DataFrames)
X = secondary_mushroom.data.features
y = secondary_mushroom.data.targets

In [6]:
# mushroom dataset metadata
# secondary_mushroom.metadata

secondary_mushroom.metadata.additional_info

{'summary': 'The given information is about the Secondary Mushroom Dataset, the Primary Mushroom Dataset used for the simulation and the respective metadata can be found in the zip.\n\nThis dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms\nper species). Each mushroom is identified as definitely edible, definitely poisonous, or of\nunknown edibility and not recommended (the latter class was combined with the poisonous class).\n\nThe related Python project contains a Python module secondary_data_generation.py\nused to generate this data based on primary_data_edited.csv also found in the repository.\nBoth nominal and metrical variables are a result of randomization.\nThe simulated and ordered by species version is found in secondary_data_generated.csv.\nThe randomly shuffled version is found in secondary_data_shuffled.csv.',
 'purpose': 'Inspired by the Mushroom Data Set of J. Schlimmer: url:https://archive.ics.uci.edu/ml/datasets/Mushroom.',
 'f

## Initial dataset observations


In [7]:
X.sample(10).T

Unnamed: 0,50749,291,35362,18216,54128,31572,25734,54598,24835,7111
cap-diameter,4.1,13.99,7.58,6.81,6.38,5.6,7.17,7.03,7.46,3.0
cap-shape,x,x,x,f,s,x,x,x,x,x
cap-surface,d,g,y,s,d,y,t,w,s,s
cap-color,n,e,n,n,o,n,e,y,k,g
does-bruise-or-bleed,f,f,f,f,f,f,f,f,f,f
gill-attachment,p,e,a,s,d,e,x,f,a,d
gill-spacing,,,c,c,c,c,d,f,c,c
gill-color,n,w,y,w,o,w,w,f,w,w
stem-height,5.19,15.42,8.49,6.43,3.76,5.0,5.25,4.77,4.74,2.81
stem-width,11.45,15.85,9.52,10.48,6.73,10.33,17.49,13.37,16.54,7.06


### Drop columns containing null values

For the purposes of this example, drop all columns containing NaN values.

This reduces the dataset to 11 features


In [8]:
X = X.dropna(axis=1)
X.head().T

Unnamed: 0,0,1,2,3,4
cap-diameter,15.26,16.6,14.07,14.17,14.64
cap-shape,x,x,x,f,x
cap-color,o,o,o,e,o
does-bruise-or-bleed,f,f,f,f,f
gill-color,w,w,w,w,w
stem-height,16.95,17.99,17.8,15.77,16.53
stem-width,17.09,18.19,17.74,15.98,17.2
stem-color,w,w,w,w,w
has-ring,t,t,t,t,t
habitat,d,d,d,d,d


For this demonstration, we'd like to have a boolean categorical value and
a categorical variable with multiple possibilities.


In [9]:
for col in X.columns:

    print(col)

    # if it's not numeric
    if not is_numeric_dtype(X[col]):

        # return value count length
        print(len(X[col].value_counts()))

cap-diameter
cap-shape
7
cap-color
12
does-bruise-or-bleed
2
gill-color
12
stem-height
stem-width
stem-color
13
has-ring
2
habitat
8
season
4


We'll keep only 3 continuous ratio, and two categorical
variables (one binary variable, one with more than two categories)

From the [data
dictionary](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset):

```text
1. cap-diameter (m): float number in cm
2. cap-shape (n): bell=b, conical=c, convex=x, flat=f,
sunken=s, spherical=p, others=o
9. stem-height (m): float number in cm
10. stem-width (m): float number in mm
16. has-ring (n): ring=t, none=f

```

We'll change the order so that the three continuous variables come first.

```python
[
    'cap-diameter', 'stem-height', 'stem-width',
    'has-ring', 'cap-shape'
]

```


In [10]:
X_subset = X[[
    'cap-diameter', 'stem-height', 'stem-width',
    'has-ring', 'cap-shape'
]]

X_subset.sample(5).T

Unnamed: 0,36178,53530,29718,24164,32462
cap-diameter,5.03,9.24,3.32,7.35,3.72
stem-height,5.28,5.27,4.54,7.71,5.78
stem-width,5.82,12.61,5.98,19.18,4.34
has-ring,t,f,f,f,f
cap-shape,x,s,s,x,f


## Observe possible and likely values for retained records


### Raw value min-max limits


In [11]:
print(X_subset.describe().T[['mean', 'std', 'min', 'max']])

                   mean        std   min     max
cap-diameter   6.733854   5.264845  0.38   62.34
stem-height    6.581538   3.370017  0.00   33.92
stem-width    12.149410  10.035955  0.00  103.91


### Possible categorical values


In [12]:
print(
    X_subset['has-ring'].value_counts().index,
    X_subset['cap-shape'].value_counts().index,
    sep='\n')

Index(['f', 't'], dtype='object', name='has-ring')
Index(['x', 'f', 's', 'b', 'o', 'p', 'c'], dtype='object', name='cap-shape')


### Prep for dropping a specific values when using OneHotEncoder

You may remember the need to drop a column when one-hot encoding so that
there is no duplicate information.

Since `has-ring` is only `t` or `f`, dropping `f` seems like an easy
choice.


In [13]:
drop_list = ['f']

What about `cap-shape`? There are two common choices:

- drop the most populous column
- drop a column with little definition (the 'other' column)

You may remember from the data dictionary that `cap-shape` is encoded
with these values:

```text
2. cap-shape (n): bell=b, conical=c, convex=x, flat=f,
sunken=s, spherical=p, others=o
```

Observe the `value_counts` for `cap-shape`.


In [14]:
display(
    X_subset['cap-shape'].value_counts(),
    X_subset['cap-shape'].value_counts().shape)

cap-shape
x    26934
f    13404
s     7164
b     5694
o     3460
p     2598
c     1815
Name: count, dtype: int64

(7,)

It's certainly a judgement call, but dropping `o` for 'other' is among
the reasonable choices.


In [15]:
drop_list.append('o')
drop_list

['f', 'o']

### Recode target values

```Text
One binary class divided in
edible=e and poisonous=p
(with the latter one also containing
mushrooms of unknown edibility)
```

let's recode these values as `edible=0` and `poisonous=1`


In [16]:
y.sample(5)

Unnamed: 0,class
44728,p
56110,e
50437,p
46719,e
40934,p


In [17]:
y.loc[:, ['poisonous']] = y['class'].apply(
    lambda x: 1 if x == 'p' else 0)

y[['poisonous']].sample(5)

Unnamed: 0,poisonous
49567,0
49149,0
4241,0
31359,0
39105,1


## Split dataset: Read carefully

For the purposes of this demonstration, let's assume that you've already
gone through the process of train test splitting, cross validation and
training and scoring multiple models and have found the very best one
already. You've taken all of your data and trained your model in
preparation for deployment.

We are using `train_test_split` in order to have a few datapoints with
which to predict. This would simulate unseen data provided 'in the wild'.


In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X_subset,
    y['poisonous'],
    test_size=0.20,
    stratify=y['poisonous'],
    random_state=42)

## Create model pipeline


We want to process the categorical variables using sklearn's
`OneHotEncoder`, and we want to scale the quantitative fields using
`StandardScaler`.

We can use `ColumnTransformer` help us assign the appropriate behavior to
the appropriate columns.

- [ColumnTransformer — scikit-learn 1.5.0
  documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)


In [19]:
# make lists of categorical and continuous columns
continuous_cols = ['cap-diameter', 'stem-height', 'stem-width']
categorical_cols = ['has-ring', 'cap-shape']

In [20]:
# define transformers for categorical and continuous
categorical_transformer = OneHotEncoder(
    handle_unknown='ignore',
    drop=drop_list)  # drop the values decided earlier

continuous_transformer = StandardScaler()

In [21]:
# ColumnTransformer expects a list of tuples
# name_string, transformer_instance, list_of_columns

preprocessor = ColumnTransformer(
    transformers=[
        ('continuous', continuous_transformer, continuous_cols),
        ('categorical', categorical_transformer, categorical_cols)
    ]
)

In [22]:
preprocessor

In [23]:
# pipe = Pipeline([
#     ('preprocessor', preprocessor),
#     # ('logistic_regression', LogisticRegression)
# ])

## Transform training data & fit model


In [24]:
transformed_data = preprocessor.fit_transform(X_train)

### Get Names Post-Transform (side-quest)

This is not strictly necessary, but it's an interesting side-quest that
doesn't take too long.


- [Get column name after fitting the machine learning pipeline | by
  Yannawut Kimnaruk |
  Medium](https://yannawut.medium.com/get-column-name-after-fitting-the-machine-learning-pipeline-145a2a8051cc#:~:text=You%20can%20get%20the%20columns,changed%20columns'%20names%20using%20get_feature_names.)
- [Get Feature Names Out](https://stackoverflow.com/a/56339153)


In [25]:
# transformed data is returned as a numpy array
# it can be viewed as a dataframe
transformed_train_df = pd.DataFrame(transformed_data)
transformed_train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.195867,-0.104756,-0.639647,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.039043,-0.801696,-0.439222,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.313735,1.253535,0.218893,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.61643,-0.73645,-0.574833,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-0.654319,-0.81949,-0.613722,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Adding names to the DataFrame takes a few steps


In [26]:
# pipe.named_steps['preprocessor']
# pipe.named_steps['preprocessor'].named_transformers_['categorical']
# pipe.named_steps['preprocessor'].named_transformers_['categorical'].categories_
post_cat_names = (
    preprocessor
    .named_transformers_['categorical']
    .get_feature_names_out())

post_cat_names

array(['has-ring_t', 'cap-shape_b', 'cap-shape_c', 'cap-shape_f',
       'cap-shape_p', 'cap-shape_s', 'cap-shape_x'], dtype=object)

In [27]:
continuous_col_names = (
    preprocessor
    .named_transformers_['continuous']
    .get_feature_names_out())

display(continuous_col_names)

post_continuous = [f"{x}_z" for x in continuous_col_names]
display(post_continuous)

array(['cap-diameter', 'stem-height', 'stem-width'], dtype=object)

['cap-diameter_z', 'stem-height_z', 'stem-width_z']

In [28]:
post_col_names = post_continuous + list(post_cat_names)
post_col_names

['cap-diameter_z',
 'stem-height_z',
 'stem-width_z',
 'has-ring_t',
 'cap-shape_b',
 'cap-shape_c',
 'cap-shape_f',
 'cap-shape_p',
 'cap-shape_s',
 'cap-shape_x']

In [29]:
new_names_dict = {
    k: v for k, v in zip(transformed_train_df.columns, post_col_names)
}

In [30]:
transformed_train_df = transformed_train_df.rename(
    columns=new_names_dict,
)

transformed_train_df.head(5)

Unnamed: 0,cap-diameter_z,stem-height_z,stem-width_z,has-ring_t,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_p,cap-shape_s,cap-shape_x
0,-0.195867,-0.104756,-0.639647,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.039043,-0.801696,-0.439222,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.313735,1.253535,0.218893,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.61643,-0.73645,-0.574833,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-0.654319,-0.81949,-0.613722,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [31]:
# observe that the first few values of has_ring_t is the same as below
has_ring_true = X_train['has-ring'].apply(lambda x: 1 if x == 't' else 0)
has_ring_true[:5]

36219    1
54188    0
1609     1
20481    0
20546    0
Name: has-ring, dtype: int64

In [32]:
lr_model = LogisticRegression().fit(transformed_data, y_train)
lr_model

In [33]:
X_test_transformed = preprocessor.transform(X_test)

y_pred = lr_model.predict(X_test_transformed)

print(
    f"accuracy: {accuracy_score(y_test, y_pred=y_pred)}",
    f"f1: {f1_score(y_test, y_pred=y_pred)}",
    sep="\n"
)

accuracy: 0.6129850990666448
f1: 0.6874297427759043


## Preserve model with `joblib`

Caveats:

- You must have the same environment (versions of Python, Scikit-learn,
  etc) when you unpack as when you saved
- files are binary and not human-readable

It may be useful to preserve a few different artifacts from this
notebook, so a Python dictionary seems a good choice.


In [34]:
# choose artifacts
# you don't have to store these as a dictionary
# that's just for this demonstration
artifacts = {
    'preprocessor': preprocessor,
    'lr_model': lr_model,
    'X_test': X_test,
    'y_test': y_test,
    'post_col_names': post_col_names
}

In [35]:
# path to where you want artifacts stored
path = '../models/artifacts.joblib'

In [36]:
# write artifacts to file
with open(path, "wb") as f:
    joblib.dump(artifacts, f, protocol=5)

## Confirm model preservation

Read back the stored model. It should work as it did before without any
issues.

In [37]:
with open(path, "rb") as f:
    recovered = joblib.load(path)

In [38]:
print(type(recovered))
print(recovered.keys())

<class 'dict'>
dict_keys(['preprocessor', 'lr_model', 'X_test', 'y_test', 'post_col_names'])


In [39]:
recovered_X = recovered['X_test']
recovered_X.head()

Unnamed: 0,cap-diameter,stem-height,stem-width,has-ring,cap-shape
49474,5.19,7.05,15.79,f,x
22798,6.84,5.03,12.98,f,s
60027,10.44,4.58,25.92,f,o
35232,3.9,7.5,8.21,t,x
42968,10.76,11.26,17.32,t,p


In [40]:
X_transformed = recovered['preprocessor'].transform(recovered_X)

In [41]:
preds = recovered['lr_model'].predict(X_transformed)

In [42]:
human_readable = [
    'poisonous' if x == 1 else 'edible' for x in list(preds)]

human_readable[:5]

['edible', 'poisonous', 'poisonous', 'poisonous', 'edible']

In [43]:
print(
    accuracy_score(
        y_true=recovered['y_test'],
        y_pred=preds),
    f1_score(y_true=recovered['y_test'],
             y_pred=preds),
    sep="\n")

0.6129850990666448
0.6874297427759043


## Conclusion

You now have a stored version of the model in the `models` directory. The
model and a few other artifacts are stored as a dictionary in a binary
format.

## References

- [9. Model persistence — scikit-learn 1.5.0
  documentation](https://scikit-learn.org/stable/model_persistence.html)
