In [19]:
import pandas as pd
import numpy as np

## Mushroom dataset

In [20]:
SHROOMS_DATASET_PATH = "datasets/mushrooms.csv"

In [21]:
df_shrooms = pd.read_csv(SHROOMS_DATASET_PATH)
target_colunm = "class"

X = df_shrooms.drop(columns=[target_colunm])
y = df_shrooms[target_colunm]

Inspect data frame.

In [22]:
df_shrooms.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Check for missing values.

In [23]:
df_shrooms.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

Check how frequently a feature occurs.

In [24]:
for var in df_shrooms.columns: 
    print(df_shrooms[var].value_counts())

class
e    4208
p    3916
Name: count, dtype: int64
cap-shape
x    3656
f    3152
k     828
b     452
s      32
c       4
Name: count, dtype: int64
cap-surface
y    3244
s    2556
f    2320
g       4
Name: count, dtype: int64
cap-color
n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: count, dtype: int64
bruises
f    4748
t    3376
Name: count, dtype: int64
odor
n    3528
f    2160
s     576
y     576
a     400
l     400
p     256
c     192
m      36
Name: count, dtype: int64
gill-attachment
f    7914
a     210
Name: count, dtype: int64
gill-spacing
c    6812
w    1312
Name: count, dtype: int64
gill-size
b    5612
n    2512
Name: count, dtype: int64
gill-color
b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: count, dtype: int64
stalk-shape
t    4608
e    3516
Name: count, dtype: int64
stalk-root
b    3776
?    2480
e    1120
c     556
r     192
Name: coun

**Conclusion**: `stalk-root` contains missing value.

In [25]:
df_shrooms['stalk-root'].value_counts()

stalk-root
b    3776
?    2480
e    1120
c     556
r     192
Name: count, dtype: int64

Replace it with most frequent existing value.

In [26]:
df_shrooms.replace('?', pd.NA, inplace=True)
df_shrooms['stalk-root'].fillna(df_shrooms['stalk-root'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_shrooms['stalk-root'].fillna(df_shrooms['stalk-root'].mode()[0], inplace=True)


In [27]:
df_shrooms['stalk-root'].value_counts()

stalk-root
b    6256
e    1120
c     556
r     192
Name: count, dtype: int64

Check the proportion of edible to poisonous mushrooms.

In [28]:
value_counts = df_shrooms['class'].value_counts()
count_poisonous = value_counts['p']
count_edible = value_counts['e']
count_edible / (count_edible + count_poisonous)

np.float64(0.517971442639094)

**Conclusion**: `class` column is relatively balanced. Learning algorithms won't favor one class over the other.

### Features with Dominant Values:
`veil-type`: The value p appears in all samples, meaning this feature provides no informative value and can be dropped.


`gill-attachment`: The value f dominates (~97%), suggesting it has low utility for classification.

### Features with Higher Variability:
`odor`: The distribution is more varied. Odor n (no odor) is the most common (3528 samples), but other values such as f, s, y, or a might be critical for distinguishing between edible and poisonous mushrooms.


`cap-color` and `gill-color`: These have many unique values and could be valuable features for classification.


`population`: The distribution indicates diversity, with values like v, y, and s being particularly prominent.

`odor`, `gill-color`, `cap-color`, `population` are likely to have the highest predictive value.

<sub><sup>this summary was generated by chatgpt</sup></sub>

Check for high cardinality.

In [29]:
for var in df_shrooms.columns:
    print(var, ' contains ', len(df_shrooms[var].unique()), ' labels')

class  contains  2  labels
cap-shape  contains  6  labels
cap-surface  contains  4  labels
cap-color  contains  10  labels
bruises  contains  2  labels
odor  contains  9  labels
gill-attachment  contains  2  labels
gill-spacing  contains  2  labels
gill-size  contains  2  labels
gill-color  contains  12  labels
stalk-shape  contains  2  labels
stalk-root  contains  4  labels
stalk-surface-above-ring  contains  4  labels
stalk-surface-below-ring  contains  4  labels
stalk-color-above-ring  contains  9  labels
stalk-color-below-ring  contains  9  labels
veil-type  contains  1  labels
veil-color  contains  4  labels
ring-number  contains  3  labels
ring-type  contains  5  labels
spore-print-color  contains  9  labels
population  contains  6  labels
habitat  contains  7  labels


Let's select a subset of features that would be most valuable: `odor`, `gill-color`, `cap-color`, `population`, `cap-shape`, `cap-surface`, `ring-number`, `habitat`, `bruises`


## Iris dataset

In [30]:
from sklearn.datasets import load_iris

In [31]:
iris = load_iris(as_frame=True)
iris.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Check for missing values.

In [32]:
iris.frame.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

Iris dataset contains numerical variables which means counting specific values inside a feature makes no sense. Checking for missing vlaues seems to be enough.

No columns are trucated as the set contains only four and all of which are valuable.