#**Kaggle**
##Mushroom Classification - Which features are most indicative of a poisonous mushroom?
https://www.kaggle.com/datasets/uciml/mushroom-classification/data

About this file

Add Suggestion
Attribute Information: (classes: edible=e, poisonous=p)

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

bruises: bruises=t,no=f

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

gill-attachment: attached=a,descending=d,free=f,notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-size: broad=b,narrow=n

gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

stalk-shape: enlarging=e,tapering=t

stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

veil-type: partial=p,universal=u

veil-color: brown=n,orange=o,white=w,yellow=y

ring-number: none=n,one=o,two=t

ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

###**Data**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import export_text

In [None]:
data = pd.read_csv('mushrooms.csv')
data.head(10)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
5,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g
6,e,b,s,w,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,n,m
7,e,b,y,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,s,m
8,p,x,y,w,t,p,f,c,n,p,...,s,w,w,p,w,o,p,k,v,g
9,e,b,s,y,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,s,m


###**Data Understanding**

In [None]:
#checking the shape of the data
data.shape

(8124, 23)

In [None]:
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
 class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64


This code checks for missing (null) values in a DataFrame called `data`. The `isnull()` function returns a DataFrame of the same shape with `True` for missing values and `False` for non-missing values. The `sum()` function then counts the number of missing values in each column. Finally, the code prints the number of missing values for each column with the label "Missing values:". This helps identify where the data is incomplete.

In [None]:
#checking for unique values on the dataframe
data.nunique()

Unnamed: 0,0
class,2
cap-shape,6
cap-surface,4
cap-color,10
bruises,2
odor,9
gill-attachment,2
gill-spacing,2
gill-size,2
gill-color,12


In [None]:
# Count unique values for each categorical column
categorical_cols = data.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"Unique values in {col}:")
    print(data[col].value_counts())
    print("\n")

Unique values in class:
class
e    4208
p    3916
Name: count, dtype: int64


Unique values in cap-shape:
cap-shape
x    3656
f    3152
k     828
b     452
s      32
c       4
Name: count, dtype: int64


Unique values in cap-surface:
cap-surface
y    3244
s    2556
f    2320
g       4
Name: count, dtype: int64


Unique values in cap-color:
cap-color
n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: count, dtype: int64


Unique values in bruises:
bruises
f    4748
t    3376
Name: count, dtype: int64


Unique values in odor:
odor
n    3528
f    2160
y     576
s     576
a     400
l     400
p     256
c     192
m      36
Name: count, dtype: int64


Unique values in gill-attachment:
gill-attachment
f    7914
a     210
Name: count, dtype: int64


Unique values in gill-spacing:
gill-spacing
c    6812
w    1312
Name: count, dtype: int64


Unique values in gill-size:
gill-size
b    5612
n    2512
Name: count, dtype: int64


Unique values in

This code identifies and processes the categorical columns (those with the data type `'object'`) in the `data` DataFrame. The `select_dtypes(include=['object'])` function selects all columns with string (or categorical) data. Then, for each categorical column, the code uses a loop to print the unique values in that column by applying the `value_counts()` function, which counts the occurrences of each unique value. Finally, it prints this information for each column, followed by a blank line for better readability. This helps to understand the distribution and variety of categories in the data.

In [None]:
print('Unique values in each column:')
for i in categorical_cols:
  amount = 0
  for j in data[i].unique():
    amount +=1
  print(f'{i}: {amount}')

Unique values in each column:
class: 2
cap-shape: 6
cap-surface: 4
cap-color: 10
bruises: 2
odor: 9
gill-attachment: 2
gill-spacing: 2
gill-size: 2
gill-color: 12
stalk-shape: 2
stalk-root: 5
stalk-surface-above-ring: 4
stalk-surface-below-ring: 4
stalk-color-above-ring: 9
stalk-color-below-ring: 9
veil-type: 1
veil-color: 4
ring-number: 3
ring-type: 5
spore-print-color: 9
population: 6
habitat: 7


This code prints the number of unique values in each categorical column of the `data` DataFrame. It iterates over each column identified as categorical (from the `categorical_cols` list). For each column, it loops through the unique values (using `data[i].unique()`) and counts them by incrementing the `amount` variable. After counting the unique values for each column, it prints the column name along with the total count of unique values. This is useful for understanding the diversity of categories within each column.

In [None]:
duplicates = data.duplicated()
print("Duplicate rows:")
print(data[duplicates])

Duplicate rows:
Empty DataFrame
Columns: [class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat]
Index: []

[0 rows x 23 columns]


In [None]:
# there is no duplicates values

This code identifies and displays the duplicate rows in the `data` DataFrame. The `duplicated()` function checks for rows that are identical to previous rows, returning a boolean series where `True` indicates a duplicate row and `False` indicates a unique row. The code then prints all the rows that are duplicates by using `data[duplicates]`, which filters and displays only the rows where the corresponding value in `duplicates` is `True`. This helps identify and examine any repeated data entries in the dataset.

In [None]:
# Initialize an empty list to store feature information
feature_info = []

# Loop through each column to gather information
for feature in data.columns:
    # Check if the feature is categorical or numerical
    if data[feature].dtype == 'object':
        feature_type = 'Categorical'
        values = ', '.join(data[feature].unique())  # List unique categories
    else:
        feature_type = 'Numerical'
        values = f'{data[feature].min()} to {data[feature].max()}'  # Min and Max range

    # Count missing values
    missing_values = data[feature].isnull().sum()

    # Check for outliers in numerical features (using IQR for example)
    if feature_type == 'Numerical':
        Q1 = data[feature].quantile(0.25)
        Q3 = data[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)].shape[0]
    else:
        outliers = 'N/A'  # Outliers are not applicable for categorical features

    # Append the feature info to the list
    feature_info.append([feature, feature_type, values, missing_values, outliers])

# Convert the feature_info list into a DataFrame
feature_table = pd.DataFrame(feature_info, columns=['Feature', 'Type', 'Values/Range', 'Missing Values', 'Outliers'])

# Display the feature table
print(feature_table)


                     Feature         Type                        Values/Range  \
0                      class  Categorical                                p, e   
1                  cap-shape  Categorical                    x, b, s, f, k, c   
2                cap-surface  Categorical                          s, y, f, g   
3                  cap-color  Categorical        n, y, w, g, e, p, b, u, c, r   
4                    bruises  Categorical                                t, f   
5                       odor  Categorical           p, a, l, n, f, c, y, s, m   
6            gill-attachment  Categorical                                f, a   
7               gill-spacing  Categorical                                c, w   
8                  gill-size  Categorical                                n, b   
9                 gill-color  Categorical  k, n, g, p, w, h, u, e, b, r, y, o   
10               stalk-shape  Categorical                                e, t   
11                stalk-root

In [None]:
#Check if the dataset is imbalanced in terms of the target variable
data['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
e,4208
p,3916


For class imbalance, the dataset has the following distribution:

- **Edible (e)**: 4208 instances
- **Poisonous (p)**: 3916 instances

Since the difference in instances between the two classes is relatively small (4208 vs. 3916), the class imbalance is **not severe**. However, it's still worth keeping an eye on, as large imbalances could affect model performance, especially in some classification algorithms. In this case, this mild imbalance might not be a major issue for a decision tree model, but it's something to monitor.