The data used was from: https://www.kaggle.com/datasets/uciml/mushroom-classification.
First, the data is imported into a pandas dataframe. This means that the data can be processed using various python packages

In [None]:
import pandas as pd
dataset = pd.read_csv("mushrooms.csv")
print(dataset)

The column list shows varied features which means the data is detailed

In [9]:
columns = list(dataset.columns)
print(columns)

['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']


All of the columns use short hands for the data. There is information on what each letter means, but it is hard to tell at a glance what this data means. Also, it worth noting that all samples that were of unknown edibility were put in the poisonous category. Whilst mushrooms of unknown edibility should not be eaten, putting them in the posionous category rather than their own separate category means that the data is not 100% accurate. It is unknown how many unknown edibility mushrooms there are and whether they are artificially inflating the poisonous class. 

In [10]:
for column in columns:
    print(dataset[column].value_counts())

class
e    4208
p    3916
Name: count, dtype: int64
cap-shape
x    3656
f    3152
k     828
b     452
s      32
c       4
Name: count, dtype: int64
cap-surface
y    3244
s    2556
f    2320
g       4
Name: count, dtype: int64
cap-color
n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: count, dtype: int64
bruises
f    4748
t    3376
Name: count, dtype: int64
odor
n    3528
f    2160
s     576
y     576
a     400
l     400
p     256
c     192
m      36
Name: count, dtype: int64
gill-attachment
f    7914
a     210
Name: count, dtype: int64
gill-spacing
c    6812
w    1312
Name: count, dtype: int64
gill-size
b    5612
n    2512
Name: count, dtype: int64
gill-color
b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: count, dtype: int64
stalk-shape
t    4608
e    3516
Name: count, dtype: int64
stalk-root
b    3776
?    2480
e    1120
c     556
r     192
Name: coun

In order to create a classification model, each string data must be encoded. This means that the same letter in each column corresponds to the same number.

In [12]:
from sklearn.preprocessing import LabelEncoder

for column in columns:
    dataset[column] = LabelEncoder().fit_transform(dataset[column])
print(dataset)

      class  cap-shape  cap-surface  cap-color  bruises  odor  \
0         1          5            2          4        1     6   
1         0          5            2          9        1     0   
2         0          0            2          8        1     3   
3         1          5            3          8        1     6   
4         0          5            2          3        0     5   
...     ...        ...          ...        ...      ...   ...   
8119      0          3            2          4        0     5   
8120      0          5            2          4        0     5   
8121      0          2            2          4        0     5   
8122      1          3            3          4        0     8   
8123      0          5            2          4        0     5   

      gill-attachment  gill-spacing  gill-size  gill-color  ...  \
0                   1             0          1           4  ...   
1                   1             0          0           4  ...   
2                 

There is no missing data

In [13]:
print(dataset.isnull().sum())

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64


The relevant modules are imported. 
X is the variables that create the target of edibility class, which is named as y.
The data is split into train and test data to test if the trained model is accurate. 
X is standardised using the standard scaler so that the variables have a mean of 0 and a standard deviation of 1. 

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

X=dataset.drop("class",axis=1)
y=dataset["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

The first tested model is logistic regression. It is useful for binary classification, which is the output of the data. The model is trained and the predicted y data matches up with the actual y data for an accuracy of 95.20%.

In [27]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 95.20%


K Nearest Neighbours calculates the closest data point to match their outputs. This has a much higher accuracy of 100%. As eating a poisonous mushroom can be very harmful, it would be better to use this algorithm to choose whether to eat a mushroom or not

In [29]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 100.00%
