Klasyfikuj grzyby ze zbioru `agaricus-lepiota.data` jako trujące (`p` - *poisonous*) lub jadalne (`e` - *edible*) za pomocą [`CategoricalNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html).
- Brakujące wartości zapisane są w zbiorze jako `?` (wczytując dane podaj `na_values='?'`). Usuń wiersze zawierające brakujące wartości (`dropna(axis='rows')`).
- Dane wejściowe (`X`) koduj jako 0,1,2,... przy użyciu [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html).
- Podziel dane na zbiór treningowy i testowy. Wypisz macierz omyłek oraz dokładność i F-miarę dla zbioru testowego. 

In [22]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import CategoricalNB

from sklearn.metrics import classification_report, confusion_matrix

In [23]:
# class - p = poisonous, e = edible
df = pd.read_csv("../data/agaricus-lepiota.data", na_values='?')
print(df.shape)
df.head()

(8124, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g


In [24]:
df_1 = df.dropna(axis='rows')
print(df_1.shape)
df_1.head()

(5644, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g


In [25]:
X = df_1.drop(columns="class")
y = df_1["class"]

print(X.shape)

enc = OrdinalEncoder()
X_enc = enc.fit_transform(X)

print(X_enc[:5])

(5644, 22)
[[5. 2. 4. 1. 6. 1. 0. 1. 2. 0. 2. 2. 2. 5. 5. 0. 0. 1. 3. 1. 3. 5.]
 [5. 2. 7. 1. 0. 1. 0. 0. 2. 0. 1. 2. 2. 5. 5. 0. 0. 1. 3. 2. 2. 1.]
 [0. 2. 6. 1. 3. 1. 0. 0. 3. 0. 1. 2. 2. 5. 5. 0. 0. 1. 3. 2. 2. 3.]
 [5. 3. 6. 1. 6. 1. 0. 1. 3. 0. 2. 2. 2. 5. 5. 0. 0. 1. 3. 1. 3. 5.]
 [5. 2. 3. 0. 5. 1. 1. 0. 2. 1. 2. 2. 2. 5. 5. 0. 0. 1. 0. 2. 0. 1.]]


In [None]:
X_train_full, X_test, y_train_full, y_test = train_test_split(X_enc, y, test_size=0.2, random_state=17)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=17)

### Dane walidacja

In [27]:
cat = CategoricalNB()
cat.fit(X_train, y_train)

y_pred = cat.predict(X_val)

tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)

print("#" * 20)
print(classification_report(y_val, y_pred))

Confusion Matrix:
[[714   2]
 [ 28 385]]

True Negatives: 714
False Positives: 2
False Negatives: 28
True Positives: 385
####################
              precision    recall  f1-score   support

           e       0.96      1.00      0.98       716
           p       0.99      0.93      0.96       413

    accuracy                           0.97      1129
   macro avg       0.98      0.96      0.97      1129
weighted avg       0.97      0.97      0.97      1129



### Dane Testowe

In [28]:
y_pred = cat.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nTrue Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Positives:", tp)

print("#" * 20)
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[719   4]
 [ 34 372]]

True Negatives: 719
False Positives: 4
False Negatives: 34
True Positives: 372
####################
              precision    recall  f1-score   support

           e       0.95      0.99      0.97       723
           p       0.99      0.92      0.95       406

    accuracy                           0.97      1129
   macro avg       0.97      0.96      0.96      1129
weighted avg       0.97      0.97      0.97      1129

