## **KNN Model**

We will build a classification model using the k-Nearest Neighbors (k-NN) algorithm on the Mushroom dataset. This dataset contains various characteristics of mushrooms, with the goal of predicting whether a mushroom is edible or poisonous based on its features. We will preprocess the data, train the k-NN model, evaluate its performance, and visualize the learning curve to better understand how the model behaves with different training sizes.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load of the preprocessed dataframe from .csv file

In [None]:
df = pd.read_csv("mushrooms_preprocessed.csv", index_col=0)
df.head()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns=["poisonous"]),
    df["poisonous"],
    test_size=0.2,
    random_state=42,
)

In [7]:
knn_pipeline = make_pipeline(KNeighborsClassifier(n_neighbors=5))

knn_pipeline.fit(X_train, y_train)

y_pred = knn_pipeline.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy:.4f}")

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=knn_pipeline.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn_pipeline.classes_)
disp.plot(cmap='Blues')
disp.ax_.set_title("Confusion matrix - KNN")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
scores = cross_val_score(knn_pipeline, X_train, y_train, cv=10)

print("Cross-validation scores:", scores)
print("Mean cross-validation accuracy:", scores.mean())

# Learning curve

In [None]:
train_sizes, train_scores, valid_scores = learning_curve(
    knn_pipeline, X_train, y_train, cv=10, train_sizes=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], scoring='accuracy')

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(valid_scores, axis=1)
test_std = np.std(valid_scores, axis=1)

# Graphic
plt.figure(figsize=(8, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
plt.plot(train_sizes, test_mean, 'o-', color='green', label='Validation score')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.2, color='green')
plt.title('Learning Curve (k-NN)')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(True)
plt.tight_layout()
plt.show()