
**Objective:**  
Classify edible and poisonous mushrooms. Determine the maximum accuracy achievable and evaluate how many poisonous mushrooms are correctly classified.

**Instructions:**
1. Start by implementing logistic regression and k-nearest neighbors to classify mushrooms based on given features.
2. Compare the accuracy of both models.
3. Examine each model's performance in accurately identifying poisonous mushrooms. 



In [1]:
import pandas as pd

#import data
df = pd.read_csv('mushroom.csv') 
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54035 entries, 0 to 54034
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cap-diameter     54035 non-null  int64  
 1   cap-shape        54035 non-null  int64  
 2   gill-attachment  54035 non-null  int64  
 3   gill-color       54035 non-null  int64  
 4   stem-height      54035 non-null  float64
 5   stem-width       54035 non-null  int64  
 6   stem-color       54035 non-null  int64  
 7   season           54035 non-null  float64
 8   class            54035 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 3.7 MB


Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
count,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0
mean,567.257204,4.000315,2.142056,7.329509,0.75911,1051.081299,8.418062,0.952163,0.549181
std,359.883763,2.160505,2.228821,3.200266,0.650969,782.056076,3.262078,0.305594,0.49758
min,0.0,0.0,0.0,0.0,0.000426,0.0,0.0,0.027372,0.0
25%,289.0,2.0,0.0,5.0,0.270997,421.0,6.0,0.88845,0.0
50%,525.0,5.0,1.0,8.0,0.593295,923.0,11.0,0.943195,1.0
75%,781.0,6.0,4.0,10.0,1.054858,1523.0,11.0,0.943195,1.0
max,1891.0,6.0,6.0,11.0,3.83532,3569.0,12.0,1.804273,1.0


In [11]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(df[['cap-diameter','cap-shape','gill-attachment','gill-color','stem-height','stem-width','stem-color','season']], df['class'], test_size=0.33, random_state=31)

logistic_regr = linear_model.LogisticRegression(max_iter = 500) #else ConvergenceWarning
logistic_regr.fit(X_train, y_train)

predict = logistic_regr.predict(X_test)

for i,number in enumerate(predict):
    if number <.5: predict[i] = 0
    else:predict[i] = 1

print(classification_report(y_test, predict, target_names=["Edible", "Poisonous"]))
print("Confusion Matrix:\n", confusion_matrix(y_test, predict))


              precision    recall  f1-score   support

      Edible       0.62      0.53      0.57      8084
   Poisonous       0.65      0.72      0.69      9748

    accuracy                           0.64     17832
   macro avg       0.63      0.63      0.63     17832
weighted avg       0.64      0.64      0.63     17832

Confusion Matrix:
 [[4319 3765]
 [2696 7052]]


In [8]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

knn.fit(X_train, y_train)
predict = knn.predict(X_test)

for i,number in enumerate(predict):
    if number <.5: prediction[i] = 0
    else:predict[i] = 1


print(classification_report(y_test, predict, target_names=["Edible", "Poisonous"]))
print("Confusion Matrix:\n", confusion_matrix(y_test, predict))


              precision    recall  f1-score   support

      Edible       0.69      0.68      0.68      8084
   Poisonous       0.74      0.74      0.74      9748

    accuracy                           0.71     17832
   macro avg       0.71      0.71      0.71     17832
weighted avg       0.71      0.71      0.71     17832

Confusion Matrix:
 [[5505 2579]
 [2516 7232]]


The accuracy of these two models is so poor, there is no need to evaluate the other data but to find a better model.

In [6]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

#tree.plot_tree(dtree, 'cap-diameter','cap-shape','gill-attachment','gill-color','stem-height','stem-width','stem-color','season')


predict = dtree.predict(X_test)

print(classification_report(y_test, predict, target_names=["Edible", "Poisonous"]))
print("Confusion Matrix:\n", confusion_matrix(y_test, predict))


              precision    recall  f1-score   support

      Edible       0.97      0.97      0.97      8084
   Poisonous       0.98      0.98      0.98      9748

    accuracy                           0.97     17832
   macro avg       0.97      0.97      0.97     17832
weighted avg       0.97      0.97      0.97     17832

Confusion Matrix:
 [[7840  244]
 [ 206 9542]]


Model Performance Explanation

Despite initial trials with logistic regression and k-nearest neighbors, neither model achieved high accuracy, with logistic regression at 64% and k-nearest neighbors at 71%. Given these limitations, I applied a decision tree classifier to the dataset, achieving a substantial improvement with an accuracy of 97%.

This Decision Tree model achieves high accuracy (97%) and recall for poisonous mushrooms, meaning it successfully captures nearly all poisonous mushrooms. In this scenario, misclassifying edible mushrooms as poisonous (false positives) is less of an issue, as it errs on the side of caution. However, only a small fraction of actual poisonous mushrooms are misclassified as edible, achieving the desired goal of high recall for poisonous mushrooms.

This high level of accuracy and feature interpretability suggests that the Decision Tree is well-suited for this task. The decision tree model performed so well in this context due to its strength in capturing non-linear patterns and interactions between categorical features, which were prominent in this dataset. Unlike logistic regression, which assumes a linear relationship, or k-nearest neighbors, which may struggle with complex boundaries, decision trees can effectively partition the feature space. This flexibility enables the decision tree to accurately distinguish between classes in datasets with intricate decision boundaries and categorical data, leading to superior performance.