# Mushrooms, safe to eat or poisonous? #
**Introduction**

In this Kaggle I'll work with diferent types of models to classificate mushrooms as edible (e) or poisonous (p). Understanding the state of art on the dataset used is important since the accuracies of other projects on this dataset are around 100%. Having this in mind I'll focus this Kaggle on a correct and explained use of different models rather than obtaining the best results.

**Libraries used** 

* [Numpy](https://numpy.org/): To treat and work with the data (linear algebra)
* [Pandas](https://pandas.pydata.org/): To work with the dataset
* [Sklearn](https://scikit-learn.org/stable/): To create and work with the models
* [Seaborn](https://seaborn.pydata.org/): To visualize the data with graphs
* [Matplotlib](https://matplotlib.org/): To visualize the data with graphs


## Step 1: Import the libraries ##

In [None]:

import numpy as np 
import pandas as pd 
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random

#Data Processing

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Model creation and hyperparameter search

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline

#Validation and visualitzation of scores

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, precision_score, recall_score, auc
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, accuracy_score, f1_score

## Step 2: EDA ##
EDA or Exploratory Data Analysis helps us to better understand the data we are working  with so we know how to adress the problems later.

**We load the dataset and visualize its first elements**

In [None]:
df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
dataset = df.values
df.head()

**Show Data size**

In [None]:
df.shape

**We check for null values**

In [None]:
df.isnull().sum()

**Separate Target from the data**

Our target variable to predict is going to be *class* so we are going to separate the depentent variable from the independent ones

In [None]:
names = list(df.columns)
x = df[names[1:]] ## Dataset with independent charecteristics
y = df['class'] ## Dataset with the target characteristic


**Target variable distribution**

In [None]:
colors = ('#EF8787','#9CF29C')
palette = sns.set_palette(sns.color_palette(colors))

f, ax = plt.subplots(figsize=(15, 10))
labels = (('Poisonous','Edible'))
df['class'].value_counts().plot.pie(labels= labels, shadow= True, ax= ax, autopct='%1.1f%%', colors= colors,textprops={'fontsize': 12} )

ax.set_title('Mushroom Class Distribution', fontsize = 15);

**Feature distribution between poisonous and edible**

In [None]:
features = df.columns
f, axes = plt.subplots(11,2, figsize=(30,150))
axes = axes.flatten()
k = 1

for i in range(0,22):
    s = sns.countplot(x = features[k], data = df, hue = 'class', ax=axes[i], palette = palette)
    axes[i].set_xlabel(features[k], fontsize=30)
    axes[i].set_ylabel("Count", fontsize=30)
    axes[i].tick_params(labelsize=20)
    axes[i].legend(loc=2, prop={'size': 20})
    k = k+1
    for p in s.patches:
        s.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2, p.get_height()), 
        ha = 'center', va = 'center', 
        xytext = (0, 9), 
        fontsize = 20,
        textcoords = 'offset points')

We can see how some features gives us way more information than others.
We can also see that there's only one veil-type, so we are going to delete that feature.

In [None]:
df=df.drop(["veil-type"],axis=1)
names = list(df.columns)
x = x.drop(["veil-type"],axis=1)
y = df['class'] ## Dataset with the target characteristic


**Data Correlation**

Since our data is categorical we'll use a heatmap to observe correlation between features. To do that, first we'll have to do some basic label encoding.

In [None]:
labelencoder=LabelEncoder()
df_enc = df.copy()
for column in df.columns:
    df_enc[column] = labelencoder.fit_transform(df[column])

In [None]:
plt.figure(figsize=(14,12))
sns.heatmap(df_enc.corr(),linewidths=.1,annot=True, cmap="magma")

## Step 3: Data Preprocessing ##

**Data Encoding**

Since all our data is categorical, in order to work with it we must encode it. We need to keep in mind that  for some models, directly encoding the values with numbers(like we have done to plot the heatmap) can create bias towards higher valued features. To avoid that we'll use One-Hot Encoding.


In [None]:
ohe_x = OneHotEncoder(drop='first').fit(x)
ohe_x = ohe_x.transform(x).toarray()

aux = y.values.reshape(-1, 1)
ohe_y = OneHotEncoder(drop='first').fit(aux)
ohe_y = ohe_y.transform(aux).toarray()
ohe_y = ohe_y.flatten()

**Data normalization**

Since all our data is categorical there's no need to normalize the data.

**Data Split**

In order to train and test our models with different data we need to split it.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(ohe_x,ohe_y,test_size=0.2, random_state=1)

## Step 4: Model Selection ##

There are many different models and many different variations of each model, to chose the ones that will have a better performance we need to consider which one will better fit our dataset.

Our dataset is balanced, the input and the output is categorical and we have around 8.000 data examples. Considering this we'll use the following models:

* Logisitc Regression
* Naive Bayes
* SVM
* Random Forest
* KNN



First we define a function that will help us printing the results of each model.

In [None]:
models = ['LogisticRegression','NaiveBayes','KernelSVM'
          ,'RandomForest','KNearestNeighbors']

scores = [None] * len(models)

def show_results(best_model, prediction,model):
    acc = accuracy_score(y_test, prediction)
    
    scores[models.index(model)] = acc
        
    prec = precision_score(y_test, prediction)
    rec = recall_score(y_test, prediction)
    error = plot_confusion_matrix(best_model, x_test, y_test, normalize='true', cmap="magma")
    error = 1-(sum(np.diag(error.confusion_matrix)) / sum(error.confusion_matrix.ravel()))
    error = ("%.4f" % (error*100))
    print(f'Accuracy:{acc}')
    print(f'Precision:{prec}')
    print(f'Recall:{rec}')
    print(f"Error rate: {error}%")

**Logistic Regression**

In [None]:
from sklearn.model_selection import cross_val_score

lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
accuracy = lr.score(x_test, y_test)

show_results(lr, y_pred,"LogisticRegression")


We have obtained a 0.997% of accuracy, we could be having an overfitting issue. To better evaluate the models from now on we are going to use crossvalidation. Doing so, we will probably obtain worse results, but these are going to be more reliable.

In [None]:
score_list = cross_val_score(lr,ohe_x,ohe_y, cv=10)
score = np.mean(score_list)
print (score)

# we swap the score obtained before with the cross_val_score
scores[0] = score


**Naive Bayes**

In [None]:
nb = GaussianNB()
nb.fit(x_train, y_train)
preds= nb.predict(x_test)
show_results(nb,preds,"NaiveBayes")

In [None]:
score_list = cross_val_score(nb,ohe_x,ohe_y, cv=5)
print(score)
score = np.mean(score_list)
# we swap the score obtained before with the cross_val_score
scores[1] = score

Again we can see how after using cross-Validation our accuracie has decreased this time from 95% to 82%. To understand what happened we must see the scores obtained from the cross_validation.

In [None]:
print(score_list)

It's now clear that the 95% accuracy wasn't really reliable.

**Data separation in Folds**

In [None]:
cv_split = TimeSeriesSplit(n_splits=5)

**SVM**

In [None]:
svc = svm.SVC(random_state=1, probability=True)
svc_params = {
    'model__C': [0.1, 1, 10, 100],  
    'model__kernel': ['poly', 'rbf', 'sigmoid']
} 
svc_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('model', svc)
])
gridsearch_svc = GridSearchCV(estimator=svc_pipe,
                          param_grid = svc_params)
gridsearch_svc.fit(x_train, y_train)

In [None]:
svc_best_model = gridsearch_svc.best_estimator_
predictions = svc_best_model.predict(x_test)
show_results(svc_best_model,predictions,"KernelSVM")

**Random Forest**

In [None]:
rf = RandomForestClassifier(random_state=1)
rf_params = {
    'model__n_estimators': list(range(25,251,25)),
    'model__max_features': list(np.arange(0.1,0.36,0.05))
}
rf_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('model', rf)
])
gridsearch_rf = GridSearchCV(estimator=rf_pipe,
                          param_grid = rf_params,
                          cv = cv_split,
                         )
gridsearch_rf.fit(x_train, y_train)

In [None]:
rf_best_model = gridsearch_rf.best_estimator_
predictions = rf_best_model.predict(x_test)
show_results(rf_best_model,predictions,'RandomForest')

**KNN**

In [None]:
knn = KNeighborsClassifier()
knn_params = {
    'n_neighbors': list(range(4,10)),
    'weights': ['uniform','distance']
}

gridsearch_knn = GridSearchCV(knn,
                          param_grid = knn_params,
                          cv = cv_split,
                         )
gridsearch_knn.fit(x_train, y_train)

In [None]:
knn_best_model = gridsearch_knn.best_estimator_
predictions = knn_best_model.predict(x_test)

show_results(knn_best_model,predictions,'KNearestNeighbors')

After previous results we can already think that regardless the hyperparameter search and the tunning of the variables we are prone to obtain very high accuracies. To ilustrate that we can look at the next code which executes KNN with various parameters.

In [None]:
distances= [1,2,5]
weights = ['distance','uniform']



for dist in distances:
    i = 0
    fig, axs = plt.subplots(1,2,figsize=(20,5))

    for w in weights:
        list1 = []
        for neighbors in range(3,10):
            classifier = KNeighborsClassifier(n_neighbors=neighbors, p=dist, weights = w)
            classifier.fit(x_train, y_train)
            y_pred = classifier.predict(x_test)
            list1.append(accuracy_score(y_test,y_pred))

        axs[i].plot(list(range(3,10)), list1, linewidth=3)
        axs[i].set_title("p = "+str(w), fontsize=15)
        axs[i].set_xlabel("K neighbors", fontsize=15)
        axs[i].set_ylabel("Accuracy", fontsize=15)
        i +=1
    fig.suptitle(dist, fontsize=20, y = 1.02)   
    plt.show()


In [None]:
plt.rcParams['figure.figsize']=15,8 
ax = sns.barplot(x=models, y=scores, palette = "magma", saturation =1.5)
plt.xlabel("Classifier Models", fontsize = 20 )
plt.ylabel("% of Accuracy", fontsize = 20)
plt.title("Accuracy of different Classifier Models", fontsize = 20)
plt.xticks(fontsize = 13, horizontalalignment = 'center', rotation = 0)
plt.yticks(fontsize = 13)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2%}', (x + width/2, y + height*1.01), ha='center', fontsize = 13)
plt.show()

## Step 5: 100% Accuracy, too good to be true? 

As previously explained this data set's features can easily explain the data  to classify the mushrooms as edible or poisonous.
We have seen how all the tested models can get high accuracies and how little the accuracies vary between different parameters (illustrated with the KNN example).
To finally understand how easy we can achieve high accuracies we'll plot various ROC curves for various subgroups of random features.

In [None]:
palette = sns.set_palette(sns.color_palette('Set1')) #just to define de plot palette

for i in range(0,5):
    random.seed(i)
    randlist = list(names[x] for x in random.sample(range(0,21),k=5))
    rand_df = df[randlist]
    rand_df = pd.get_dummies(rand_df)

    x2_train, x2_test, y_train, y_test = train_test_split(rand_df, ohe_y, test_size=0.2)

    lr = LogisticRegression(solver="lbfgs").fit(x2_train, y_train)
    predicted = lr.predict(x2_test)

    y_probs = lr.predict_proba(x2_test)
    y_probs = pd.DataFrame(y_probs)[1]

    roc_auc="%.2f" % roc_auc_score(y_test, predicted)
    lr_fpr, lr_tpr, _ = roc_curve(y_test, y_probs)

    plt.plot(lr_fpr, lr_tpr, marker='.', label=(str(randlist) + " " + str(roc_auc)))
    # axis labels
    plt.xlabel('False Positive Rate',fontsize = 15)
    plt.ylabel('True Positive Rate',fontsize = 15)
    # show the legend
    plt.legend(fontsize = 11)
    
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

plt.gcf().set_size_inches(15,10)

## Conclusions ##

Either KNN or RandomForests will give us almost a perfect accuracy.
Even if the dataset features allow to easily have high accuracies it's always important to properly process the data and tune the models correctly, understanding what the program is doing rather than just focusing on having better scores. Doing this we'll know if our scores are correct or we are doing something wrong.