# Mushroom Classification with Categorical Data

### Explicit Goals:
1. Select a model that is most equipped to predict the status of a mushroom based on the provided 22 features as either edible or poisonous.
2. Determine the effectiveness of LabelEncoder as a preprocessing method vs. get_dummies.

### Personal Goals:
1. Test various quality of life functions that provide a first glance at the effectiveness of various classification modeling methods.
2. Maintain a strong level of documentation and clarity throughout the kernel.
3. Embrace the Kaggle notebook platform as a way to organize my code as well as display controlled experiments.
4. Continue to get used to the platform and tackling a project from start-to-finish.

# Packages Used

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, KFold, ShuffleSplit
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_curve, auc, confusion_matrix
from astropy.table import Table, Column
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.base import clone
import matplotlib.pyplot as plt
%matplotlib inline

# Importing and Processing the Data

In [None]:
df = pd.read_csv("../input/mushrooms.csv")
print(df.shape)
df.head()

It is important to note that every feature included in this dataset is categorical, which provides its own set of challenges. In order to fit models using these features, they must be converted into a useable, numeric form. I will be converting them into both integers (as is used by the most popular kernel for this dataset) and into indicator variables (a more classic statistical approach).

In [None]:
df['class'].value_counts()

While the data is technically imbalanced towards edible mushrooms, it isnt by a significant amount. Throughout the course of this analysis I will not only be tracking accuracy score, but also recall and precision scores.

In [None]:
df.isnull().sum().sum()

### Conversion of categorical features into indicator variables (get_dummies)

In [None]:
df_dum = pd.DataFrame()
for col in df.columns:
    dum = pd.get_dummies(df[col])
    for dcol in dum.columns:
        name = col +"_"+ dcol
        df_dum[name] = dum[dcol]
print(df_dum.shape)

In [None]:
df_dum.head()

### Conversion of categorical features into integer variables

I'm not happy with the practice of using integers as a numerical standby for categorical features. This method implies some amount of heirarchy beteen categories, which is most dangerous in models like logistic regression in which the results would depend upon the mapping of the categories to integers. This method does require less columns of data to be used for modeling, which is a distinct advantage for larger datasets.

In [None]:
df_int = pd.DataFrame()
le = LabelEncoder()
for col in df.columns:
    df_int[col] = le.fit_transform( df[col])
df_int.head()

In [None]:
X = df.iloc[:,1:]
y = df.iloc[:,0]
Xd = df_dum.iloc[:,2:]
yd = df_dum.iloc[:,0:2]
Xi = df_int.iloc[:,1:]
yi = df_int.iloc[:,0]
scalerd = StandardScaler()
Xds = scalerd.fit_transform(Xd)
scaleri = StandardScaler()
Xis = scaleri.fit_transform(Xi)

## Summary of Preprocess:
Dataframes:
* Categorical Features : df
* Dummy Features : df_dum
* Integer Features : df_int

I/O versions:
* Categorical Features : X
* Categorical Response: y
* Dummy Features : Xd
* Dummy Response : yd
* Integer Features: Xi
* Integer Response: yi where y == 'p'
* Scaled, Dummy Features : Xds, scalerd
* Scaled, Integer Features: Xis, scaleri

# Initial Model Fitting

### Definition of my quality of life functions:
As I am prone to using very similar processes to compare these models not just in this kernel but in other analyses that I perform, I developed these funciton to keep myself organized and manage my cross-validation and shuffling in a systematic way. While it is a bit dense to go through before the analysis is even performed, in my own work I would stick these in their own .py file and import them (while making small edits as necessary depending on the problem).

In [None]:
def d_method( X, y, model, random_state = 0, k = 5 ):
    # Fits a categorical model and outputs a cross-validation result of:
    # Accuracy, Recall, Precision, and the model thats fit last.
    # The data is train/test split and shuffled systematically
    kf = ShuffleSplit( n_splits = k )
    ac = np.zeros( k ); re = np.zeros( k ); pr = np.zeros( k )
    i = 0
    for train_index, test_index in kf.split(X):
        t_model = clone(model)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        t_model.fit( X_train, y_train )
        y_pred = t_model.predict( X_test )
        ac[i] = accuracy_score( y_test, y_pred )
        re[i] = recall_score( y_test, y_pred )
        pr[i] = precision_score( y_test, y_pred )
        i = i+1
    return( ac, re, pr, t_model )

In [None]:
def d_conf( X_test, y_test, t_model, p=0.5 ):
    # Creates a confusion matrix
    # Using a test set of data and a trained model
    ypro = t_model.predict_proba(X_test)
    yp = ypro[:,1] >= p
    cm = confusion_matrix(y_test, yp)
    return(cm)

In [None]:
def d_conf_l( X_test, y_test, t_model):
    # The same as d_conf only removing the theshold of class selection
    yp = t_model.predict(X_test)
    cm = confusion_matrix(y_test, yp)
    return(cm)

In [None]:
def d_roc( X_test, y_test, t_model, points=100):
    # Generates a plot that describes the changes in
    #   Accuracy, recall, precision scores as the threshold
    #   For classification changes.
    #   Using a test set of data and a trained model
    ac_p=np.zeros(points); re_p=np.zeros(points); pr_p=np.zeros(points); 
    for i in np.arange(points):
        ypro = t_model.predict_proba(X_test)
        yp = ypro[:,1] >= (i/points)
        ac_p[i] = accuracy_score( y_test, yp )
        re_p[i] = recall_score( y_test, yp )
        pr_p[i] = precision_score( y_test, yp )
    t = np.arange(points)/points
    plt.plot(t, ac_p, label='accuracy score')
    plt.plot(t, re_p, label = 'recall score')
    plt.plot(t, pr_p, label = 'precision score')
    plt.legend()
    plt.show()

In [None]:
def d_summary(X, y, model, random_state = 0,k=5,X_test=None,y_test=None,p=0.5,points=100):
    # Runs the d_method, d_conf, and d_roc on a model and dataset.
    ac, re, pr, t_model = d_method( X, y, model )
    print( "Accuracy Score =  ", ac, " Mean = ", ac.mean() )
    print( "Recall Score =    ", re, " Mean = ", re.mean() )
    print( "Precision Score = ", pr, " Mean = ", pr.mean() )
    if X_test is None:
        X_test = X
    if y_test is None:
        y_test = y
    cm = d_conf( X_test, y_test, t_model, p)
    print("Confusion Matrix:")
    print(cm)
    d_roc( X_test, y_test, t_model, points)

## Logistic Regression

In [None]:
d_summary( Xds, yi, LogisticRegression() )

In [None]:
d_summary(Xis, yi, LogisticRegression() )

Well... this is unexpected to say the least. I want to make it perfectly clear that the 100% accuracy score does NOT apply to an independent validation set that is put aside, but rather the total X and y put into the last kfold model that was fit in a particular cross-validation, so there is some amount of overfitting, however this is meant as an at-a-glance estimation.

However this is a fairly significant result from just logistic regression. To confirm this, I'll just run a couple logistic regressions but use a basic train_test_split to output the confusion matrix.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Xis, yi, test_size=0.2)
d_summary( Xis, yi, LogisticRegression(),X_test= X_test,y_test=y_test)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Xds, yi, test_size=0.2)
d_summary( Xds, yi, LogisticRegression(),X_test= X_test,y_test=y_test)

This confirms the initial results above. While this isnt definitive proof that dummy variables are superior to integer variables for categorical variables as a whole, it certainly appears to be for this data generating system.

Since its a bit awkward to just do logistic regression and be done with it, I'll do some hyper-parameter tuning since I've been wanting to build a function to handle that for later.

In [None]:
def d_table( X, y, model, A, parameter):
    # Returns accuracy scores for models performed with specfied hyper-parameters.
    #   Take care to describe the parameter as a string
    print( 'alpha\t\tAccuracy\tRecall\t\tPrecision')
    for alpha in A:
        l_model = clone(model)
        eval("l_model.set_params(" + parameter + "=" + str(alpha) + ")")
        ac, re, pr, t_model = d_method( Xds, yi, l_model )
        print( alpha,'\t\t%0.4f\t\t%0.4f\t\t%0.4f' % (ac.mean(), re.mean(), pr.mean() ) )

In [None]:
reg_strength = [0.01, 0.05, 0.1, 0.5,1,5, 10,50,100]
print('Indicator Variables')
d_table( Xds, yi, LogisticRegression(penalty="l1"), reg_strength, 'C' )
print('Integer Variables')
d_table( Xis, yi, LogisticRegression(penalty="l1"), reg_strength, 'C' )

Not particularly helpful information, but I felt pretty cool using the eval workaround for setting parameters in the model mid-loop.

Let's look at some more models since I still have more kaggle run-time to kill.

## Decision Tree Classifier

In [None]:
print("Indicator / Dummy")
d_summary( Xds, yi, DecisionTreeClassifier())
print("Integer / LabelEncoder")
d_summary( Xis, yi, DecisionTreeClassifier())

In [None]:
min_samples = np.arange(10) + 3
print("Indicator / Dummy")
d_table( Xds, yi, DecisionTreeClassifier(), min_samples, 'min_samples_split')
print("Integer / LabelEncoder")
d_table( Xis, yi, DecisionTreeClassifier(), min_samples, 'min_samples_split')

In [None]:
print( "Dummy / Indicator")
d_summary( Xds, yi, RandomForestClassifier())
print( "LabelEncoder / Integer")
d_summary( Xis, yi, RandomForestClassifier())

In [None]:
max_features = np.arange(10) + 3
print("Indicator / Dummy")
d_table( Xds, yi, DecisionTreeClassifier(), max_features, 'max_features')
print("Integer / LabelEncoder")
d_table( Xis, yi, DecisionTreeClassifier(), max_features, 'max_features')

In [None]:
print( "Dummy / Indicator")
ac, re, pr, t_model = d_method( Xds, yi, SVC())
cm = d_conf_l(Xds, yi, t_model)
print( "Accuracy Score =  ", ac, " Mean = ", ac.mean() )
print( "Recall Score =    ", re, " Mean = ", re.mean() )
print( "Precision Score = ", pr, " Mean = ", pr.mean() )
print(cm)
print( "LabelEncoder / Integer")
ac, re, pr, t_model = d_method( Xis, yi, SVC())
cm = d_conf_l(Xis, yi, t_model)
print( "Accuracy Score =  ", ac, " Mean = ", ac.mean() )
print( "Recall Score =    ", re, " Mean = ", re.mean() )
print( "Precision Score = ", pr, " Mean = ", pr.mean() )
print(cm)

In [None]:
print( "Dummy / Indicator")
ac, re, pr, t_model = d_method( Xds, yi, KNeighborsClassifier())
cm = d_conf_l(Xds, yi, t_model)
print( "Accuracy Score =  ", ac, " Mean = ", ac.mean() )
print( "Recall Score =    ", re, " Mean = ", re.mean() )
print( "Precision Score = ", pr, " Mean = ", pr.mean() )
print(cm)
print( "LabelEncoder / Integer")
ac, re, pr, t_model = d_method( Xis, yi, KNeighborsClassifier())
cm = d_conf_l(Xis, yi, t_model)
print( "Accuracy Score =  ", ac, " Mean = ", ac.mean() )
print( "Recall Score =    ", re, " Mean = ", re.mean() )
print( "Precision Score = ", pr, " Mean = ", pr.mean() )
print(cm)

In [None]:
print( "Dummy / Indicator")
d_summary( Xds, yi, GaussianNB())
print( "LabelEncoder / Integer")
d_summary( Xis, yi, GaussianNB())

# Summary


While this dataset reached very high acuracy scores with minimal effort, the dummy variables performed slightly better than the integer variables. 