# Classification of Mushrooms

This notebook shows how to classify mushrooms as edible or poisonous by using various Machine Learning Methods. All the code in this notebook is written Python 3.6.0 . Also library used in this notebook containing Machine Learning methods is sklearn which is an open source library. 

This dataset is taken from Kaggle (https://www.kaggle.com )
The link for the dataset is as follows:
https://www.kaggle.com/uciml/mushroom-classification
This dataset includes descriptions of hypothetical samples corresponding to 23 species of
gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon
Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely
edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was
combined with the poisonous one. The dataset has features which are entirely categorical in
nature. The dataset will be used after being transformed by LabelEncoder or OneHotEncoding.
The following describes the columns and its categorical values and what they represent.

**Attribute Information**: (**classes**: edible=e, poisonous=p)
**cap-shape**: bell=b, conical=c, convex=x ,flat=f, knobbed=k, sunken=s
**cap-surface**: fibrous=f, grooves=g, scaly=y, smooth=s
**cap-color**: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w,
yellow=y
**bruises**: bruises=t, no=f
**odor**: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
**gill-attachment**: attached=a, descending=d, free=f, notched=n
**gill-spacing**: close=c, crowded=w, distant=d
**gill-size**: broad=b, narrow=n
**gill-color**: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o ,pink=p,
purple=u, red=e, white=w, yellow=y
**stalk-shape**: enlarging=e, tapering=t
**stalk-root**: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
**stalk-surface-above-ring**: fibrous=f, scaly=y, silky=k, smooth=s 
**stalk-surface-below-ring**: fibrous=f, scaly=y, silky=k, smooth=s
**stalk-color-above-ring**: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,
white=w, yellow=y
**stalk-color-below-ring**: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,
white=w, yellow=y
**veil-type**: partial=p, universal=u
**veil-color**: brown=n, orange=o, white=w, yellow=y
**ring-number**: none=n ,one=o, two=t
**ring-type**: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s,
zone=z
**spore-print-color**: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u,
white=w, yellow=y
**population**: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
**habitat**: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import preprocessing # Preprocessing
from sklearn import metrics  # For Evaluation
import matplotlib.pyplot as plt #For Plots
%matplotlib inline

In [None]:
#Importing the dataset 
raw_data=pd.read_csv('../input/mushrooms.csv')

#Display top few rows in the dataset
raw_data.head()

In [None]:
#See the description of different columns
for col in raw_data :
    print()
    print ('Column Name: ',col)
    print(raw_data[col].describe())

In [None]:
# Different values in the columns and its count
for col in raw_data :
    print()
    print ('Column Name: ',col)
    print(raw_data[col].value_counts())

In [None]:
#Split the file into samples and labels
samples = raw_data.drop('class',1)
labels = raw_data['class']
print('Samples')
print(samples[:5])
print()
print('Labels')
print(labels[:5])

From the description of data it can be seen that all of the values are categorical values. Thus before initiating training we have to convert them into numerical values. This is done using Label Encoder as shown below

In [None]:

lb=preprocessing.LabelEncoder() #Initating the encoder

#Encoding the features(columns of samples)
for cols in samples.columns:
    samples[cols] = lb.fit_transform(samples[cols])

print('Samples')
print(samples[:5])
print()

#Encoding the labels
labels=lb.fit_transform(labels)
print('Labels')
print(labels[:5])

Now that ecoding is done it is time to split the data for training , validation and testing purposes. This is done so that a model does not overfit to the training data. Validation data is done so as to select the type of models which perform good in validation set after having trained in training set. But still there is one more problem which arises that is bleeding of the validation data into the training data. This happens when one tries to maximize the performance in validation set by tweaking the parameters in training set. Thus the model tries to indirectly fit the validation set.
Thus the need for a set which can be used for testing and which is isolated from training is required. Performance on this set will be the final deciding factor for selection of the best model.

In [None]:
from sklearn.model_selection import train_test_split # for splitting the data

# Normally data is split into 70,15,15 % for training , validation and testing respectivlely

X_train, X_valid, Y_train, Y_valid = train_test_split(samples, labels, test_size=0.30, random_state=42)
X_valid, X_test, Y_valid, Y_test = train_test_split(X_valid, Y_valid, test_size=0.50, random_state=42)
print('Training data count : {}'.format(X_train.count()[0]))
print('Validation data count : {}'.format(X_valid.count()[0]))
print('Testing data count : {}'.format(X_test.count()[0]))

Now lets set a benchmark model which we can compare with other models. Since this is a binary classification problem the simplest model would be when the model classifies each sample as poisonous i.e. 1. This will cause roughly half of the samples to be classified correctly.

In [None]:
# Making predictions as 1(poisonous)
Y_pred =np.ones_like(Y_test)
print('Precision: {0:2f}'.format(metrics.precision_score(Y_test,Y_pred)))
print('Accuracy: {0:2f}'.format(metrics.accuracy_score(Y_test,Y_pred)))
print('Recall: {0:2f}'.format(metrics.recall_score(Y_test,Y_pred)))
print('F1 score: {0:2f}'.format(metrics.f1_score(Y_test,Y_pred)))

Thus from above it is certain that the worst a model can perform would have these values. For our model it is important to absolutely classify poisonous correctly. Thus importance will be given to precision and F1 score rather then accuracy.

For training following models are considered : 1) Logistic Regression 2) Decision Tree Classifier 3) Random Forest 4) Support Vector Machines 5) AdaBoost Classifier 6) xgBoost Classifier 7) Stochastic Gradient Descent

In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier



models ={'Logistic Regression':LogisticRegression(),'Decision Tree Classifier':DecisionTreeClassifier(),
         'Random Forest':RandomForestClassifier(),'Support Vector Machines':SVC(),'AdaBoost Classifier':AdaBoostClassifier(),
         'Stochastic Gradient Descent':SGDClassifier(),'xgBoost Classifier':XGBClassifier()}



#Training models on training set and performance evluation on validation set
scores_precision=[]
scores_acc=[]
scores_recall=[]
scores_f1score=[]
names=[]
for name,model in models.items():
    model.fit(X_train,Y_train)
    names.append(name)
    precision=metrics.precision_score(Y_valid,model.predict(X_valid))
    scores_precision.append(precision)
    acc=metrics.accuracy_score(Y_valid,model.predict(X_valid))
    scores_acc.append(acc)
    recall=metrics.recall_score(Y_valid,model.predict(X_valid))
    scores_recall.append(recall)
    f1score=metrics.f1_score(Y_valid,model.predict(X_valid))
    scores_f1score.append(f1score)
   
    dataframe = pd.DataFrame({'Models':names,'Precision':scores_precision,'F1-score':scores_f1score,
                              'Accuracy':scores_acc,'Recall':scores_recall})
    
cols = list(dataframe)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Models')))
dataframe = dataframe.ix[:, cols]
dataframe

From the result above it can be concluded that Stochastic Gradient Descent and Logistic regression performs worst in this dataset classification. And rest of the classifiers perform exceptionally well in classification. As previously specified the model will be chosen based on the Precision score and F1 score. Thus any of the following models could be used : Decision Tree Classifier, AdaBoost Classifier, Support Vector Machines, Random Forest, xgBoost Classifier. This result was based on not using cross-validation to optimize the results. 

The models train on a large number of features. And may be not all features contribute to the classification. In this dataset and classification the number of features used are 22. Thus to check weather these features actually conrtibute to the classification we will use PCA(Principle Component Analysis) available from sklearn.decomposition . We shall first import the PCA.Then run it with all the components, plot a bar graph. Calculate how many features are required to maintain 98% of data variance.

In [None]:
from sklearn.decomposition import PCA #Importing PCA
pca=PCA(n_components=22) # Initializing PCA
pca.fit(X_train) 
var_ratio=pca.explained_variance_ratio_
print('First 5 features variance ratio is :',var_ratio[:5])

In [None]:
#Plot to show fetures contribution to variance
with plt.style.context('dark_background'):
    plt.figure(figsize=(10, 8))
    
    plt.bar(range(22), var_ratio, alpha=0.5, align='center',
            label='Individual explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()

In [None]:
from xgboost import plot_importance
plot_importance(models['xgBoost Classifier'])

Thus it can be concluded that odor is the most important feature used for classifying

In [None]:
#Now calculating the number of features required to contain 98% of the data's variance
add =0.0
count =0
for i in range(22):
    count+= 1
    add+= var_ratio[i]
    if ((add/1.0)*100 >99.5) :
        break
        
print('{} features are required to contain {} % of variance in data'.format(count,add*100))  

Now that we have determined that 16 features are enough to contain 99.5% of variance, we will transform the data into 16 principle components and retrain the models on these features.

In [None]:
#Transforming 
pca=PCA(n_components=16) # Initializing PCA
pca.fit(X_train)
X_train_pca= pca.transform(X_train)
X_valid_pca= pca.transform(X_valid)# for validation purposes
print(X_train_pca[:5])

In [None]:
#Retraining the models on X_train_pca features and validating on X_valid_pca
scores_precision=[]
scores_acc=[]
scores_recall=[]
scores_f1score=[]
names=[]
for name,model in models.items():
    model.fit(X_train_pca,Y_train)
    names.append(name)
    precision=metrics.precision_score(Y_valid,model.predict(X_valid_pca))
    scores_precision.append(precision)
    acc=metrics.accuracy_score(Y_valid,model.predict(X_valid_pca))
    scores_acc.append(acc)
    recall=metrics.recall_score(Y_valid,model.predict(X_valid_pca))
    scores_recall.append(recall)
    f1score=metrics.f1_score(Y_valid,model.predict(X_valid_pca))
    scores_f1score.append(f1score)
    dataframe = pd.DataFrame({'Models':names,'Precision':scores_precision,'F1-score':scores_f1score,
                              'Accuracy':scores_acc,'Recall':scores_recall})
    
cols = list(dataframe)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Models')))
dataframe = dataframe.ix[:, cols]
dataframe

From the results above it clearly shows that without any optimization the best models for this scenario are random forest and Support Vector Machines followed by XGBoost,Decision Tree Classifier and AdaBoost Classifier. Thus we will select them for final testing in the test data set. ALso we will try to optimize them with the help of Grid Search CV.

In [None]:
#Importing Grid Search CV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer,precision_score

#Initializing few models for Grid Search CV
svr =SVC()
ran=RandomForestClassifier()
xgb=XGBClassifier()
ada=AdaBoostClassifier()
dt=DecisionTreeClassifier()

#parameters for Support Vector Machine
parameters_svr = {'C':[0.1,0.3,0.9,1.0,3.0,9.0,10.0],
                  'kernel':('linear','rbf'),
                  'random_state':[1,42,56,32,15]}
#parameters for Random_forest
parameters_ran = {'n_estimators':[2,5,7,10,12,15,18,20],'random_state':[1,42,56,32,15]}

#parameters for XG Boost Classifier
parameters_xgb = {'learning_rate':[0.01,0.03,0.09,0.1,0.3,0.9,1],'booster':('gbtree','gblinear','dart'),
                  'random_state':[1,15,2,3,48,42]}

#parameters for Decision Tree Classifier
parameters_dt = {'min_samples_split':[2,7,10], 'random_state':[1,42,56,32,15]}

#parameters for AdaBoost Classifier
parameters_ada= {'learning_rate':[0.01,0.03,0.09,0.1,0.3,0.9,1],'n_estimators':[10,30,50,70],
                'random_state':[1,42,56,32,15]}

#Defining precision as the factor for comparision
scorer = make_scorer(precision_score)

In [None]:
#Calculating best parameters for SVM
clf = GridSearchCV(svr,parameters_svr,scoring=scorer)
clf.fit(X_train_pca,Y_train)
print('SVM best Parameters: ')
print(clf.best_params_)
print('')

In [None]:
#Calculating best parameters for Random Forest
clf = GridSearchCV(ran,parameters_ran,scoring=scorer)
clf.fit(X_train_pca,Y_train)
print('Random Forest best Parameters: ')
print(clf.best_params_)
print('')

In [None]:
#Calculating best parameters for xgBoost Classifier
clf = GridSearchCV(xgb,parameters_xgb,scoring=scorer)
clf.fit(X_train_pca,Y_train)
print('XG Boost best Parameters: ')
print(clf.best_params_)
print('')

In [None]:
#Calculating best parameters for DecisionTree Classifier
clf = GridSearchCV(dt,parameters_dt,scoring=scorer)
clf.fit(X_train_pca,Y_train)
print('DecisionTree best Parameters: ')
print(clf.best_params_)
print('')

In [None]:
#Calculating best parameters for AdaBoost Classifier
clf = GridSearchCV(ada,parameters_ada,scoring=scorer)
clf.fit(X_train_pca,Y_train)
print('AdaBoost best Parameters: ')
print(clf.best_params_)
print('')

Now that we have optimal parameters from the set of parameters that we defined, let us re-train these models on the opitmal parameters and then test them with test set which we have previously kept aside.

In [None]:
model_optimal={'Random Forest':RandomForestClassifier(n_estimators=7,random_state=1),
               'Support Vector Machines':SVC(kernel='rbf',random_state=1,C=3),
               'XG Boost Classifier':XGBClassifier(random_state=1, booster='gbtree', learning_rate= 0.9),
               'Decision Tree Classifier': DecisionTreeClassifier(min_samples_split= 2, random_state= 42),
               'AdaBoost Classifier': AdaBoostClassifier(random_state= 1, n_estimators= 70, learning_rate= 1)}
scores_precision=[]
scores_acc=[]
scores_recall=[]
scores_f1score=[]
names=[]
for name,model in model_optimal.items():
    model.fit(X_train_pca,Y_train)
    names.append(name)
    precision=metrics.precision_score(Y_valid,model.predict(X_valid_pca))
    scores_precision.append(precision)
    acc=metrics.accuracy_score(Y_valid,model.predict(X_valid_pca))
    scores_acc.append(acc)
    recall=metrics.recall_score(Y_valid,model.predict(X_valid_pca))
    scores_recall.append(recall)
    f1score=metrics.f1_score(Y_valid,model.predict(X_valid_pca))
    scores_f1score.append(f1score)
    dataframe = pd.DataFrame({'Models':names,'Precision':scores_precision,'F1-score':scores_f1score,
                              'Accuracy':scores_acc,'Recall':scores_recall})
    
cols = list(dataframe)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Models')))
dataframe = dataframe.ix[:, cols]
dataframe

Now lets test the models on test data.

In [None]:
#Transforming X_test to PCA components
X_test_pca = pca.transform(X_test)


scores_precision=[]
scores_acc=[]
scores_recall=[]
scores_f1score=[]
names=[]
for name,model in model_optimal.items():
    model.fit(X_train_pca,Y_train)
    names.append(name)
    precision=metrics.precision_score(Y_test,model.predict(X_test_pca))
    scores_precision.append(precision)
    acc=metrics.accuracy_score(Y_test,model.predict(X_test_pca))
    scores_acc.append(acc)
    recall=metrics.recall_score(Y_test,model.predict(X_test_pca))
    scores_recall.append(recall)
    f1score=metrics.f1_score(Y_test,model.predict(X_test_pca))
    scores_f1score.append(f1score)
    dataframe = pd.DataFrame({'Models':names,'Precision':scores_precision,'F1-score':scores_f1score,
                              'Accuracy':scores_acc,'Recall':scores_recall})
    
cols = list(dataframe)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Models')))
dataframe = dataframe.ix[:, cols]
dataframe


Thus it can be concluded that best models for classification are : Support Vector Machines, Random Forest and xgBoost for when we have applied PCA. But training time wise , it can be said that Decision tree is the fastest and SVM takes the most time for training. But overall if training time is not a constraint then SVM could be said to be the best model for classification for this type of problem. SVM has been consistently giving outstanding performance in this problem. It has perfect results in all areas in all types of tesing (validation , testing). Thus it could be said to be the best model If training time is a constraint then Random Forest should be the next best classifier.


Code Reference :
https://www.kaggle.com/nirajvermafcb/comparing-various-ml-models-roc-curve-comparison
https://www.kaggle.com/monkeydunkey/a-comparison-of-few-ml-models