**Xintong Li**

**Python Classification Project**

**11/24/2020**

# Mushroom Classification

1. Introduction
2. Libraries
3. Exploratory Analysis
    * 3a. Statistical Summary
    * 3b. Bar Plot
    * 3c. Correlation Heatmap
4. Test & Train Split
5. Machine Learning Models
6. Experiment Results & Next Steps
7. Methods to Improve Models
    * 7a. Parameter Tunning
    * 7b. Comparing results before and after Parameter Tunning
8. Conclusion

## 1. Introduction

As someome who loves mushroom, I want to explore what the key features of a poisonous mushroom. I will be doing Exploratory Analysis to detect any interesting patterns or relationships between each features and its target variable (classes: edible=e, poisonous=p). On top of that, I will apply different Machine Learning Algrithoms to find the best model to predict whether a mushroom is edible or poisonous. Some models I will be using are KNN, Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, Stochastic Gradient Descentn and AdaBoost.

This mushroom data set I found is from Kaggle: https://www.kaggle.com/uciml/mushroom-classification. It includes 23 features and 8124 rows. Here is a list of the features' description from Kaggle:

**Attribute Information: (classes: edible=e, poisonous=p)**
* cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
* cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
* cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
* bruises: bruises=t,no=f
* odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
* gill-attachment: attached=a,descending=d,free=f,notched=n
* gill-spacing: close=c,crowded=w,distant=d
* gill-size: broad=b,narrow=n
* gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
* stalk-shape: enlarging=e,tapering=t
* stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
* stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* veil-type: partial=p,universal=u
* veil-color: brown=n,orange=o,white=w,yellow=y
* ring-number: none=n,one=o,two=t
* ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
* spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
* population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
* habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

## 2. Libraries

In [None]:
# General libraries
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error 
from math import sqrt

# Models 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

# Visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
import graphviz
from sklearn import tree


%matplotlib inline
sns.set(color_codes=True)

## 3. Exploratory Data Analysis 

In [None]:
# read file
df = pd.read_csv("../input/mushroom-classification/mushrooms.csv")

### 3a. Statistical Summary

In [None]:
# display top 5 rows 
df.head()

In [None]:
# display bottom 5 rows
df.tail()

In [None]:
# checking the data type
df.dtypes

In [None]:
# checking duplicated data
df.shape
duplicate_rows = df[df.duplicated()]
print("number of duplicated rows: ", duplicate_rows) 

### no duplicated rows

In [None]:
# checking null values
print(df.isnull().sum())

### no missing values 

In [None]:
# find unique values for each variable  
df.nunique()

I will remove veil-type since it only has 1 unique value. It will not be very useful for this project. 

In [None]:
df = df.drop(['veil-type'], axis=1)

In [None]:
# use labelencoding to convert categorical data to numerical 
from sklearn.preprocessing import LabelEncoder
df_encoded = df.copy()
le=LabelEncoder()
for col in df_encoded.columns:
    df_encoded[col] = le.fit_transform(df_encoded[col])

df_encoded.head()

In [None]:
df_encoded.describe()

### 3b. Bar Plots

In this part, I'm using Bar Plots to understand the difference between each features based on our target variable (hue = 'class').

According to these plots, the amount of edible and poisonous are very similar, which helps us to prevent bias towards one class. Mushrooms with a cap-shape b has a higher count on being edible, whereas cap-shape k has a higher count on being poisonous. Cap-surface with a f has a higher count for being edible. 

A significanr difference in gill-color = buff between edible and poisonous. It shows us mushrooms with gill-color = buff might have a more chance to be considered for poisonous. The assumption applies to when stalk-surface-above-ring = k. 

Spore-print-color also has significant difference between edible and posnonous. k and n have a higher edible count, where h and w have a higher poisonous count.

In [None]:
# barplots with hus of class (e,p)
sns.catplot(x='class',kind='count',palette='ch:.25',data=df)
sns.catplot(x='cap-shape',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='cap-surface',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='bruises',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='odor',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='gill-attachment',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='gill-spacing',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='gill-size',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='gill-color',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-shape',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-root',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-surface-above-ring',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-surface-below-ring',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-color-above-ring',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='stalk-color-below-ring',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='veil-color',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='ring-number',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='ring-type',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='spore-print-color',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='population',kind='count',palette='ch:.25',data=df,hue='class')
sns.catplot(x='habitat',kind='count',palette='ch:.25',data=df,hue='class')

### 3c. Correlation Matrix

In this part, I created a correlation matrix heatmap to get a better understanding on the relationships between each feature. Based on the heatmap, veil-color and gill-attachment are highly positively correlated with a correlation of 0.9. Gill-spacing and population, gill-color and target variable are highly negatively correlated with a correlation of -0.53. Ring-type and bruises also have a high negative correlation with our target variable. 

In [None]:
matrix = np.triu(df_encoded.corr())
plt.subplots(figsize=(20,15))
sns.heatmap(df_encoded.corr(), annot=True, mask=matrix, xticklabels=True, yticklabels=True)

## 4. Train and Test Split

In [None]:
# split data into x and y
x = df_encoded.iloc[:, 1:]
y = df_encoded.iloc[:, 0:1]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

## 5. Machine Learning Models

### KNN Classification

In [None]:
rmse_val = [] #to store rmse values for different k
for K in range(20):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)

    model.fit(X_train, y_train)  #fit the model
    pred=model.predict(X_test) #make prediction on test set
    error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)

In [None]:
curve = pd.DataFrame(rmse_val) 
curve.plot()

When k = 2, RMSE has the smallest value of 0.021. It is safe to say that k=3 will give us the best model.

In [None]:
classifier = KNeighborsClassifier(n_neighbors=2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [None]:
print("KNN accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Logistic Regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
print("Logistic Regression accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Support Vector Machine

In [None]:
clf_scm = svm.SVC(kernel='linear') # Linear Kernel
clf_scm.fit(X_train, y_train)
y_pred = clf_scm.predict(X_test)

In [None]:
print("SVM accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Decision Tree

In [None]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [None]:
tree_clf = DecisionTreeClassifier()

parameters = {"min_samples_split":[50,100,200],
             "max_depth":[1,5,10,50,100]}

tree_clf.fit(X_train,y_train) # fit the model

In [None]:
# find the best model use scoring = balanced_accuracy
grid_cv = GridSearchCV(estimator=tree_clf, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv.fit(X_train,y_train) 

print_results(grid_cv)

In [None]:
print("SVM accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
dot_data = tree.export_graphviz(grid_cv.best_estimator_, out_file=None, filled=True)

graph = graphviz.Source(dot_data, format="png") 
graph

### Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

In [None]:
print("Random Forest accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Stochastic Gradient Descent

In [None]:
clf_SGD = SGDClassifier()
clf_SGD.fit(X_train, y_train)
y_pred = clf_SGD.predict(X_test)

In [None]:
print("SGD accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### AdaBoost 

In [None]:
ac = AdaBoostClassifier()
ac.fit(X_train,y_train)
y_pred = ac.predict(X_test)

In [None]:
print("AdaBoost accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred).round(3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

## 6. Experiment Results

After applying these models, I got some great results. KNN, Random Forest, and AdaBoost have perfect prediction on the testing data set, with accruacy and macro average of 1. Stochastic Gradient Descent has the worst performance among all, with an accruacy and macro average of 0.94.

Next step, I will apply hyperparamter tunning on the models and comparing the results between different methods. My goal is to discover if these methods will help my models to perform better.

## 7. Methods to Improve Models

### 7 a. Applying Hyperparameter Tunning

#### 7 a a. Logistic Regression

In [None]:
lr_clf = LogisticRegression()

parameters = {"solver":['newton-cg', 'lbfgs', 'liblinear'],
              "penalty": ['l2'],
              "C": [100, 10, 1.0, 0.1, 0.01]}

lr_clf.fit(X_train,y_train) 

grid_cv1 = GridSearchCV(estimator=lr_clf, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv1.fit(X_train,y_train) 

print_results(grid_cv1)

In [None]:
lr_clf = LogisticRegression(C = 100,penalty = 'l2',solver='newton-cg')
lr_clf= lr_clf.fit(X_train,y_train)
y_pred1 = lr_clf.predict(X_test)

#### 7 a b. Support Vector Machine

In [None]:
from sklearn.svm import SVC

In [None]:
svm_clf = SVC()

parameters = {
    'C': [0.1, 0.5, 1, 2, 5, 10, 20],
    'gamma': ['scale', 'auto', 1, 2, 3]
    }

svm_clf.fit(X_train,y_train) 

grid_cv2 = GridSearchCV(estimator=svm_clf, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv2.fit(X_train,y_train) 

print_results(grid_cv2)

In [None]:
svm_clf = SVC(kernel ='rbf', C = 2, gamma = 'auto')
svm_clf= svm_clf.fit(X_train,y_train)
y_pred2 = svm_clf.predict(X_test)

#### 7 a c. Decision Tree

In [None]:
tree_clf = DecisionTreeClassifier()

parameters = {"min_samples_split":[50,100,200],
             "max_depth":[1,5,10,50,100]}

tree_clf.fit(X_train,y_train) 

In [None]:
grid_cv1 = GridSearchCV(estimator=tree_clf, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv1.fit(X_train,y_train) 

print_results(grid_cv1)

In [None]:
clf1 = DecisionTreeClassifier(max_depth=10,min_samples_split=50)
clf1= clf1.fit(X_train,y_train)
y_pred3 = clf1.predict(X_test)

#### 7 a d. Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)

rf_parameters = {"min_samples_split":[50,100,200],
             "max_depth":[1,5,10,50,100],
             "n_estimators":[5, 50, 250, 500]}

grid_cv4 = GridSearchCV(estimator=rf, param_grid=rf_parameters, cv=5,scoring="balanced_accuracy") 
grid_cv4.fit(X_train,y_train) 

print_results(grid_cv4)

In [None]:
rf_clf = RandomForestClassifier(max_depth=10,min_samples_split=50,n_estimators=50)
rf_clf = rf_clf.fit(X_train,y_train)
y_pred4 = rf_clf.predict(X_test)

#### 7 a e. Stochastic Gradient Descent

In [None]:
sgd = SGDClassifier()
sgd.fit(X_train,y_train)

parameters = {'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], 
                'penalty': ['l2'],
                'n_jobs': [-1]}

grid_cv5 = GridSearchCV(estimator=sgd, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv5.fit(X_train,y_train) 

print_results(grid_cv5)

In [None]:
sgd_clf = SGDClassifier(alpha=0.001, n_jobs=-1, penalty= 'l2')
sgd_clf = sgd_clf.fit(X_train,y_train)
y_pred5 = sgd_clf.predict(X_test)

#### 7 a f. Adaboost

In [None]:
ab = AdaBoostClassifier()
ab.fit(X_train,y_train)

parameters = {'n_estimators':[500,1000,2000],
              'learning_rate':[.001,0.01,.1]}

grid_cv6 = GridSearchCV(estimator=ab, param_grid=parameters, cv=5,scoring="balanced_accuracy") 
grid_cv6.fit(X_train,y_train) 

print_results(grid_cv6)

In [None]:
ab = AdaBoostClassifier(n_estimators=1000,learning_rate=0.1)
ab = ab.fit(X_train,y_train)
y_pred6 = ab.predict(X_test)

### 7 b. Comparing Results: before and after parameter tunning 

#### 7 b b. Logistic Regression

After applying parameter tunning, f1-score and accuracy for Logistic Regression increased 0.02. 

In [None]:
print("Logistic Regression accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred1))
print(classification_report(y_test, y_pred1))

#### 7 b c. Support Vector Machine

f1-score and accuracy increased by 0.03. 

In [None]:
print("SVM accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred2))
print(classification_report(y_test, y_pred2))

#### 7 b d. Decision Tree

f1-score and accuracy increased by 0.02. 

In [None]:
print("Decision Tree accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred3))
print(classification_report(y_test, y_pred3))

#### 7 b e. Random Forest 

Accuracy score on testing set decreased by 0.03. However, f1-score stayed the same.

In [None]:
print("Logistic Regression accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred1))
print(classification_report(y_test, y_pred4))

#### 7 b f. SGD

Accuracy and f1-score stayed the same.

In [None]:
print("SGD accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred5))
print(classification_report(y_test, y_pred5))

#### 7 b e. AdaBoost

Accuracy and f1-score stayed the same.

In [None]:
print("Adaboost accuracy on testing set is:",metrics.accuracy_score(y_test, y_pred6))
print(classification_report(y_test, y_pred6))

## 8. Conclusion

The best models I got were AdaBoost and Support Vector Machine after parameter tunning, with a perfect accuracy and f1-score. The features that effect the target variable the most are: gill-size and gill-color. 