# Supervised ML Models - Classification

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
from sklearn import tree, linear_model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, plot_confusion_matrix
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import mode
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_learning_curves

Data Source : https://www.kaggle.com/uciml/mushroom-classification

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

Time period: Donated to UCI ML 27 April 1987

Dependent Variable : (classes: edible=e, poisonous=p)

Information on Independent Variables:
- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
- bruises: bruises=t,no=f
- odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b,club=c,cup=u,equal=e,rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- veil-type: partial=p
- veil-color: brown=n,orange=o,white=w,yellow=y
- ring-number: none=n,one=o,two=t
- ring-type: evanescent=e,flaring=f,large=l,none=n,pendant=p
- spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
- population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
- habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

The purpose of this project is to understand which model performs the best on this dataset.


## Data Extraction

In [None]:
# Import data
df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

# View all variables and its unique features
col_list = df.columns.values.tolist()
for col_name in col_list:
    print(col_name, ' : ', df[col_name].unique())

Above is a quick view of the variables in the dataset and the unique features in each column.

In [None]:
# View some details for each variable
df.info()

There's no missing value in any of the columns. Thus, all rows and all columns will be used for the analysis. Next, we shall spilt the data into independent and dependent variables.

The dependent variable is 'class', stating whether the mushroom is edible or poisonous.The independent variables are all the other 22 columns which are the features on the mushroom. All the independent variables are used as they represent different features. There are no obvious duplication on the features. I.e. We are not seeing a column representing the overall stalk color and two columns for the stalk color above and below the ring.


## Data Cleaning

In [None]:
# Set the values for X (independent variable)
X = df.drop(['class'], axis=1)
# Set the values for y (dependent variable)
y = df['class']

[X.info(), y.shape]

In [None]:
# Set a default colour for all plots below
sns.set_palette(['steelblue','palevioletred', '#ffcc99' , 'mediumaquamarine'])

In [None]:
# Set the figure's size
plt.figure(figsize=(6,6))
# Create a Pie Chart for gender
plt.pie(x = df['class'].value_counts(), explode = [0, 0.05], autopct='%0.01f%%', labels = [ 'Edible', 'Poisonous'])
plt.show()

From the pie chart, there are more edible mushroom in the data than the poisonous mushrooms.

The next part will be converting the X and y information from string format into 0 and 1. For example, the dependent cells of 'e' and 'p' will be converted to 0 and 1. From the number of counts for 0 and 1, we can identify that 0 represents edible whereas 1 represents poisonous. The number of rows for X and y after converting remains the same. 

Note that the number of columns for X has increased from 22 columns to 117 columns. This happens when there are more than one variable in the column. For example, the cap shape includes bell, conical, flat etc. These cannot be represented by 0 and 1 within one column. On such situation, the data in one column will be converted into multiple columns in the new dataset. Bell will be represented in one column with 0 and 1 as whether the cap is bell shaped. Then conical will be represented in the next column with the same concept and so on.

In [None]:
ohe = OneHotEncoder()
X = ohe.fit_transform(X)
print (X.shape)

le = LabelEncoder()
y = le.fit_transform(y)
values, counts = np.unique(y, return_counts=True)
print('Shape of y : ', y.shape)
print(values, counts)

## Principal Component Analysis (PCA)

The variation per principal component for MCA and PCA are between 7% to 9%. The difference is not huge, thus, for simplicity to analyze the downstream models, PCA is selected instead of MCA. It is required to condense the information contained in a large number of original variables into a smaller set of new composite dimensions, with a minimum loss of information. Else, all model's results have 100% accuracy rate, which is too good to be true.

In [None]:
# Standardizing the variables
X = StandardScaler().fit_transform(X.toarray()) 
# Check if mean = 0, std dev = 1
print('Mean : ', X.mean(), 
      '\nStd Dev : ', X.std())

After standardizing the variables , the data is then spilt into 80% training data and 20% testing data. The shape of the new data set is provided below.

In [None]:
# Create a train and test data with 80% and 20% spilt
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size = 0.2, random_state = 1)
# Get the shape
[train_x.shape, test_x.shape, train_y.shape, test_y.shape]

After looking at n_components equals to 2 and 3. To make it simple and easy to understand, we shall use 2 as the number of components.

Note that 'fit_transform' was used on the training data so that we can scale the training data and also learn the scaling parameters of the data. The model built will learn the mean and variance of the features of the training set. These learned parameters are then used to scale the test data. In other words, 'transform' was used on the testing data. In addition, the transformation are happening on the independent variables. The dependent variable is not involved in the PCA process below.

In [None]:
# PCA model with 2 number of components
pca = PCA(n_components = 2)
# Fit the data
trainx_pca = pca.fit_transform(train_x)
testx_pca = pca.transform(test_x)
# Variation of PC1 and PC2
print('Variation per principal component: ', pca.explained_variance_ratio_)
# Shape of train_x
print("train_x shape: ", train_x.shape)
# Shape of trainx_pca
print("trainx_pca shape: ", trainx_pca.shape)

In [None]:
list_x = list(range(117))
# Compute Covariance based on all 22 variables
cov_22 = np.cov(train_x.T)
# Compute eigenvalue and eigenvector for all variables ussing the covariance from above
e_values, e_vectors = np.linalg.eig(cov_22)
# Calculating the explained variance on each of components
var_22 = []
for i in e_values:
     var_22.append((i / sum(e_values)) * 100)
# Identifying components that explain at least 95%
cum_var_22 = np.cumsum(var_22)
# Set the plot size
plt.figure(figsize=(10, 6))
# Visualizing the eigenvalues and finding the "elbow" in the graphic
plt.title("Explained Variance vs Number of Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
sns.lineplot(x = list_x, y = cum_var_22.real)
plt.show()

From the graph above, there's a steep curve within the first 10 components. However, a 10D graph is hard to interpret and understand. To make it easier to understand, the number of component chosen is 2. The 2D graph before and after transformation is provided below.

In [None]:
features = ['PC1', 'PC2']
# Make it into a data frame
pca_df = pd.DataFrame(trainx_pca,columns = features)
# Include independent variable
pca_df['Class'] = train_y
# View data frame
pca_df

# Set the figure's size
plt.figure(figsize=( 24 ,10 ))
# Plot the Original and transformed data
X_new = pca.inverse_transform(trainx_pca)
plt.subplot(1,2,1)
sns.scatterplot(x=train_x[:, 0], y=train_x[:, 1], hue = "Class", data=pca_df, alpha=0.2)
plt.title("Original Mushroom Dataset")
plt.subplot(1,2,2)
sns.scatterplot(x=X_new[:, 0], y=X_new[:, 1], hue = "Class", data=pca_df, alpha=0.8)
plt.title("Inverse Transform of Mushroom Dataset")

In [None]:
# Set the figure's size
plt.figure(figsize=(10,6))
# Scatter Plot for PC1 and PC2
sns.scatterplot(x="PC1", y="PC2", hue = "Class", data=pca_df, legend="full", alpha=0.6)
plt.title("PCA of Mushroom Dataset")
plt.show()

## Model 1 : Logistic Regression

This is a supervised model to model binary dependent variable - edible or poisonous. The transformed training and testing X variables are used in the regression analysis.

In [None]:
# Fit the logistic regression model according to the given training data
lr = linear_model.LogisticRegression(random_state=0).fit(trainx_pca, train_y)
# Print coefficient and intercept
print('Coefficient : ', lr.coef_, '\nIntercept :', lr.intercept_)
# Check accuracy of model
print ('Train : ', lr.score(trainx_pca,train_y) * 100)
print ('Test  : ', lr.score(testx_pca, test_y) * 100)
# Predict for test_x
pred_y1 = lr.predict(testx_pca)
# View the classfication report
print(classification_report(test_y, pred_y1))
# Plot the confusion matrix graph
plot_confusion_matrix(lr, testx_pca, test_y, cmap = 'Blues_r')
# Compiling test score of all models for the graph at the end of the project
scores = []
scores.append(lr.score(testx_pca, test_y))

This is a multiple logistic regression as the data consist more than one independent variables (PC1 and PC2). The intercept point is 0.55 and the coefficient for the two independent variables are 0.96 and -0.16.

From the training and testing score, the data is not overfitted. The accuracy score for logistic regression is 88.41% and 88.68% using the transformed training independent variables and testing independent variables. If the original train_x and test_x is used, the model will give us 100% accuracy rate on all results.

From the confusion matrix, there are 803 + 638 = 1441 rows of data are predicted correctly as edible or poisonous. There is quite a big portion of date (167) which are actually poisonous mushroom but was predicted as edible mushrooms. This has an impact on the results for precision and recall. Overall, the model results is still good but there may be better models.


## Model 2 : Random Forest Classification

This is a supervised model where the PCA transformed X data was used in the model. The classification model was used since the output results are binary.

In [None]:
train_scores = []
test_scores = []
error_rates =[]
for i in range(1,25):
    modeli = RandomForestClassifier(max_depth=i, random_state = 0)
    modeli = modeli.fit(trainx_pca, train_y)
    train_score = accuracy_score(modeli.predict(trainx_pca),train_y) * 100
    test_score = accuracy_score(test_y, modeli.predict(testx_pca)) * 100
    error_rate = np.mean(modeli.predict(testx_pca) != test_y)
    train_scores.append(train_score)
    test_scores.append(test_score)
    error_rates.append(error_rate)

# Set the figure's size    
plt.figure(figsize=(20,6))
plt.subplot(1,2,1)
plt.plot(range(1,25),train_scores,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='steelblue', markersize=10)
plt.plot(range(1,25),test_scores,color='palevioletred', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Training Data Score and Testing Data Score')
plt.subplot(1,2,2)
plt.plot(range(1,25),error_rates,color='#ffcc99', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. Max Depth Value')
plt.xlabel('Max Depth')
plt.ylabel('Rates')

Above is the accuracy score for training data and testing data on the left with various max depth. The blue line represents training score whereas the red line represents testing score. On the right, the graph shows the error rate of the testing data. When the depth goes beyond 25, most of the accuracy score for training data is 100%. Thus, the graph shows the max depth from 0 to 24 only. Max depth of 4, 8, 12 or 15 is a good fit to model based on the steep decrease on error rate. The overall accuracy on training data is pretty good and the accuracy for testing data is consistently around 94%. On all the data points on the left graph, we are seeing training data's accuracy score is consistently higher than the testing data. To minimize the degree of overfitting the data and still have a good accuracy score results, the selected model has a max depth set to 9.

In [None]:
# Fit the Random Forest model according to the given training data
rfc = RandomForestClassifier(max_depth = 9, random_state=0).fit(trainx_pca, train_y)
# Check accuracy of model
print ('Train : ', rfc.score(trainx_pca,train_y) * 100)
print ('Test  : ', rfc.score(testx_pca, test_y) * 100)
# Predict for test_x
pred_y2 = rfc.predict(testx_pca)
# View the classfication report
print(classification_report(test_y, pred_y2))
# Plot the confusion matrix graph
plot_confusion_matrix(rfc, testx_pca, test_y, cmap = 'Blues_r')
# Compiling test score of all models for the graph at the end of the project
scores.append(rfc.score(testx_pca, test_y))

The accuracy score for training data and testing data is 94.63% and 93.97%. This model seems to be a better fit than the logistic model. However, there is a slight indication of overfitting the training data on the model. The number of data which was predicted inaccurately is pretty small (98 rows) compare to logistic model above. Below are two plots related to random forest. 

The first plot indicates the importance of each variables - PC1 and PC2. From the earlier PCA section, PC1 indicates 8.95% variation per principal component whereas PC2 has 8.13% variation per principal component. It is not surprising to see PC1 is a more important feature in the random forest model compare to PC2. However, the difference on importance between PC1 and PC2 is much larger than expected.

The second plot is a decision tree plot with max depth of 9 from the random forest model. Within each box, it states which independent variable was used, the gini results and the number of samples. To reach each of the leaf, use the 'AND' function to include all the roots or branch on the path to reach the leaf.

In [None]:
# Set the figure's size
plt.figure(figsize=(6,3))
# Set the variables
importances = rfc.feature_importances_
indices = np.argsort(importances)
# Show the quantified relative importance in the order the features were fed to the algorithm
feature_imp = pd.Series(rfc.feature_importances_,index=features).sort_values(ascending=False)
# Plot barh graph
plt.barh(range(len(indices)), feature_imp, color='b', align='center')
plt.yticks(range(len(indices)), features)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title('Feature Importances')
plt.show()

In [None]:
# Data Visualization on Random Forest
fig, ax = plt.subplots(figsize=(30, 30))
tree.plot_tree(rfc.estimators_[0], ax=ax, filled=True)
plt.show()

## Model 3 : KNN

This is a supervised model where the PCA transformed X data was used in the model. The classification model was used since the output results are binary.

In [None]:
train_scores = []
test_scores = []
error_rates =[]
for i in range(1,25):
    modeli = KNeighborsClassifier(n_neighbors=i)
    modeli = modeli.fit(trainx_pca, train_y)
    train_score = accuracy_score(modeli.predict(trainx_pca),train_y) * 100
    test_score = accuracy_score(test_y, modeli.predict(testx_pca)) * 100
    error_rate = np.mean(modeli.predict(testx_pca) != test_y)
    train_scores.append(train_score)
    test_scores.append(test_score)
    error_rates.append(error_rate)

#sns.set_palette(['steelblue','palevioletred', '#ffcc99' , 'mediumaquamarine'])
    
plt.figure(figsize=(20,6))
plt.subplot(1,2,1)
plt.plot(range(1,25),train_scores,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='steelblue', markersize=10)
plt.plot(range(1,25),test_scores,color='palevioletred', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Training Data Score and Testing Data Score')
plt.subplot(1,2,2)
plt.plot(range(1,25),error_rates,color='#ffcc99', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Rates')

The accuracy score of training data and testing data with KNN model reacts differently from the random forest model. The blue line represents training score whereas the red line represents testing score. On the right, the graph shows the error rate of the testing data. On the random forest model earlier, the accuracy score for both training and testing data increases as the max depth increases. On the flip side, the accuracy score decreases as the number of neighbors increases on the KNN model's training data. For the testing data, there's a slight increase on accuracy as the number of neighbors increases.

17 and 21 are two points which are ideal for the model. From the accuracy score and error rate, the selected number of neighbors to model is 17 since it as a low error rate and there's no overfitting shown on the training and testing data. The difference on error rate for 21 is not much different from 17. When number of nieghbors are at 21, the training score is slightly lower than the score at neighbor equals to 17. Thus, the final selected value is 17, instead of 21 although there have very similar results..

In [None]:
# Fit the KNN Classifier model according to the given training data
knc = KNeighborsClassifier(n_neighbors=17).fit(trainx_pca, train_y)
# Check accuracy of model
print ('Train : ', knc.score(trainx_pca,train_y) * 100)
print ('Test  : ', knc.score(testx_pca, test_y) * 100)
# Predict for test_x
pred_y1 = knc.predict(testx_pca)
# View the classfication report
print(classification_report(test_y, pred_y1))
# Plot the confusion matrix graph
plot_confusion_matrix(knc, testx_pca, test_y, cmap = 'Blues_r')
# Compiling test score of all models for the graph at the end of the project
scores.append(knc.score(testx_pca, test_y))

KNN is a goot fit for this data. There's no overfitting and the accuracy scores are the highest among all three models. There are 81 (65 + 16) incidents where the class of mushroom was predicted inaccurately.

The graph below shows the learning curve of the model. When the data size is below 80%, the test set consistently have a higher error rate than the training set. Once the data size reach 80% and above, the error rate for training set and test set is almost the same and they are at minimal.

In [None]:
plot_learning_curves(trainx_pca, train_y, testx_pca, test_y, KNeighborsClassifier(n_neighbors=17))
plt.show()

In [None]:
# Set the variables
#X_plot = np.column_stack((trainx_pca[:,0], trainx_pca[:,1]))
X_plot = np.column_stack((pca_df['PC1'].tolist(), pca_df['PC2'].tolist()))
X = X_plot
y = train_y
h = .02  # step size in the mesh
n_neighbors = 17

# Create color maps
cmap_light = ListedColormap(['orange', 'cornflowerblue'])
cmap_bold = ['darkorange', 'darkblue']

for weights in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=pca_df['Class'],
                    palette=cmap_bold, alpha=1.0, edgecolor="black")
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("2-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, weights))
    plt.xlabel('PC1')
    plt.ylabel('PC2')

plt.show()

Above is the KNN grpph using weights as uniform and distance. The standard model, which is also the model used for the analysis is the uniform weights. The distance weights was included in the plot over here to visualize the difference between uniform and distance. At a quick glance, the two plots are similar. There seems to be no obvious difference between these two.

## Conclusion

In [None]:
models = ['Logistic Regression','Random Forest Classification','K-Nearest Neighbors']

# Visualising the accuracy score of each classification model
plt.rcParams['figure.figsize'] = 10, 8
ax = sns.barplot(x = models, y = scores, saturation =1.5)
plt.xlabel("Supervised Models", fontsize = 16 )
plt.ylabel("Accuracy Score", fontsize = 16)
plt.title("Accuracy of Various Supervised Models", fontsize = 16)
plt.xticks(fontsize = 13, horizontalalignment = 'center', rotation = 0)
plt.yticks(fontsize = 13)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2%}', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()

Above are the accuracy scores of all three model's test data. All models had used independent variables which undergo PCA to condense the independent variables. Without apply PCA, the models were all providing a 100% accuracy rate. Thus, it was necessary to apply PCA although it creates a small loss in information. 

It is appropriate to compare these three models together as they had all used the same transformed data to model. From the accuracy score above, the KNN models have the highest score. In addition, there were no overfitting on the training data for this model. Therefore, this is the best model among the three models.

Note that although the model results are good, there are a few limitations that should be kept in mind. A major limitation of the models is that they are 'idealizations' or 'simplification' of reality. We may be able to use it to predict reality but there is always a possibility of error in the prediction. Each prediction comes with a set of assumptions which are made during modeling and this causes differences between model and reality. In this project, PCA with 2 number of components were used. If the number of component changed or MCA was used instead of PCA, the best model might not be KNN any longer. Or if we had changed the depth or the number of neighbors to some other values, it will have an impact on the generated model results.