# INTRODUCTION
In this project we will dive into mushroom classification problem which is represented by a tabular dataset that contains 23 features and 8124 observations, labeled as edible or poisonous mushroom. The goal is to classify mushrooms as either edible or poisonous.

## **ABOUT THE DATASET**:
- Data Set Information:<br>

    This dataset is collected from the archive of University of California, Irvine, This dataset originally created by University of Wisconsin Clinical Sciences Center, and titled as Breast Cancer Wisconsin (Diagnostic) Data Set, according to Kaggle.com/ this dataset was updated 3 years ago.


- Creators: 

 1. Dr. William H. Wolberg, General Surgery Dept. 
University of Wisconsin, Clinical Sciences Center 
Madison, WI 53792 
wolberg '@' eagle.surgery.wisc.edu 

 2. W. Nick Street, Computer Sciences Dept. 
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 
street '@' cs.wisc.edu 608-262-6619 

 3. Olvi L. Mangasarian, Computer Sciences Dept. 
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 
olvi '@' cs.wisc.edu 


- Source:<br>

    This database is also available through the UW CS ftp server:
 ftp.cs.wisc.edu 
cd math-prog/cpo-dataset/machine-learning/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29



## DATA DESCRIPTION:

- Data dictionary of the used dataset.


- Attribute Information: (classes: edible=e, poisonous=p)



- Number of instances: (569).
- Number of attributes: 32 (ID, diagnosis, 30 real-valued input features).
- Class distribution: 357 benign, 212 malignant.

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

    
    

| Column                   | Description                                                                                        |
|--------------------------|----------------------------------------------------------------------------------------------------|
| cap-shape                | bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s                                               |
| cap-surface              | fibrous=f,grooves=g,scaly=y,smooth=s                                                               |
| cap-color                | brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y                    |
| bruises                  | bruises=t,no=f                                                                                     |
| odor                     | almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s                        |
| gill-attachment          | attached=a,descending=d,free=f,notched=n                                                           |
| gill-spacing             | close=c,crowded=w,distant=d                                                                        |
| gill-size                | broad=b,narrow=n                                                                                   |
| gill-color               | black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y |
| stalk-shape              | enlarging=e,tapering=t                                                                             |
| stalk-root               | bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?                                    |
| stalk-surface-above-ring | fibrous=f,scaly=y,silky=k,smooth=s                                                                 |
| stalk-surface-below-ring | fibrous=f,scaly=y,silky=k,smooth=s                                                                 |
| stalk-color-above-ring   | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y                                                                                           |
| stalk-color-below-ring   | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y                                                                                            |
|veil-type | partial=p,universal=u|
|veil-color|brown=n,orange=o,white=w,yellow=y|
|ring-number| none=n,one=o,two=t|
|ring-type| cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z|
|spore-print-color| black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y|
|population| abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y|
|habitat| grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d|


    
    

# IMPORT MODULES
In this section, we are going to import the needed resources and configurations to the environment. Pandas will be used for data manipulation, numpy for linear algebric operations such as converting datashape into an array, matplotlib, seaborn and graphviz for visualizations, scikit-learn for preprosessing, as well for machine learning algorithms, other models and frameworks will be introduced such as keras (Deep-learning) and catboost (Gradient-boosting).

In [None]:
# uncheck this cell to install the needed modules
# !pip install pandas as pd #data structure
# !pip install numpy as np #numerical computing
# !pip install matplotlib.pyplot as plt #matlab based plotting
# !pip install seaborn as sns #more pretty visulzation
# !pip install warnings #warning messages eliminating
!pip install dython #data analysis tools for python 3.x
# !pip install math #mathematical functions
# !pip install catboost #gradient boosting
# !pip install tensorflow #keras backend
!pip install scipy #math operations
# !pip install graphviz #decision tree visualzations
!pip install pydotplus #convert graphviz viz from svg to png
import warnings
warnings.filterwarnings('ignore')

In [None]:
#import modules
import pandas as pd #data structure
import numpy as np #numerical computing
import matplotlib.pyplot as plt #matlab based plotting
import seaborn as sns #more pretty visulzation
import dython #data analysis tools for python 3.x
import catboost #gradient boosting
import tensorflow #keras backend
import scipy.stats as ss #math operations
import pydotplus
import graphviz #decision tree visualzations
#configurations#
# %autosave 60
%matplotlib inline
# %config InlineBackend.figure_format ='retina'
import datetime
print('Last update on the nootebook was: \n', datetime.datetime.now())

# INTRODUCTION TO DATASET
Import dataset from system path, and preview the first 5 entries for each column, it can be seen that all our 23 columns are categorical and a total of 8124 rows.

- NOTE: Change the path of the dataset according to your local machine.

In [None]:
# if you read data from local path
# df = pd.read_csv('/content/mushrooms.csv')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
df = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv')
print('First 5 rows of all columns: \n\n',df.head().T)
print('\nTotal number of columns: \n',df.shape[1])
print('\nTotal number of rows: \n',df.shape[0])

# EXPLORATORY DATA ANALYSIS

In this part, exploratory data analysis which will performed by going through investigating  features to see how data is distributed, unique values, and if there are any missings values.

## INITIAL STATISTICAL SUMMARY OF OUR DATASET.
Here we observe the distribution of our data and how values are distributed among the columns.

In [None]:
print(df.describe().T)

## CORRELATION BETWEEN CATEGORICAL FEATURES
Here we are going to measure of association between two categorical features, to observe how features are correlated to the target feature ['class'], whether a mushroom is edible or poisonous. in the original dataset it can be seen that all features are donated by a character value, since we are going to perform one-hot label encoding to our feature its not wise to perform it after encoding our features since features will increase dramatically from our current number into 97 feature.


**Uncertainty Coefficient:**


this method called Theil's U, also known as the Uncertainty Coefficient. Formally marked as U(x|y), this coefficient provides value in the range of [0,1], where 0 means that feature y provides no information about feature x, and 1 means that feature y provides full information about features x's value.

 - Findings:
 
 >- It can be seen how insignificant feature is 'veil-type' to target feature 'class', which needs to be removed from dataset.
 
 >- The feature odor is the most significant one for target feature.

In [None]:
import math #mathematical functions
from collections import Counter
def conditional_entropy(x,y):
    # entropy of x given y
    y_counter = Counter(y)
    xy_counter = Counter(list(zip(x,y)))
    total_occurrences = sum(y_counter.values())
    entropy = 0
    for xy in xy_counter.keys():
        p_xy = xy_counter[xy] / total_occurrences
        p_y = y_counter[xy[1]] / total_occurrences
        entropy += p_xy * math.log(p_y/p_xy)
    return entropy

def theil_u(x,y):
    s_xy = conditional_entropy(x,y)
    x_counter = Counter(x)
    total_occurrences = sum(x_counter.values())
    p_x = list(map(lambda n: n/total_occurrences, x_counter.values()))
    s_x = ss.entropy(p_x)
    if s_x == 0:
        return 1
    else:
        return (s_x - s_xy) / s_x   
#correlation viz
theilu = pd.DataFrame(index=['class'])
columns = df.columns
for j in range(0,len(columns)):
    u = theil_u(df['class'].tolist(),df[columns[j]].tolist())
    theilu.loc[:,columns[j]] = u
theilu.fillna(value=np.nan,inplace=True)
plt.figure(figsize=(20,1))
sns.heatmap(theilu,annot=True,fmt='.2f')
plt.show()

## DATA VISUALIZATION
As a final step to exploratory data analysis, we are going to visualize our features per the target feature 'class'  and to see how observations do contrasts from column to another.

 - Findings:
 
 >- The distribution of the 'class' feature is balanced and not distributed extremely in p or e label.
 
 >- Most poisonous mushrooms have a Buff='b' in the feature 'gill-color'.

 >- the feature **'veil-type'** has only have one unique observation, which does not add any value to our   analysis, therefore we will eliminate it.

 >- The feature **'stalk-root'** have a value of **'?'** which donated to be a missing value, from the main dataset source description.

In [None]:
#viz per class
fig, ax = plt.subplots(1,3, figsize=(15,5))
sns.countplot(x="cap-shape", hue='class', data=df, ax=ax[0])
sns.countplot(x="cap-surface", hue='class', data=df, ax=ax[1])
sns.countplot(x="cap-color", hue='class', data=df, ax=ax[2])
fig, ax = plt.subplots(1,2, figsize=(15,5))
sns.countplot(x="bruises", hue='class', data=df, ax=ax[0])
sns.countplot(x="odor", hue='class', data=df, ax=ax[1])
fig, ax = plt.subplots(1,4, figsize=(20,5))
sns.countplot(x="gill-attachment", hue='class', data=df, ax=ax[0])
sns.countplot(x="gill-spacing", hue='class', data=df, ax=ax[1])
sns.countplot(x="gill-size", hue='class', data=df, ax=ax[2])
sns.countplot(x="gill-color", hue='class', data=df, ax=ax[3])
fig, ax = plt.subplots(2,3, figsize=(20,10))
sns.countplot(x="stalk-shape", hue='class', data=df, ax=ax[0,0])
sns.countplot(x="stalk-root", hue='class', data=df, ax=ax[0,1])
sns.countplot(x="stalk-surface-above-ring", hue='class', data=df, ax=ax[0,2])
sns.countplot(x="stalk-surface-below-ring", hue='class', data=df, ax=ax[1,0])
sns.countplot(x="stalk-color-above-ring", hue='class', data=df, ax=ax[1,1])
sns.countplot(x="stalk-color-below-ring", hue='class', data=df, ax=ax[1,2])
fig, ax = plt.subplots(2,2, figsize=(15,10))
sns.countplot(x="veil-type", hue='class', data=df, ax=ax[0,0])
sns.countplot(x="veil-color", hue='class', data=df, ax=ax[0,1])
sns.countplot(x="ring-number", hue='class', data=df, ax=ax[1,0])
sns.countplot(x="ring-type", hue='class', data=df, ax=ax[1,1])
fig, ax = plt.subplots(1,3, figsize=(20,5))
sns.countplot(x="spore-print-color", hue='class', data=df, ax=ax[0])
sns.countplot(x="population", hue='class', data=df, ax=ax[1])
sns.countplot(x="habitat", hue='class', data=df, ax=ax[2])
fig.tight_layout()
fig.show()

## DATA CLEANING
In this section we are going to remove the feature **'veil-type'**, and the **'?'** value from **'stalk-root'** feature, and return the new cleaned dataset. Checking out new features and values, the size of dataset labels and columns will decreases due to this process.

- FINDING:

>- The total number of rows decreased from 8124 to 5644 due to removing the '?' value from the 'stalk-root' feature.

>- Our columns decreases from 23 to 22 due to clean the 'veil-type' feature.

>- After cleaning it can be seen that there is no 'NA' values in our dataset.

In [None]:
# exclude any Na's which is represented by '?'
df = df[df['stalk-root'] != '?']
# drop column veil-type becaue of 1 only unique observation
df = df.drop(['veil-type'],axis=1)
print('Unique columns from all data are: \n\n',np.unique(df.columns))
print('\nUnique values from all columns: \n',np.unique(df.values))
print('\nTotal number of new columns: \n',df.shape[1])
print('\nTotal number of new rows: \n',df.shape[0])
# How many Na's count per column
# df.isnull().sum().sort_values(ascending=False)
print('\nCheck if we have na value in any column:\n',df.isnull().any())

## FINAL STATISTICAL SUMMARY OF OUR DATASET.
Here we observe the distribution of our new cleaned data and how features are distributed among the columns.


- **SUMMARY**:

>- All features have at least two unique labels distributed by all labels in our target class.

>- The new dataset has a total of 5644 labels.

In [None]:
print(df.describe().T)

# FEATURE ENGINEERING
In the feature engineering step, Data will be transformed into an acceptable format for machine learning models. Since all our features are all categorical, we will need to encode them to numerical values.

## ONE-HOT LABEL ENCODING
One-hot encoding is a technique which applied mostly for text-based categorical data, to transfer the values into boolean numerics of 1s or 0s, and each label will be added as a column and every time it occurs as a label it will get 1 as a value if exist, otherwise will get 0. for an example: the feature 'population' have values of v, y, s, a, n, and c, representing those labels by a different numeric value will give each one of them a higher effect more than the other labels. converting each into a binary will give all the labels the same effect across the dataset. Using one-hot encoding will increases the columns count to 97, which adds a huge sparsity into our data.

>- NOTE: one-hot encoding can be implemented by calling (label encoder) from "scikit-learn" module or (get_dummies) from "pandas" module.

In [None]:
#one hot label encoding
features = df.iloc[:,1:]
features = pd.get_dummies(features)
target = df.iloc[:,0].replace({'p': 0, 'e': 1})
print('First 5 rows of new encoded feature columns:\n',features.head())
print('First 5 rows of new encoded target class of mushroom poisonous = 0 edible = 1:\n',target.head())
X = features.values
y = target.values

# BUILDING, TRAINING & TESTING MACHINE LEARNING MODELS
Building a machine learning model will start first by setting the train/test data split, then build and evaluate several models and summarise each model score.

## DATA TRAIN/TEST SPLIT
Data train/test split is the main step into test machine learning on unseen data, Here train/test data split by a 70:30 ratio, which means that models will be trained by 70% of data, and tested on 30% of unseen data. Random_stat parameter ensures that the random numbers are generated in the same order for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size = 0.3,
                                                    random_state=29)
target_names = ['poisonous', 'edible']
print ('X_train Shape:', X_train.shape)
print ('X_test Shape:', X_test.shape)
print ('y_train Shape:', y_train.shape)
print ('y_test Shape:', y_test.shape)

## KMEANS CLUSTERING

k-means algorithm is a very fast algorithm. but the results can fall behind with a sparse data, Thatâ€™s why it can be useful to restart it several times. in our example, it can be seen how the sparsity of features is affecting the performance of the model with every run.


#### **MODEL SUMMARY**:

 
- In our example, we created a model with 2 clusters for our target data labels, which are 0s, and 1s. and left the other parameters on the default setting.

>>- NOTE:
Kmeans clustering model is so robust, running it several times give a different accuracy (13%, and 85%)

In [None]:
#calling kmeans classifier from sklearn
from sklearn.cluster import KMeans
# setting the classifier parameters
k_means=KMeans(n_clusters=2)
#Fitting kmesnd to training set
k_means.fit(X_train, y_train)
#Predicting values on test set
k_means_predict = k_means.predict(X_test)
#report the results
print("\nKmeans confusion matrix: \n",confusion_matrix(y_test, k_means_predict))
print("\nKmeans Classifier report: \n",classification_report(y_test,k_means_predict,target_names=target_names))
#testing
# print("\nAccuraccy score of the model is:\n",accuracy_score(k_means_predict, y_test)*100)

## SVM CLASSIFIER
Support vector machine constructs a hyper-plane between data labels, which can be used for classification, and regression problems, data separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class, the larger the margin the lower the generalization error of the classifier.



#### **MODEL SUMMARY**:


- In the training  model, "sigmoid' kernel"  is being used, since the target values distributed as 0s, and 1s.

In [None]:
#calling the svm classifier from sklearn
from sklearn.svm import SVC
# setting the classifier parameters
svm = SVC(kernel= 'sigmoid',gamma='scale',probability=True)
#Fitting SVM to training set
svm.fit(X_train, y_train)
#Predicting values on test set
svm_predict = svm.predict(X_test)
#report the results
print("\nSVM confusion matrix: \n",confusion_matrix(y_test, svm_predict))
print("\nSVM classification report: \n",classification_report(y_test,svm_predict,target_names=target_names))

## DECISION TREE
Decision Trees algorithm uses a tree-like graph or model of decisions and their possible consequences, And used broadly in classification and regression problems, By predicting the value of a target variable by learning simple decision rules from data features, as an if-else in programming, decision tree gives a meaning of which features are important to the classifier decision, by stating how powerfull is every label related to the classifier accuracy.


#### **MODEL SUMMARY**:

- In the training model, the maximum depth of the tree set to be 3, since features are sparse and have less effect of the tree after this depth, other parameters been set at the default setting.

- Most important features are: [spore-print-color_h, gill-size_n, and ring-number_o].

- The model achieved 97% accuracy with the default setting.

In [None]:
#calling the decision tree classifier from sklearn and graphiz for visuals
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
# setting the classifier parameters
dtree = tree.DecisionTreeClassifier(max_depth=3)
#Fitting decision tree to training set
dtree.fit(X_train, y_train)
#Predicting values on test set
dtree_predict = dtree.predict(X_test)
#report the results
print("\nDecision tree confusion matrix: \n",confusion_matrix(y_test, dtree_predict))
print("\nDecision tree classification report: \n",classification_report(y_test,dtree_predict,target_names=target_names))
#test
# print(accuracy_score(y_test,dtree_predict)) #raw_score
dtree_viz = export_graphviz(dtree, out_file=None, 
                         feature_names=features.columns,  
                         filled=True, rounded=True,  
                         special_characters=True,
                         impurity=True,proportion=True,
                         rotate=True,node_ids=True,
                         class_names=['Poisonous','Edible'])  
import pydotplus #convert graphviz viz from svg to png
# Draw graph
graph = pydotplus.graph_from_dot_data(dtree_viz)  

from IPython.display import Image  
# Show graph as png since it default output it as svg
Image(graph.create_png())

## CATBOOST
CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It provides flexibility since it prevents the model of overfitting, Created by Yandex Technologies ,and use it for ranking tasks, forecasting and making recommendations.


#### **MODEL SUMMARY**:


- the model achieved 100% accuracy by the default parameters, and without any tunning.

In [None]:
from catboost import CatBoostClassifier, Pool
#catboost model, use_best_model params will make the model prevent overfitting
catboost = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True)
catboost.fit(X_train,y_train,use_best_model=True,eval_set=(X_test,y_test),verbose=False)
catboost_predict = catboost.predict(X_test)
#report the results
print("Catboost confusion matrix: \n",confusion_matrix(y_test, catboost_predict))
print("Catboost classification report: \n",classification_report(y_test,catboost_predict,target_names=target_names))

## DEEP LEARNING

In this section two deep learning approaches will be selected, the first module is from sci-kit learn module, known as multilayer perception, and the second one is from Keras module, A neural network using the Sequential model of dense layers, all parameters in both models have been set nearly the same as possible.

### SCIKIT-LEARN IMPLEMENTATION<BR>

MLP classifier is a feed-forward neural network, which has an input layer, output layer, and two or more trainable weight layers (consisting of Perceptrons). scikit-learn MLP implementation is not intended for large-scale applications and offers no GPU support, this model is been set with a size of the hidden layer with a size of 50, and mode trained with 20 iterations. Adam optimizer has been chosen, Adam method intends to automatically adjust the amount to update parameters, Adam is also appropriate for non-stationary problems.



#### **MODEL SUMMARY**:


- MLP classifier achieved 99% accuracy with only 10 iterations.

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(50,),
                    activation='logistic',
                    max_iter=10,
                    solver='adam',
                    verbose=True)
mlp.fit(X_train, y_train)
mlp_predict = mlp.predict(X_test)
print("\nMLP confusion matrix: \n",confusion_matrix(y_test, mlp_predict))
print("\nMLP classification report: \n",classification_report(y_test,mlp_predict))

### KERAS IMPLEMENTATION<BR>


Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, and offers CPU and GPU computation support, widely used for text and image machine learning application, such as classification object detection......etc. In the following model, Keras has been designed in a similar setting as possible to MLP classifier from the Scikit-learn module, as a loss function, 'sparse_categorical_crossentropy' been selected for the categorical binary output. Adam optimizer selected since it required minimum tuning to work, and the number of iteration been reduced to the minimum.


#### **MODEL SUMMARY**:


- Keras implementation achieved 100% accuracy with half the number of iteration compared to the MLP classifier from the scikit-learn module.

In [None]:
from tensorflow import keras
model = keras.Sequential([keras.layers.Dense(50, input_shape=(97,)),
                          keras.layers.Dense(2, activation='sigmoid')])
# model.summary()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])
model.fit(X_train, y_train,epochs=5, verbose=1)
keras_pred = model.predict_classes(X_test)
# keras_pred = np.argmax(keras_pred, axis=1)
print('\nKeras confusion matrix:\n',confusion_matrix(keras_pred, y_test))
print('\nKeras classification Report:\n',classification_report(keras_pred, y_test,target_names=target_names))

# CONCLUSION


Mushroom classification is one of the most famous examples of categorical machine learning problems, therefore it can be seen that most of the introduced models performed very good without any tuning for the parameters, on the other hand, Kmeans clustering as a classifier is not recommended for a such a sparse data since it misclassified many values and accuracy of the model is robust (between 13% & 85%) , on the other hand, support vector machine model achieved 94% accuracy, followed by decision tree classifier by a 97% accuracy.  Catboost algorithm which used manly in kaggle competitions for related problems achieved a 100% accuracy on our dataset. Deep learning introduced by two different modules, MLP classifier from scikit-learn which achieved 99% accuracy with an after 10 number of iterations, and Keras framework which selected the same model structure of MLP and half number of iterations, achieved a 100% accuracy, which means that gradient boosting and neural nets have good promising results in such cases.


# ENDING NOTE
Thanks for the many several kaggle notebooks to make this work possible, and for make acessing to information easier, please do not hesitate to add or comment. If the accuracy results were little bit diffrent due to kaggle enviroment, please note that the comments were written acording my local machine.