**Mushroom Classification - *UCI Machine Learning Problem***

In this project we, have to predict that if the mushroom is edible or if it is poisnous. We have 22 features and 1 target variable named as 'class'. Depending on the data, we will train our model to predict the class to which the mushroom belongs.

In [None]:
# Importing the important libraries.
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
ds = pd.read_csv('../input/mushroom-classification/mushrooms.csv') # reading the dataset
ds

*Dealing with Null values in 'stalk-root' column*

In the column 'stalk-root' of the dataset, there are 2480 missing values which are replaced with '?' sign. So, we replace these values with the most frequent values or mode for that column such that no missing values are present in ther dataset.

In [None]:
loc = np.where(ds['stalk-root'] == '?') # finding the location where '?' are present
np.shape(loc) # total number of rows where '?' is present

In [None]:
ds['stalk-root'].replace('?', np.nan, inplace = True) # replacing the '?' values with NaN values.
ds['stalk-root'].isnull().sum()

As we can see that the '?' values have been replaced with the NaN values in the stalk-root column. We can now replace these null values with the mode of the column using simple imputer.

In [None]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
ds['stalk-root'] = si.fit_transform(ds['stalk-root'].values.reshape(-1,1))
ds['stalk-root'].isnull().sum() # checking for the null values again.

Now, what we have with us is a classification problem with all the features containing categorical dataset.
We can check for the unique values in all the features of the dataset. 

In [None]:
clist = ds.columns.values
for i in range(0, len(clist)):
    print(f" {i + 1} The unique values in feature {clist[i]} are {ds[clist[i]].unique()}") 
# Here we have all the unique values in every feature of dataset.

**Encoding the Categorical Values**

Now that we have checked all the null values and unique values of the features, We now convert the categorical data into the numeric data using Label Encoder, so that the model can learn and predict the target.

In [None]:
from sklearn.preprocessing import LabelEncoder
# Encoding the categorical values to numerical values.
le = LabelEncoder()
for i in range(len(clist)): # this loop will encode values in the features of the dataset one by one 
    ds[clist[i]] = le.fit_transform(ds[clist[i]])
    print(f" {i + 1} The unique values in feature {clist[i]} are {ds[clist[i]].unique()}") 
    # printing the final unique values after encoding

Now we see that all the values have been encoded to give us the int datatype such that we can proceed with the learning.

**EDA**

Now we perform some visualisations to get the insights about the data. First we check if the data is balance or inbalanced

In [None]:
x = ds # making a copy of dataset
x0 = x[x['class'] > 0] # Values from the class which are only 1 i.e Poisonous
x1 = x[x['class'] == 0] # Values from the class which are only 0 i.e Edible
import plotly.graph_objects as go
fig = go.Figure()
fig.add_traces(go.Histogram(x = x0['class'], name='Poisonous', xbins = dict(size=0.5),
                            marker_color='darkred', opacity=0.75))
fig.add_traces(go.Histogram(x = x1['class'], name='Edible', xbins = dict(size=0.5),
                            marker_color='forestgreen', opacity=0.75))
fig.update_layout(title_text="Mushroom's Class", xaxis_title_text='Value', yaxis_title_text='Count')
fig.show()

From the above plot, it is clear that the the **dataset is balanced**. *Poisonous class* of mushrooms have *3916* values and *Edible class* has values *4208*. So the sampling of the data is not required.

We now check the variance of all the features of the dataset to check if there are any columns with low variance. The feature which consist of only one value or constant values and which is low on variance will be useless for the learning of the model.

In [None]:
ds.var() # variance of the dataset

From the above values, we see that variance of column 'veil-type' is 0. So, all the values in this column are same. So this column can be dropped from the dataset as it will not help in the learning of the model.

In [None]:
ds.drop(['veil-type'], axis = 1, inplace = True) # removing from the dataset.
ds.shape # Checking the dimensions of th edataset after dropping the dataset.

**Checking the Correlation - Features and Class**

We check the correlation of all the columns with the target feature.

In [None]:
cor_mat = ds.corr()
cor_mat
plt.figure(figsize = (20, 12)) # changing the figure size so that we can analyse better
sb.heatmap(cor_mat, annot = True)

From the above correlation matrix, we can check the between class and all the other features of the dataset.

Since the *values were all categorical*, so **we donot check for the outliers and skewness of the features**.

We can now proceed with spliting the data and fitting it into various models to check its performance and selecting the best model for the dataset.

**Model Fitting and Tuning**

We first start with splititng the dataset into input variables or features and target variable.

In [None]:
x = ds.loc[:, 'cap-shape':'habitat'] # Features
y = ds.loc[:, 'class'] # Target
print(x.shape, y.shape) # Dimensions of Features and Target

In [None]:
# Importing all the important models, meathods and classes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [None]:
from sklearn.model_selection import train_test_split

We first start with checking the best possible random state for the model fitting. For this we can use any model as later we will check the performance of all the models anyways.

In [None]:
max_accuracy = 0
best_rs = 0
for i in range(1, 200):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = i)
    lg = LogisticRegression()
    lg.fit(x_train, y_train)
    pred = lg.predict(x_test)
    acc = accuracy_score(y_test, pred)
    if acc > max_accuracy: # after each iteration, acc is replace by the best possible accuracy
        max_accuracy = acc
        best_rs = i
print(f"Best accuracy is {max_accuracy} and best random state is {best_rs}")

As from above, we see that the best possible random state will be 54 so we use it for the splitting of the model. Now, we split the data into training and testing with the best random state calculated

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 54)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # The dimensions of training and testing data

**Finding the best Model** - 

Now, we fit the training and testin gdata into different classification models and check their respective accuracies. If the accuracy is low, it can mean that the model is underfitting, or if the accuracy is very high then it means that the model is overfitting. We will check the underfitting and overfitting of the model using the cross validation

In [None]:
rfc = RandomForestClassifier(n_estimators = 100) # making the instance of RandomForestClassifier class
rfc.fit(x_train, y_train) # fitting the model
pred_rfc = rfc.predict(x_test) # predicting the values
print("Accuracy Score of RFC model is", accuracy_score(y_test, pred_rfc))

dtc = DecisionTreeClassifier() # making the instance of DecisionTreeClassifier class
dtc.fit(x_train, y_train) # fitting the model
pred_dtc = dtc.predict(x_test) # predicting the values
print("Accuracy Score of DTC model is", accuracy_score(y_test, pred_dtc))

nb = MultinomialNB() # making the Multinomial Naive Bayes class
nb.fit(x_train, y_train) # fitting the model
pred_nb = nb.predict(x_test) # predicting the values
print("Accuracy Score of MNB model is", accuracy_score(y_test, pred_nb))

knc = KNeighborsClassifier(n_neighbors = 5) # making the K-Nearest Neighbor Classifier class. By default value of nn is 5
knc.fit(x_train, y_train) # fitting the model
pred_knc = knc.predict(x_test) # predicting the values
print("Accuracy Score of KNN model is", accuracy_score(y_test, pred_knc))

svc = SVC(kernel = 'rbf') # making the Support Vector Machine Classifier class. By default, the kernel is set to RBF.
svc.fit(x_train, y_train) # fitting the model
pred_svc = svc.predict(x_test) # predicting the values
print("Accuracy Score of svc model is", accuracy_score(y_test, pred_svc))

ada= AdaBoostClassifier()
ada.fit(x_train, y_train) # fitting the model
pred_ada = ada.predict(x_test) # predicting the values
print("Accuracy Score of ADA model is", accuracy_score(y_test, pred_ada))

lg.fit(x_train, y_train)
pred_lg = lg.predict(x_test) # predicting the values
print("Accuracy Score of LG model is", accuracy_score(y_test, pred_lg))

**Cross Validating - Checking Underfiting or Overfiting**

Now we cross validate different models and check the difference between the accuracy score to find out that which model is actually giving the best result. This will help us to find out if the model accuracy is real or is the model *Underfiting or Overfiting*

In [None]:
from sklearn.model_selection import cross_val_score
rfc_scores = cross_val_score(rfc, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for RFC is{rfc_scores.mean()}")

dtc_scores = cross_val_score(dtc, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for DTC is {dtc_scores.mean()}")

nb_scores = cross_val_score(nb, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for NB is {nb_scores.mean()}")

knc_scores = cross_val_score(knc, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for KNC is {knc_scores.mean()}")

svc_scores = cross_val_score(svc, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for SVC is {svc_scores.mean()}")

ada_scores = cross_val_score(ada, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for ADA is {ada_scores.mean()}")

lg_scores = cross_val_score(lg, x, y, cv = 5) # cross validating the model
print(f"Mean of accuracy for ADA is {lg_scores.mean()}")

From the above cross validation scores, we find out that the accuracy mean for Decision Tree Classifier Model is 0.92 and Accuracy given by fitting is 1. So it has the least difference between original accuracy and mean accuracy. Hence, we select the Decision Tree Classifier as our model.

**Hyper Parameter Tuning** - 

Now that we have selected our model as decision tree classifier, we now select the best parameters to give the best possible metrics possible. We use *GridSearchCV for the tuning of the model*.

In [None]:
from sklearn.model_selection import GridSearchCV
dtc = DecisionTreeClassifier() # making the instance of class.
# defining the parameters in the dictionary.
parameters = { 'criterion' : ['gini', 'entropy'], 'max_depth': [1,2,3,4,5,6,7,8,9,10,11,12,15,16,17,18,19,20]}
gs = GridSearchCV(estimator = dtc, param_grid = parameters, scoring = 'f1', cv = 5)
gs.fit(x_train, y_train)
print(f"The best possible score after tuning is : {gs.best_score_}")
print(f"The best parameters for the model given after tuning are : {gs.best_params_}")

Now that we have the *best possible parameters* for the model. We fit the model with final parameters, and perform all the metrics on that model.

**Final Fitting of the model** -

In [None]:
dtc = DecisionTreeClassifier(criterion = 'gini', max_depth = 7)
dtc.fit(x_train, y_train) # fitting the model
print(f"Learning score of model : {dtc.score(x_train, y_train)}") #calculating that how much data have been learned
pred_dtc = dtc.predict(x_test) # predicting the values
# Now performing some metrics to test the fitted model.
print("Accuracy Score of Decision Tree Classifier model is", accuracy_score(y_test, pred_dtc))
print("Confusion matrix for Decision Tree Classifier model Model is : ")
print(confusion_matrix(y_test, pred_dtc))
print("Classification Report of the Decision Tree Classifier model Model is")
print(classification_report(y_test, pred_dtc))

From the above metrics we have found out that the **f1-score of our model is 1.0 or 100%**. Also the **precision and recall for the model is 1.0**. It means that the *model is perfectly fitted and predicting all the values accurately*.

*We can now succesfully predict on any new given data for Mushrooms that they are either Edible (0) or Poisonous (1)*

**Serialisation - Saving the model.**

We will save the project such that predictions can be carried out on the different types of mushrooms and one can safely find out the class of mushrooms.

In [None]:
import joblib
joblib.dump(dtc, 'mushroom prediction model.obj')