**In this notebook,mushrooms are classified as either edible or poisonous, where edible is denoted as 'e' and poisonous is denoted as 'p' in the data set. Each character in the data set has some specific meaning, for example - in cap-shape column bell = b, conical=c,flat=f,sunken = s.**
**This is a checklist which shows what each character means - **  
cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises?: bruises=t,no=f
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment: attached=a,descending=d,free=f,notched=n
gill-spacing: close=c,crowded=w,distant=d
gill-size: broad=b,narrow=n
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape: enlarging=e,tapering=t
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type: partial=p,universal=u
veil-color: brown=n,orange=o,white=w,yellow=y
ring-number: none=n,one=o,two=t
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d *italicized text*

# Importing libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.express as px

In [None]:
mush = pd.read_csv("../input/mushroom-classification/mushrooms.csv")
mush

# Exploratory Data Analysis

In [None]:
mush.head(10)

In [None]:
mush.tail()

In [None]:
mush.info()

In [None]:
mush["class"].unique()

In [None]:
mush.dtypes

In [None]:
mush.shape

In [None]:
mush.isnull().sum()

In [None]:
mush.isna().sum()

In [None]:
#shows description of the text data
mush.describe()

****Now we will use Label Encoding which is a preprocessing technique to normalize labels to assign numbers to each of the features using Label Encoder class from scikit Learn library****

In [None]:
from sklearn.preprocessing import LabelEncoder
mush_encoded = mush.copy()
le = LabelEncoder()
for col in mush_encoded.columns:
  mush_encoded[col] = le.fit_transform(mush_encoded[col])

mush_encoded.head(15)

****In this data set feature scaling will not be much of a requirement as the features already have a low variance****

In [None]:
mush_encoded.max()

In [None]:
mush_encoded.describe()

In [None]:
#shows the name of all the columns
mush_encoded.columns

# Visualizations
**Set Visualization Functions and Parameters**

In [None]:
import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
pylab.rcParams.update(params)

In [None]:
def plot_col(col, hue=None, color=['blue', 'purple'], labels=None):
    fig, ax = plt.subplots(figsize=(15, 7))
    sns.countplot(col, hue=hue, palette=color, saturation=0.6, data=mush_encoded, dodge=True, ax=ax)
    ax.set(title = f"Mushroom {col.title()} Quantity", xlabel=f"{col.title()}", ylabel="Quantity")
    if labels!=None:
        ax.set_xticklabels(labels)
    if hue!=None:
        ax.legend(('Poisonous', 'Edible'), loc=0)

In [None]:
class_dict = ('Poisonous', 'Edible')
plot_col(col='class', labels=class_dict)

**Insight -1: Number of poisonous mushrooms are more in quantity than edible mushrooms**

In [None]:
#Visualizing the number of mushrooms for each of the available cap sizes
shape_dict = {"bell":"b","conical":"c","convex":"x","flat":"f", "knobbed":"k","sunken":"s"}
labels = ('convex', 'bell', 'sunken', 'flat', 'knobbed', 'conical')
plot_col(col='cap-shape', hue='class', labels=labels)

**Insight 2 - Canonical shaped Mushrooms are more in quantity where number of poisonous mushrooms are greater than edible mushrooms. Sunked shape mushrooms are also in significant quantity and both types have atleast 1000 samples**

In [None]:
#Visualizing the number of mushrooms for each cap color
color_dict = {"purple":"n","yellow":"y", "blue":"w", "violet":"g", "red":"e","pink":"p",
              "orange":"b", "purple":"u", "black":"c", "green":"r"}
plot_col(col='cap-color', color=color_dict.keys(), labels=color_dict)

**Insight 3 - Red, Violet and Blue Mushrooms are more in quantity than other colors of mushrooms and are above 1000 in quantity**

In [None]:
#Visualizing the Mushroom Cap Surface Quantity
surface_dict = {"smooth":"s", "scaly":"y", "fibrous":"f","grooves":"g"}
plot_col(col='cap-surface', hue='class', labels=surface_dict)

**Insight:4 - Scaly Cap-surface mushrooms are very low in quantity in the sample, other categories of cap-surface are in considerable number**

**Number of Mushrooms based on Odor**

In [None]:
def get_labels(order, a_dict):    
    labels = []
    for values in order:
        for key, value in a_dict.items():
            if values == value:
                labels.append(key)
    return labels

In [None]:
odor_dict = {"almond":"a","anise":"l","creosote":"c","fishy":"y",
             "foul":"f","musty":"m","none":"n","pungent":"p","spicy":"s"}
order = ['p', 'a', 'l', 'n', 'f', 'c', 'y', 's', 'm']
labels = get_labels(order, odor_dict)      
plot_col(col='odor', color=color_dict.keys(), labels=labels)

**Visualization using Plotly**

In [None]:
labels = ['Edible', 'Poison']
values = mush_encoded['class'].value_counts()

fig=go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=['#87CEFA', '#7FFF00'],
                              line=dict(color='#FFFFFF',width=3)))
fig.show()

**Insight: 5 - There are 4208 samples of Edible mushrooms and 3916 samples of poisonous ones, i.e, nearly 50% chances of picking a poisonous mushroom from the samples**

In [None]:
#Plot to understand the habitat of different mushrooms
labels = ['Woods', 'Grasses', 'Paths', 'Leaves', 'Urban', 'Meadows', 'Waste']
values = mush_encoded['habitat'].value_counts()
colors = ['#DEB887','#778899', '#B22222', '#FFFF00', 
          '#F8F8FF','#FFE4C4','#FF69B4']

fig=go.Figure(data=[go.Pie(labels=labels,
                           values=values,
                           #marker_colors=labels,
                           pull=[0.1, 0, 0, 0, 0.2, 0, 0])])
fig.update_traces(title='Mushrooms Habitat Percentage',
                  hoverinfo='label+value', 
                  textinfo='percent', 
                  opacity=0.9,
                  textfont_size=20,
                  marker=dict(colors=colors,
                             line=dict(color='#000000', width=0.1)),
                 )
fig.show()

In [None]:
colors = ['#DEB887','#f8f8ff','#778899', '#FF69B4','#FFFF00','#B22222','#FFE4C4','#F0DC82','#C000C5', '#228B22']
fig = px.histogram(mush_encoded, x='cap-color',
                   color_discrete_map={'p':'#7FFF00'},
                   #opacity=0.8,
                   color_discrete_sequence=[colors],
                   #barmode='relative',
                   barnorm='percent',
                   color='class'
                  )
fig.update_layout(title='Percentage of Edible or Poisonous mushrooms Based on Cap Color',
                  xaxis_title='Cap Color',
                  yaxis_title='Quantity',
                 )

fig.show()

In [None]:
labels = ['Brown', 'Gray', 'Red', 'Yellow', 'White', 'Buff', 'Pink', 
          'Cinnamon', 'Purple', 'Green']
values = mush_encoded['cap-color'].value_counts()
colors = ['#DEB887','#778899', '#B22222', '#FFFF00', 
          '#F8F8FF','#FFE4C4','#FF69B4','#F0DC82','#C000C5', '#228B22']

fig=go.Figure(data=[go.Pie(labels=labels,
                           values=values,
                           #marker_colors=labels,
                           pull=[0, 0, 0, 0, 0.2, 0, 0, 0, 0, 0])])
fig.update_traces(title='Mushrooms Color Quantity',
                  hoverinfo='label+percent', 
                  textinfo='value',
                  opacity=0.9,
                  textfont_size=20,
                  marker=dict(colors=colors,
                             line=dict(color='#000000', width=0.1)),
                 )
fig.show()

**Now we will split the data set into train data and test data to apply machine learning**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
#class column is taken as a numpy array
y = mush_encoded["class"].values
#All the features are separated from our target value or label and stored in x
x = mush_encoded.drop(["class"],axis=1)
#Finally split the data into train and test set
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size = 0.25)

**No we will apply different classification methods to compare the accuracy of prediction of the classification models in predicting whether the mushrooms are edible or poisonous from the training data fed into the model**

# Logistic Regression Classification

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver="newton-cg")
lr.fit(x_train,y_train)
print("Test Accuracy: {}%".format(round(lr.score(x_test,y_test)*100,2)))

# KNN Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier
best_Kvalue = 0
best_score=0
for i in range(1,10):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train,y_train)
    if knn.score(x_test,y_test) > best_score:
        best_score = knn.score(x_train,y_train)
        best_Kvalue = i
print("""Best KNN Value: {}
Test Accuracy: {}%""".format(best_Kvalue, round(best_score*100,2)))

# SVM Classification

In [None]:
from sklearn.svm import SVC
svm = SVC(random_state=42, gamma="auto")
svm.fit(x_train,y_train)
print("Test Accuracy: {}%".format(round(svm.score(x_test,y_test)*100,2)))

# Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)
print("Test Accuracy: {}%".format(round(nb.score(x_test,y_test)*100,2)))


# Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
print("Test Accuracy: {}%".format(round(dt.score(x_test,y_test)*100,2)))

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(x_train,y_train)
print("Test Accuracy: {}%".format(round(rf.score(x_test,y_test)*100,2)))

**Only Naive Bayes and Logistic Regression gives less than 100% accuracy on the test data. Now we need to check our classification results with confusion matrix to know whether there are any false negative or false positive values**

In [None]:
from sklearn.metrics import confusion_matrix
y_pred_lr = lr.predict(x_test)
y_true_lr = y_test
cm = confusion_matrix(y_true_lr, y_pred_lr)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred_lr")
plt.ylabel("y_true_lr")
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix
y_pred_nb = nb.predict(x_test)
y_true_nb = y_test
cm = confusion_matrix(y_true_nb, y_pred_nb)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred_nb")
plt.ylabel("y_true_nb")
plt.show()

# Conclusion
**Through the use of Confusion matrix, we can clearly see that our train and test datas are balanced, so our model is predicting well and also most classification methods scored 100%**