**MUSHROOMS CLASSIFICATION**
Beginners guide to steps for classification.
Simple EDA and classification with logistic regression will be performed. Data for classification will be one-hot-encoded. Based on the analysis results features importance will be revealed. 

Reading necesary libraries. We will use [sklearn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/), [numpy](http://www.numpy.org/) and [matplotlib](https://matplotlib.org/).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, accuracy_score, precision_score, f1_score


**Reading raw data from .csv file.**

In [None]:
data = pd.read_csv("../input/mushrooms.csv")

**Glimpse of the data set.

In [None]:
data.head()

**Let's look at the description of columns along with total number of entries.**
**Here there are 8124 rows with 23 columns.**

In [None]:
# Structure of the data (shows 8124 rows and 23 columns)
data.shape

In [None]:
data.describe().T

In [None]:
# Check for missing values (No Missing values present)

data.isnull().values.any()

In [None]:
#Check for uniques values in for class

data['class'].unique()


In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
for col in data.columns:
    data[col] = labelencoder.fit_transform(data[col])
 
data.head()

In [None]:
# Checking for unique values after labeling to integers
data['stalk-color-above-ring'].unique()

In [None]:
#Check values for target variable
print(data.groupby('class').size())

In [None]:
#Exploratory Data analysis for target variable

ax = sns.boxplot(x='class', y='stalk-color-above-ring', data=data)

In [None]:
#Checking for correlation (color closer to 1 means they are highly correlated)
corr = data.corr()
corr = data.corr()
ax = sns.heatmap(corr, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(20, 220, n=200), square=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right');

**Data have to be encoded for classification. One-hot-encoding with pandas_get_dummies() will be performed. 
Binary labels will be encoded using sklearn LabelEncoder() which means Label encoding is simply converting each value in a column to a number.**

In [None]:
X = data.drop(["class"], axis=1)
y = data["class"]
X = pd.get_dummies(X)

le = LabelEncoder()
y = le.fit_transform(y)

**Classification and metrics for logistic regression classifier**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

clf = LogisticRegression(solver="lbfgs").fit(X_train,y_train)
predicted = clf.predict(X_test)
predicted_proba = clf.predict(X_test)

print("Accuracy is: "+ str(clf.score(X_test,y_test)))
print("Recall score is: " + str(round(recall_score(y_test, predicted),3)))
print("Precision score is: " + str(round(precision_score(y_test, predicted),3)))
print("F1 score is: " + str(round(f1_score(y_test, predicted),3)))
print("\nConfusion matrix:")
print(confusion_matrix(y_test, predicted))

** Confusion matrix shows 825 (True Positive), 732(True Negative), 35 (False Positive) and 33 (False negative)
** Accuracy is not the only determination to decide the best model, for this precision, recall or F1 score is also important.
** Thus F1 score is the best measurent to check the model as it incorporates both precision and recall i.e F1 = 2 * (precision*recall )/ precision + recall).

#Thus F1 score of 95% is good indicator for the classification.
