Hi, I'm going to apply 5 supervised machine learning classification models on the given dataset to classify mushrooms as poisonous or edible.
1. Logistic Regression
2. K-Nearest Neighbours(K-NN)
3. Naive Bayes classifier
4. Decision Tree Classifier
5. Random Forest Classifier

I'll proceed by converting categorical variables into dummy/indicator variables, then applying 3 feature selection techniques to reduce 23 categorical variables (which will become 95 variables after conversion to dummy variables) to only 20 variables and choose the best feature elemination technique for given dataset. Then training different classification models over these 20 features. Here  the goal is to choose best feature selection technique for such datasets with optimum accuracy.


### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


### Read dataset

In [None]:
df = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

In [None]:
df.head()

This dataset contains discrete values for each variable. So, standardization/normalization should not be applied on this.

### Getting information of the data

In [None]:
df.info()

### Describing the data

In [None]:
df.describe()

### Class is dependent variable and rest are independent variables

### Checking whether the data is equally distributed between poisonous (p) and edible (e)


In [None]:
df['class'].value_counts()

As we can see above classes are not imbalanced. To use such discrete features we'll first encode these to natural numbers using LabelEncoder then One-Hot-Encoding will be applied.

## Encoding Categorical Data

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
for c in df.columns:
    df[c]=label.fit_transform(df[c])

In [None]:
df.head()

## 1 = p , 0 = e

### Separating dependent and independent variables

In [None]:
x = df.drop('class', axis=1)
y = df['class']

### One Hot Encoding

In [None]:
x = pd.get_dummies(x,columns=x.columns ,drop_first=True)

In [None]:
x.head()

# Applying Feature Selection Techniques

Now, we have 95 features, let's try some feature selection techniques to extract useful features.

## 1. Selecting features with highest correlation with independent variable (y)

### Making list of correlation values

In [None]:
corr = []
for i in range(x.shape[1]):
    c = np.corrcoef(x.iloc[:,i],y)
    corr.append(abs(c[0][1]))

### Making DataFrame of correaltion values

In [None]:
corr_data = pd.DataFrame({'correlation': corr}, index=x.columns)

In [None]:
corr_data

### Visualization of corr DataFrame

In [None]:
plt.figure(figsize=(20,9))
sns.barplot(x=corr_data.index, y = corr_data['correlation'])
plt.xticks(rotation=90)

From the above graph we can conclude that there are only a few number of features which have higher correlation than most of the features with respect to target feature.

### Choosing features with correlation values greater than 0.5.

In [None]:
corr_data = corr_data.sort_values(by = 'correlation', ascending=False)

In [None]:
corr_imp = corr_data[corr_data['correlation'] >= 0.5]

In [None]:
corr_imp

In [None]:
corr_X = x[corr_imp.index]

### Defining new DataFrame of selected independent variables (x)

In [None]:
corr_X

## Applying Logistic Regression Model (Independent variable = corr_x)

### Splitting the Dataset into X_train, X_test, y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(corr_X, y, test_size=0.33, random_state=42)

### Importing Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

### Fitting Logistic Regression Model

In [None]:
classifier = LogisticRegression(n_jobs=-1)
classifier.fit(X_train, y_train)

### Making Predictions

In [None]:
predictions1 = classifier.predict(X_test)

### Making Classification report and Confusion Matrix

In [None]:
from sklearn.metrics import classification_report
print(classification_report(predictions1,y_test))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(predictions1,y_test))

## 2. Univariate feature selection

### Splitting the Dataset into X_train, X_test, y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [None]:
X_indices = np.arange(x.shape[-1])

### Importing SelectKBest method 

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

### Appying SelectKBest with k = 20

In [None]:
selector = SelectKBest(chi2, k=20)
selector.fit(X_train, y_train)
scores = selector.scores_/1000

plt.figure(figsize=(50,10))
sns.barplot(data=pd.DataFrame({'Feature':x.columns, 'Scores': scores}),x='Feature',y='Scores',ci=None)
plt.xticks(rotation=90)

In [None]:
scores_data = pd.DataFrame({'Feature':x.columns, 'Scores': scores})

### Visualizing scores 

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(scores_data['Scores'])

### Selecting 20 Scores with highest value

In [None]:
scores_data = scores_data.sort_values(by = 'Scores',ascending=False)

In [None]:
scores_x = scores_data.head(20)

In [None]:
scores_x = x[scores_x['Feature']]

### Defining new DataFrame of selected independent variables (x)

In [None]:
scores_x

## Applying Logistic Regression Model (Independent variable = scores_x)

### Splitting the Dataset into X_train, X_test, y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scores_x, y, test_size=0.33, random_state=42)

### Importing Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

### Fitting Logistic Regression Model

In [None]:
classifier = LogisticRegression(n_jobs=-1)
classifier.fit(X_train, y_train)

### Making Predictions

In [None]:
predictions = classifier.predict(X_test)

### Making Classification report and Confusion Matrix

In [None]:
from sklearn.metrics import classification_report
print(classification_report(predictions,y_test))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(predictions,y_test))

## 3. Recurssive feature elimination (RFE)

### Importing RFE

In [None]:
from sklearn.feature_selection import RFE

In [None]:
estimator = LogisticRegression(n_jobs=-1)

## Let's check how many features to preserve with RFE

In [None]:
d = {}
for k in range(2, 25,2):  
    selector = RFE(estimator, n_features_to_select=k, step=2)
    selector = selector.fit(x, y)
    selector.support_
    selector.ranking_

    sel_fea  = [i for i,j in zip(x.columns,selector.ranking_) if j==1]

    x_new = x[sel_fea]

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_new, y, test_size=0.33, random_state=42)
    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(n_jobs=-1)
    classifier.fit(X_train, y_train)
    y_pred1 = classifier.predict(X_test)

    from sklearn.metrics import accuracy_score
    acc = accuracy_score(y_pred1,y_test)
    print("features: %s"%k, " Accuracy: %f"%acc)
    d[str(k)]=acc

### Applying RFE with 20 features

In [None]:
selector = RFE(estimator, n_features_to_select=20, step=2)
selector = selector.fit(x, y)
selector.support_
selector.ranking_


In [None]:
sel_fea  = [i for i,j in zip(x.columns,selector.ranking_) if j==1]

In [None]:
x_new = x[sel_fea]

### Defining new DataFrame of selected independent variables (x)

In [None]:
x_new

## Applying Logistic Regression Model (Independent variable = scores_x)

### Splitting the Dataset into X_train, X_test, y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_new, y, test_size=0.33, random_state=42)

### Importing Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

### Fitting Logistic Regression Model

In [None]:
classifier = LogisticRegression(n_jobs=-1)
classifier.fit(X_train, y_train)

### Making Predictions

In [None]:
y_pred1 = classifier.predict(X_test)

### Getting Accuracy score and confusion matrix

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_pred1,y_test))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_pred1,y_test))

## Accuracy scores of Logistic Regression with above mentioned feature selection techniques.

In [None]:
from sklearn.metrics import accuracy_score
print('correlation :')
print(accuracy_score(predictions1,y_test))
print('selectKBest :')
print(accuracy_score(predictions,y_test))
print('RFE :' )
print(accuracy_score(y_pred1,y_test))

# Applying 5 classification models

### Splitting the Dataset into X_train, X_test, y_train and y_test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_new, y, test_size=0.33, random_state=42)

## 1. Logistic Regression

### Importing Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
classifier = LogisticRegression(n_jobs=-1)

### Fitting Logistic Regression Model

In [None]:
classifier.fit(X_train, y_train)

### Making Predictions

In [None]:
y_pred1 = classifier.predict(X_test)

### Making Classification report and Confusion Matrix

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_pred1,y_test))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_pred1,y_test))

## 2. KNN Classifier

### Importing KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

### Fitting KNN Classifier

In [None]:
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier.fit(X_train,y_train)

### Predicting the test set results

In [None]:
y_pred2 = classifier.predict(X_test)

### Making classification report and confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))

## 3. Naive Bayes

### Fitting the naive bayes model

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

### Predicting the test set results

In [None]:
y_pred3 = classifier.predict(X_test)

### Making classification report and confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))

## 4. Decision Tree classification

### Fitting the Decision Tree classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)

### Predicting the test set results

In [None]:
y_pred4 = classifier.predict(X_test)

### Making classification report and confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred4))
print(confusion_matrix(y_test,y_pred4))

## 5. Random Forest Classification

### Fitting the Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)

### Predicting the test set results

In [None]:
y_pred5 = classifier.predict(X_test)

### Making classification report and confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred5))
print(confusion_matrix(y_test,y_pred5))

### Making DataFrame of all the predictions made by 5 models with recpect to actual target value (y_test)

In [None]:
df = pd.DataFrame({'y_test': y_test,'logistic_reg': y_pred1, 'KNN': y_pred2, 'Naive_Bayes': y_pred3
                  , 'Decision Tree': y_pred4, 'Random Forest': y_pred5})

In [None]:
df

### Calculating Accuracy Scores of above mentioned Classification models.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
for i in df.columns[1:]:
    print(i+': ',accuracy_score(df['y_test'], df[i]))

### So, we conclude that on this dataset RFE performed best among different feature selection techniques and successfully reduced number of variables without hampering accuracy.
If you like my work, an upvote will motivate me to persue this never ending ML/Data Science journey.
I am new to this field, if you feel I made some mistakes or have any suggestions please comment. I trust this community will help me to hone my skills.