<h3>Introduction</h3>

Feature selection is a common phase in feature eningeering process. It is defined as "selecting a subset of input features that are most relevant to the target variable". When it comes to categorical data (nominal or ordinal), the main statistical techniques for feature selection are **Chi-squared** and **Mutual Information**.

Data scientists usually apply each technique in isolation and compare the results. Now the question is:

> Can an inference based on both techniques *in conjuction*, improve the overall accuracy?
 
The present kernel will examine this *mixed inference* hypothesis. Without further ado, let's dive in and encode our categorical dataset:




In [None]:
import numpy as np 
import pandas as pd 

dataset = pd.read_csv("../input/mushroom-classification/mushrooms.csv")

X = dataset.iloc[:, 1:]
X = X.astype(str)
y = dataset.iloc[:,0]

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
oe.fit(X)
X_enc = oe.transform(X)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y)
y_enc = le.transform(y)

<h3>A straightforward classifier</h3>

Here for a quick proof-of-concept, we'll use Support Vector Classifier (SVC) with mostly default values, and just a light GridSearchCV on two SVC's main hyper-parameters (C and gamma). That said, for consistency, we'll follow the same footsteps for the rest of the models.

The models basically predict if a mushroom is poisonous or edible (that is, having the *class* column as target). First off, let's build the model based on *all* the 
features provided by the dataset: 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

svc = SVC()
parameters = {'C':(1, 10, 100), 'gamma':(0.1 ,1, 10)}
clf_full_features = GridSearchCV(svc, parameters)
clf_full_features.fit(X_enc, y_enc)
print(clf_full_features.best_score_)

The model hit the above score with either {C:100, gamma:0.1} or {C:10, gamma:0.1}. With this insight into the hyper-parameters let's adjust the values a bit hoping for a higher accuracy

In [None]:
svc = SVC()
parameters = {'C':(5,200,300), 'gamma':(0.01, 0.05,0.1)}
clf_full_features = GridSearchCV(svc, parameters)
clf_full_features.fit(X_enc, y_enc)
print(clf_full_features.best_score_)

We improved the accuracy a bit with either {C:300, gamma:0.05} or 
{C:200, gamma:0.05}. So it makes sense perhaps to have another shot with a higher value of C:

In [None]:
svc = SVC()
parameters = {'C':(200,300,400), 'gamma':(0.01, 0.05,0.1)}
clf_full_features = GridSearchCV(svc, parameters)
clf_full_features.fit(X_enc, y_enc)
print(clf_full_features.best_score_)

No further improvement. So let's take 0.87957696 as the **baseline accuracy** and see if we can improve it through feature selection.

<h3> Feature selection based on statistical measures </h3>

As the first step, we may plot the Chi-squared and Mutual Information on the dataset with the *class* feature as the target. We expect the plots to provide insight into the statistical importance or relevance of each feature to the  target feature:  

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

fs1 = SelectKBest(score_func=chi2, k='all')
fs1.fit(X_enc, y_enc)
X_fs1 = fs1.transform(X_enc)

from sklearn.feature_selection import mutual_info_classif

fs2 = SelectKBest(score_func=mutual_info_classif, k='all')
fs2.fit(X_enc, y_enc)
X_fs2 = fs2.transform(X_enc)

import matplotlib.pyplot as plt

x = np.arange(len(fs1.scores_))

fig, ax = plt.subplots()
fig.set_size_inches(18.5, 10.5)

width = 0.35  
rects1 = ax.bar(x - width/2, fs1.scores_, width, label='Chi-squared')
rects2 = ax.bar(x + width/2, fs2.scores_*10000, width, label='Mutual Information * 10000')

ax.set_ylabel('Feature Importance',fontsize=20)
ax.set_title('Feature',fontsize=20)
ax.set_xticks(x)
ax2 = ax.twiny()
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(x)
ax2.set_xticklabels(x, fontsize=20)
ax.set_xticklabels(dataset.columns[1:], rotation='vertical', fontsize=20)

ax.legend(fontsize=20)
plt.gcf().subplots_adjust(bottom=0.35)  
plt.show()

We have plotted both statistical measures in one figure for convenience. For the moment, disregard the Chi-squared measure and focus on Mutual Information. According to this measure, relevancy varies drastically. Let's drop the feature less relevant than *habitat*. Now if we tune the hyper-parameters as before, we'll get:

In [None]:
X_enc_mod = X_enc[:,[3,4,7,8,11,12,13,14,18,19,20,21]]
svc = SVC()
parameters = {'C':(200, 300,400), 'gamma':(0.01,0.05,0.1)}
clf_select_features = GridSearchCV(svc, parameters)
clf_select_features.fit(X_enc_mod, y_enc)
print(clf_select_features.best_score_)

Our select features mark around **4 percent improvement** upon the baseline accuracy. Now lets focus on Chi-squared. Following the same foot-steps, lets disregard any feature less relevant than *habitat*:

In [None]:
X_enc_mod = X_enc[:, [3,6,7,8,10,18,21]]
svc = SVC()
parameters = {'C':(100,200,500), 'gamma':(0.1,1,100)}
clf_select_features = GridSearchCV(svc, parameters)
clf_select_features.fit(X_enc_mod, y_enc)
print(clf_select_features.best_score_)

An accuracy of 0.8862263 means that our model which is inspired by Chi-squared measure, drops our best result by less than 3 percent. That said, the above 7 features contain a lot of useful information for modeling it seems. So let's see if we can revamp our inference from  Chi-squared, looking at the Mutual Information bars. 

<h3> Feature selection based on mixed inference </h3>

Before we go any further, note that the pitfall here would be comparing the relevancy of each feature based on the actual numbers provided by each measure; which is meaningless of course, as these techniques more-or-less measure different aspects. That said, the relative importance of features within each set side-by-side, may in fact provide valuable insight. For example, looking at the Mutual Information bars, we observe that **odor** of mushrooms are extremely important in deciding whether they are poisonous or edible (and that makes sense too!). Chi-squared measure seems to be blind to this fact. So let's add this feature to our Chi-squared trimmed feature set. Similarly one might add **spore-print-color** and **population** features:

In [None]:
X_enc_mod = X_enc[:, [3,6,7,8,10,18,21,4,19,20]]
svc = SVC()
parameters = {'C':(95,100,105), 'gamma':(0.5,0.6,0.7)}
clf_select_features = GridSearchCV(svc, parameters)
clf_select_features.fit(X_enc_mod, y_enc)
print(clf_select_features.best_score_)

A 9 percent improvement upon the baseline accuracy, and 5 percent improvement upon the model solely based on Mutual Information measure, proves that feature selection based on mixed inference may in fact be efficacious.