# Feature Selection-Information gain - mutual information In Classification Problem Statements
### Mutual Information
MI Estimate mutual information for a discrete target variable.

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

#### Inshort

A quantity called mutual information measures the amount of information one can obtain from one random variable given another.

The mutual information between two random variables X and Y can be stated formally as follows:

I(X ; Y) = H(X) – H(X | Y) Where I(X ; Y) is the mutual information for X and Y, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y. The result has the units of bits.

In [1]:
import pandas as pd

In [4]:
df=pd.read_csv('https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv')
df.head()

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [5]:
## unique values 
df['Wine'].unique()

array([1, 2, 3], dtype=int64)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Wine                  178 non-null    int64  
 1   Alcohol               178 non-null    float64
 2   Malic.acid            178 non-null    float64
 3   Ash                   178 non-null    float64
 4   Acl                   178 non-null    float64
 5   Mg                    178 non-null    int64  
 6   Phenols               178 non-null    float64
 7   Flavanoids            178 non-null    float64
 8   Nonflavanoid.phenols  178 non-null    float64
 9   Proanth               178 non-null    float64
 10  Color.int             178 non-null    float64
 11  Hue                   178 non-null    float64
 12  OD                    178 non-null    float64
 13  Proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


In [7]:
# x , y variables
x = df.drop(columns='Wine',axis=1)
y = df['Wine']

In [8]:
## Train Test split
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.3,random_state=0)

In [9]:
## Mutual Information
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(x_train,y_train)
mutual_info

array([0.42098684, 0.30425869, 0.18220383, 0.23229126, 0.17380218,
       0.47162497, 0.71799447, 0.16476314, 0.25400558, 0.61364823,
       0.57161139, 0.54420173, 0.52907728])

The higher the value , the more important feature it is. That means , the column with the highest value has the most impact on the dependent feature. So we can drop the columns with low values. OR select the top 5 or 10 depending on the usecase and the dataset.

In [11]:
## coverting into series
mutual_info = pd.Series(mutual_info)
mutual_info.index = x_train.columns
mutual_info.sort_values(ascending=False)

Flavanoids              0.717994
Color.int               0.613648
Hue                     0.571611
OD                      0.544202
Proline                 0.529077
Phenols                 0.471625
Alcohol                 0.420987
Malic.acid              0.304259
Proanth                 0.254006
Acl                     0.232291
Ash                     0.182204
Mg                      0.173802
Nonflavanoid.phenols    0.164763
dtype: float64

In [12]:
## Selecting the TOP 5 columns those are best Features.
from sklearn.feature_selection import SelectKBest

In [21]:
sel_five_cols = SelectKBest(mutual_info_classif,k=5)
sel_five_cols.fit(x_train,y_train)
imp_cols = x_train.columns[sel_five_cols.get_support()]

In [23]:
imp_cols

Index(['Flavanoids', 'Color.int', 'Hue', 'OD', 'Proline'], dtype='object')

In [29]:
for i in x_train.columns:
    if i not in imp_cols:
        x_train.drop(columns=i,axis=1,inplace=True)

In [30]:
x_train.head()

Unnamed: 0,Flavanoids,Color.int,Hue,OD,Proline
22,2.88,3.8,1.11,4.0,1035
108,2.04,2.7,0.86,3.02,312
175,0.69,10.2,0.59,1.56,835
145,0.55,4.0,0.6,1.68,830
71,2.86,3.38,1.36,3.16,410
