# Feature Selection and Feature Engineer

This notebook I discuss feature selection and feature engineer. As a starting practice, let me use the most famous procedure k-nearest neighborhood in machine learning as major tool to construct this notebook.

The notebook covers the following:
- **Environment Initiation** I initiate coding environment of the notebook by importing all the required modules.
- **Loading Data** Data is acquired from [Data Folder](https://github.com/yiqiao-yin/feature-engineering-and-feature-selection/tree/master/data) of [Feature Engineer Github Repo](https://github.com/yiqiao-yin/feature-engineering-and-feature-selection/).
- **Equal Width Binning** I adopt the procedure to divide data into equal width of bins.
- **Equal Frequency Binning** I adopt the procedure to divide data into bins with the same observations.

## Environment Initiation

Let me start the notebook by initiating the python modules.

In [30]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
#from feature_engineer import discretization as dc

## Loading Data

Load the data set.

In [14]:
use_cols = ['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Survived']
data = pd.read_csv('../data/titanic.csv', usecols=use_cols)
data.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Fare
0,0,3,male,22.0,1,7.25
1,1,1,female,38.0,1,71.2833
2,1,3,female,26.0,0,7.925


Let us define covariate matrix $X$ and dependent variable $y$. Then let us separate training and testing set. 

In [15]:
X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=.3, random_state=0)
X_train.shape, X_test.shape

((623, 6), (268, 6))

## Equal Bandwidth Binning

Equal Bandwidth Binning: let me divide the scope of possible values into N bins of the same width.

In [32]:
from sklearn.preprocessing import KBinsDiscretizer
enc_equal_width = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform').fit(X_train[['Fare']])

In [17]:
result = enc_equal_width.transform(X_train[['Fare']])
pd.DataFrame(result)[0].value_counts()

0.0    610
1.0     11
2.0      2
Name: 0, dtype: int64

In [18]:
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_equal_width'] = enc_equal_width.transform(X_train[['Fare']])
print(X_train_copy.head(10))

     Survived  Pclass     Sex   Age  SibSp      Fare  Fare_equal_width
857         1       1    male  51.0      0   26.5500               0.0
52          1       1  female  49.0      1   76.7292               0.0
386         0       3    male   1.0      5   46.9000               0.0
124         0       1    male  54.0      0   77.2875               0.0
578         0       3  female   NaN      1   14.4583               0.0
549         1       2    male   8.0      1   36.7500               0.0
118         0       1    male  24.0      0  247.5208               1.0
12          0       3    male  20.0      0    8.0500               0.0
157         0       3    male  30.0      0    8.0500               0.0
127         1       3    male  24.0      0    7.1417               0.0


## Equal Frequency Binning

This approach of binning divides the scope of possible values into N bins where each bin carries the same amount of observations.

In [19]:
enc_equal_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile').fit(X_train[['Fare']])

In [20]:
enc_equal_freq.bin_edges_

array([array([  0.        ,   8.69303333,  26.2875    , 512.3292    ])],
      dtype=object)

In [21]:
result = enc_equal_freq.transform(X_train[['Fare']])
pd.DataFrame(result)[0].value_counts()

2.0    209
0.0    208
1.0    206
Name: 0, dtype: int64

In [22]:
X_train_copy = X_train.copy(deep=True)
X_train_copy['Fare_equal_freq'] = enc_equal_freq.transform(X_train[['Fare']])
print(X_train_copy.head(3))

     Survived  Pclass     Sex   Age  SibSp     Fare  Fare_equal_freq
857         1       1    male  51.0      0  26.5500              2.0
52          1       1  female  49.0      1  76.7292              2.0
386         0       3    male   1.0      5  46.9000              2.0
