# Feature Selection / Feature Engineer: Why Does It Work

Motivation: we are motivated to select the most important (influential, predictive) variables out of vast amount of noisy variables under small amount of sample size in a data. We are also interested in mechanically engineer new feature that recover the information preserved in a variable module.

### Feature Selection

Given $X$ and $Y$, let us define influence score (or I-score) to be the following
$$
\text{I}(X,Y) := \sum_{j \in \Pi} n_j^2 (\bar{Y}_j - \bar{Y})^2
$$
while $j$ indicates the $j^\text{th}$ partition in $\Pi$ which is the overall partition generated by $X$. 

If size of number of parameters gets extremely large, we also introduce a greedy search algorithm called Backward Dropping Algorithm (short for BDA). The Backward Dropping Algorithm states the following:
- Randomly select $k$ variables each round of Backward Dropping Algorithm;
- Take turns and drop each variable of the $k$ variables; and compute I-score; 
- Drop the variable that leads to the highest I-score, here we have $k-1$ variables left;
- Go back to first step and start again.

In the end, we report the variable set with the highest I-score and its associated I-score.

### Feature Engineer

In addition, let us also construct engineered features based on interaction-based variable sets. In other words, given $X$, we can construct
$$
X^{\dagger} := \bar{y}_j, \forall j \in \Pi
$$
while $\Pi$ is the total possible partitions generated by selected variable sets $X$ and $j$ indicates the $j^{\text{th}}$ partition in $\Pi$. The values of $X^{\dagger}$ is replaced with $\bar{y}_j$ which is the local average of resposne variable from each partition $j$.

### Artificial Example

Let us draw random variables from Bernoulli distribution and create data $X_1, ..., X_p$ and define underyling model to be
$$y = \left\{
\begin{matrix}
X_1 + X_2 & (\text{mod } 2) \\
X_3 + X_4 + X_5 & (\text{mod } 2) \\
\end{matrix}
\right.
$$

The goal of this example is correct model specification. We want to capture the important information and in this case we want the two variable modules. If we can successfully capture the important variable modules, we do not even need to worry about what type of machine learning algorithm to choose.

In [1]:
from scipy.stats import bernoulli
import pandas as pd
import numpy as np

In [8]:
n = 1000
N = n + 1000
cutoff = round(n/N, 1)
p = 30
data_bern = bernoulli.rvs(size=N*p, p=0.5)

X = pd.DataFrame(data_bern.reshape([N, p]), columns=np.arange(p).astype(str))
print(X.shape)
print(X.head(2))

I = bernoulli.rvs(size=N, p=0.5)
print(np.mean(I))
y1 = np.mod(X.iloc[:, 1] + X.iloc[:, 2], 2)
y2 = np.mod(X.iloc[:, 2] + X.iloc[:, 3] + X.iloc[:, 4], 2)
y = np.where(I == 1, y1, y2)
print(np.mean(y))

(2000, 30)
   0  1  2  3  4  5  6  7  8  9  ...  20  21  22  23  24  25  26  27  28  29
0  1  0  1  0  0  0  1  1  1  0  ...   0   0   0   0   0   0   0   0   1   0
1  0  0  1  0  1  1  0  0  0  0  ...   0   0   0   0   1   0   0   1   0   1

[2 rows x 30 columns]
0.5175
0.5085


In [36]:
%run "../scripts/InteractionBasedLearning.py"
InteractionBasedLearning.InteractionLearning

---------------------------------------------------------------------

        Yin's Interaction-based Learning Statistical Package 
        Copyright © YINS CAPITAL, 2009 – Present
        For more information, please go to www.YinsCapital.com
        
README:
This script has the following functions:

    (1) iscore(): this function computes the I-score of selected X at predicting Y
    (2) BDA(): this function runs through Backward Dropping Algorithm once
    (3) InteractionLearning(): this function runs many rounds of BDA and 
                               finalize the variables selcted according to I-score
    
---------------------------------------------------------------------


<function __main__.InteractionBasedLearning.InteractionLearning>

In [10]:
tmpResult = InteractionBasedLearning.InteractionLearning(
    newX=X,
    y=y,
    testSize=cutoff,
    num_initial_draw=9,
    total_rounds=200,
    top_how_many=2,
    nameExists=False,
    TYPE=str,
    verbatim=True)

100%|█████████████████████████████████████████████████████████████| 200/200 [04:37<00:00,  1.39s/it]


Time Consumption (in sec): 278.11
Time Consumption (in min): 4.64
Time Consumption (in hr): 0.08


In [11]:
tmpResult['Brief'].head()

Unnamed: 0,Modules,Score
63,"[[1, 2]]",33.628328
163,"[[2, 3, 4]]",14.749468
149,"[[11, 15]]",1.416835
172,"[[7, 9, 20]]",1.341947
145,"[[5, 12, 26]]",1.215592


In [12]:
tmpResult['New Data'].head()

Unnamed: 0,1,2,0,2.1,3,4,0.1
0,0,1,0.753684,1,0,0,0.73251
1,0,1,0.753684,1,0,1,0.232068
2,1,1,0.258964,1,1,1,0.743396
3,1,0,0.773234,0,0,1,0.756654
4,0,1,0.753684,1,0,1,0.232068


### Machine Learning: Building Classifier

To test our idea about the importance of feature selection and feature engineer, we build classifier using a 3-layer neural network using original data set. Then we build classifier using the same 3-layer neural network architecture using important variable moduels and interaction-based features. In the end, we observe that proposed method deliver test set performance similar to that of theoretical prediction while original method does not hit the benchmark.

In [35]:
%run "../scripts/YinsDL.py"

---------------------------------------------------------------------

        Yin's Deep Learning Package 
        Copyright © YINS CAPITAL, 2009 – Present
        For more information, please go to www.YinsCapital.com
        
---------------------------------------------------------------------


In [18]:
YinsDL.NN3_Classifier

<function __main__.YinsDL.NN3_Classifier>

Let us try using original data first.

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=cutoff, random_state=0)
print(X_train.shape, X_test.shape)
print(y_train)

(1000, 30) (1000, 30)
[0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1
 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1
 0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0
 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1
 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 0 0
 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1
 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0
 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1
 0 1 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0
 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0
 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0
 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0
 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1
 0 

In [27]:
testresult = YinsDL.NN3_Classifier(X_train, y_train, X_test, y_test, 
                                 l1_act='relu', l2_act='relu', l3_act='softmax',
                                 layer1size=128, layer2size=64, layer3size=2,
                                 num_of_epochs=50)

  from ._conv import register_converters as _register_converters


2.0.0
Train on 1000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [28]:
testresult['Performance']

{'confusion':      0    1
 0  320  165
 1  179  336, 'test_acc': 0.656}

Now let us try using important variable modules and interaction-based features as input data.

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tmpResult['New Data'], y, test_size=cutoff, random_state=0)
print(X_train.shape, X_test.shape)
print(y_train)

testresult = YinsDL.NN3_Classifier(X_train, y_train, X_test, y_test, 
                                 l1_act='relu', l2_act='relu', l3_act='softmax',
                                 layer1size=128, layer2size=64, layer3size=2,
                                 num_of_epochs=50)

(1000, 7) (1000, 7)
[0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1
 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1
 0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0
 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1
 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 0 0
 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1
 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0
 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1
 0 1 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0
 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0
 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0
 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0
 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 1
 0 0 

In [34]:
testresult['Performance']

{'confusion':      0    1
 0  351  134
 1   94  421, 'test_acc': 0.772}

The performance is 77%. If this experiment is repeated many times, we should be able to hit average prediction rate of 75%. Why 75%? This is the exact theoretical prediction rate of the artificial example. In the underlying model $Y$, there are two modules. One of the correct module gives us at least 50% to start with. Since there is no marginal signal, the first module will perform 50% on the rest of the observations. This means correct theoretical prediction rate (the best you can do) is $75\% = 50\% + 50\% \times 50\%$). 

Investigation ends here.