# Feature Selection & Feature Engineering: Predictivity by Influence Score

Modern feature selection methodologies such as step-wise backward or forward selection according to AIC/BIC presume a certain underlying model before large dimensions are penalized with an error term. This school of though is problematic because the method left data scientist, statisticians, and machine learning practitioners ensure of the source of impurity of the data set (is error coming from data or the underlying model). 

Based on this motivation, Professor Shaw-hwa Lo introduced a non-parametric feature selection technique called Influence Score (i.e. I-score). The technique is a function that takes in certain covariates $X$ and dependent variable $y$ and output a measure of how much selected $X$ influences $y$. The formula is derived from the lower bound of Bayes' accuracy. 

From previous [notebook](https://github.com/yiqiao-yin/YinsPy/blob/master/scripts/python_DS_Measure_Predictivity.ipynb), we have some working knowledge of how Influence Score is computed given selected $X$ and target $y$. This notebook I will build up on what we have done and introduce a generalized feature method. In this book, I cover

- **Review: Influence Score**

- **Backward Dropping Algorithm**

- **Software Development / Product Management**

## Influence Score

Let us recall the function we coded from previous notebook in the following

In [233]:
# Define function
def iscore(X, y):
    # Environment Initiation
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import random
    
    # Create Partition
    partition = X.iloc[:, 0].astype(str)
    if X.shape[1] >= 2:
        for i in range(X.shape[1]-1):
            partition = partition.astype(str) + '_' + X.iloc[:, i].astype(str)
    else:
        partition = partition

    # Local Information
    list_of_partitions = pd.DataFrame(partition.value_counts())
    Pi = pd.DataFrame(list_of_partitions.index)
    local_n = pd.DataFrame(list_of_partitions.iloc[:, :])

    # Compute Influence Score:
    import collections
    list_local_mean = []
    Y_bar = y.mean()
    local_mean_vector = []
    grouped = pd.DataFrame({'y': y, 'X': partition})
    local_mean_vector = pd.DataFrame(grouped.groupby('X').mean())
    iscore = np.mean(np.array(local_n).reshape(1, local_n.shape[0]) * np.array((local_mean_vector['y'] - Y_bar)**2))/np.std(y)
    
    # Output
    return {
        'X': X,
        'y': y,
        'Local Mean Vector': local_mean_vector,
        'Global Mean': Y_bar,
        'Partition': Pi,
        'Number of Samples in Partition': local_n,
        'Influence Score': iscore}
# End of function

Let us try it on a real data set.

In [234]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

In [235]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0


In [236]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)


In [237]:
# Random Sampling
# Note: Python executes each code box independently. Once this box is executed
#       you have to start from previous code box to recover the original 
#       covariate matrix first. If this is not done, the covariate matrix 
#       *newX* will get smaller and smaller.
num_initial_draw = 3
newX = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
print(newX.head(3))

   Marketing Spend  New York  R&D Spend
0                1         1          1
1                1         0          1
2                1         0          1


In [239]:
# Try
testresult = iscore(X=newX, y=y)
print(testresult['Partition'])
print(testresult['X'].head(3))
print(testresult['Influence Score'])

       0
0  1_1_0
1  0_0_0
2  0_0_1
3  1_1_1
   Marketing Spend  New York  R&D Spend
0                1         1          1
1                1         0          1
2                1         0          1
2.8567550791291914


## The Backward Dropping Algorithm

Let us introduce a greedy backward selection algorithm based on the unique property of Influence Score (i.e. I-score). If selected covariate $X$ carries crucial information about dependent variable $y$, we expect to observe a high value for Influence Score (i.e. I-score). If somehow selected covariate $X$ carries noisy variables that damage the purity of $X$ to predict $y$, we expect this measure to decrease. In addition, the more noisy variable selected covariate matrix carries, the lower the Influence Score (e.g. I-score). Hence, due to this property, we develop the Backward Dropping Algorithm.

In [303]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

# Random Sampling
# Note: Python executes each code box independently. Once this box is executed
#       you have to start from previous code box to recover the original 
#       covariate matrix first. If this is not done, the covariate matrix 
#       *newX* will get smaller and smaller.
num_initial_draw = 4
newX = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
print(newX.head(3))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)
   Florida  Marketing Spend  R&D Spend  New York
0        0                1          1         1
1        0                1          1         0
2        1                1          1         0


In [304]:
# Compute Influence Score, e.g. I-score
testresult = iscore(X=newX, y=y)
print(testresult['Influence Score'])

1.9387641557937267


In [305]:
newX_copy = newX
iscorePath = []
selectedX = {}
for j in range(newX_copy.shape[1]-1):
    unit_scores = []
    for i in range(newX.shape[1]):
        unit_scores.append(iscore(X=newX.iloc[:, :].drop([str(newX.columns[i])], axis=1), y=y)['Influence Score'])
        #print(i, unit_scores, np.max(unit_scores), unit_scores.index(max(unit_scores)))
    iscorePath.append(np.max(unit_scores))
    to_drop = unit_scores.index(max(unit_scores))
    newX = newX.iloc[:, :].drop([str(newX.columns[to_drop])], axis=1).head(3)
    selectedX[str(j)] = newX

In [306]:
print(iscorePath)
print(selectedX[str(iscorePath.index(max(iscorePath)))])

[4.544159627264301, 2.0447423868476524, 2.0447423868476524]
   Florida  R&D Spend  New York
0        0          1         1
1        0          1         0
2        1          1         0


## Software Development / Product Managmeent

Let us soft code the procedure of Backward Dropping Algorithm.

In [314]:
# Define function
def FSFE_BDA(newX, y, num_initial_draw = 4):
    # Random Sampling
    newX = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
    
    # BDA
    newX_copy = newX
    iscorePath = []
    selectedX = {}
    for j in range(newX_copy.shape[1]-1):
        unit_scores = []
        for i in range(newX.shape[1]):
            unit_scores.append(iscore(X=newX.iloc[:, :].drop([str(newX.columns[i])], axis=1), y=y)['Influence Score'])
            #print(i, unit_scores, np.max(unit_scores), unit_scores.index(max(unit_scores)))
        iscorePath.append(np.max(unit_scores))
        to_drop = unit_scores.index(max(unit_scores))
        newX = newX.iloc[:, :].drop([str(newX.columns[to_drop])], axis=1).head(3)
        selectedX[str(j)] = newX
        
    # Final Output
    finalX = pd.DataFrame(selectedX[str(iscorePath.index(max(iscorePath)))])

    # Output
    return {
        'Path': iscorePath,
        'MaxIscore': np.max(iscorePath),
        'newX': finalX
    }
# End of function

Let us try it!

In [315]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0
   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)


In [316]:
testBDA = FSFE_BDA(newX, y, num_initial_draw = 4)

In [317]:
testBDA['Path']

[3.4715736175130565, 2.0447423868476524, 2.0447423868476524]

In [319]:
testBDA['MaxIscore']

3.4715736175130565

In [318]:
testBDA['newX'].head(3)

Unnamed: 0,Marketing Spend,Florida,R&D Spend
0,1,0,1
1,1,0,1
2,1,1,1
