# Data Structures: Measure Predictivity of Features

This section we adopt influence measure developed by Lo (2002, 2009, 2012) to investigate the importance of features of start-up investment profits data set.

The notebook has the following steps:
- $\textbf{Environment Initiation}$. I always start with initiating the environment. I import the correct modules, APIs, and libraries that need to be used for this notebook and I set up my data set.

- $\textbf{Data Cleanup}$. Data set cleanup is very important. In real world, not all data are saved properly and it is our duty as a data scientist and machine learning practitioner to ensure that the data is valid and can be processed by machines.

- $\textbf{Measure Predictivity}$ I adopt the influence score developed by Lo et al (2002, 2009, 2012) and it is a function that indicates how predictive a sets of features are on response variable $y$.

- $\textbf{Software Development / Product Management}$. Every data science project has two phases. Phase I is about end-to-end research and select the most optimal machine learning procedure. Phase II is about delivering a software product to consumer and clients so that the python codes can be called and there is no need to redo everything that has already been done.

## Envrionment Initiation

Let us initiate our environment by importing the required packages.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Cleanup

Let us get the data and clean up the data

In [2]:
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')

In [3]:
print(data.head(3))
print(data.shape)

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
(50, 5)


In [62]:
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
print(X.head(10))
print(y)

   R&D Spend  Administration  Marketing Spend       State
0  165349.20       136897.80        471784.10    New York
1  162597.70       151377.59        443898.53  California
2  153441.51       101145.55        407934.54     Florida
3  144372.41       118671.85        383199.62    New York
4  142107.34        91391.77        366168.42     Florida
5  131876.90        99814.71        362861.36    New York
6  134615.46       147198.87        127716.82  California
7  130298.13       145530.06        323876.68     Florida
8  120542.52       148718.95        311613.29    New York
9  123334.88       108679.17        304981.62  California
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45

## Measure Predictivity

Let me measure the predictivity of each variable and the predictivity of joint variables.

In [63]:
from sklearn.preprocessing import KBinsDiscretizer
newX = pd.DataFrame()
for i in range(X.shape[1]):
    newColResult = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform').fit(np.array(X.iloc[:, 1]).reshape(-1, 1))
    newCol = newColResult.transform(np.array(X.iloc[:, 1]).reshape(-1, 1))
    newCol = pd.DataFrame(newCol)
    newX = pd.concat([newX, newCol], axis=1)

In [64]:
print(newX.shape)
print(newX.iloc[:, :])

(50, 4)
      0    0    0    0
0   1.0  1.0  1.0  1.0
1   1.0  1.0  1.0  1.0
2   0.0  0.0  0.0  0.0
3   1.0  1.0  1.0  1.0
4   0.0  0.0  0.0  0.0
5   0.0  0.0  0.0  0.0
6   1.0  1.0  1.0  1.0
7   1.0  1.0  1.0  1.0
8   1.0  1.0  1.0  1.0
9   0.0  0.0  0.0  0.0
10  0.0  0.0  0.0  0.0
11  0.0  0.0  0.0  0.0
12  1.0  1.0  1.0  1.0
13  1.0  1.0  1.0  1.0
14  1.0  1.0  1.0  1.0
15  1.0  1.0  1.0  1.0
16  1.0  1.0  1.0  1.0
17  1.0  1.0  1.0  1.0
18  0.0  0.0  0.0  0.0
19  1.0  1.0  1.0  1.0
20  0.0  0.0  0.0  0.0
21  1.0  1.0  1.0  1.0
22  1.0  1.0  1.0  1.0
23  0.0  0.0  0.0  0.0
24  0.0  0.0  0.0  0.0
25  1.0  1.0  1.0  1.0
26  1.0  1.0  1.0  1.0
27  1.0  1.0  1.0  1.0
28  1.0  1.0  1.0  1.0
29  1.0  1.0  1.0  1.0
30  0.0  0.0  0.0  0.0
31  1.0  1.0  1.0  1.0
32  1.0  1.0  1.0  1.0
33  0.0  0.0  0.0  0.0
34  1.0  1.0  1.0  1.0
35  0.0  0.0  0.0  0.0
36  1.0  1.0  1.0  1.0
37  0.0  0.0  0.0  0.0
38  0.0  0.0  0.0  0.0
39  0.0  0.0  0.0  0.0
40  1.0  1.0  1.0  1.0
41  0.0  0.0  0.0  0.0
42 

In [65]:
import random
num_initial_draw = 2
initial_set = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
partition = initial_set.iloc[:, 0].astype(str)
if initial_set.shape[1] >= 2:
    for i in range(initial_set.shape[1]-1):
        partition = partition.astype(str) + '_' + initial_set.iloc[:, i].astype(str)
else:
    partition = partition

In [67]:
print(partition.head(3))
print(partition.value_counts())

0    1.0_1.0
1    1.0_1.0
2    0.0_0.0
Name: 0, dtype: object
1.0_1.0    30
0.0_0.0    20
Name: 0, dtype: int64


The above code gives us partition and their counts for each possible combination in the data set. 

Next, based on partition we can compute the local average of response variable $y$ and compare that with global average of the resposne variable $y$. 

In [68]:
list_of_partitions = pd.DataFrame(partition.value_counts())
Pi = pd.DataFrame(list_of_partitions.index)
local_n = pd.DataFrame(list_of_partitions.iloc[:, :])
print(Pi)
print(local_n)

         0
0  1.0_1.0
1  0.0_0.0
          0
1.0_1.0  30
0.0_0.0  20


In [102]:
import collections
list_local_mean = []
Y_bar = y.mean()
local_mean_vector = []
grouped = pd.DataFrame({'y': y, 'X': partition})
local_mean_vector = pd.DataFrame(grouped.groupby('X').mean())
iscore = np.mean((local_mean_vector['y'] - Y_bar)**2)/np.std(y)
print(iscore)

0.0005853184266750218


## Software Development / Product Management

Let us code this into a software product.

In [123]:
# Define function
def iscore(X=newX, y=y, num_initial_draw = 2):
    # Environment Initiation
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import random
    
    # Create Partition
    initial_set = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
    partition = initial_set.iloc[:, 0].astype(str)
    if initial_set.shape[1] >= 2:
        for i in range(initial_set.shape[1]-1):
            partition = partition.astype(str) + '_' + initial_set.iloc[:, i].astype(str)
    else:
        partition = partition

    # Local Information
    list_of_partitions = pd.DataFrame(partition.value_counts())
    Pi = pd.DataFrame(list_of_partitions.index)
    local_n = pd.DataFrame(list_of_partitions.iloc[:, :])

    # Compute Influence Score:
    import collections
    list_local_mean = []
    Y_bar = y.mean()
    local_mean_vector = []
    grouped = pd.DataFrame({'y': y, 'X': partition})
    local_mean_vector = pd.DataFrame(grouped.groupby('X').mean())
    iscore = np.mean((local_mean_vector['y'] - Y_bar)**2)/np.std(y)
    
    # Output
    return {
        'X': X,
        'Set Drawn': initial_set,
        'y': y,
        'Local Mean Vector': local_mean_vector,
        'Global Mean': Y_bar,
        'Partition': Pi,
        'Number of Samples in Partition': local_n,
        'Influence Score': iscore}
# End of function

The function is done. Let us try it out!

In [104]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0


In [105]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = feature > feature.mean()
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0       True            True             True    False      True
1       True            True             True    False     False


In [162]:
# Try
testresult = iscore(X=newX, y=y, num_initial_draw = 3)
print(testresult['Partition'])
print(testresult['Set Drawn'].head(3))
print(testresult['Influence Score'])

                   0
0  False_False_False
1    True_True_False
2     True_True_True
3   False_False_True
   Administration  New York  R&D Spend
0            True      True       True
1            True     False       True
2           False     False       True
0.004598527569771014


The above code says that from the covariate matrix *newX* and dependent variable $y$ we are able to randomly draw 3 variables out and compute how predictive they are to $y$. Let us repeat this random sampling many times and compare the result. You can go to the code box above and hit "ctrl + enter" and observe different results.