# Data Structures: Measure Predictivity of Features

This section we adopt influence measure developed by Lo (2002, 2009, 2012) to investigate the importance of features of start-up investment profits data set.

The notebook has the following steps:
- $\textbf{Environment Initiation}$. I always start with initiating the environment. I import the correct modules, APIs, and libraries that need to be used for this notebook and I set up my data set.

- $\textbf{Data Cleanup}$. Data set cleanup is very important. In real world, not all data are saved properly and it is our duty as a data scientist and machine learning practitioner to ensure that the data is valid and can be processed by machines.

- $\textbf{Measure Predictivity}$ I adopt the influence score developed by Lo et al (2002, 2009, 2012) and it is a function that indicates how predictive a sets of features are on response variable $y$.

- $\textbf{Software Development / Product Management}$. Every data science project has two phases. Phase I is about end-to-end research and select the most optimal machine learning procedure. Phase II is about delivering a software product to consumer and clients so that the python codes can be called and there is no need to redo everything that has already been done.

## Envrionment Initiation

Let us initiate our environment by importing the required packages.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Cleanup

Let us get the data and clean up the data

In [2]:
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')

In [3]:
print(data.head(3))
print(data.shape)

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
(50, 5)


In [4]:
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
print(X.head(3))

   R&D Spend  Administration  Marketing Spend       State
0  165349.20       136897.80        471784.10    New York
1  162597.70       151377.59        443898.53  California
2  153441.51       101145.55        407934.54     Florida


## Measure Predictivity

Let me measure the predictivity of each variable and the predictivity of joint variables.

In [5]:
newX = pd.DataFrame()
for i in range(X.shape[1]):
    newX = pd.concat([newX, pd.get_dummies(X.iloc[:, i], drop_first=True)], axis=1)

In [6]:
print(newX.shape)
print(newX.iloc[:3, :3])

(50, 146)
   542.05   1000.23  1315.46
0        0        0        0
1        0        0        0
2        0        0        0


In [7]:
import random
num_initial_draw = 1
initial_set = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
partition = initial_set.iloc[:, 0].astype(str)
if initial_set.shape[1] >= 2:
    for i in range(initial_set.shape[1]):
        partition = partition.astype(str) + '_' + initial_set.iloc[:, i].astype(str)
else:
    partition = partition

In [8]:
print(partition.head(3))
print(partition.values)
print(partition.value_counts())

0    0
1    0
2    0
Name: 122616.84, dtype: object
['0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '1' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0']
0    49
1     1
Name: 122616.84, dtype: int64


The above code gives us partition and their counts for each possible combination in the data set. 

Next, based on partition we can compute the local average of response variable $y$ and compare that with global average of the resposne variable $y$. 

In [9]:
list_of_partitions = pd.DataFrame(partition.value_counts())
Pi = pd.DataFrame(list_of_partitions.index)
local_n = pd.DataFrame(list_of_partitions.iloc[:, :])
print(Pi)
print(local_n)

   0
0  0
1  1
   122616.84
0         49
1          1


In [10]:
list_local_mean = []
Y_bar = y.mean()
local_mean_vector = []
for i in range(len(Pi)):
    chk = (Pi.iloc[i] == pd.DataFrame(partition))
    local_mean_vector.append([np.array(y)[np.array(chk[0])].mean()])

  result = method(y)
  
  ret = ret.dtype.type(ret / rcount)


## Software Development / Product Management

Let us code this into a software product.

In [11]:
# Define function
def iscore(X, y):
    # Environment Initiation
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import random
    
    # Create Partition
    partition = X.iloc[:, 0].astype(str)
    if X.shape[1] >= 2:
        for i in range(X.shape[1]):
            partition = partition.astype(str) + '_' + X.iloc[:, i].astype(str)
    else:
        partition = partition
    list_of_partitions = pd.DataFrame(partition.value_counts())
    Pi = pd.DataFrame(list_of_partitions.index)
    local_n = pd.DataFrame(list_of_partitions.iloc[:, :])
    chk = (Pi.iloc[0] == pd.DataFrame(partition))

    # Compute Influence Score:
    list_local_mean = []
    Y_bar = y.mean()
    local_mean_vector = []
    for i in range(len(Pi)):
        chk = (Pi.iloc[i] == pd.DataFrame(partition))
        local_mean_vector.append([np.array(y)[np.array(chk[0])].mean()])
    iscore = np.mean((np.array(local_mean_vector) - Y_bar)**2) / np.std(y)
    
    # Output
    return {
        'X': X,
        'y': y,
        'Local Mean Vector': local_mean_vector,
        'Global Mean': Y_bar,
        'Partition': Pi,
        'Number of Samples in Partition': local_n,
        'Influence Score': iscore}
# End of function

The function is done. Let us try it out!

In [12]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0


In [13]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = feature > feature.mean()
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0       True            True             True    False      True
1       True            True             True    False     False


In [14]:
# Try
testresult = iscore(newX.iloc[:,:3], y)
print(testresult['X'].head(2))
print(testresult['Influence Score'])

   R&D Spend  Administration  Marketing Spend
0       True            True             True
1       True            True             True
18589.914362732427
