# Feature Selection & Feature Engineering: Predictivity by Influence Score

Modern feature selection methodologies such as step-wise backward or forward selection according to AIC/BIC presume a certain underlying model before large dimensions are penalized with an error term. This school of though is problematic because the method left data scientist, statisticians, and machine learning practitioners ensure of the source of impurity of the data set (is error coming from data or the underlying model). 

Based on this motivation, Professor Shaw-hwa Lo introduced a non-parametric feature selection technique called Influence Score (i.e. I-score). The technique is a function that takes in certain covariates $X$ and dependent variable $y$ and output a measure of how much selected $X$ influences $y$. The formula is derived from the lower bound of Bayes' accuracy. 

From previous [notebook](https://github.com/yiqiao-yin/YinsPy/blob/master/scripts/python_DS_Measure_Predictivity.ipynb), we have some working knowledge of how Influence Score is computed given selected $X$ and target $y$. This notebook I will build up on what we have done and introduce a generalized feature method. In this book, I cover

- **Review: Influence Score** I start by discussing what I covered from influence measure notebook in Data Scructures. We can produce a fixed set of partitions given a covariate matrix $X$. According to these partitions, we can compute the local average of target variable $y$ and calculate how that differs from global average of $y$, i.e. $\hat{y}$. The discrepancies between local average based on partitions and global average can be raised by the number of observations that fall in each partition. This method, Influence Score, allows us to compute how powerful given (or selected) covariate matrix $X$ is at predicting target variable $y$.

- **Backward Dropping Algorithm** Based on upon understanding, we need lots of random sampling to come up with a list of variable modules that are powerful at predicting $y$. This is the place where we design Backward Dropping Algorithm (BDA). The Backward Dropping Algorithm is a greedy searching algorithm that iteratively searching for noisy variables to drop. This system works due to a unique property of Influence Score, e.g. I-score, which is the following. Influence Score of a given collection of covariates increase if there are noisy variables in this collection and they are deleted. Influence Score of a given collection of covariates decrease if there newly added variables that do not contribute to the prediction of target variable.

- **Software Development / Product Management** Last, I land on softly coded software products. Then I pack the soft codes in functions using *def* and push all functions into a *class* object in a *.py* script. To reuse this function in the future, I simply need to run the code **%run "../data/NAME.py"** in the code box of *ipynb*.

- **Application** To test out the product, I reproduce the central idea on a brand new data set: housing price data set. Decision Tree algorithm produced about 89% on out-of-sample test set using original 18 variables. The performance seemed ok, but do we need all 18 variables? I use Influence Score and Backward Dropping Algorithm to screen for potential important variable combinations and I identified two important variables ['sqft_living', 'bedrooms']. By using only 8 variables, I am able to reproduce the 89% out-of-sample test set accuracy!

- **Bonus: Why Does It Work?** This final section I introduce an artificial model and test the proposed package on this data to illustrate why does the algorithm work.

## Influence Score

Let us recall the function we coded from previous notebook in the following

In [6]:
# Define function
def iscore(X, y):
    # Environment Initiation
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import random

    # Create Partition
    partition = X.iloc[:, 0].astype(str)
    if X.shape[1] >= 2:
        for i in range(1, X.shape[1]):
            partition = partition.astype(str).str.cat(X.iloc[:, i].astype(str), sep ="_")
    else:
        partition = partition

    # Local Information
    list_of_partitions = pd.DataFrame(partition.value_counts())
    Pi = pd.DataFrame(list_of_partitions.index)
    local_n = pd.DataFrame(list_of_partitions.iloc[:, :])

    # Compute Influence Score:
    import collections
    n = X.shape[0]
    Y_bar = y.mean()
    grouped = pd.DataFrame({'y': y, 'X': partition})
    local_mean_vector = pd.DataFrame(grouped.groupby('X').mean())
    local_n = grouped.groupby('X').count()['y']
    iscore = np.sum(np.array(local_n**2).reshape(1, local_n.shape[0]) * np.array([(local_mean_vector['y'] - Y_bar)**2])) / np.std(y) / n

    # Output
    return {
        'X': X,
        'y': y,
        'Local Mean Vector': local_mean_vector,
        'Global Mean': Y_bar,
        'Partition': Pi,
        'Number of Samples in Partition': local_n,
        'Influence Score': iscore}
# End of function

Let us try it on a real data set.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

In [8]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0


In [9]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)


In [10]:
# Random Sampling
# Note: Python executes each code box independently. Once this box is executed
#       you have to start from previous code box to recover the original 
#       covariate matrix first. If this is not done, the covariate matrix 
#       *newX* will get smaller and smaller.
num_initial_draw = 3
newX = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
print(newX.head(3))

   Administration  Florida  Marketing Spend
0               1        0                1
1               1        0                1
2               0        1                1


In [11]:
# Try
testresult = iscore(X=newX, y=y)
print(testresult['Partition'])
print(testresult['X'].head(3))
print(testresult['Influence Score'])

       0
0  1_0_0
1  0_0_0
2  1_0_1
3  0_1_1
4  0_0_1
5  1_1_1
6  1_1_0
7  0_1_0
   Administration  Florida  Marketing Spend
0               1        0                1
1               1        0                1
2               0        1                1
1.827155214425592


## The Backward Dropping Algorithm

Let us introduce a greedy backward selection algorithm based on the unique property of Influence Score (i.e. I-score). If selected covariate $X$ carries crucial information about dependent variable $y$, we expect to observe a high value for Influence Score (i.e. I-score). If somehow selected covariate $X$ carries noisy variables that damage the purity of $X$ to predict $y$, we expect this measure to decrease. In addition, the more noisy variable selected covariate matrix carries, the lower the Influence Score (e.g. I-score). Hence, due to this property, we develop the Backward Dropping Algorithm.

In [12]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

# Random Sampling
# Note: Python executes each code box independently. Once this box is executed
#       you have to start from previous code box to recover the original 
#       covariate matrix first. If this is not done, the covariate matrix 
#       *newX* will get smaller and smaller.
num_initial_draw = 4
newX = newX.iloc[:, random.sample(range(newX.shape[1]), num_initial_draw)]
print(newX.head(3))

   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)
   Marketing Spend  New York  Florida  Administration
0                1         1        0               1
1                1         0        0               1
2                1         0        1               0


In [13]:
# Compute Influence Score, e.g. I-score
testresult = iscore(X=newX, y=y)
print(testresult['Influence Score'])

1.0334418160969685


In [14]:
newX_copy = newX
iscorePath = []
selectedX = {}
for j in range(newX_copy.shape[1]-1):
    unit_scores = []
    for i in range(newX.shape[1]):
        unit_scores.append(iscore(X=newX.iloc[:, :].drop([str(newX.columns[i])], axis=1), y=y)['Influence Score'])
        #print(i, unit_scores, np.max(unit_scores), unit_scores.index(max(unit_scores)))
    iscorePath.append(np.max(unit_scores))
    to_drop = unit_scores.index(max(unit_scores))
    newX = newX.iloc[:, :].drop([str(newX.columns[to_drop])], axis=1).head(3)
    selectedX[str(j)] = newX

In [15]:
print(iscorePath)
print(selectedX[str(iscorePath.index(max(iscorePath)))])

[2.025583564007748, 1.1359679926931403, 2.0447423868476524]
   Marketing Spend
0                1
1                1
2                1


## Software Development / Product Managmeent

### Soft Code

Let us soft code the procedure of Backward Dropping Algorithm.

In [16]:
# Define function
def BDA(X, y, num_initial_draw = 4):
    # Environment Initiation
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import random

    # Random Sampling
    newX = X.iloc[:, sorted(random.sample(range(X.shape[1]), num_initial_draw))]

    # BDA
    newX_copy = newX
    iscorePath = []
    selectedX = {}
    for j in range(newX_copy.shape[1]-1):
        unit_scores = []
        for i in range(newX.shape[1]):
            unit_scores.append(InteractionBasedLearning.iscore(
                X=newX.iloc[:, :].drop([str(newX.columns[i])], axis=1), y=y)['Influence Score'])
            #print(i, unit_scores, np.max(unit_scores), unit_scores.index(max(unit_scores)))
        iscorePath.append(np.max(unit_scores))
        to_drop = unit_scores.index(max(unit_scores))
        newX = newX.iloc[:, :].drop([str(newX.columns[to_drop])], axis=1)
        selectedX[str(j)] = newX

    # Final Output
    finalX = pd.DataFrame(selectedX[str(iscorePath.index(max(iscorePath)))])

    # Output
    return {
        'Path': iscorePath,
        'MaxIscore': np.max(iscorePath),
        'newX': finalX,
        'Summary': {
            'Variable Module': np.array(finalX.columns), 
            'Influence Score': np.max(iscorePath) },
        'Brief': [np.array(finalX.columns), [np.max(iscorePath)]]
        }
# End of function

Let us try it!

In [17]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0
   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)


In [18]:
testBDA = BDA(newX, y, num_initial_draw = 4)

In [19]:
testBDA['Path']

[4.258510327480856, 7.470222772857905, 8.935110905666017]

In [20]:
testBDA['MaxIscore']

8.935110905666017

In [21]:
testBDA['newX'].head(3)

Unnamed: 0,R&D Spend
0,1
1,1
2,1


In [22]:
class InteractionBasedLearning:
    """
    README:
    This script has the following functions:
    (1) iscore(): this function computes the I-score of selected X at predicting Y
    (2) BDA(): this function runs through Backward Dropping Algorithm once
    (3) InteractionLearning(): this function runs many rounds of BDA and finalize the variables selcted according to I-score
    """
    # Define function
    def iscore(X, y):
        # Environment Initiation
        import numpy as np
        import matplotlib.pyplot as plt
        import pandas as pd
        import random

        # Create Partition
        partition = X.iloc[:, 0].astype(str)
        if X.shape[1] >= 2:
            for i in range(1, X.shape[1]):
                partition = partition.astype(str).str.cat(X.iloc[:, i].astype(str), sep ="_")
        else:
            partition = partition

        # Local Information
        list_of_partitions = pd.DataFrame(partition.value_counts())
        Pi = pd.DataFrame(list_of_partitions.index)
        local_n = pd.DataFrame(list_of_partitions.iloc[:, :])

        # Compute Influence Score:
        import collections
        n = X.shape[0]
        Y_bar = y.mean()
        grouped = pd.DataFrame({'y': y, 'X': partition})
        local_mean_vector = pd.DataFrame(grouped.groupby('X').mean())
        local_n = grouped.groupby('X').count()['y']
        iscore = np.sum(np.array(local_n**2).reshape(1, local_n.shape[0]) * np.array([(local_mean_vector['y'] - Y_bar)**2])) / np.std(y) / n

        # Output
        return {
            'X': X,
            'y': y,
            'Local Mean Vector': local_mean_vector,
            'Global Mean': Y_bar,
            'Partition': Pi,
            'Number of Samples in Partition': local_n,
            'Influence Score': iscore}
    # End of function
    
    # Define function
    def BDA(X, y, num_initial_draw = 4):
        # Environment Initiation
        import numpy as np
        import matplotlib.pyplot as plt
        import pandas as pd
        import random
        
        # Random Sampling
        newX = X.iloc[:, sorted(random.sample(range(X.shape[1]), num_initial_draw))]

        # BDA
        newX_copy = newX
        iscorePath = []
        selectedX = {}
        for j in range(newX_copy.shape[1]-1):
            unit_scores = []
            for i in range(newX.shape[1]):
                unit_scores.append(InteractionBasedLearning.iscore(
                    X=newX.iloc[:, :].drop([str(newX.columns[i])], axis=1), y=y)['Influence Score'])
                #print(i, unit_scores, np.max(unit_scores), unit_scores.index(max(unit_scores)))
            iscorePath.append(np.max(unit_scores))
            to_drop = unit_scores.index(max(unit_scores))
            newX = newX.iloc[:, :].drop([str(newX.columns[to_drop])], axis=1)
            selectedX[str(j)] = newX

        # Final Output
        finalX = pd.DataFrame(selectedX[str(iscorePath.index(max(iscorePath)))])

        # Output
        return {
            'Path': iscorePath,
            'MaxIscore': np.max(iscorePath),
            'newX': finalX,
            'Summary': {
                'Variable Module': np.array(finalX.columns), 
                'Influence Score': np.max(iscorePath) },
            'Brief': [np.array(finalX.columns), [np.max(iscorePath)]]
            }
    # End of function
    
    # Define function
    def InteractionLearning(newX, y, 
                            testSize=0.3, 
                            num_initial_draw=7, total_rounds=10, top_how_many=3, 
                            verbatim=True):
        # Environment Initiation
        import numpy as np
        import matplotlib.pyplot as plt
        import pandas as pd
        import random
        import time
        
        # Split Train and Validate
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(newX, y, test_size=testSize, random_state = 0)
        
        # Start Learning
        start = time.time()
        listVariableModule = []
        listInfluenceScore = []
        from tqdm import tqdm
        for i in tqdm(range(total_rounds)):
            oneDraw = InteractionBasedLearning.BDA(X=X_train, y=y_train, num_initial_draw=num_initial_draw)
            listVariableModule.append([np.array(oneDraw['newX'].columns)])
            listInfluenceScore.append(oneDraw['MaxIscore'])
        end = time.time()
        
        # Time Check
        if verbatim == True: print('Time Consumption', end - start)
        
        # Update Features
        listVariableModule_copy = listVariableModule
        listInfluenceScore_copy = listInfluenceScore
        selectedNom = listVariableModule[listInfluenceScore.index(np.max(listInfluenceScore))]
        informativeX = pd.DataFrame(newX[selectedNom[0]])
        listVariableModule_copy = np.delete(listVariableModule_copy, listInfluenceScore_copy.index(np.max(listInfluenceScore)))
        listInfluenceScore_copy = np.delete(listInfluenceScore_copy, listInfluenceScore_copy.index(np.max(listInfluenceScore)))

        for j in range(2, top_how_many):
            selectedNom = listVariableModule_copy[listInfluenceScore_copy.tolist().index(np.max(listInfluenceScore_copy))]
            informativeX = pd.concat([informativeX, pd.DataFrame(newX[selectedNom])], axis=1)
            listVariableModule_copy = np.delete(
                listVariableModule_copy, 
                listInfluenceScore_copy.tolist().index(np.max(listInfluenceScore_copy)))
            listInfluenceScore_copy = np.delete(
                listInfluenceScore_copy, 
                listInfluenceScore_copy.tolist().index(np.max(listInfluenceScore_copy)))
        
        briefResult = pd.DataFrame({'Modules': listVariableModule, 'Score': listInfluenceScore})
        briefResult = briefResult.sort_values(by=['Score'], ascending=False)
        briefResult = briefResult.loc[~briefResult['Score'].duplicated()]

        # Output
        return {
            'List of Variable Modules': listVariableModule,
            'List of Influence Measures': listInfluenceScore,
            'Brief': briefResult,
            'New Features': informativeX
        }
    # End of function

In [23]:
# Data
data = pd.read_csv('~/OneDrive/Documents/YinsPy/data/startups_invest.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, data.shape[1] - 1]
y = (y > np.mean(y)).astype(int)
State = pd.get_dummies(X.iloc[:, 3], drop_first=True)
X = pd.concat([X.iloc[:, :3], State], axis=1)
print(X.head(2))

newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   R&D Spend  Administration  Marketing Spend  Florida  New York
0   165349.2       136897.80        471784.10        0         1
1   162597.7       151377.59        443898.53        0         0
   R&D Spend  Administration  Marketing Spend  Florida  New York
0          1               1                1        0         1
1          1               1                1        0         0
(50, 5)


In [24]:
InteractionBasedLearning.BDA

<function __main__.InteractionBasedLearning.BDA>

In [25]:
testResult = InteractionBasedLearning.BDA(X=newX, y=y, num_initial_draw=5)

In [26]:
testResult['MaxIscore']

8.935110905666017

In [27]:
testResult['newX'].head()

Unnamed: 0,R&D Spend
0,1
1,1
2,1
3,1
4,1


## Application

We can also save the above *class* object in a *.py* script so that in the future we can load this script.

In [372]:
# Import Modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

In [373]:
# Get Data
house_sales = pd.read_csv('../data/kc_house_data.csv')
house_sales.head(3)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062


In [374]:
print(house_sales.shape)

(21613, 21)


In [654]:
# Clean Data
data = house_sales
X = data.iloc[:, :].drop(['id', 'date', 'price'], axis=1)
y = (data['price'] > np.mean(data['price'])).astype(int)
print(X.head(2))
print(y.head())

   bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0         3       1.00         1180      5650     1.0           0     0   
1         3       2.25         2570      7242     2.0           0     0   

   condition  grade  sqft_above  sqft_basement  yr_built  yr_renovated  \
0          3      7        1180              0      1955             0   
1          3      7        2170            400      1951          1991   

   zipcode      lat     long  sqft_living15  sqft_lot15  
0    98178  47.5112 -122.257           1340        5650  
1    98125  47.7210 -122.319           1690        7639  
0    0
1    0
2    0
3    1
4    0
Name: price, dtype: int32


In [655]:
newX = pd.DataFrame([])
for j in range(X.shape[1]):
    feature = X.iloc[:, j]
    feature = (feature > feature.mean()).astype(int)
    newX = pd.concat([newX, feature], axis=1)
print(newX.head(2))
print(newX.shape)

   bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0         0          0            0         0       0           0     0   
1         0          1            1         0       1           0     0   

   condition  grade  sqft_above  sqft_basement  yr_built  yr_renovated  \
0          0      0           0              0         0             0   
1          0      0           1              1         0             1   

   zipcode  lat  long  sqft_living15  sqft_lot15  
0        1    0     0              0           0  
1        1    1     0              0           0  
(21613, 18)


In [688]:
%run "../scripts/InteractionBasedLearning.py"
InteractionBasedLearning.InteractionLearning

<function __main__.InteractionBasedLearning.InteractionLearning>

In [678]:
print(range(0, newX.shape[1]))
NAMES = newX.columns
print(NAMES)

range(0, 18)
RangeIndex(start=0, stop=18, step=1)


In [679]:
newX.columns = range(0, newX.shape[1])

In [680]:
newX.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,1,1,0,1,0,0,0,0,1,1,0,1,1,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,1,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,0


In [681]:
type(newX.iloc[0,0])

numpy.int32

In [682]:
print(type(newX))
print(newX.shape)
print(len(y))

<class 'pandas.core.frame.DataFrame'>
(21613, 18)
21613


In [683]:
newX=newX.astype(int)

In [689]:
oneDraw = InteractionBasedLearning.InteractionLearning(
    newX=newX,
    y=y,
    testSize=0.1,
    num_initial_draw=9,
    total_rounds=10,
    top_how_many=3,
    nameExists=False,
    TYPE=int,
    verbatim=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:28<00:00,  2.84s/it]


Time Consumption: 28.358969688415527


In [690]:
oneDraw['Brief'].head()

Unnamed: 0,Modules,Score
1,[[8]],1103.120383
2,[[2]],1086.76751
3,[[16]],787.852581
0,[[9]],711.28288


Now let us random sample many times.

The above script did the following work and this is how we interpret the results:

- The script is loaded from *.py* script and we sample $B=200$ repeating times to run Backward Dropping Algorithm.

- The first module, for example, is $[\text{grade}]$ and the second variable module is $[\text{sqft_living}]$.

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [40]:
%run "../scripts/YinsML.py"

In [41]:
testResult = YinsML.DecisionTree_Classifier(X_train, X_test, y_train, y_test, maxdepth=10)
print(testResult['Data']['X_train'].head(3))
print(testResult['Test Result']['confusion_test'])
print(testResult['Test Result']['test_acc'])

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
1468          4       1.50         1390      7200     1.0           0     0   
15590         3       1.50         1450      7316     1.0           0     0   
18552         5       2.75         2860      5379     2.0           0     0   

       condition  grade  sqft_above  sqft_basement  yr_built  yr_renovated  \
1468           3      7        1140            250      1965             0   
15590          3      7        1450              0      1961             0   
18552          3      9        2860              0      2005             0   

       zipcode      lat     long  sqft_living15  sqft_lot15  
1468     98133  47.7224 -122.332           1630        7702  
15590    98133  47.7725 -122.349           1440        7316  
18552    98052  47.7082 -122.104           2980        6018  
      0     1
0  3767   348
1   367  2002
0.8897285626156693


Let us select import variables and then repeat Decision Tree algorithm.

In [50]:
oneDraw['List of Variable Modules'][:10]

[[array(['grade'], dtype=object)],
 [array(['grade'], dtype=object)],
 [array(['grade'], dtype=object)],
 [array(['bathrooms'], dtype=object)],
 [array(['sqft_living'], dtype=object)],
 [array(['sqft_living15'], dtype=object)],
 [array(['sqft_living15'], dtype=object)],
 [array(['sqft_living'], dtype=object)],
 [array(['grade'], dtype=object)],
 [array(['sqft_above'], dtype=object)]]

In [56]:
X = pd.concat([X['grade'], X['bathrooms'], X['sqft_living'], X['sqft_living15']])

In [70]:
print(X.shape)
print(len(y))

(21613, 18)
21613


In [71]:
informativeX = X
X_train, X_test, y_train, y_test = train_test_split(informativeX, y, test_size = 0.3, random_state = 0)
testResult = YinsML.DecisionTree_Classifier(X_train, X_test, y_train, y_test, maxdepth=8)
print(testResult['Data']['X_train'].head(3))
print(testResult['Test Result']['confusion_test'])
print(testResult['Test Result']['test_acc'])

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
1468          4       1.50         1390      7200     1.0           0     0   
15590         3       1.50         1450      7316     1.0           0     0   
18552         5       2.75         2860      5379     2.0           0     0   

       condition  grade  sqft_above  sqft_basement  yr_built  yr_renovated  \
1468           3      7        1140            250      1965             0   
15590          3      7        1450              0      1961             0   
18552          3      9        2860              0      2005             0   

       zipcode      lat     long  sqft_living15  sqft_lot15  
1468     98133  47.7224 -122.332           1630        7702  
15590    98133  47.7725 -122.349           1440        7316  
18552    98052  47.7082 -122.104           2980        6018  
      0     1
0  3738   322
1   396  2028
0.8892658852560148


Now let us check out the continuous version of the same data set. What if instead of trying to predict whether a housing price is greater than average house price we directly predict the price of the house.

In [83]:
# Clean Data
data = house_sales
X = data.iloc[:, :].drop(['id', 'date', 'price'], axis=1)
y = data['price']
print(X.head(2))
print(y.head())

   bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0         3       1.00         1180      5650     1.0           0     0   
1         3       2.25         2570      7242     2.0           0     0   

   condition  grade  sqft_above  sqft_basement  yr_built  yr_renovated  \
0          3      7        1180              0      1955             0   
1          3      7        2170            400      1951          1991   

   zipcode      lat     long  sqft_living15  sqft_lot15  
0    98178  47.5112 -122.257           1340        5650  
1    98125  47.7210 -122.319           1690        7639  
0    221900.0
1    538000.0
2    180000.0
3    604000.0
4    510000.0
Name: price, dtype: float64


In [84]:
%run "../scripts/YinsML.py"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
testRegression = YinsML.DecisionTree_Regressor(X_train, X_test, y_train, y_test, maxdepth = 5)
print(testRegression['Test Result']['RMSE_test'])
print(testRegression['Train Result']['RMSE_train'])

241305.20011885505
236787.34293765223


In [95]:
informativeX = X = pd.concat([X['grade'], X['bathrooms'], X['bedrooms'], X['sqft_living']], axis=1)
X_train, X_test, y_train, y_test = train_test_split(informativeX, y, test_size = 0.3, random_state = 0)
print(informativeX.head(3))
testRegression = YinsML.DecisionTree_Regressor(X_train, X_test, y_train, y_test, maxdepth = 5)
print(testRegression['Test Result']['RMSE_test'])
print(testRegression['Train Result']['RMSE_train'])

   grade  bathrooms  bedrooms  sqft_living
0      7       1.00         3         1180
1      7       2.25         3         2570
2      6       1.00         2          770
282092.9366937713
275978.15523456864


This wraps of performance.

## Bonus: Why Does It Work

Let us draw random variables from Bernoulli distribution and create data $X_1, ..., X_p$ and define underyling model to be
$$y = \left\{
\begin{matrix}
X_1 + X_2 & (\text{mod } 2) \\
X_3 + X_4 + X_5 & (\text{mod } 2) \\
\end{matrix}
\right.
$$

In addition, let us also construct engineered features based on interaction-based variable sets. In other words, given $X$, we can construct
$$
X^{\dagger} := \bar{y}_j, \forall j \in \Pi
$$
while $\Pi$ is the total possible partitions generated by selected variable sets $X$ and $j$ indicates the $j^{\text{th}}$ partition in $\Pi$. The values of $X^{\dagger}$ is replaced with $\bar{y}_j$ which is the local average of resposne variable from each partition $j$.

In [1]:
from scipy.stats import bernoulli
import pandas as pd
import numpy as np

In [19]:
n = 2000
p = 50
data_bern = bernoulli.rvs(size=n * p,p=0.5)

X = pd.DataFrame(data_bern.reshape([n, p]), columns=np.arange(p).astype(str))
print(X.shape)
print(X.head(2))

I = bernoulli.rvs(size=n, p=0.5)
print(np.mean(I))
y1 = np.mod(X.iloc[:, 1] + X.iloc[:, 2], 2)
y2 = np.mod(X.iloc[:, 2] + X.iloc[:, 3] + X.iloc[:, 4], 2)
y = np.where(I == 1, y1, y2)
print(np.mean(y))

(2000, 50)
   0  1  2  3  4  5  6  7  8  9  ...  40  41  42  43  44  45  46  47  48  49
0  1  0  0  0  0  0  0  1  0  1  ...   0   1   0   1   1   1   1   1   0   1
1  1  1  1  0  0  1  1  1  1  0  ...   0   1   0   1   1   0   0   0   0   1

[2 rows x 50 columns]
0.4975
0.492


In [20]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,1,0,0,0,0,0,0,1,0,1,...,0,1,0,1,1,1,1,1,0,1
1,1,1,1,0,0,1,1,1,1,0,...,0,1,0,1,1,0,0,0,0,1
2,0,1,0,1,1,1,1,0,1,0,...,0,0,1,0,0,1,0,1,0,0
3,1,1,0,1,0,0,1,1,0,0,...,1,0,0,1,0,1,1,0,0,0
4,1,1,0,1,1,0,1,0,1,1,...,0,1,1,1,1,0,1,0,1,0


In [21]:
%run "../scripts/InteractionBasedLearning.py"
InteractionBasedLearning.InteractionLearning

-----------------------------------------------------

        Yin's Money Managmeent Package 
        Copyright © YINS CAPITAL, 2009 – Present
        For more information, please go to www.YinsCapital.com
        
README:
This script has the following functions:

    (1) iscore(): this function computes the I-score of selected X at predicting Y
    (2) BDA(): this function runs through Backward Dropping Algorithm once
    (3) InteractionLearning(): this function runs many rounds of BDA and finalize the variables selcted according to I-score
    
-----------------------------------------------------


<function __main__.InteractionBasedLearning.InteractionLearning(newX, y, testSize=0.3, num_initial_draw=7, total_rounds=10, top_how_many=3, nameExists=True, TYPE=<class 'int'>, verbatim=True)>

In [22]:
tmpResult = InteractionBasedLearning.InteractionLearning(
    newX=X,
    y=y,
    testSize=0.1,
    num_initial_draw=9,
    total_rounds=2000,
    top_how_many=2,
    nameExists=False,
    TYPE=str,
    verbatim=True)

100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [36:38<00:00,  1.10s/it]


Time Consumption: 2198.378620147705


In [17]:
print(f"Time Consumption (in min): {round(800/60, 3)}")

Time Consumption (in min): 13.333


In [18]:
tmpResult['Brief'].head()

Unnamed: 0,Modules,Score
246,"[[1, 2]]",27.02932
940,"[[2, 3, 4]]",14.063055
211,[[1]],2.62122
561,[[3]],1.978809
547,"[[8, 12, 15]]",1.803012


In [9]:
tmpResult['New Data'].head()

Unnamed: 0,1,2,0,2.1,3,4,0.1
0,0,0,0.285714,0,1,1,0.222222
1,1,0,0.705882,0,0,0,0.266667
2,1,0,0.705882,0,0,1,0.782609
3,1,0,0.705882,0,0,1,0.782609
4,1,0,0.705882,0,0,0,0.266667


As we can see, selected modules are ranked according to descending order of I-score. The top two modules are $[X_1, X_2]$ and $[X_2, X_3, X_4]$. Correct model is specified. 