As shown by other competitors, the correlation between the features and the asset prices is low. Hence, the features need to be transformed in order to learn predictive models. 

One possible approach to transform the features is to lag the technical indicators in order to get a better match to the historical market prices they are calculated on.

First, we load the available data and check if everything works out as planned:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Load Data
with pd.HDFStore('../input/train.h5') as train:
    df = train.get('train')
    
df.head()

Next, we filter for the the technical indicators.

In [None]:
technical = df.filter(regex="technical").columns
technical[:5]

As technical indicators are calculated based on historical data, we expect them to follow the asset price 'y' with a certain lag. While the unweighted moving average would be a simple example, the proposition also holds for more complex indicators like bollinger bands wich incorporate not only the rolling mean but also the rolling standard deviation. 

We define a function to analyze the correlation between the lagged features and the asset prices.

In [None]:
def crosscorr(x, y, lag=0):
    return y.corr(x.shift(lag))

We apply this function to two assets (id = 70 and id = 150) and plot the results for all lags from 0 to 99.  The blue bars show the correlations for the different legs for the asset with id 70 and the red bars for id 150.

In [None]:
def plot_lagged_correlation(id, lags, columns, color):
    xcov = {}
    for i in range(lags):
        xcov[i] = crosscorr(df[df.id == id]["y"], df[df.id == id][columns[feature]], lag=i)
    X = np.arange(len(xcov))
    plt.bar(X, xcov.values(), color = color)
    
for feature in range(len(technical)):
    plt.figure()
    plt.subplot(211)
    plot_lagged_correlation(70, 100, technical, "blue")
    plt.title("Feature : " + str(technical[feature]))
    plt.subplot(212)
    plot_lagged_correlation(150, 100, technical, "red")
    plt.tight_layout()


We see mixed results for the different technical indicators. On the one hand, a few features have the highest absolute correlation without beeing lagged (e.g. technical_30). On the other hand, most features have a very low correlation in the initial unlagged setting but a higher correlation if they are beeing lagged. Often times, the correlations are positive for one asset and negative for the other asset. However, my guess is that the optimal lag can be calculated based on the absolute correlation. 

I hope that this first analysis can be used to identify the time window underlying the different technical features and  create more powerful features for the predictive models. If you have any suggestions to improve this script or ideas on how to apply it to the complete portfolio I would be happy to know.