## Target Computation

This notebook attempts to compute target as described here:

https://www.kaggle.com/c/g-research-crypto-forecasting/discussion/286778

Version 2.0 improves readability by avoiding some unnecessary shift operations.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
import time
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(20,8)
INPUT=Path("../input/g-research-crypto-forecasting")

In [None]:
def ResidualizeMarket(df, mktColumn, window):
  if mktColumn not in df.columns:
    return df

  mkt = df[mktColumn]

  num = df.multiply(mkt.values, axis=0).rolling(window).mean().values  #numerator of linear regression coefficient
  denom = mkt.multiply(mkt.values, axis=0).rolling(window).mean().values  #denominator of linear regression coefficient
  beta = np.nan_to_num( num.T / denom, nan=0., posinf=0., neginf=0.)  #if regression fell over, use beta of 0

  resultRet = df - (beta * mkt.values).T  #perform residualization
  resultBeta = 0.*df + beta.T  #shape beta

  return resultRet.drop(columns=[mktColumn]), resultBeta.drop(columns=[mktColumn])

In [None]:
# Function log_return_ahead computes R_t = log(P_{t+16} / P_{t+1})
def log_return_ahead(series, periods=1): 
    return -np.log(series).diff(periods=-periods).shift(-1)

In [None]:
%%time
train_df = pd.read_csv(INPUT/"train.csv")
train_df.head()

### Price of assets
$$P^a$$

In [None]:
prices = train_df.pivot(index=["timestamp"], columns=["Asset_ID"], values=["Close"])

In [None]:
prices.columns = [f"A{a}" for a in range(14)]

In [None]:
prices = prices.reindex(range(prices.index[0], prices.index[-1]+60,60), method='pad')

In [None]:
prices.info()

In [None]:
prices.index = prices.index.map(lambda x: datetime.fromtimestamp(x))

In [None]:
prices.sort_index(inplace=True)

In [None]:
prices.tail()

### Log Returns over 15 Minutes

$$R^a(t) = log (P^a(t+16)\ /\ P^a(t+1))$$


In [None]:
log_returns_15min = log_return_ahead(prices, periods=15)

In [None]:
log_returns_15min.info()

In [None]:
log_returns_15min.tail()

In [None]:
log_returns_15min[-200:].plot(grid=True)

### Weighted Average Market Returns

$$M(t) = \frac{\sum_a w^a R^a(t)}{\sum_a w^a}  $$

In [None]:
assets_df = pd.read_csv(INPUT/"asset_details.csv", index_col = "Asset_ID")
assets_df.sort_index(inplace=True)
assets_df

In [None]:
weights = assets_df.Weight.values
weights

In [None]:
weighted_avg_market_log_returns = log_returns_15min.mul(weights, axis='columns').mean(axis=1)

In [None]:
log_returns_15min.mul(weights, axis='columns')[-200:].plot()
weighted_avg_market_log_returns[-200:].plot(style="k8", grid=True)

In [None]:
log_returns_15min["market"] = weighted_avg_market_log_returns
residualized_market_returns, beta = ResidualizeMarket(log_returns_15min, "market", window=3750)

In [None]:
residualized_market_returns[-200:].plot(grid=True)

### Compare computed with provided target

In [None]:
target = train_df.pivot(index=["timestamp"], columns=["Asset_ID"], values=["Target"])

In [None]:
target.columns = [f"A{a}" for a in range(14)]

In [None]:
target = target.reindex(range(target.index[0], target.index[-1]+60,60), method='pad')

In [None]:
target.index = target.index.map(lambda x: datetime.fromtimestamp(x))

In [None]:
target.sort_index(inplace=True)

In [None]:
target[-200:].plot(grid=True)

In [None]:
residualized_market_returns["A0"][-200:].plot(grid=True)
target["A0"][-200:].plot(style='r--', grid=True)

In [None]:
residualized_market_returns["A1"][-200:].plot()
target["A1"][-200:].plot(style='r--',grid=True)

In [None]:
target_diffs = residualized_market_returns - target

In [None]:
target_diffs.dropna(inplace=True)

In [None]:
np.quantile(target_diffs, [0.025, 0.975])

In [None]:
plt.hist(target_diffs.values.reshape(-1), bins=1000)
plt.xlim((-0.01,0.01))
plt.grid()
plt.show()

### Conclusion

The targets provided in `train.csv` are very close to the values computed by ```ResidualizeMarket``` function provided here https://www.kaggle.com/c/g-research-crypto-forecasting/discussion/286778.
