# Idea #
From the way target is calculated, it is clear that target values for individual assets are not independent:
$$\text{Target}^a(t) = R^a(t) - \beta^a M(t)$$
$$\beta^a = \frac{\langle M \cdot R^a \rangle}{\langle M^2 \rangle}$$
$$M(t) = \frac{\sum_a w^a R^a(t)}{\sum_a w^a}$$
Instead we need to predict is a vector of targets $(Target_1, Target_2, ... Target_{14})$ for each timestamp. 

Building a model to predict the full vector $(Target_1, Target_2, ... Target_{14})$ can be hard, so why not start with a simplier setup: consider there are only two cryptocurrencies in the universe: Bitcoin and Etherium. We can gain some insights and try models in this 2-asset setup. If a solution does not work in the 2-asset setup, it will probably not work in the 14-asset setup. It is important to notice that in this new 2-asset system the old target values no longer apply and we need to construct the targets ourselves by applying the equations above.

# Converting data from one asset per line to a vector of assets per line #

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
df = pd.read_csv("../input/g-research-crypto-forecasting/train.csv")

df_BTC = df[df["Asset_ID"] == 1]  # Bitcoin
df_BTC = df_BTC.drop(columns=["Asset_ID", "Target"])  # asset id and old target are no longer useful
df_BTC.set_index("timestamp", inplace=True)
df_BTC = df_BTC.add_suffix('_BTC')

df_ETH = df[df["Asset_ID"] == 6]  # Etherium
df_ETH = df_ETH.drop(columns=["Asset_ID", "Target"])   # asset id and old target are no longer useful
df_ETH.set_index("timestamp", inplace=True)
df_ETH = df_ETH.add_suffix('_ETH')

df_BTC_ETH = pd.concat([df_BTC, df_ETH], axis=1)

# at some timestamps only one asset is traded - fill the second with previous values
df_BTC_ETH.fillna(method='ffill', inplace=True)

df_BTC_ETH

# Feature engineering #

Time for some feature engineering. The variables Open, High, Low, Close, VWAP are all strongly correlated to each other so it makes sence to only keep one of them (I keep Close). Moreover, it makes sence to store the price in the log format as the returns are simply differences between logprices:
$$R^a(t) = log (P^a(t+16)\ /\ P^a(t+1)) = log (P^a(t+16) - log (P^a(t+1)$$

The other prices are only useful as we can infer price differences:
- The Open-Close difference is probably not very helpful since we it only gives a short term behaviour (1 minute - much less than the needed 15-minute intervals)
- The Low-High difference gives some measure of how volatile the asset is, so I introduce this as a new feature

As far as time is concerned, it is possible that an hour of day matters (e.g. people who live in a different timezones trade assets differently). I introduce this new feature as a `frac_of_day`. Similarly, I also extract `frac_of_week`, `frac_of_month`, and `frac_of_year`


In [None]:
# engineer logprice and volatility
df_features = df_BTC_ETH.copy()
for suffix in ["_BTC", "_ETH"]:
    df_features["logprice" + suffix] = np.log(df_features["Close" + suffix])    
    df_features["Volatility" + suffix] = np.log(df_features["High" + suffix]) - np.log(df_features["Low" + suffix])
    df_features = df_features.drop(columns=["Close"+suffix, "High"+suffix, "Low"+suffix, "Open"+suffix, "VWAP"+suffix])
    
    
    
# engineer meaningful features out of timestamp
from datetime import datetime as dt
import time

def get_time_fractions(date):
        
    def s(date): # returns seconds since epoch
        return time.mktime(date.timetuple())

    year = date.year
    month = date.month
    day = date.day
    dayofweek = date.dayofweek
    
    startOfThisDay = dt(year=year, month=month, day=day, hour=0, minute=0, second=0)
    n_sec_in_day = 86400
    fraction_of_day = (s(date) - s(startOfThisDay)) / n_sec_in_day
    
    fraction_of_week = (dayofweek + fraction_of_day) / 7
    
    startOfThisMonth = dt(year=year, month=month, day=1)
    startOfNextMonth = dt(year=year, month=month+1, day=1) if month < 12 else dt(year=year+1, month=1, day=1)
    fraction_of_month = (s(date) - s(startOfThisMonth)) / (s(startOfNextMonth) - s(startOfThisMonth))
    
    startOfThisYear = dt(year=year, month=1, day=1)
    startOfNextYear = dt(year=year+1, month=1, day=1)
    fraction_of_year = (s(date) - s(startOfThisYear)) / (s(startOfNextYear) - s(startOfThisYear))   

    return fraction_of_day, fraction_of_week, fraction_of_month, fraction_of_year

datetimes = pd.Series(df_features.index).astype('datetime64[s]')
df_features['frac_of_day'], df_features['frac_of_week'], df_features['frac_of_month'], df_features['frac_of_year'] = zip(*datetimes.map(get_time_fractions))

df_features

# Constructing targets #
As mentioned above, the targets need to be recalculated for the 2-asset system as follows:
$$\text{Target}^a(t) = R^a(t) - \beta^a M(t)$$
$$\beta^a = \frac{\langle M \cdot R^a \rangle}{\langle M^2 \rangle}$$
$$M(t) = \frac{\sum_a w^a R^a(t)}{\sum_a w^a}$$

We probably need to add $\beta^a$ in our features since a model has no way of learning how it's calculated unless we use a very long time series. $\beta^a$ are averaged over a long time period (3750 minutes) and should remain constant over the 15 min interval, so it makes sence to just supply the last calculated $\beta^a$ to the model.

The final dataset for model training `df_features_targets` contains the vectorized target `(Target_BTC, Target_ETH)` as the last two columns.

In [None]:
# calculate 2-asset targets (different from 14-asset targets!)

    
df_logprices = df_features[["logprice_BTC", "logprice_ETH"]]

df_returns = df_logprices.shift(-16) - df_logprices.shift(-1)
for suffix in ["_BTC", "_ETH"]:
    df_returns.rename(columns={"logprice"+suffix : "R"+suffix}, inplace = True)

asset_details = pd.read_csv("../input/g-research-crypto-forecasting/asset_details.csv")
asset_details = asset_details[(asset_details["Asset_ID"] == 1) | (asset_details["Asset_ID"] == 6)]
asset_details = asset_details.sort_values(by=["Asset_ID"])
weights = asset_details["Weight"].to_numpy()
weights = weights.reshape(len(weights),1)

R = df_returns.to_numpy()
weights_sum = np.sum(weights)
M = np.dot(R, weights) / weights_sum
df_M = pd.DataFrame(data=M, index=df_returns.index, columns=["M"])

df_R_M = df_returns.copy()
for col in df_R_M.columns:
    df_R_M[col] = df_R_M[col] * df_M["M"]
for suffix in ["_BTC", "_ETH"]:
    df_R_M.rename(columns={"R"+suffix : "R_M"+suffix}, inplace = True)
df_R_M_rolling = df_R_M.rolling(window=3750).mean()

df_M2 = df_M ** 2
df_M2.rename(columns={"M" : "M2"}, inplace = True)
df_M2_rolling = df_M2.rolling(window=3750).mean()

df_betas = df_R_M_rolling.copy()       
for col in df_betas.columns:
    df_betas[col] = df_betas[col] / df_M2_rolling["M2"]
for suffix in ["_BTC", "_ETH"]:
    df_betas.rename(columns={"R_M"+suffix : "beta"+suffix}, inplace = True)

df_targets = df_returns.copy()
for suffix in ["_BTC", "_ETH"]:
    df_targets["R"+suffix] -= df_betas["beta"+suffix] * df_M["M"]
    df_targets.rename(columns={"R"+suffix : "Target"+suffix}, inplace = True)
    

df_features_targets = pd.concat([df_features, df_betas, df_targets], axis=1)
df_features_targets = df_features_targets.iloc[3750:-16]  # get rid of the Nan rows

df_features_targets

# Looking at the data #

In [None]:
# correlation heatmap
from bokeh.io import output_notebook, show
from bokeh.models import (
    BasicTicker,
    ColorBar,
    ColumnDataSource,
    LinearColorMapper,
    PrintfTickFormatter,
)
from bokeh.plotting import figure
from bokeh.transform import transform

output_notebook()

df_to_viz = df_features_targets

xcorr = abs(df_to_viz.corr())
xcorr.index.name = "Feature1"
xcorr.columns.name = "Feature2"

df = pd.DataFrame(xcorr.stack(), columns=["Corr"]).reset_index()

source = ColumnDataSource(df)

colors = [
    "#75968f",
    "#a5bab7",
    "#c9d9d3",
    "#e2e2e2",
    "#dfccce",
    "#ddb7b1",
    "#cc7878",
    "#933b41",
    "#550b1d",
]

mapper = LinearColorMapper(palette=colors, low=df.Corr.min(), high=df.Corr.max())

f1 = figure(
    plot_width=800,
    plot_height=800,
    title="Correlation Heat Map",
    x_range=list(sorted(xcorr.index)),
    y_range=list(reversed(sorted(xcorr.columns))),
    toolbar_location=None,
    tools="hover",
    x_axis_location="above",
)

f1.rect(
    x="Feature2",
    y="Feature1",
    width=1,
    height=1,
    source=source,
    line_color=None,
    fill_color=transform("Corr", mapper),
)

color_bar = ColorBar(
    color_mapper=mapper,
    location=(0, 0),
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(format="%d%%"),
)
f1.add_layout(color_bar, "right")

f1.hover.tooltips = [
    ("Feature1", "@{Feature1}"),
    ("Feature2", "@{Feature2}"),
    ("Corr", "@{Corr}{1.1111}"),
]

f1.axis.axis_line_color = None
f1.axis.major_tick_line_color = None
f1.axis.major_label_text_font_size = "12px"
f1.axis.major_label_standoff = 2
f1.xaxis.major_label_orientation = 1.0

show(f1)

In [None]:
# looking at returns
df_to_plot = df_returns.iloc[::100]
df_to_plot[(np.abs(df_to_plot["R_BTC"]) < 0.01) & (np.abs(df_to_plot["R_ETH"]) < 0.01)].plot.hist(bins=50, alpha=0.5)
plt.xlabel("Return")
plt.title("Return frequency")
plt.show()

plt.plot(df_to_plot["R_BTC"], df_to_plot["R_ETH"], '.b')
plt.xlim([-0.01, 0.01])
plt.ylim([-0.01, 0.01])
plt.xlabel("Return_BTC")
plt.ylabel("Return_ETH")
plt.grid()
plt.title("ETH return vs BTC return")
plt.show()

We can notice a few things:
- for both assets the return distribution resembles normal distrubution, but is more pointy (low returns are disproportionatly frequent)
- for BTC, this pointy effect is more pronounced than for ETH
- we see significant correlation between the returns of the two assets. Mostly, when BTC is up, ETH is up.

In [None]:
# looking at betas
y1 = df_features_targets["beta_BTC"].iloc[::15000]
y2 = df_features_targets["beta_ETH"].iloc[::15000]
plt.plot(y1, '-b', label="beta_BTC")
plt.plot(y2, '-r', label="beta_ETH")
plt.xlabel("time, s")
plt.ylabel("betas")
plt.legend()
plt.grid()
plt.title("beta vs time")
plt.show()

The two betas always add up to 2, and vary considerably vs time

In [None]:
# looking at targets
df_to_plot = df_features_targets[["Target_BTC", "Target_ETH"]].iloc[::100]
df_to_plot[(np.abs(df_to_plot["Target_BTC"]) < 0.01) & (np.abs(df_to_plot["Target_ETH"]) < 0.01)].plot.hist(bins=50, alpha=0.5)
plt.xlabel("Target")
plt.title("Target frequency")
plt.show()


plt.plot(df_to_plot["Target_BTC"], df_to_plot["Target_ETH"], '.b')
plt.xlim([-0.01, 0.01])
plt.ylim([-0.01, 0.01])
plt.xlabel("Target_BTC")
plt.ylabel("Target_ETH")
plt.grid()
plt.title("ETH target vs BTC target")
plt.show()

It is interesting that residualization transforms the spread-out 2D distribution of returns to a 1D distribution of targets (all the target points all line up on a straight line). Similarly, for 14 assets I expect the distribution of targets to be 13-dimensional. If each asset was treated separately, this feature of the target distribution would have been missed.

In [None]:
df_features_targets[["Target_BTC", "Target_ETH"]].corr()

The BTC and ETH targets are strictly anti-correlated. This means that for model training we can just try to match Target_BTC. If we can get any correlation, the other asset will be correlated automatically.

# Next steps #
We need to train a model and try to get some meaningful correlation. So far I have not had any success with this. It is possible that the current data is insufficient and we need to add dat from previous timestamps in the data set (time series problem).

Any suggestions are welcome!