# Introduction 
`V1.0.0`
### Who am I
Just a fellow Kaggle learner. I was creating this Notebook as practice and thought it could be useful to some others 
### Who is this for
This Notebook is for people that learn from examples. Forget the boring lectures and follow along for some fun/instructive time :)
### What can I learn here
You learn all the basics needed to create a rudimentary RNN/LSTM Parallel Network. I go over a multitude of steps with explanations. Hopefully with these building blocks,you can go ahead and build much more complex models.

### Things to remember
+ Please Upvote/Like the Notebook so other people can learn from it
+ Feel free to give any recommendations/changes. 
+ I will be continuously updating the notebook. Look forward to many more upcoming changes in the future.

### You can also refer to these notebooks that have helped me as well:
+ https://www.kaggle.com/yamqwe/g-research-lstm-starter-notebook#Training-%F0%9F%8F%8B%EF%B8%8F

+ https://www.kaggle.com/vmuzhichenko/g-research-parallel-lstm-training



# Dataset Structure 

> **train.csv** - The training set
> 
> 1.  timestamp - A timestamp for the minute covered by the row.
> 2.  Asset_ID - An ID code for the cryptoasset.
> 3.  Count - The number of trades that took place this minute.
> 4.  Open - The USD price at the beginning of the minute.
> 5.  High - The highest USD price during the minute.
> 6.  Low - The lowest USD price during the minute.
> 7.  Close - The USD price at the end of the minute.
> 8.  Volume - The number of cryptoasset u units traded during the minute.
> 9.  VWAP - The volume-weighted average price for the minute.
> 10. Target - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
> 11. Weight - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
> 12. Asset_Name - Human readable Asset name.
> 
>
> **example_test.csv** - An example of the data that will be delivered by the time series API.
> 
> **example_sample_submission.csv** - An example of the data that will be delivered by the time series API. The data is just copied from train.csv.
> 
> **asset_details.csv** - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.
> 
> **supplemental_train.csv** - After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period. In the Evaluation phase, the train, train supplement, and test set will be contiguous in time, apart from any missing data. The current copy, which is just filled approximately the right amount of data from train.csv is provided as a placeholder.
>
> - There are 14 coins in the dataset
>
> - There are 4 years  in the [full] dataset

# Imports
First let us start by importing the relevant libraries that we need.

In [None]:
# Ignore Warnings
import warnings
from warnings import simplefilter
warnings.filterwarnings("ignore")

# Computational imports
import numpy as np   # Library for n-dimensional arrays
import pandas as pd  # Library for dataframes (structured data)

# Helper imports
import os 
import re
import time
import warnings
from tqdm import tqdm
import datetime as dt
from datetime import datetime
import scipy.stats as stats
from pathlib import Path

# ML/DL imports
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder, LabelEncoder, RobustScaler
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_probability as tfp
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, RepeatVector, TimeDistributed
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Plotting imports
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot

%matplotlib inline
init_notebook_mode(connected=True)

# Set seeds to make the experiment more reproducible.
from numpy.random import seed
seed(1)

# Allows us to see more information regarding the DataFrame
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

# Plotting
Preparing the data and plotting different kinds of plots.

## Reading and Preparing the Data
Let's start by reading our data. We will store it in many dataframes.

In [None]:
data_path = "../input/g-research-crypto-forecasting/"

crypto_df = pd.read_csv(data_path + 'train.csv')
asset_details_df = pd.read_csv(data_path + 'asset_details.csv')

Let us see how many timesteps are in this dataframe.

In [None]:
len(crypto_df[crypto_df["Asset_ID"]==1].set_index("timestamp"))

## Plotting: Candlestick plots
Bitcoin has Asset_ID of 1. We select only the recent 200 timesteps of data (-200: refers to LAST_TIMESTEP-200:LAST_TIMESTEP). We call it btc_mini since it is a miniscule portion of the whole dataframe

In [None]:
btc = crypto_df[crypto_df["Asset_ID"]==1].set_index("timestamp") # Asset_ID = 1 for Bitcoin
btc_mini = btc.iloc[-200:] # Select recent data rows

We use plotly library and its CandleStick graph. 

In [None]:
candle_stick_graph = go.Candlestick(x=btc_mini.index, open=btc_mini['Open'], high=btc_mini['High'], low=btc_mini['Low'], close=btc_mini['Close'])
fig = go.Figure(data=[candle_stick_graph])
fig.show()

## Taking care of NaN (missing values) in the dataframe
This allows us to check how many null/nan values each column has.


In [None]:
crypto_df.isnull().sum()

Let's check more precisely on one crypto: Ethereum.

In [None]:
eth = crypto_df[crypto_df["Asset_ID"] == 6].set_index("timestamp") # Asset_ID = 6 for Ethereum
eth.info(show_counts =True)

In [None]:
eth.isna().sum()

There is indeed many missing values here, we will take care of them later.

In [None]:
start_btc = btc.index[0].astype('datetime64[s]')
end_btc = btc.index[-1].astype('datetime64[s]')
start_eth = eth.index[0].astype('datetime64[s]')
end_eth = eth.index[-1].astype('datetime64[s]')

print('BTC data goes from', start_btc, 'to', end_btc)
print('Ethereum data goes from', start_eth, 'to', end_eth)

Let's check out the intervals between all timesteps and see if they are all homogenous. If they are not, we will have to fix them.

In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

We notice that all timesteps aren't equally spaced. That is a problem we have to solve. We solve it by reindexing the index.

In [None]:
eth = eth.reindex(range(eth.index[0],eth.index[-1]+60,60),method='ffill')
eth.head(5)

We solved it by reindexing the dataframe and forward filling for all the missing values. 

In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

We have equally spaced data now!

## Plotting: BTC and ETC vs time
We plot both BTC and ETC coin side by side to analyze and observe if they have any similarities. To do so, we use plotly subplots.

In [None]:
fig  = make_subplots(rows=1, cols=2, 
                    specs=[[{"type": "scatter"}, {"type": "scatter"}]],
                    column_widths=[0.5, 0.5], vertical_spacing=0, horizontal_spacing=0.10,
                    subplot_titles=("BTC vs Time", "ETH vs Time"))

fig.add_trace(go.Scatter(x=btc.index,y=btc.Close,name="BTC"), row=1, col=1)
fig.add_trace(go.Scatter(x=eth.index,y=eth.Close,name="ETH"), row=1, col=2)
fig.update_layout(autosize=True, margin=dict(b=0,r=20,l=20), template="plotly_dark", title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"), 
                  width=900, height=500, title_text="Crypto vs Time", font=dict(color='#8a8d93'),)
fig.update_xaxes(title_text="Time (seconds)")
fig.update_yaxes(title_text="Closing Price ($)")
fig.show()

Let's make an inline functions that will take care of transforming string dates into posix timestamps.

In [None]:
# auxiliary function, from datetime to timestamp
totimestamp = lambda s: np.int32(time.mktime(dt.datetime.strptime(s, "%d/%m/%Y").timetuple()))

# create intervals
btc_mini_2021 = btc.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]
eth_mini_2021 = eth.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]

Finally, lets build a similar graph, but this time only observing 2021 data.

In [None]:
fig  = make_subplots(rows=2, cols=1, 
                    specs=[[{"type": "scatter"}], [{"type": "scatter"}]],
                    column_widths=[0.5], vertical_spacing=0.35, horizontal_spacing=0,
                    subplot_titles=("BTC vs Time (2021)", "ETH vs Time (2021)"))

fig.add_trace(go.Scatter(x=btc_mini_2021.index,y=btc_mini_2021.Close,name="BTC"), row=1, col=1)
fig.add_trace(go.Scatter(x=eth_mini_2021.index,y=eth_mini_2021.Close,name="ETH"), row=2, col=1)
fig.update_layout(autosize=True, margin=dict(b=0,r=20,l=20), template="plotly_dark", title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"), 
                  width=900, height=500, title_text="Crypto vs Time", font=dict(color='#8a8d93'),)
fig.update_xaxes(title_text="Time (seconds)")
fig.update_yaxes(title_text="Closing Price ($)")
fig.show()

## Plotting: Log Returns 
In order to analyze price changes for an asset we can deal with the price difference. However, different assets exhibit different price scales, so that the their returns are not readily comparable. We can solve this problem by computing the percentage change in price instead, also known as the return. This return coincides with the percentage change in our invested capital.

Returns are widely used in finance, however log returns are preferred for mathematical modelling of time series, as they are additive across time. Also, while regular returns cannot go below -100%, log returns are not bounded.

To compute the log return, we can simply take the logarithm of the ratio between two consecutive prices. The first row will have an empty return as the previous value is unknown, therefore the empty return data point will be dropped.

In [None]:
# define function to compute log returns
def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

In [None]:
lret_btc_mini = log_return(btc_mini_2021.Close)[1:]
lret_eth_mini  = log_return(eth_mini_2021.Close)[1:]
lret_btc_mini.rename('lret_btc_mini', inplace=True)
lret_eth_mini.rename('lret_eth_mini', inplace=True)

plt.figure(figsize=(8,4))
plt.plot(lret_btc_mini);
plt.plot(lret_eth_mini);
plt.show()

## Plotting: Correlations

In [None]:
# join two asset in single DataFrame
lret_btc_long = log_return(btc.Close)[1:]
lret_eth_long = log_return(eth.Close)[1:]
lret_btc_long.rename('lret_btc', inplace=True)
lret_eth_long.rename('lret_eth', inplace=True)
two_assets = pd.concat([lret_btc_long, lret_eth_long], axis=1)

# group consecutive rows and use .corr() for correlation between columns
corr_time = two_assets.groupby(two_assets.index//(10000*60)).corr().loc[:,"lret_btc"].loc[:,"lret_eth"]

corr_time.plot();
plt.xticks([])
plt.ylabel("Correlation")
plt.title("Correlation between BTC and ETH over time");

We do see a strong correlation between BTC and ETH. This is what we expect. The price of ETH is known to be strongly swayed and dependent on the price of BTC.

## Plotting: Heatmaps
We are going to plot the heatmap here with the seaborn library. The heatmap will show us visualy the correlation between different assets. The hotter (red) it is, the higher the correlation.

In [None]:
# create dataframe with returns for all assets
all_assets_2021 = pd.DataFrame([])
for asset_id, asset_name in zip(asset_details_df.Asset_ID, asset_details_df.Asset_Name):
  asset = crypto_df[crypto_df["Asset_ID"]==asset_id].set_index("timestamp")
  asset = asset.loc[totimestamp('01/01/2021'):totimestamp('01/05/2021')]
  asset = asset.reindex(range(asset.index[0],asset.index[-1]+60,60),method='pad')
  lret = log_return(asset.Close.fillna(0))[1:]
  all_assets_2021 = all_assets_2021.join(lret, rsuffix=asset_name, how="outer")

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(all_assets_2021.corr(), annot=True)
plt.show()

## Plotting: Trend
Simple trend plotting with rolling average. We used a 365 day window (in minutes, thus the *24*60)

In [None]:
moving_average = eth.Close.rolling(
    window=365*24*60,       # 365-day window
    center=True,            # puts the average at the center of the window
    min_periods=182*24*60,  # choose about half the window size
).mean()                    # compute the mean (could also do median, std, min, max, ...)

ax = eth.Close.plot(style=".", color="0.5")
moving_average.plot(
    ax=ax, linewidth=3, title="Tunnel Traffic - 365-Day Moving Average", legend=False,
);

## Plotting: Seasonal Trend
First we have a couple of functions that will allows us to easily plot the seasonal trend.

In [None]:
simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'


# annotations: https://stackoverflow.com/a/49238256/5769929
def seasonal_plot(X, y, period, freq, ax=None):
    if ax is None:
        _, ax = plt.subplots()
    palette = sns.color_palette("husl", n_colors=X[period].nunique(),)
    ax = sns.lineplot(
        x=freq,
        y=y,
        hue=period,
        data=X,
        ci=False,
        ax=ax,
        palette=palette,
        legend=False,
    )
    ax.set_title(f"Seasonal Plot ({period}/{freq})")
    for line, name in zip(ax.lines, X[period].unique()):
        y_ = line.get_ydata()[-1]
        ax.annotate(
            name,
            xy=(1, y_),
            xytext=(6, 0),
            color=line.get_color(),
            xycoords=ax.get_yaxis_transform(),
            textcoords="offset points",
            size=14,
            va="center",
        )
    return ax

Before anything, we have to transform the index into a datetime format to be able to plot the seasonal plot.

In [None]:
eth["temp_date"] = eth.index
eth["temp_date"]

In [None]:
for idx in eth.index[:1000]:
    new_date = datetime.utcfromtimestamp(idx).strftime('%Y-%m-%d %H:%M:%S')
    eth["temp_date"].loc[idx] = new_date

In [None]:
eth.set_index(eth['temp_date'], inplace = True)  
eth.drop(labels = 'temp_date', axis = 1)
eth_seas_plot = eth.iloc[:1000, :]
eth_seas_plot.index = pd.to_datetime(eth_seas_plot.index)

Now that we have set a new datetime index, we can proceed with the plotting.

In [None]:
X = pd.DataFrame(eth_seas_plot.Close.copy())

# days within a week
X["day"] = eth_seas_plot.index.dayofweek  # the x-axis (freq)
X["week"] = eth_seas_plot.index.week  # the seasonal period (period)
X["minute"] = eth_seas_plot.index.minute
X["second"] = eth_seas_plot.index.second
# days within a year
X["dayofyear"] = eth_seas_plot.index.dayofyear
X["year"] = eth_seas_plot.index.year
fig, ax0 = plt.subplots(1, 1, figsize=(11, 6))
seasonal_plot(X, y="Close", period="week", freq="minute", ax=ax0)

# Training and Predicting
This section contains all necessary steps to train and predict your model.

## Helper Functions

### Split Sequences
A key component of time-series problem is splitting our input data into sequences that we can feed to our LSTM network. This sequences depend on the required timesteps and horizons. 

In [None]:
def split_sequences(sequences, timesteps, horizon):
    Sequences, Targets = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + timesteps
        out_end_ix = end_ix + horizon-1
        # check if we are beyond the dataset
        if out_end_ix > len(sequences):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1:out_end_ix, -1]
        Sequences.append(seq_x)
        Targets.append(seq_y)
        show_shapes()
    return array(X), array(y)

### Downcasting
This functions is used to downcast our variables to types that take less memory. This helps with model performance and speed.

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

### Normalization 
These functions are used to normalize our data. This aids with model performance and speed. You can also use the scikit-learn MinMaxScaler if you wish, it is up to you.

In [None]:
def Normalize(list):
    list = np.array(list)
    low, high = np.percentile(list, [0, 100])
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = (list[i]-low)/delta
    return  list,low,high

def FNoramlize(list,low,high):
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = list[i]*delta + low
    return list

def Normalize2(list,low,high):
    list = np.array(list)
    delta = high - low
    if delta != 0:
        for i in range(0, len(list)):
            list[i] = (list[i]-low)/delta
    return  list

### Show Shapes
This functions is used to quickly check the shapes of our numpy arrays. This is especially important to assure we have the right shape for our LSTM network.

In [None]:
def show_shapes(Sequences, Targets): # this'll use inputs; can make yours to use local variable values
    print("Expected: (num_samples, timesteps, channels)")
    print("Sequences: {}".format(Sequences.shape))
    print("Targets:   {}".format(Targets.shape))   

### Exploraty Data Analysis for pandas
This functions is used to quickly check the basic attributes of our pandas DataFrame.

In [None]:
def basic_eda(df):
    print("-------------------------------TOP 5 RECORDS-----------------------------")
    print(df.head(5))
    print()
    
    print("-------------------------------INFO--------------------------------------")
    print(df.info())
    print()
    
    print("-------------------------------Describe----------------------------------")
    print(df.describe())
    print()
    
    print("-------------------------------Columns-----------------------------------")
    print(df.columns)
    print()
    
    print("-------------------------------Data Types--------------------------------")
    print(df.dtypes)
    print()
    
    print("----------------------------Missing Values-------------------------------")
    print(df.isnull().sum())
    print()
    
    print("----------------------------NULL values----------------------------------")
    print(df.isna().sum())
    print()
    
    print("--------------------------Shape Of Data---------------------------------")
    print(df.shape)
    print()
    
    print("============================================================================ \n")

This feature is specific to this competition. It automatically adds usefull features to the DataFrame.

In [None]:
def add_features(df):
    df['Upper_Shadow'] = df['High'] - np.maximum(df['Close'], df['Open'])
    df['Lower_Shadow'] = np.minimum(df['Close'], df['Open']) - df['Low']
    
    df['spread'] = df['High'] - df['Low']
    df['mean_trade'] = df['Volume']/df['Count']
    df['log_price_change'] = np.log(df['Close']/df['Open'])
    return df

## Reading and Preparing the Data
Let's start by reading our data. We will store it in many dataframes.

In [None]:
data_path = "../input/g-research-crypto-forecasting/"

train = pd.read_csv(data_path + 'train.csv').set_index("timestamp")
assets = pd.read_csv(data_path + 'asset_details.csv')

Let's create a dict name assets_order which tracks the true asset order in the Dataframe (for example, in the DataFrame, asset order #3 (Cardano) starts and is followed by #2). This is important so that we can create a new feature in the training dataframe named asset order which goes from 0 -> 13 for one timestep and then resets to 0 for the next timestep.

In [None]:
assets_order = pd.read_csv('../input/g-research-crypto-forecasting/supplemental_train.csv').Asset_ID[:14]
assets_order = dict((t,i) for i,t in enumerate(assets_order))
assets_order

I'm going to truncate a the training data just for the notebook's sake. Or else, the training would take an eternity,

In [None]:
train_cut = train[15000000:]

train_cut[['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP','Target']] = \
train_cut[['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP','Target']].astype(np.float32)

train_cut.head(10)

## Taking care of NaN (missing values) in the dataframe
There is multiple ways of doing this, I have chosen do it with the following method:

In [None]:
print(np.sum(train_cut.isna()))
train_cut['Target'] = train_cut['Target'].fillna(method = 'ffill')
print('\n', np.sum(train_cut.isna()))

The VMAP feature has some nan values. Let's take care of them by replacing by our defined VMAP_max, VMAP_min.

In [None]:
# VWAP column has -inf and inf values. VWAP_max and VWAP_min will be used for replacement
    
VWAP_max = np.max(train_cut[np.isfinite(train_cut.VWAP)].VWAP)
VWAP_min = np.min(train_cut[np.isfinite(train_cut.VWAP)].VWAP)
print(VWAP_max, "\n", VWAP_min)

train_cut['VWAP'] = np.nan_to_num(train_cut.VWAP, posinf=VWAP_max, neginf=VWAP_min)

Let's make a new series named ids, which stores what timestep the current row is part of. For example, for timestep xxxxxx (first timestep) of Cardano (#3), the ids for this row would be 1_3 (representing first timestep and asset id 3 which is Cardano).

In [None]:
df = train_cut[['Asset_ID', 'Target']].copy()
times = dict((t,i) for i,t in enumerate(df.index.unique()))
df['id'] = df.index.map(times)
df['id'] = df['id'].astype(str) + '_' + df['Asset_ID'].astype(str)
ids = df.id.copy()

del df

Let's add some extra features that will help with the prediction. We will be using the add_features function we defined previously.

In [None]:
train_cut = add_features(train_cut)
train_cut.shape

## Stadardizing the training data
Here we stadardize our training data with RobustScaler. As the anme states, robust scaler is robust to outlier data. This is good for random data points that dont' really follow the data distribution. This will increase models accuracy and speed.

In [None]:
scale_features = train_cut.columns.drop(['Asset_ID','Target'])
RS = RobustScaler()
train_cut[scale_features] = RS.fit_transform(train_cut[scale_features])

## Fixing Timesteps
We have to re-index the timestep (like we did for plotting). I chose to do it in a different fashion here just to practice different methods. We will be using the reindex() function to do so.

In [None]:
ind = train_cut.index.unique()
def reindex(df):
    df = df.reindex(range(ind[0],ind[-1]+60,60),method='nearest')
    df = df.fillna(method="ffill").fillna(method="bfill")
    return df

In [None]:
train_cut=train_cut.groupby('Asset_ID').apply(reindex).reset_index(0, drop=True).sort_index()
train_cut.shape

## Removing fake records
Here we find fake records. Fake records are timesteps where we do not have a record for a specific crypto. Fake records have all column values set to 0.

In [None]:
# Matching records and marking generated rows as 'fake'

train_cut['group_num'] = train_cut.index.map(times)
train_cut = train_cut.dropna(subset=['group_num'])
train_cut['group_num'] = train_cut['group_num'].astype('int')

train_cut['id'] = train_cut['group_num'].astype(str) + '_' + train_cut['Asset_ID'].astype(str)

train_cut['is_real'] = train_cut.id.isin(ids)*1
train_cut = train_cut.drop('id', axis=1)

In [None]:
# Features values for 'non-real' rows are set to zeros

features = train_cut.columns.drop(['Asset_ID','group_num','is_real'])
train_cut.loc[train_cut.is_real == 0, features] = 0.

Here we order the dataframe using the asset_order column we will create. This is to ensure that all new timesteps start with the same token (in our case, Asset_ID of 3 which is Cardano)

In [None]:
# Sorting assets according to their order in the 'supplemental_train.csv'

train_cut['asset_order'] = train_cut.Asset_ID.map(assets_order) 
train_cut=train_cut.sort_values(by=['group_num', 'asset_order'])
train_cut.head(20)

In [None]:
train_cut['asset_order'] = train_cut['asset_order'].astype('float64')

In [None]:
train_targets = train_cut['Target'].to_numpy().reshape(-1, 14)

features = train_cut.columns.drop(['Asset_ID', 'Target', 'group_num','is_real'])
train_cut = train_cut[features]

train_cut=np.array(train_cut)
train_cut = train_cut.reshape(-1,14,train_cut.shape[-1])
train_cut.shape

In [None]:
train_cut = train_cut.astype('float64')
train_cut.shape

## Splitting the training data into sequences
In this section, we split the training data into sequences that we can further feed into LSTM network. Notice that each sequence has many variables/features making it a multivariate problem. To predict the next 1 days (our horizon), we are going to use the events that occureed 15 days ago.

In [None]:
# timeseriesgenerator-like class, except it using target from the last timestep insteed of last+1
class sample_generator(keras.utils.Sequence):
    def __init__(self, x_set, y_set, batch_size, length):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.length = length
        self.size = len(x_set)

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x=[]
        batch_y=[]
        for i in range(self.batch_size):
            start_ind = self.batch_size*idx + i
            end_ind = start_ind + self.length 
            if end_ind <= self.size:
                batch_x.append(self.x[start_ind : end_ind])
                batch_y.append(self.y[end_ind -1])

        return np.asarray(batch_x).astype("float32"), np.asarray(batch_y).astype("float32")

## Split into sequence and target
After sequencing and normalzing the data, we slice the data to create the input sequences and output targets.

In [None]:
#last 10% of the data are used as validation set
X_train, X_test = train_cut[:-len(train_cut)//10], train_cut[-len(train_cut)//10:]
y_train, y_test = train_targets[:-len(train_cut)//10], train_targets[-len(train_cut)//10:]

In [None]:
BATCH_SIZE=2**10
train_generator = sample_generator(X_train, y_train, length=15, batch_size=BATCH_SIZE)
val_generator = sample_generator(X_test, y_test, length=15, batch_size=BATCH_SIZE)

print(f'Sample shape: {train_generator[0][0].shape}')
print(f'Target shape: {train_generator[0][1].shape}')

## Creating the LSTM Network
We are going to be creating a multivariate Parallel LSTM Network with a dense layer at the end. We are using dropout as a regularisation method to combat overfitting. Notice the lambda layer which allows to add our own functional layer. In this case, the lambda layer is used to slice the array to the respective crypto we want. We then loop with the for loop and create the same layer structure for each crypto. The hidden layer output is then concatenanted for each crypto and then the prediction it obtained through the dense layer. 

* Build the model. Simplified  structure:
    - Lambda layer needed for assets separation
    - Masking layer. Generated records (filled gaps) has zeros as features values, so they are not used in the computations
    - LSTM or GRU layer
    - Dropout as a regularisation method to combat overfitting
    - Concatanate layer
    - Dense Layer for Prediction (linear activation which is the default)

In [None]:
#https://github.com/tensorflow/tensorflow/issues/37495
def MaxCorrelation(y_true,y_pred):
    """Goal is to maximize correlation between y_pred, y_true. Same as minimizing the negative."""
    mask = tf.math.not_equal(y_true, 0.)
    y_true_masked = tf.boolean_mask(y_true, mask)
    y_pred_masked = tf.boolean_mask(y_pred, mask)
    return -tf.math.abs(tfp.stats.correlation(y_true_masked,y_pred_masked, sample_axis=None, event_axis=None))

def Correlation(y_true,y_pred):
    return tf.math.abs(tfp.stats.correlation(y_pred,y_true, sample_axis=None, event_axis=None))

def masked_mse(y_true, y_pred):
    mask = tf.math.not_equal(y_true, 0.)
    y_true_masked = tf.boolean_mask(y_true, mask)
    y_pred_masked = tf.boolean_mask(y_pred, mask)
    return tf.keras.losses.mean_squared_error(y_true = y_true_masked, y_pred = y_pred_masked)

def masked_mae(y_true, y_pred):
    mask = tf.math.not_equal(y_true, 0.)
    y_true_masked = tf.boolean_mask(y_true, mask)
    y_pred_masked = tf.boolean_mask(y_pred, mask)
    return tf.keras.losses.mean_absolute_error(y_true = y_true_masked, y_pred = y_pred_masked)

def masked_cosine(y_true, y_pred):
    mask = tf.math.not_equal(y_true, 0.)
    y_true_masked = tf.boolean_mask(y_true, mask)
    y_pred_masked = tf.boolean_mask(y_pred, mask)
    return tf.keras.losses.cosine_similarity(y_true_masked, y_pred_masked)

def get_model(n_assets=14):  
    x_input = keras.Input(shape=(train_generator[0][0].shape[1], n_assets, train_generator[0][0].shape[-1]))

    branch_outputs = []
        
    for i in range(n_assets):
            # Slicing the ith asset:
        a = layers.Lambda(lambda x: x[:,:, i])(x_input)
        a = layers.Masking(mask_value=0.,)(a)
        a = layers.LSTM(units=32, return_sequences=True)(a)
        a = layers.Dropout(0.2)(a)
        a = layers.LSTM(units=16)(a)
        a = layers.Dropout(0.2)(a)
        branch_outputs.append(a)
    
    x = layers.Concatenate()(branch_outputs)
    x = layers.Dense(units=128)(x)
    out = layers.Dense(units=n_assets)(x)
    
    model = keras.Model(inputs=x_input, outputs=out)
    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3), 
                  loss = masked_cosine,
                  metrics=[Correlation]
                 )
    
    return model 
    
model=get_model()
model.summary()

Let us now use plot model method to validate our network visually.

In [None]:
#example with 3 assets for visibility
tf.keras.utils.plot_model(get_model(n_assets=3), show_shapes=True)

## Training/Fitting time
We can finally train our model with our training data. Let's see how it does.

In [None]:
tf.random.set_seed(10)

estop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min',restore_best_weights=True)
scheduler = keras.optimizers.schedules.ExponentialDecay(1e-3, (0.5*len(X_train)/BATCH_SIZE), 1e-3)
lr = keras.callbacks.LearningRateScheduler(scheduler, verbose = 1)
    
history = model.fit(train_generator, validation_data = (val_generator), epochs = 10, callbacks = [lr])

## Plotting model accuracy and loss
This step is very important since it allows you to see if your model is performing well as you train it. If it isn't, you will rather have to create new features, tune hyperparameters, modify the RNN network or cry.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(16,8))

histories = pd.DataFrame(history.history)

epochs = list(range(1,len(histories)+1))
loss = histories['loss']
val_loss = histories['val_loss']
Correlation = histories['Correlation']
val_Correlation = histories['val_Correlation']

ax[0].plot(epochs, loss, label = 'Train Loss')
ax[0].plot(epochs, val_loss, label = 'Val Loss')
ax[0].set_title('Losses')
ax[0].set_xlabel('Epoch')
ax[0].legend(loc='upper right')

ax[1].plot(epochs, Correlation, label = 'Train Correlation')
ax[1].plot(epochs, val_Correlation, label = 'Val Correlation')
ax[1].set_title('Correlations')
ax[1].set_xlabel('Epoch')
ax[1].legend(loc='upper right')

fig.show()

In [None]:
sns.set_style('darkgrid') # darkgrid, white grid, dark, white and ticks
colors = sns.color_palette('pastel') # Color palette to use
plt.rc('axes', titlesize=18)     # fontsize of the axes title
plt.rc('axes', labelsize=14)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)    # fontsize of the tick labels
plt.rc('legend', fontsize=13)    # legend fontsize
plt.rc('font', size=13)          # controls default text sizes

# # Set Matplotlib defaults
# plt.style.use("seaborn-whitegrid")
# plt.rc("figure", autolayout=True, figsize=(11, 5))
# plt.rc(
#     "axes",
#     labelweight="bold",
#     labelsize="large",
#     titleweight="bold",
#     titlesize=16,
#     titlepad=10,
# )
# plot_params = dict(
#     color="0.75",
#     style=".-",
#     markeredgecolor="0.25",
#     markerfacecolor="0.25",
#     legend=False,
# )
# %config InlineBackend.figure_format = 'retina'


def plot_loss(history):
    fig, ax = plt.subplots(figsize=(10,6), tight_layout=True)
    ax.plot(history.history['loss'], 'o-', color="#004C99", linewidth=2)
    ax.plot(history.history['val_loss'], 'o-', color="#D96552",linewidth=2)
    ax.set_facecolor(colors[-1])
    plt.grid(b=True,axis = 'y')
    ax.grid(b=True,axis = 'y')
    plt.ylabel('Loss')
    plt.xlabel('epoch')
    plt.legend(['Train loss', 'Validation loss'], loc='upper right',prop={'size': 15})
    plt.show()
    
def plot_future(prediction, y_test):
    fig, ax = plt.subplots(figsize=(10,6), tight_layout=True)
    range_future = len(prediction)
    ax.plot(np.arange(range_future), np.array(y_test),label='Actual',color="#004C99")
    ax.plot(np.arange(range_future),np.array(prediction),label='Prediction',color="#D96552")
    ax.set_facecolor(colors[-1])
    plt.grid(b=True,axis = 'y')
    ax.grid(b=True,axis = 'y')
    plt.ylabel('USD')
    plt.legend(loc='upper left',prop={'size': 15})
    plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)   
    plt.show()

In [None]:
plot_loss(history)