# G- Research Crypto Forecast Introduction

Over $40 billion worth of cryptocurrencies are traded every day. They are among the most popular assets for speculation and investment, yet have proven wildly volatile. Fast-fluctuating prices have made millionaires of a lucky few, and delivered crushing losses to others. Could some of these price movements have been predicted in advance?

In this competition, you'll use your machine learning expertise to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 which you can use to build your model. Once the submission deadline has passed, your final score will be calculated over the following 3 months using live crypto data as it is collected.

The simultaneous activity of thousands of traders ensures that most signals will be transitory, persistent alpha will be exceptionally difficult to find, and the danger of overfitting will be considerable. In addition, since 2018, interest in the cryptomarket has exploded, so the volatility and correlation structure in our data are likely to be highly non-stationary. The successful contestant will pay careful attention to these considerations, and in the process gain valuable insight into the art and science of financial forecasting.

G-Research is Europeâ€™s leading quantitative finance research firm. We have long explored the extent of market prediction possibilities, making use of machine learning, big data, and some of the most advanced technology available. Specializing in data science and AI education for workforces, Cambridge Spark is partnering with G-Research for this competition

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Getting the data folder

In [None]:
data_folder = "../input/g-research-crypto-forecasting/"
!ls $data_folder

In [None]:
#reading the train.csv
crypto_df = pd.read_csv(data_folder + 'train.csv')

In [None]:
crypto_df.head(10)

## Data features
We can see the different features included in the dataset. Specifically, the features included per asset are the following:
*   **timestamp**: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
*   **Asset_ID**: The asset ID corresponding to one of the crytocurrencies (e.g. `Asset_ID = 1` for Bitcoin). The mapping from `Asset_ID` to crypto asset is contained in `asset_details.csv`.
*   **Count**: Total number of trades in the time interval (last minute).
*   **Open**:	Opening price of the time interval (in USD).
*   **High**:	Highest price reached during time interval (in USD).
*   **Low**: Lowest price reached during time interval (in USD).
*   **Close**:	Closing price of the time interval (in USD).
*   **Volume**:	Quantity of asset bought or sold, displayed in base currency USD.
*   **VWAP**: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
*   **Target**: Residual log-returns for the asset over a 15 minute horizon. 

The first two columns define the time and asset indexes for this data row. The 6 middle columns are feature columns with the trading data for this asset and minute in time. The last column is the prediction target, which we will get to later in more detail.

We also view the asset information, including the list of all assets, the `Asset_ID` to asset mapping, and the weight of each asset used to weigh their relative importance in the evaluation metric.

In [None]:
asset_details = pd.read_csv(data_folder + 'asset_details.csv')
asset_details


creating seperate data frames for each crypto

In [None]:
binance = crypto_df[crypto_df["Asset_ID"]==0].set_index("timestamp") 
btc_cash = crypto_df[crypto_df["Asset_ID"]==2].set_index("timestamp") 
cardano = crypto_df[crypto_df["Asset_ID"]==3].set_index("timestamp") 
dodge = crypto_df[crypto_df["Asset_ID"]==4].set_index("timestamp")       
eos = crypto_df[crypto_df["Asset_ID"]==5].set_index("timestamp")   
eth = crypto_df[crypto_df["Asset_ID"]==6].set_index("timestamp")  
eth_classic = crypto_df[crypto_df["Asset_ID"]==7].set_index("timestamp")  
iota = crypto_df[crypto_df["Asset_ID"]==8].set_index("timestamp")  
lite = crypto_df[crypto_df["Asset_ID"]==9].set_index("timestamp")
maker = crypto_df[crypto_df["Asset_ID"]==10].set_index("timestamp")  
monero = crypto_df[crypto_df["Asset_ID"]==11].set_index("timestamp")
stellar = crypto_df[crypto_df["Asset_ID"]==12].set_index("timestamp")
tron = crypto_df[crypto_df["Asset_ID"]==13].set_index("timestamp")          

## Candlestick charts

The trading data format is an aggregated form of market data including for Open, High, Low and Close. We can visualize this data through the commonly used candlestick bar chart, which allows traders to perform technical analysis on intraday values. The bar's body length represents the price range between the open and close of that day's trading. When the bar is red, it means the close was lower than the open, and green otherwise. These are also referred to as bullish and bearish candlesticks. The wicks above and below the bars show the high and low prices of that interval's trading.

We can visualize a slice of the Bitcoin prices using the `plotly` library. The bottom part of the plot shows a rangeslider, which you can use to zoom in the plot.

### Visualizing BTC

In [None]:
import plotly.graph_objects as go

btc = crypto_df[crypto_df["Asset_ID"]==1].set_index("timestamp") # Asset_ID = 1 for Bitcoin
btc

In [None]:
btc_mini = btc.iloc[-200:] # Select recent data rows
fig = go.Figure(data=[go.Candlestick(x=btc_mini.index, open=btc_mini['Open'], high=btc_mini['High'], low=btc_mini['Low'], close=btc_mini['Close'])])
fig.show()

### Visualizing Ethernum

In [None]:
eth = crypto_df[crypto_df["Asset_ID"]==13].set_index("timestamp")
eth

In [None]:
eth_mini = eth.iloc[-200:] # Select recent data rows
fig = go.Figure(data=[go.Candlestick(x=eth_mini.index, open=eth_mini['Open'], high=eth_mini['High'], low=eth_mini['Low'], close=eth_mini['Close'])])
fig.show()

### Visualizing Tron

In [None]:
tron = crypto_df[crypto_df["Asset_ID"]==8].set_index("timestamp")
tron

In [None]:
tron_mini = tron.iloc[-200:] # Select recent data rows
fig = go.Figure(data=[go.Candlestick(x=tron_mini.index, open=tron_mini['Open'], high=tron_mini['High'], low=tron_mini['Low'], close=tron_mini['Close'])])
fig.show()

### Visualizing Dodge

In [None]:
dodge = crypto_df[crypto_df["Asset_ID"]==4].set_index("timestamp")
dodge

In [None]:
dodge_mini = dodge.iloc[-200:] # Select recent data rows
fig = go.Figure(data=[go.Candlestick(x=dodge_mini.index, open=dodge_mini['Open'], high=dodge_mini['High'], low=dodge_mini['Low'], close=dodge_mini['Close'])])
fig.show()

### Data Preprocessing

#### Dealing with missing values 

First we can get a coin and look more into the dataset

In [None]:
# getting the ethernum data
eth = crypto_df[crypto_df["Asset_ID"]==6].set_index("timestamp") # Asset_ID = 6 for Ethereum
eth.info(show_counts =True)

As in the above table all the columns except the `target`  has same number of Non-Null records. We can confirm it by

In [None]:
eth.isna().sum()

Target column has 340 null values 

Then we check the time range for the data

In [None]:
# getting the first five rows
btc.head()

In [None]:
# getting the last five rows
btc.tail()

In [None]:
# getting start and end date of btc as datetime64 format
start_btc = btc.index[0].astype('datetime64[s]')
end_btc = btc.index[-1].astype('datetime64[s]')

# getting start and end date of eth as datetime64 format
start_eth = btc.index[0].astype('datetime64[s]')
end_eth = btc.index[-1].astype('datetime64[s]')

print('BTC data is from ',start_btc , 'to ', end_btc)
print('Eth data is from ',start_eth , 'to ', end_eth)


When developing a timeseries model we need to find whethere there are any missing timestamps (rows) in the dataset. If there are any missing rows we have to fill the values with an appropriate method

In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

Notice that there are many gaps in the data. To work with most time series models, we should preprocess our data into a format without time gaps. To fill the gaps, we can use the .reindex() method for forward filling, filling gaps with the previous valid value.

In [None]:
eth = eth.reindex(range(eth.index[0],eth.index[-1]+60,60),method='pad')


In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

### Data visualisation
##### We will start by visualising the Close prices for the two assets we have selecte

In [None]:
# fill missing values for BTC
btc = btc.reindex(range(btc.index[0],btc.index[-1]+60,60),method='pad')

#fill missing values for Tron
tron =  tron.reindex(range(tron.index[0],tron.index[-1]+60,60),method='pad')

#fill missing values for Dodge
dodge =  dodge.reindex(range(dodge.index[0],dodge.index[-1]+60,60),method='pad')



In [None]:
import matplotlib.pyplot as plt


f = plt.figure(figsize=(20,10))

ax = f.add_subplot(221)
plt.plot(btc['Close'], label='BTC')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Bitcoin')

ax2 = f.add_subplot(222)
ax2.plot(eth['Close'], color='red', label='ETH')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Ethereum')

ax2 = f.add_subplot(223)
ax2.plot(dodge['Close'], color='green', label='Dodge')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Dodge')

ax2 = f.add_subplot(224)
ax2.plot(tron['Close'], color='yellow', label='Tron')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Tron')


plt.show()

The assets have quite different history, but we could check if they correlate in recent times.

In [None]:
import time
from datetime import datetime

# auxiliary function, from datetime to timestamp
totimestamp = lambda s: np.int32(time.mktime(datetime.strptime(s, "%d/%m/%Y").timetuple()))


In [None]:
#create intervals
btc_mini_2021 = btc.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]
eth_mini_2021 = eth.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]

In [None]:
# plot time series for both chosen assets
f = plt.figure(figsize=(7,8))

ax = f.add_subplot(211)
plt.plot(btc_mini_2021['Close'], label='btc')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Bitcoin Close')

ax2 = f.add_subplot(212)
ax2.plot(eth_mini_2021['Close'], color='red', label='eth')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Ethereum Close')

plt.tight_layout()
plt.show()

On shorter intervals we can visually see some potential correlation between both assets, with some simultaneous ups and downs.

Analysing the price changes ????

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

print(df1)
print(df2)
print('--------------------------------')
print(df2.reindex_like(df1,method='pad'))

In [None]:
# define function to compute log returns
def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

In [None]:
import scipy.stats as stats

lret_btc = log_return(btc_mini_2021.Close)[1:]
lret_eth = log_return(eth_mini_2021.Close)[1:]
lret_btc.rename('lret_btc', inplace=True)
lret_eth.rename('lret_eth', inplace=True)

plt.figure(figsize=(8,4))
plt.plot(lret_btc);
plt.plot(lret_eth);
plt.show()

#### Correlation between assets
We hypothesized before that crypto asset returns may exhibit some correlation. Let's check this in more detail now.
We can check how the correlation between Bitcoin and Ethereum change over time for the 2021 period we selected.

In [None]:
# join two asset in single DataFrame

lret_btc_long = log_return(btc.Close)[1:]
lret_eth_long = log_return(eth.Close)[1:]
lret_btc_long.rename('lret_btc', inplace=True)
lret_eth_long.rename('lret_eth', inplace=True)
two_assets = pd.concat([lret_btc_long, lret_eth_long], axis=1)

# group consecutive rows and use .corr() for correlation between columns
corr_time = two_assets.groupby(two_assets.index//(10000*60)).corr().loc[:,"lret_btc"].loc[:,"lret_eth"]

corr_time.plot();
plt.xticks([])
plt.ylabel("Correlation")
plt.title("Correlation between BTC and ETH over time");

Note the high but variable correlation between the assets. Here we can see that there is some changing dynamics over time, and this would be critical for this time series challenge, that is, how to perform forecasts in a highly non-stationary environment.

A stationary behaviour of a system or a process is characterized by non-changing statistical properties over time such as the mean, variance and autocorrelation. On the other hand, a non-stationary behaviour is characterized by a continuous change of statistical properties over time. Stationarity is important because many useful analytical tools and statistical tests and models rely on it.

We can also check the correlation between all assets visualizing the correlation matrix. Note how some assets have much higher pairwise correlation than others.

In [None]:
all_assets_2021 = pd.DataFrame([])
for asset_id, asset_name in zip(asset_details.Asset_ID, asset_details.Asset_Name):
  asset = crypto_df[crypto_df["Asset_ID"]==asset_id].set_index("timestamp")
  asset = asset.loc[totimestamp('01/01/2021'):totimestamp('01/05/2021')]
  asset = asset.reindex(range(asset.index[0],asset.index[-1]+60,60),method='pad')
  lret = log_return(asset.Close.fillna(0))[1:]
  all_assets_2021 = all_assets_2021.join(lret, rsuffix=asset_name, how="outer")

In [None]:
plt.imshow(all_assets_2021.corr());
plt.yticks(asset_details.Asset_ID.values, asset_details.Asset_Name.values);
plt.xticks(asset_details.Asset_ID.values, asset_details.Asset_Name.values, rotation='vertical');
plt.colorbar();

In [None]:
# Select some input features from the trading data: 
# 5 min log return, abs(5 min log return), upper shadow, and lower shadow.
upper_shadow = lambda asset: asset.High - np.maximum(asset.Close,asset.Open)
lower_shadow = lambda asset: np.minimum(asset.Close,asset.Open)- asset.Low

X_btc = pd.concat([log_return(btc.VWAP,periods=5), log_return(btc.VWAP,periods=1).abs(), 
               upper_shadow(btc), lower_shadow(btc)], axis=1)
y_btc = btc.Target

X_eth = pd.concat([log_return(eth.VWAP,periods=5), log_return(eth.VWAP,periods=1).abs(), 
               upper_shadow(eth), lower_shadow(eth)], axis=1)
y_eth = eth.Target

In [None]:
# select training and test periods
train_window = [totimestamp("01/05/2021"), totimestamp("30/05/2021")]
test_window = [totimestamp("01/06/2021"), totimestamp("30/06/2021")]

# divide data into train and test, compute X and y
# we aim to build simple regression models using a window_size of 1
X_btc_train = X_btc.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  # filling NaN's with zeros
y_btc_train = y_btc.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  

X_btc_test = X_btc.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 
y_btc_test = y_btc.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 

X_eth_train = X_eth.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  
y_eth_train = y_eth.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  

X_eth_test = X_eth.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 

In [None]:
from sklearn.preprocessing import StandardScaler
# simple preprocessing of the data 
scaler = StandardScaler()

X_btc_train_scaled = scaler.fit_transform(X_btc_train)
X_btc_test_scaled = scaler.transform(X_btc_test)

X_eth_train_scaled = scaler.fit_transform(X_eth_train)
X_eth_test_scaled = scaler.transform(X_eth_test)