# G-Research Crypto Forecasting
![](https://storage.googleapis.com/kaggle-competitions/kaggle/30894/logos/header.png?t=2021-09-14-17-32-48)

# Data Description
This dataset contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum. Your challenge is to predict their future returns.

As historic cryptocurrency prices are not confidential this will be a forecasting competition using the time series API. Furthermore the public leaderboard targets are publicly available and are provided as part of the competition dataset. Expect to see many people submitting perfect submissions for fun. Accordingly, THE PUBLIC LEADERBOARD FOR THIS COMPETITION IS NOT MEANINGFUL and is only provided as a convenience for anyone who wants to test their code. The final private leaderboard will be determined using real market data gathered after the submission period closes.

## Files
train.csv - The training set

>- timestamp - A timestamp for the minute covered by the row.
>- Asset_ID - An ID code for the cryptoasset.
>- Count - The number of trades that took place this minute.
>- Open - The USD price at the beginning of the minute.
>- High - The highest USD price during the minute.
>- Low - The lowest USD price during the minute.
>- Close - The USD price at the end of the minute.
>- Volume - The number of cryptoasset units traded during the minute.
>- VWAP - The volume weighted average price for the minute.
>- Target - 15 minute residualized returns. See the 'Prediction and Evaluation' section of this notebook for details of how the target is calculated.

example_test.csv - An example of the data that will be delivered by the time series API.

example_sample_submission.csv - An example of the data that will be delivered by the time series API. The data is just copied from train.csv.

asset_details.csv - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.

gresearch_crypto - An unoptimized version of the time series API files for offline work. You may need Python 3.7 and a Linux environment to run it without errors.

supplemental_train.csv - After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period. The current copy, which is just filled approximately the right amount of data from train.csv is provided as a placeholder.

## Time-series API Details
Refer to the time series introduction notebook for an example of how to complete a submission. The time-series API has changed somewhat from previous competitions!

Expect to see roughly three months worth of data in the test set.

The API will require 0.5 GB of memory after initialization. The initialization step (env.iter_test()) will require meaningfully more memory than that; we recommend you do not load your model until after making that call. The API will also consume less than 30 minutes of runtime for loading and serving the data.

The API loads the data using the following types: Asset_ID: int8, Count: int32, row_id: int32, Count: int32, Open: float64, High: float64, Low: float64, Close: float64, Volume: float64, VWAP: float64

# Notebooks referenced:
>- <https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition>
>- <https://www.kaggle.com/sohier/detailed-api-introduction>

# Import packages

In [None]:
# Install packages if required
!pip install "notebook>=5.3" "ipywidgets>=7.5"

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from datetime import datetime
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from ipywidgets import interactive
from datetime import timedelta

# Load input files

In [None]:
crypto_df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv')
asset_details = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv')
sample_submission = pd.read_csv('../input/g-research-crypto-forecasting/example_sample_submission.csv')
test_df = pd.read_csv('../input/g-research-crypto-forecasting/example_test.csv')
supplemental_train_df = pd.read_csv('../input/g-research-crypto-forecasting/supplemental_train.csv')

# EDA

## Looking at the data

In [None]:
print(crypto_df.shape)
crypto_df.head()

In [None]:
# Checking level of data - asset and timestamp level
crypto_df[['timestamp','Asset_ID']].drop_duplicates().shape[0]==crypto_df.shape[0]

In [None]:
print(asset_details.shape)
asset_details.head()

In [None]:
print(crypto_df.shape)
crypto_df = crypto_df.merge(asset_details, how = 'left', on = 'Asset_ID')
print(crypto_df.shape)
crypto_df.head()

In [None]:
# Checking nulls
crypto_df.isnull().sum()

Target is the only column with nulls

In [None]:
crypto_df[crypto_df['Target'].isnull()].head()

In [None]:
print('# Assets:',crypto_df['Asset_ID'].nunique(), crypto_df['Asset_Name'].unique().tolist())

In [None]:
crypto_df.describe()

In [None]:
print(sample_submission.shape)
sample_submission.head()

In [None]:
sample_submission['group_num'].unique()

In [None]:
print(test_df.shape)
test_df.head()

In [None]:
test_df[test_df['group_num']==1].head()

In [None]:
# Number of rows per asset
test_df.groupby(['Asset_ID']).size()

In [None]:
# Number of assets per group
test_df.groupby(['group_num']).agg({'Asset_ID':['unique', 'nunique']}).reset_index()

In [None]:
print(supplemental_train_df.shape)
supplemental_train_df.head()

In [None]:
supplemental_train_df[supplemental_train_df['timestamp']==1623542400]

In [None]:
crypto_df[crypto_df['timestamp']==1623542400]

## Candlestick chart

In [None]:
crypto_df['time'] = crypto_df['timestamp'].astype('datetime64[s]')

In [None]:
def candlestick_chart(Asset_Name):
    temp_data = crypto_df[crypto_df['Asset_Name'] == Asset_Name].reset_index(drop = True)
    temp_data = temp_data.iloc[-500:]
    fig = go.Figure(data=[go.Candlestick(x=temp_data.time, open=temp_data['Open'], high=temp_data['High'], low=temp_data['Low'], close=temp_data['Close'])])
    fig.show()

w = interactive(candlestick_chart, Asset_Name = crypto_df['Asset_Name'].unique())
display(w)

## Comparison of assets

In [None]:
asset_id_list = crypto_df["Asset_ID"].unique().tolist()

f = plt.figure(figsize=(15,30))

for i in range(0,14):
    asset_id = asset_id_list[i]
    btc = crypto_df[crypto_df["Asset_ID"]==asset_id].set_index("timestamp") # Asset_ID = 1 for Bitcoin

    beg_btc = btc.index[0].astype('datetime64[s]')
    end_btc = btc.index[-1].astype('datetime64[s]')

    print(btc['Asset_Name'].unique().tolist()[0] + ' data goes from ', beg_btc, 'to ', end_btc)

    print('\n--------------Checking timegaps start')
    display((btc.index[1:]-btc.index[:-1]).value_counts().head())

    btc = btc.reindex(range(btc.index[0],btc.index[-1]+60,60),method='pad') # ffill

    display((btc.index[1:]-btc.index[:-1]).value_counts().head())
    print('\n--------------Checking timegaps end')

    ax = f.add_subplot(7,2,i+1)
    
    ax.plot(btc['time'], btc['Close'], label=btc['Asset_Name'].unique().tolist()[0])
    plt.legend()
    plt.xlabel('Time')
    plt.ylabel(btc['Asset_Name'].unique().tolist()[0])

plt.tight_layout()
plt.show()

# Modelling

# Predictions