# Introduction

This notebook is an attempt to gauge, at a first glance, all the tasks we will have to undertake to participate in this competition. So, here I will be looking at the data files provided to us. Specifically, *the asset_details.csv* and *train.csv* files and noting my thoughts as I go through them.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import os
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
cmap = sns.color_palette()
import warnings
warnings.filterwarnings("ignore")

In [None]:
data_path = '../input/g-research-crypto-forecasting'

In [None]:
!ls $data_path

# Data

In [None]:
assets = pd.read_csv(os.path.join(data_path, 'asset_details.csv'))
assets.sort_values('Weight', ascending = False)

So this file contains the names of all 14 coins along with their id to which they must be mapped. Also, contains a weighting value presumably based on how important they will be in our predicitons.

In [None]:
train_df = pd.read_csv(os.path.join(data_path, 'train.csv'))
train_df['asset_name'] = train_df.Asset_ID.map(assets.set_index('Asset_ID').Asset_Name)
print(f'There are {len(train_df)} rows in the dataset')
train_df.head()

This is what we'll mainly be working with. There are almost 25 million rows of data. Each row represents a minute's worth of data for a particular coin. Here's what the hosts say about the variables:
* timestamp: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
* Asset_ID: The asset ID corresponding to one of the crytocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping from Asset_ID to crypto asset is contained in asset_details.csv.
* Count: Total number of trades in the time interval (last minute).
* Open: Opening price of the time interval (in USD).
* High: Highest price reached during time interval (in USD).
* Low: Lowest price reached during time interval (in USD).
* Close: Closing price of the time interval (in USD).
* Volume: Quantity of asset bought or sold, displayed in base currency USD.
* VWAP: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
* Target: Residual log-returns for the asset over a 15 minute horizon.

Let's see if all the coins are equally represented in our dataset.

In [None]:
asset_counts = []
for asset in assets.Asset_ID:
    x = assets[assets.Asset_ID == asset].Asset_Name.values[0]
    y = train_df[train_df.Asset_ID == asset].Asset_ID.value_counts().values[0]
    asset_counts.append([x,y])
asset_count_df = pd.DataFrame(asset_counts)
asset_count_df.columns = ['coin', 'count']
asset_count_df

In [None]:
fig_1 = px.pie(asset_count_df, values = 'count', names = 'coin',
              title = 'Distribution of data in the dataset')
fig_1.show()

As you can see, not all coins are equally represented. In some cases there is quite a big disparity. Like in the case of **Maker** where there is almost a disparity of 1.3 million samples when compared to **Bitcoin**

This could be caused by data not being available for the same time period for all the different coins. We verify this by finding the start and end dates for all the different coins.

In [None]:
dates = []
for ind,coin in enumerate(assets.Asset_Name.values):
    crypto = train_df[train_df.asset_name == coin].set_index('timestamp')
    start_time = crypto.index[0].astype('datetime64[s]')
    end_time = crypto.index[-1].astype('datetime64[s]')
    dates.append([coin, start_time, end_time])
dates_df = pd.DataFrame(dates)
dates_df.columns = ['coin','start', 'end']
dates_df

From this we can see that almost all coins have the same start and end date. All of them, with the exception of Dogecoin, have a start date in 2018 and all of them end at the same period. While this does account for some disparity, it doesn not explain the magnitude of it we see in our data distribution. 

We have been informed by the host that:

> Missing asset data, for a given minute, is not represented by NaN's, but instead by the absence of those rows.

This could explain the missing rows. We can find these by looking for instances where the timegap between consecutive rows is greater than 60 seconds. 

In [None]:
missing_rows = []
for ind,coin in enumerate(assets.Asset_Name.values):
    crypto = train_df[train_df.asset_name == coin].set_index('timestamp')
    v, c = np.unique(np.diff(crypto.index), return_counts = True)
    vc = zip(v,c)
    vc_df = pd.DataFrame(vc)
    vc_df.columns = ['interval', 'counts']
    missing_data = vc_df[1:]
    missing_data['interval(mins)'] = missing_data.interval/60
    missing_data['rows'] = missing_data['interval(mins)']*missing_data.counts
    gaps = missing_data.counts.sum()
    rows = missing_data.rows.sum()
    missing_rows.append([coin, gaps, rows])
missing_rows_df = pd.DataFrame(missing_rows)
missing_rows_df.columns = ['coin', 'gaps','rows_missing']
missing_rows_df

This goes someway to explaining the disparity in our dataset. Some of these coins have hundreds of thousands of rows of missing data. We're going to have to find a way to fill these gaps, or at least account for them, when we build our model. 

In [None]:
import time
totimestamp = lambda s: np.int32(time.mktime(datetime.strptime(s, "%d/%m/%Y").timetuple()))

# Plots

Let's see how the closing price has behaved for bitcoin over the last 3 years.

In [None]:
# f = plt.figure(figsize = (10,6))

# ax = f.add_subplot(111)
# ax.set_facecolor('azure')
# plt.plot(bit['Close'], c = 'darkviolet')
# plt.axvline(x = 1546300800, label = 'Start of 2019', c = 'orange')
# plt.axvline(x = 1577836800, label = 'Start of 2020', c = 'forestgreen')
# plt.axvline(x = 1609459200, label = 'Start of 2021', c = 'crimson')
# plt.legend()
# plt.xlabel('Time')
# plt.ylabel('Bitcoin Close')
# plt.grid()
# plt.title('Close price of bitcoin')

# plt.show()

Something has clearly changed in 2021 where we see extremely volatile behaviour. The period before looks relatively stable compared to what has happened since the end of 2019. Let's see if this behaviour is replicated by the other coins.

In [None]:
f = plt.figure(figsize = (15,30))

for ind,coin in enumerate(assets.Asset_Name.values):
    crypto = train_df[train_df.asset_name == coin].set_index('timestamp')
    ax = f.add_subplot(7,2,ind+1)
    ax.set_facecolor('azure')
    plt.plot(crypto['Close'], c = cmap[ind%10])
    plt.axvline(x = 1546300800, label = 'Start of 2019', c = 'orange')
    plt.axvline(x = 1577836800, label = 'Start of 2020', c = 'forestgreen')
    plt.axvline(x = 1609459200, label = 'Start of 2021', c = 'crimson')
    plt.legend()
    plt.xlabel('Time')
    plt.ylabel('Close price')
    plt.grid()
    plt.title(coin)
    
plt.tight_layout()
plt.show()

A lot of coins share a lot of similarities with each other. And most coins exhibit the same behaviour in 2021. There's probably an exogenous reason for it. Like, perhaps stemming from increased speculation because of cryptos going mainstream and people having a lot of time on their hands because of being home during the pandemic? It's just a hypothesis, a spurious one at that. We'll explore this later.

Let's take a deeper look at one of the coins to see if we can expect more missing data from this dataset, such as NaN's.

In [None]:
bit = train_df[train_df.Asset_ID == 1].set_index('timestamp')
bit.head()

In [None]:
#Padding all the missing rows
bit = bit.reindex(range(bit.index[0],bit.index[-1]+60,60),method='pad')

In [None]:
bit.isna().sum()

In [None]:
#taking bit for only 2021 with last day added for making some features 
bit_21 = bit.loc[totimestamp('31/12/2020'):totimestamp('01/09/2021')]

In [None]:
bit_21.isna().sum()

In [None]:
bit_21 = bit_21.fillna(0)
bit_21.isna().sum()

In [None]:
#Checking for 
plt.figure(figsize=(10,5))
cor = bit_21.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()

In [None]:
#creating a column for close prices from an hour before 
bit_21['Close_hour'] = bit_21.Close.shift(60)

In [None]:
# define function to compute log returns
def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

In [None]:
#Creating features based off of literature 
bit_21['PROC'] = ((bit_21.Close - bit_21.Close_hour)/bit_21.Close_hour)*100


bit_21['SMA'] = bit_21['Close'].rolling(window=10).mean()


bit_21['EMA'] = bit_21['Close'].ewm(span=60,adjust=False).mean()


bit_21['lret'] = log_return(bit_21.Close)[1:]

# Mom = VWAP is a widely used momentum indicator




In [None]:
bit_21 = bit_21.loc[totimestamp('01/01/2021'):totimestamp('01/09/2021')]

In [None]:
bit_21.head()

In [None]:
features = ['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP',
            'PROC', 'SMA', 'EMA', 'lret']

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X = bit_21[features]
y = bit_21.Target


In [None]:
from sklearn.preprocessing import StandardScaler
# simple preprocessing of the data 
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X_scaled, y, random_state=15)
model = RandomForestRegressor(n_estimators=50, random_state=15).fit(train_X, train_y)

In [None]:
pred_y = model.predict(val_X) 

In [None]:
#Calculate correlation coeff for pred y and train y 
np.corrcoef(pred_y, val_y)[0,1]

In [None]:
#Modelexplainibility// Permutation Importance
import eli5
from eli5.sklearn import PermutationImportance

# Make a small change to the code below to use in this problem. 
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)

eli5.show_weights(perm, feature_names = features)

In [None]:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(val_X)

shap.summary_plot(shap_values[1], val_X)

In [None]:
shap.dependence_plot('EMA', shap_values[1], val_X)

# Upcoming
I intend to continue working on this notebook and pick up from where I left off. Do return if you have found this helpful.

**References**
* [Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
* [Simple Exploration Notebook - Crypto Forecasting](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-crypto-forecasting)
I have learned a great deal from these notebooks and I hope you will as well.