In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime as dt
import os
import plotly.express as px
import plotly.graph_objects as go
sns.set_style("whitegrid")
palette = sns.color_palette("tab10")
plt.rcParams['font.size']=13



#data path
data_path = '../input/g-research-crypto-forecasting/'

#### Reading and Inspecting our Data

In [None]:
#Reading in data
crypto_trade_df = pd.read_csv(os.path.join(data_path, "train.csv"))
assets_df = pd.read_csv(os.path.join(data_path, "asset_details.csv"))
print('training data has {:,} rows and {} columns'.format(crypto_trade_df.shape[0],crypto_trade_df.shape[1]))


In [None]:
crypto_trade_df.head(5)

In [None]:
#Sorting the assets so that we can access the data for each asset in through a dictionary
assets_df.sort_values(by='Asset_ID',inplace=True)
assets_df

### Understanding The Data

**Training & Test Data**
1. ***timestamp*** is the unix time stamp. This will be converted to datetime.
2. ***Asset_ID*** refers to the particular crypto asset under consideration.
3. ***Count*** refers to the number of trades that took place in a particular minute.
4. ***Open*** referst to the USD price at the beginning of the minute.
5. ***High*** is the highest USD price reached during the minute.
6. ***Low*** is the lowest USD price during the minute.
7. ***Close*** refers to the USD price at the end of the minute.
8. ***Volume*** refers to the amount of the crypto asset involved in the trade.
9. ***VWAP*** is the volume-weighted-average price of the crytop asset in the minute
10. ***Target*** according to the [G-Research tutorial](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition/notebook) is the 15 minute residualized log returns for the asset.

**Asset Data**
1. **Weight** refers to the relative importance of each asset in the evalulation metric. **Asset_ID** and **Asset_Name** are self-explanatory.


**Converting unix timestamp to datetime**

We convert the unix timestamps here to datetime, and set this as the index.

In [None]:
crypto_trade_df.timestamp = crypto_trade_df.timestamp.astype('datetime64[s]')
crypto_trade_df.set_index('timestamp', inplace=True)
crypto_trade_df.head(2)

**Assets Dictionary**

We create a dictionary of *Asset_Names* and their corresponding trade data. This would allow the elegant manipulations and the use of functions down the line.

In [None]:
assets = {}
asset_names = assets_df.Asset_Name
for i, asset in enumerate(asset_names):
    assets[asset] = crypto_trade_df[crypto_trade_df.Asset_ID == i]

### Prepocessing

**Missing Timestamps**

From the [G-Research tutorial](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition/notebook), we know that there are missing timestamps for some coins. We can understand this better by subtracting from each current index, the preceding index, which should give the same interval if there are no missing timestamps, but this is not the case as seen below for the *Bitcoin* asset.

In [None]:
(assets['Bitcoin'].index[1:]-assets['Bitcoin'].index[:-1]).value_counts().head(5)

We would fix this by reindexing with the missing date times, and filling up empty rows by propagating the last valid observation forward.

In [None]:
for asset_name in assets:
    assets[asset_name] = assets[asset_name].reindex(pd.date_range(start=assets[asset_name].index[0],
                                                                  end=assets[asset_name].index[-1],freq='T'),method='pad')

Having done this step for all assets, we can see below that missing time stamps have been taken care of for *Bitcoin*, by running the earlier code again.

In [None]:
(assets['Bitcoin'].index[1:]-assets['Bitcoin'].index[:-1]).value_counts().head(5)

**Missing Target Variables**

The crypto trades dataset is missing so many target variables. We understand from the tutorial that these cases are due to missing values in future prices, and we can ignore this for scoring purposes in the test set ground truth.

In [None]:
missing_targets=np.sum([assets[asset].Target.isna().sum() for asset in assets])
print('Total number of missing Targets:{:,}'.format(missing_targets))

### Understanding Each Asset
Let's explore when trade data collection for each asset began as well as the number of days of data and missing targets we have for each asset.

In [None]:
start_times = []
end_times = []
missing_targets = []

for asset in assets:
    start_times.append(assets[asset].index[0])
    end_times.append(assets[asset].index[-1])
    missing_targets.append(assets[asset].Target.isna().sum()) 
    
    
asset_description_df = pd.concat([pd.Series([asset for asset in assets]), pd.Series(start_times), pd.Series(end_times), pd.Series(missing_targets)], axis=1)
asset_description_df.columns = ['asset','start_time','end_time','missing_targets']
asset_description_df['days_of_data'] = (asset_description_df.end_time - asset_description_df.start_time)
asset_description_df['%missing_target'] = asset_description_df['missing_targets']/((asset_description_df.end_time - asset_description_df.start_time).dt.total_seconds()/60)
asset_description_df.set_index('asset',inplace=True)
asset_description_df = asset_description_df.iloc[:,[0,1,3,2,4]]
asset_description_df.sort_values(by='start_time')

Data collection stopped on the same day for all assets. We find that not all assets have the same length of data collection, with Dogecoin being the most notable with ~892 days. This is exepcted as there are thousands of crypto assets, and some of this may not have had enough prominence for [CryptoCompare](https://www.cryptocompare.com) (the likely data source for G-Research)to store minute by minute trade data on. More research is needed here.

Further more, Bitcoin has the lowest number of missing targets, with the highest being Maker with more than 1 million missing targets. Below is a plot of this.

In [None]:
asset_description_df.sort_values(by='%missing_target',inplace=True)

fig = px.bar(asset_description_df, y=asset_description_df.index, x='%missing_target', labels = {'asset':'Assets','%missing_target':'Missing Target Percentage'},
            color_discrete_sequence=['#3366CC'], title= 'Asset Target Missingness')
fig.update_layout(xaxis=dict(tickformat=".1%"))

4 assets have more than 20% of targets missing. Assets with a high number of missing targets wil no doubt affect the predictions we're able to make. In a later notebook, I would compare the accuracy of the predictions from those with more complete data, against those with a high percentage of targets missing.

### Trends Analysis
Let's explore the trends we have in our data starting with volume traded


In [None]:
#Monthly Volume Trends
    
asset_names= [asset for asset in assets]
index=0
rows=round(len(assets)/2)
columns = 2
row_indices = range(rows)
column_indices = range(columns)
fig, axes = plt.subplots(rows, columns, figsize=(20, 30))
for i in row_indices:
    for j in column_indices:
        axes[i,j].plot( assets[asset_names[index]]['Volume'].resample('MS').sum(),
                       linewidth=2, color=palette[index%10],label=asset_names[index])
        axes[i,j].legend()
        axes[i,j].set(ylabel='Total Traded Volume', title =asset_names[index] )
        index+=1
        
plt.suptitle('Monthly Volume Trends For Each Asset',fontsize='x-large')             
fig.tight_layout(rect=[0, 0.03, 1, 0.98])        



**Learnings**

We observe the following:
1. *Binance Coin*, *Bitcoin*, *Bitcoin Cash*, *Ethereum*, and *Ethereum Classic* share similar traded volume trends, as they all reached a new high around the begining of 2020, and also peaked when the traded volume of other coins peaked in January and May 2021.
2. *Dogecoin*, *EOS.IO*, *IOTA*, *Maker*, *Litecoin*, and *Stellar* also share similar trends in that they all reached a new peak in January 2021, far higher than any peak before then.
3. The above observations imply that Q1 2020, January 2021, and May 2021 are important dates we should pay attention to.

We study closing minute-wise prices next.

In [None]:
#Monthly Value/Trade
    
asset_names= [asset for asset in assets]
index=0
rows=round(len(assets)/2)
columns = 2
row_indices = range(rows)
column_indices = range(columns)
fig, axes = plt.subplots(rows, columns, figsize=(20, 30))
for i in row_indices:
    for j in column_indices:
        temp = assets[asset_names[index]]
        axes[i,j].plot( temp['Close'] ,
                       linewidth=2, color=palette[index%10],label=asset_names[index])
        axes[i,j].legend()
        axes[i,j].set(ylabel='Closing Price($)', title =asset_names[index] )
        index+=1
        
plt.suptitle('Minute-wise Asset Closing Price($)',fontsize='x-large')             
fig.tight_layout(rect=[0, 0.03, 1, 0.98])        



**Learnings**

We observe the following:
1. The first set of coins from the volume chart have a relatively smooth trend at the beginning of 2020. However in **January 2021 (say Q1,2021)**, the price of all coins really begins to take off with a similar trend across each coin. 
2. In Q2,2021, Paritculary in May, we have a strong bearish trend across crypto assets, which also corresponds to a spike in traded volume in the volume chart.
3. There is a **cyclic** component as expected in a volatile time series trend. We zoom in on the final month of the Bitcoin timeseries to investigate further.


In [None]:
btc = assets['Bitcoin'].iloc[-50000:,:]


fig = px.line(btc,x=btc.index, y='Close', color_discrete_sequence = ['#3366CC'], labels={'index':''},
              title='Bitcoin Timeseires (Aug 17 - Sep 19 2021)')
fig.update_layout(yaxis_tickprefix = '$', yaxis_tickformat = ',.0f')

On zooming in, the trend appears very random. That's why predicting it will be some challenge ðŸ˜‰

In [None]:
#Monthly Volume Trends
    
asset_names= [asset for asset in assets]
index=0
rows=round(len(assets)/2)
columns = 2
row_indices = range(rows)
column_indices = range(columns)
fig, axes = plt.subplots(rows, columns, figsize=(20, 30), sharey=True)
for i in row_indices:
    for j in column_indices:
        data = assets[asset_names[index]]
        data['Value'] = data.Close*data.Volume
        data = data[['Value','Count']].resample('MS').sum()
        data['value_per_trade'] = data.Value/data.Count
        axes[i,j].plot( data['value_per_trade'],linewidth=2, 
                       color=palette[index%10],label=asset_names[index])
        axes[i,j].legend()
        axes[i,j].set(ylabel='Avg Value / Trade ($)', title =asset_names[index] )
        index+=1
        
plt.suptitle('Monthly Average Trade Value For Each Asset',fontsize='x-large')             
fig.tight_layout(rect=[0, 0.03, 1, 0.98])        



**Learnings**

We observe the following:
1. Even for the really expesive coins like *Bitcoin*, the highest average value/trade was just over $2000 in 2018, putting all assets in a similar range in terms of value/trade.
2. Compared to the fluctuations in *Volume* and *Price*, fluctuations in value/trade are a lot more gentle, which implies a direct relationship between the value (product of *Price* and *Volume) and the number of trades, creating some stability.
3. As the *Volume* and *Price* of the assets increases, we also have more trades happening (a higher *Count* representing more people hitting the exchanges). We can flip this around and say that more demand in an asset results in higer volumes being traded, and consequently creates an increase in price.
4. *Etherum* *Cardano*, and *Dogecoin* are on an upward trajectory, which can imply the price of those assets is not purely due to an increased demand (it's due to increased demand but to a lesser extent than others). 





In [None]:
#Monthly Value/Trade
    
asset_names= [asset for asset in assets]
index=0
rows=round(len(assets)/2)
columns = 2
row_indices = range(rows)
column_indices = range(columns)
fig, axes = plt.subplots(rows, columns, figsize=(20, 30))
for i in row_indices:
    for j in column_indices:
        temp = assets[asset_names[index]]
        axes[i,j].plot( temp['Target'] ,
                       linewidth=2, color=palette[index%10],label=asset_names[index])
        axes[i,j].legend()
        axes[i,j].set(ylabel='Target', title =asset_names[index] )
        index+=1
        
plt.suptitle('Minute-wise Asset Targets',fontsize='x-large')             
fig.tight_layout(rect=[0, 0.03, 1, 0.98])        



**Learnings**

We observe the following:

1. Contrary to what was mentioned in the tutorial about empty target observations, in the case of *Maker*, there were already some empty target observations, which only became amplifed when we added in the missing timestamps.


This notbook is still in progress, please check again for more updates.