---
# G-Research Crypto Forecasting - Prediction

---

**Problem Statement:**

* Over $40 billion worth of cryptocurrencies are traded every day. 
 * They are among the most popular assets for speculation and investment, yet have proven wildly volatile. 
 * Fast-fluctuating prices have made millionaires of a lucky few, and delivered crushing losses to others. 
* Could some of these price movements have been predicted in advance?

* Use machine learning expertise to forecast short term returns in 14 popular cryptocurrencies. 
* Dataset contains millions of rows of high-frequency market data dating back to 2018 which can be used to build ML model. 
* Once the submission deadline has passed, final score will be calculated over the following 3 months using live crypto data as it is collected. 
 
---
 
**Main Dataset:**

* **train.csv** file contains training set.
* **example_test.csv** file includes example of the data that will be delivered by the time series API.
* **example_sample_submission.csv** file includes example of the data that will be delivered by the time series API. The data is just copied from train.csv.
* **asset_details.csv** file provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.
* **supplemental_train.csv**- After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period.

**gresearch_crypto** - An unoptimized version of the time series API files for offline work. You may need Python 3.7 and a Linux environment to run it without errors.

---

---
**Importing Libraries:**

* To get started we will use Python for data pre-processing and data analysis.

* Import python libraries as necessary to get started for data load and later import other libraries as needed

---

In [None]:
# Custom G-Research python module needed
import gresearch_crypto

# Pandas for data manipulation
import pandas as pd
import numpy as np

# module provides a portable way of using operating system dependent functionality
import os 

 # Importing pyplot interface using matplotlib
import matplotlib.pyplot as plt 

# Importing seaborn library for interactive visualization
import seaborn as sns 

# Importing WordCloud for text data visualization
from wordcloud import WordCloud

# Importing matplotlib for plots
import matplotlib as mpl

#Importing datetime for using datetime
from datetime import datetime

#Importing plotly Express for visualization
import plotly.express as px

# Importing statsmodel for statistical methods
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

#Importing for customizing properties in matplotlib
from pylab import rcParams

import itertools

In [None]:
# Custom G-Research python module requires this step
#env = gresearch_crypto.make_env()

In [None]:
# Display files available in this competition folder
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

---
# Data Definition/Description

---

---
**Data Definition:**

---

* **train.csv:** This dataset contains information on historic trades for several cryptoassets.

In [None]:
# load training data
train_df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv', low_memory=False)

# load asset details data
asset_details_df = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv', low_memory=False)

# load example test data
test_df = pd.read_csv('../input/g-research-crypto-forecasting/example_test.csv', low_memory=False)

---
**Q: What is the structure of train dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **timestamp**   | A timestamp for the minute covered by the row |
|02| **Asset_ID** | An ID code for the cryptoasset                 |
|03| **Count**   | The number of trades that took place this minute                 |
|04| **Open**   | The USD price at the beginning of the minute  |
|05| **High**   | The highest USD price during the minute|
|06| **Low**   | The lowest USD price during the minute  |
|07| **Close**   | The USD price at the end of the minute|
|08| **Volume**   | The number of cryptoasset units traded during the minute|
|09| **VWAP**   | The volume weighted average price for the minute  |
|10| **Target**   | 15 minute residualized returns|

In [None]:
# get shape of dataframe
print('Shape of train dataset is:', train_df.shape)

# print summary of dataframe
train_df.info(show_counts=True)

**train dataset information:**

* There are 24236806 data points (rows) and 10 feature (column) in train dataset.
* There are seven columns which are of numerical float type and three column of numerical int type.
* There are missing values (non-null count is not same) for VWAP and Target columns.

---
**Q: What does data looks like for train dataset?**

---

In [None]:
# print first 10 rows of dataframe
train_df.head(10)

---
**Q: What does data looks like for asset_details dataset?**

---

In [None]:
# print summary of asset details dataframe
asset_details_df.sort_values(by=['Weight'], ascending=False)

---
**Q: What is the statistics description for train dataset?**

---

In [None]:
# print descriptive statistics for train dataset
train_df.describe(include='all').round(1)

**train dataset data description:**

* There are no missing values for timestamp and appears to be having normal distribution of datapoints.
* There are no missing values for Asset_ID and appears to be having normal distribution of datapoints.
* There are no missing values for Count, Open, High, Low, Close and Volume but appears to be having skewed distribution of datapoints.
* There are missing values for VWAP and Target but appears to be having normal distribution of datapoints.

Need to merge train_df and asset_details_df

In [None]:
# merge dataframe using Asset_ID as key
gresearch_crypto_df = pd.merge(train_df,asset_details_df,on=['Asset_ID'])

In [None]:
# print first 10 rows of merged dataframe
gresearch_crypto_df.head(10)

In [None]:
# print last 10 rows of merged dataframe
gresearch_crypto_df.tail(10)

Handle missing value, will drop them

In [None]:
# drop missing value from dataframe
gresearch_crypto_df.dropna(axis=0, inplace=True)

In [None]:
# print summary of dataframe
gresearch_crypto_df.info(show_counts=True)

---
# Model Building/Prediction

---

In [None]:
# convert timestamp to Datetime for further analysis
gresearch_crypto_df['Datetime'] = pd.to_datetime(gresearch_crypto_df['timestamp'], unit='s')

In [None]:
# Lookup asset name and return asset series
def asset_lookup (asset_name):
    # check for Asset_Name
    crypto_asset =gresearch_crypto_df.loc[gresearch_crypto_df['Asset_Name'] == asset_name]
    # drop all other columns as of now
    crypto_asset_df= crypto_asset.drop(columns=['timestamp','Asset_ID','Count','Open','High','Low','Close','Volume','VWAP','Weight','Asset_Name'], axis=1)
    # sort by Datetime and set index
    crypto_asset_df = crypto_asset_df.sort_values('Datetime')
    crypto_asset_df = crypto_asset_df.groupby('Datetime')['Target'].sum().reset_index()
    crypto_asset_df = crypto_asset_df.set_index('Datetime')
    # group the data by business day
    return crypto_asset_df['Target'].resample('B').mean()

In [None]:
# check for stationarity using ADF Test for a given asset series
def asset_stationarity (asset_series):
    result = adfuller(asset_series)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
     print('\t%s: %.3f' % (key, value))

In [None]:
# visualization for trend,seasonality and residual for a given asset series
def asset_decompose (asset_series):
    rcParams['figure.figsize'] = 18, 8
    decomposition = sm.tsa.seasonal_decompose(asset_series, model='additive')
    fig = decomposition.plot()
    plt.show()

**Lets explore ARIMA model for Bitcoin**

In [None]:
# create series for Asset_Name Bitcoin
bitcoin = asset_lookup('Bitcoin')
bitcoin.isna().sum()

In [None]:
# check asset stationarity
asset_stationarity (bitcoin)

In [None]:
# visualize trend, seasonality and residual
asset_decompose(bitcoin)

In [None]:
# ARIMA models are denoted by p,d,q which are seasonality, trend and residual (noise)
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

In [None]:
# search for best parameter combination for ARIMA model
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(bitcoin,order=param,seasonal_order=param_seasonal,enforce_stationarity=False,enforce_invertibility=False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

In [None]:
# lets fit ARIMA model
model = sm.tsa.statespace.SARIMAX(bitcoin,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = model.fit()
print(results.summary().tables[1])

---
**Q: What does model diagnostics visualization looks like for Bitcoin?**

---

In [None]:
# model diagnostics to check for errors
results.plot_diagnostics(figsize=(16, 8))
plt.show()

---
**Q: What does One-step ahead forecast looks like for Bitcoin?**

---

In [None]:
# validate forecast
pred = results.get_prediction(start=pd.to_datetime('2021-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = bitcoin['2018':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Target')
plt.legend()
plt.show()

In [None]:
y_forecasted = pred.predicted_mean
y_truth = bitcoin['2021-01-01':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

In [None]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))

---
**Q: What does visualization of forecast looks like for Bitcoin?**

---

In [None]:
# visualization of forecast
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = bitcoin.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Target')
plt.legend()
plt.show()

**Lets explore ARIMA model for Ethereum**

In [None]:
# create series for Asset_Name Ethereum
ethereum = asset_lookup('Ethereum')
ethereum.isna().sum()

In [None]:
# check asset stationarity
asset_stationarity (ethereum)

In [None]:
# visualize trend, seasonality and residual
asset_decompose(ethereum)

In [None]:
# search for best parameter combination for ARIMA model
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(ethereum,order=param,seasonal_order=param_seasonal,enforce_stationarity=False,enforce_invertibility=False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

In [None]:
# lets fit ARIMA model
model = sm.tsa.statespace.SARIMAX(ethereum,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = model.fit()
print(results.summary().tables[1])

---
**Q: What does model diagnostics visualization looks like for Ethereum?**

---

In [None]:
# model diagnostics to check for errors
results.plot_diagnostics(figsize=(16, 8))
plt.show()

---
**Q: What does One-step ahead forecast looks like for Ethereum?**

---

In [None]:
# validate forecast
pred = results.get_prediction(start=pd.to_datetime('2021-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = bitcoin['2018':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Target')
plt.legend()
plt.show()

---
**Q: What does visualization of forecast looks like for Ethereum?**

---

In [None]:
# visualization of forecast
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = bitcoin.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Target')
plt.legend()
plt.show()

---
**Thank you and Happy Learning.**

---

In [None]:
thank_you_str="Thanks,Happy Learning,Collaboration,Thankyou,Keep Learning"
# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(thank_you_str)
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()