---
# G-Research Crypto Forecasting - EDA

---

**Problem Statement:**

* Over $40 billion worth of cryptocurrencies are traded every day. 
 * They are among the most popular assets for speculation and investment, yet have proven wildly volatile. 
 * Fast-fluctuating prices have made millionaires of a lucky few, and delivered crushing losses to others. 
* Could some of these price movements have been predicted in advance?

* Use machine learning expertise to forecast short term returns in 14 popular cryptocurrencies. 
* Dataset contains millions of rows of high-frequency market data dating back to 2018 which can be used to build ML model. 
* Once the submission deadline has passed, final score will be calculated over the following 3 months using live crypto data as it is collected. 
 
---
 
**Main Dataset:**

* **train.csv** file contains training set.
* **example_test.csv** file includes example of the data that will be delivered by the time series API.
* **example_sample_submission.csv** file includes example of the data that will be delivered by the time series API. The data is just copied from train.csv.
* **asset_details.csv** file provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.
* **supplemental_train.csv**- After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period.

**gresearch_crypto** - An unoptimized version of the time series API files for offline work. You may need Python 3.7 and a Linux environment to run it without errors.

---

---
**Importing Libraries:**

* To get started we will use Python for data pre-processing and data analysis.

* Import python libraries as necessary to get started for data load and later import other libraries as needed

---

In [None]:
# Custom G-Research python module needed (Not used in this notebook)
import gresearch_crypto

# Pandas for data manipulation
import pandas as pd
import numpy as np

# module provides a portable way of using operating system dependent functionality
import os 

 # Importing pyplot interface using matplotlib
import matplotlib.pyplot as plt 

# Importing seaborn library for interactive visualization
import seaborn as sns 

# Importing WordCloud for text data visualization
from wordcloud import WordCloud

# Importing matplotlib for plots
import matplotlib as mpl

#Importing datetime for using datetime
from datetime import datetime

#Importing plotly Express for visualization
import plotly.express as px

# Importing statsmodel for statistical methods
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

#Importing for customizing properties in matplotlib
from pylab import rcParams

In [None]:
# Custom G-Research python module requires this step
#env = gresearch_crypto.make_env()

In [None]:
# Display files available in this competition folder
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

---
# Data Definition/Description

---

---
**Data Definition:**

---

* **train.csv:** This dataset contains information on historic trades for several cryptoassets.

In [None]:
# load training data
train_df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv', low_memory=False)
# load asset details data
asset_details_df = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv', low_memory=False)

---
**Q: What is the structure of train dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **timestamp**   | A timestamp for the minute covered by the row |
|02| **Asset_ID** | An ID code for the cryptoasset                 |
|03| **Count**   | The number of trades that took place this minute                 |
|04| **Open**   | The USD price at the beginning of the minute  |
|05| **High**   | The highest USD price during the minute|
|06| **Low**   | The lowest USD price during the minute  |
|07| **Close**   | The USD price at the end of the minute|
|08| **Volume**   | The number of cryptoasset units traded during the minute|
|09| **VWAP**   | The volume weighted average price for the minute  |
|10| **Target**   | 15 minute residualized returns|

In [None]:
# get shape of dataframe
print('Shape of train dataset is:', train_df.shape)

# print summary of dataframe
train_df.info(show_counts=True)

**train dataset information:**

* There are 24236806 data points (rows) and 10 feature (column) in train dataset.
* There are seven columns which are of numerical float type and three column of numerical int type.
* There are missing values (non-null count is not same) for VWAP and Target columns.

---
**Q: What does data looks like for train dataset?**

---

In [None]:
# print first 10 rows of dataframe
train_df.head(10)

---
**Q: What does data looks like for asset_details dataset?**

---

In [None]:
# print summary of asset details dataframe
asset_details_df.sort_values(by=['Weight'], ascending=False)

---
**Q: What is the statistics description for train dataset?**

---

In [None]:
# print descriptive statistics for train dataset
train_df.describe(include='all').round(1)

**train dataset data description:**

* There are no missing values for timestamp and appears to be having normal distribution of datapoints.
* There are no missing values for Asset_ID and appears to be having normal distribution of datapoints.
* There are no missing values for Count, Open, High, Low, Close and Volume but appears to be having skewed distribution of datapoints.
* There are missing values for VWAP and Target but appears to be having normal distribution of datapoints.

Need to merge train_df and asset_details_df

In [None]:
# merge dataframe using Asset_ID as key
gresearch_crypto_df = pd.merge(train_df,asset_details_df,on=['Asset_ID'])

In [None]:
# print first 10 rows of merged dataframe
gresearch_crypto_df.head(10)

In [None]:
# print last 10 rows of merged dataframe
gresearch_crypto_df.tail(10)

Handle missing value, will drop them

In [None]:
# drop missing value from dataframe
gresearch_crypto_df.dropna(axis=0, inplace=True)

In [None]:
# print summary of dataframe
gresearch_crypto_df.info(show_counts=True)

In [None]:
# create sample of dataset for EDA (For faster execution and to avoid out of memory issue for notebook)
gresearch_crypto_df_sample = gresearch_crypto_df.sample(n=1000000, random_state=1)

In [None]:
# print summary of sample dataframe
gresearch_crypto_df_sample.info()

In [None]:
# convert timestamp to Datetime for EDA purpose
gresearch_crypto_df_sample['Datetime'] = pd.to_datetime(gresearch_crypto_df_sample['timestamp'], unit='s')

In [None]:
# print first 10 rows of sample dataframe
gresearch_crypto_df_sample.head(10)

---
# Data Analysis/EDA

---

---
**Q: What is the distribution percentage for Asset_Name in sample dataset?**

---

In [None]:
# distribution percentage for Asset_Name in sample dataset
fig = px.pie(gresearch_crypto_df_sample, names='Asset_Name')
fig.show()

Top 5 cryptoasset (Bitcoin, Ethereum, Cardano, Binance Coin and Dogecoin) seems to have similar distribution in sample dataset.

---
**Q: What is the distribution for Asset_Name with respect to Weight?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# distribution for Asset_Name with respect to Weight
sns.barplot(data=gresearch_crypto_df_sample, x='Asset_Name', y='Weight',ax=ax)
plt.show()

Bitcoin, Ethereum,Cardano, Binance Coin and Dogecoin appears to be top five crypto assest as per Weight.

---
**Q: What is the distribution for Asset_Name with respect to Count?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# distribution for Asset_Name with respect to Count
sns.barplot(data=gresearch_crypto_df_sample, x='Asset_Name', y='Count',ax=ax)
plt.show()

Count: The number of trades that took place this minute

Bitcoin, Ethereum,Cardano, Binance Coin and Dogecoin apperas to be having higher count.

---
**Q: What is the trend for Asset_Name with respect to Count?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# trend for Asset_Name with respect to Count
sns.lineplot(data=gresearch_crypto_df_sample, x='Datetime', y='Count',hue='Asset_Name', ax=ax)
plt.show()

---
**Q: What is the distribution for Asset_Name with respect to Volume?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# distribution for Asset_Name with respect to Volume
sns.barplot(data=gresearch_crypto_df_sample, x='Asset_Name', y='Volume',ax=ax)
plt.show()

Volume: The number of cryptoasset units traded during the minute

Dogecoin appears to be the leader.

---
**Q: What is the trend for Asset_Name with respect to Volume?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# trend for Asset_Name with respect to Volume
sns.lineplot(data=gresearch_crypto_df_sample, x='Datetime', y='Volume',hue='Asset_Name', ax=ax)
plt.show()

---
**Q: What is the distribution for Asset_Name with respect to High?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# distribution for Asset_Name with respect to High
sns.barplot(data=gresearch_crypto_df_sample, x='Asset_Name', y='High',ax=ax)
plt.show()

High: The highest USD price during the minute

Bitcoin appears to be the leader.

---
**Q: What is the trend for Asset_Name with respect to High?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# trend for Asset_Name with respect to High
sns.lineplot(data=gresearch_crypto_df_sample, x='Datetime', y='High',hue='Asset_Name', ax=ax)
plt.show()

---
**Q: What is the distribution for Asset_Name with respect to Target?**

---

In [None]:
# setting the dimensions of the plot
fig, ax = plt.subplots(figsize=(17,8))
# distribution for Asset_Name with respect to Target
sns.boxplot(data=gresearch_crypto_df_sample, x='Asset_Name', y='Target',ax=ax)
plt.show()

Target: 15 minute residualized returns

Bitcoin appears to be having the most fluctuations followed by Dogecoin and Ethereum Classic.

Let us further explore trend for Bitcoin, Ethereum,Cardano, Binance Coin and Dogecoin with respect to Target.

In [None]:
# convert timestamp to Datetime for further analysis
gresearch_crypto_df['Datetime'] = pd.to_datetime(gresearch_crypto_df['timestamp'], unit='s')

In [None]:
# Lookup asset name and return asset series
def asset_lookup (asset_name):
    # check for Asset_Name
    crypto_asset =gresearch_crypto_df.loc[gresearch_crypto_df['Asset_Name'] == asset_name]
    # drop all other columns as of now
    crypto_asset_df= crypto_asset.drop(columns=['timestamp','Asset_ID','Count','Open','High','Low','Close','Volume','VWAP','Weight','Asset_Name'], axis=1)
    # sort by Datetime and set index
    crypto_asset_df = crypto_asset_df.sort_values('Datetime')
    crypto_asset_df = crypto_asset_df.groupby('Datetime')['Target'].sum().reset_index()
    crypto_asset_df = crypto_asset_df.set_index('Datetime')
    # group the data by start of month
    return crypto_asset_df['Target'].resample('MS').mean()

In [None]:
# check for stationarity using ADF Test for a given asset series
def asset_stationarity (asset_series):
    result = adfuller(asset_series)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
     print('\t%s: %.3f' % (key, value))

In [None]:
# visualization for trend,seasonality and residual for a given asset series
def asset_decompose (asset_series):
    rcParams['figure.figsize'] = 18, 8
    decomposition = sm.tsa.seasonal_decompose(asset_series, model='additive')
    fig = decomposition.plot()
    plt.show()

---
**Q: What is the trend for Bitcoin with respect to Target?**

---

In [None]:
# visualization for Bitcoin trend
bitcoin = asset_lookup('Bitcoin')
bitcoin.plot(figsize=(15, 6))
plt.show()

---
**Q: What is the trend, seasonality and residuals for Bitcoin with respect to Target?**

---

In [None]:
# visualization for trend,seasonality and residual
asset_decompose(bitcoin)

In [None]:
# check for stationarity using ADF Test
asset_stationarity(bitcoin)

---
**Q: What is the trend for Ethereum with respect to Target?**

---

In [None]:
# visualization for trend
ethereum = asset_lookup('Ethereum')
ethereum.plot(figsize=(15, 6))
plt.show()

---
**Q: What is the trend, seasonality and residuals for Ethereum with respect to Target?**

---

In [None]:
# visualization for trend,seasonality and residual
asset_decompose(ethereum)

In [None]:
# check for stationarity using ADF Test
asset_stationarity(ethereum)

---
**Q: What is the trend for Cardano with respect to Target?**

---

In [None]:
# visualization for trend
cardano = asset_lookup('Cardano')
cardano.plot(figsize=(15, 6))
plt.show()

---
**Q: What is the trend, seasonality and residuals for Cardano with respect to Target?**

---

In [None]:
# visualization for trend,seasonality and residual
asset_decompose(cardano)

In [None]:
# check for stationarity using ADF Test
asset_stationarity(cardano)

---
**Q: What is the trend for Binance Coin with respect to Target?**

---

In [None]:
# visualization for trend
binancecoin = asset_lookup('Binance Coin')
binancecoin.plot(figsize=(15, 6))
plt.show()

---
**Q: What is the trend, seasonality and residuals for Binance Coin with respect to Target?**

---

In [None]:
# visualization for trend,seasonality and residual
asset_decompose(binancecoin)

In [None]:
# check for stationarity using ADF Test
asset_stationarity(binancecoin)

---
**Q: What is the trend for Dogecoin with respect to Target?**

---

In [None]:
# visualization for trend
dogecoin = asset_lookup('Dogecoin')
dogecoin.plot(figsize=(15, 6))
plt.show()

---
**Q: What is the trend, seasonality and residuals for Dogecoin with respect to Target?**

---

In [None]:
# visualization for trend,seasonality and residual
asset_decompose(dogecoin)

In [None]:
# check for stationarity using ADF Test
asset_stationarity(dogecoin)

---
# Summary

---

* Based on EDA and ADF test for top 5 crypto asset, most of the data point appears to be stationary.

* And most of the top 5 assets are on the rise relatively and appears to be having good run.

* Bitcoin crypto asset is kind of outlier in overall scenario and scheme of things.

* New crypto asset may arrive in 2022 based on rising popularity of Cryptocurrency.

* Cryptocurrency sector will continue to transition to become main asset class in 2022.

---
**Thank you and Happy Learning.**

---

In [None]:
thank_you_str="Thanks,Happy Learning,Collaboration,Thankyou,Keep Learning"
# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(thank_you_str)
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()