# G-Research Crypto Forecasting
## Use your ML expertise to predict real crypto market data

## 1. Problem Understanding

The general research associated with the market is highly focusing on neither buy or sell. The common trend towards the stock market among the society is that it is highly risky for investment or not suitable for trade so most of the people are not even interested. The seasonal variance and steady flow of any index will help both existing and inexperienced investors to understand and make a decision to invest in the stock market.

To solve these types of problems, the time series analysis will be the best tool for forecasting the trend or even future. The trend chart will provide adequate guidance for the investors.

So let us understand this concept in great detail and use a machine learning technique to forecast stocks.

Over $40 billion worth of cryptocurrencies are traded every day. They are among the most popular assets for speculation and investment, yet have proven wildly volatile. Fast-fluctuating prices have made millionaires of a lucky few, and delivered crushing losses to others. Could some of these price movements have been predicted in advance?

## 2. Our Goal is:
To forecast short term returns on the most popular cryptocurrencies

## 3. Data Understanding
#### 3.1. Data volume (size, number of records)
        - The size of the dataset is : 3.06 GB
        - number of files in csv: 5 
                - train.csv(2.82 GB)
                - example_test.csv(5.92 kB)
                - asset_details.csv(444 B)
                - example_sample_submission.csv(406 B)
                - supplemental_train.csv(243.24 MB)
#### 3.2. Data attributes and their description (variables, data types)
        - files variable: 
        train.csv - The training set
*               **timestamp** - A timestamp for the minute covered by the row.
*                **Asset_ID** - An ID code for the cryptoasset.
*                **Count** - The number of trades that took place this minute.
*                **Open** - The USD price at the beginning of the minute.
*                **High** - The highest USD price during the minute.
*                **Low** - The lowest USD price during the minute.
*                **Close** - The USD price at the end of the minute.
*                **Volume** - The number of cryptoasset units traded during the minute.
*                **VWAP** - The volume weighted average price for the minute.
*                **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation' section of this notebook for details of how the target is calculated.


        example_test.csv - An example of the data that will be delivered by the time series API.

        example_sample_submission.csv - An example of the data that will be delivered by the time series API. The data is just copied from train.csv.

        asset_details.csv - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.

        gresearch_crypto - An unoptimized version of the time series API files for offline work. You may need Python 3.7 and a Linux environment to run it without errors.

        supplemental_train.csv - After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period. In the Evaluation phase, the train, train supplement, and test set will be contiguous in time, apart from any missing data. The current copy, which is just filled approximately the right amount of data from train.csv is provided as a placeholder.
#### 3.3. Relationship and mapping schemes (understand attribute representations)
These financial charts are named candlesticks because the rectangular shape and lines on either end represent a candle with wicks. Each candlestick resembles one minute’s worth of price data of stock. With time, the candlesticks group into recognizable patterns that investors use for making buying and selling decisions.
![How to easily understand the candlestick chart in the stock market](http://qph.fs.quoracdn.net/main-qimg-3543e6c07485ced5fa0c7ff35f54081e)
#### 3.4. Basic descriptive statistics (mean, median, variance)
#### 3.5. Focus on which attributes are important for the business

In [None]:
!pip install pmdarima

In [None]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
plt.style.use('fivethirtyeight')
import plotly.express as px
from datetime import datetime
import plotly.graph_objects as go
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from statsmodels.tsa.stattools import adfuller
import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA',
                        FutureWarning)
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARIMA',
                        FutureWarning)
from pmdarima.arima import auto_arima
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_squared_error, mean_absolute_error

### Load the training set

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = "../input/g-research-crypto-forecasting/"
train = pd.read_csv(path + "train.csv")
#train.index = pd.to_datetime(train.index, unit='s')
train['timestamp'] = pd.to_datetime(train['timestamp'], unit='s')
train = train.set_index('timestamp')

asset_details = pd.read_csv(path + "asset_details.csv")

supplemental_train = pd.read_csv(path + "supplemental_train.csv")
# Converting the timestamp from string to timestamp format:
supplemental_train.index = pd.to_datetime(supplemental_train.index, unit='s')
example_test = pd.read_csv(path + "example_test.csv")
train.head(8)

Here we can see that each row of our dataset has the trading data for an asset, at a given minute timestamp, described in the first 8 rows of the table below.

In [None]:
train.tail(3)

In [None]:
train.describe()

Printing the DataFrame’s info, we can see all that it contains:

### 4. Exploratory Data Analysis

In [None]:
train.info()

**asset_details.csv** - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.

In [None]:
asset_details

In [None]:
train.shape

In [None]:
train.isnull().sum()

In [None]:
import plotly.express as px
asset_details = asset_details.sort_values(by='Weight', ascending=False)
fig = px.bar(asset_details, x='Asset_Name', y='Weight', title="The Cryptomoney with their Weight")
fig.show()

Let's visualize the per day closing price of the stock.

In [None]:
btc = train[train["Asset_ID"]==1] # Asset_ID = 1 for Bitcoin
btc_mini = btc.iloc[-200:] # Select recent data rows

eth = train[train["Asset_ID"]==6] # Asset_ID = 6 for eth
eth_mini = eth.iloc[-200:] # Select recent data rows

crd = train[train["Asset_ID"]==3] # Asset_ID = 3 for eth
crd_mini = crd.iloc[-400:] # Select recent data rows

bc = train[train["Asset_ID"]==0] # Asset_ID = 0 for eth
bc_mini = bc.iloc[-400:] # Select recent data rows

dg = train[train["Asset_ID"]==4] # Asset_ID = 4 for eth
dg_mini = dg.iloc[-400:] # Select recent data rows

ltc = train[train["Asset_ID"]==9] # Asset_ID = 9 for eth

In [None]:
import matplotlib.pyplot as plt

f = plt.figure(figsize=(15,7))
ax = f.add_subplot(321)
plt.plot(btc['Close'], label='Bitcoin')
plt.legend()

ax2 = f.add_subplot(322)
ax2.plot(eth['Close'], label='Ethereum', color='red')
plt.legend()

ax3 = f.add_subplot(323)
ax3.plot(crd['Close'], label='Cardano', color='red',)
plt.legend()

ax2 = f.add_subplot(324)
ax2.plot(bc['Close'], label='Binance Coin')
plt.legend()

ax2 = f.add_subplot(325)
ax2.plot(dg['Close'], label='Dogecoin')
plt.legend()

ax2 = f.add_subplot(326)
ax2.plot(ltc['Close'], label='Litecoin', color='red')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
import plotly.graph_objects as go

def plotcandlcharts(monaie, title):
    fig = go.Figure(data=[go.Candlestick(x=monaie.index, open=monaie['Open'], high=monaie['High'], low=monaie['Low'], close=monaie['Close'])])
    fig.update_layout(
        title=title,
        xaxis_title="Time",
        yaxis_title="Stock")

    fig.show()
plotcandlcharts(btc_mini,'Candlestick Trading for BITCOIN the latest day')
plotcandlcharts(eth_mini,'Candlestick Trading for ETHERUM the latest day')
plotcandlcharts(crd_mini,'Candlestick Trading for CARDINO the latest day')

In [None]:
train1 = pd.read_csv("../input/g-research-crypto-forecasting/train.csv")

asset_details = pd.read_csv("../input/g-research-crypto-forecasting/asset_details.csv")
mapping = dict(asset_details[['Asset_ID', 'Asset_Name']].values)
train1["Asset name"] = train1["Asset_ID"].map(mapping)

bitcoin = train1.query("Asset_ID == 1").reset_index(drop = True)
bitcoin['timestamp'] = pd.to_datetime(bitcoin['timestamp'], unit='s')
bitcoin = bitcoin.set_index('timestamp')
bitcoin

ethereum = train1.query("Asset_ID == 6").reset_index(drop = True)
ethereum['timestamp'] = pd.to_datetime(ethereum['timestamp'], unit='s')
ethereum = ethereum.set_index('timestamp')
ethereum

cardano = train1.query("Asset_ID == 3").reset_index(drop = True)
cardano['timestamp'] = pd.to_datetime(cardano['timestamp'], unit='s')
cardano = cardano.set_index('timestamp')
cardano.head(1)

maker = train1.query("Asset_ID == 10").reset_index(drop = True)
maker['timestamp'] = pd.to_datetime(maker['timestamp'], unit='s')
maker = maker.set_index('timestamp')

In [None]:
import seaborn as sns
plt.figure(figsize=(10,6))
sns.heatmap(train[['Count','Open','High','Low','Close','Volume','VWAP','Target']].corr(), vmin=-1.0, vmax=1.0, annot=True, linewidths=0.1)
plt.show()

## Is Our dataset Stationary data

### What is a stationary series?

1. The average of the series should not be in function of time. The red graph below is not stationary as the average increases over time.
![](https://assets.moncoachdata.com/v7/moncoachdata.com/wp-content/uploads/2020/01/moyenne-series-temporelles-stationnaires.png?w=555)

2. The variance of the series must not be in function of time. Notice in the red graph below the variance of the data that varies over time.
![](https://assets.moncoachdata.com/v7/moncoachdata.com/wp-content/uploads/2020/01/variance-series-temporelles-stationnaires.png?w=555)

3. Finally, the covariance of the i-th term and the (i + m)-th term must not be a function of time. In the following graph, you will notice that the gap gets closer as time increases. Therefore, the covariance is not related to time for the “red series”.
![](https://assets.moncoachdata.com/v7/moncoachdata.com/wp-content/uploads/2020/01/covariance-series-temporelles-stationnaires.png?w=555)

If a time series is stationary and exhibits a particular behavior during a given time interval, it is safe to assume that it will exhibit the same behavior at a later time. Most statistical modeling methods assume or require the time series to be stationary.

As mentioned earlier, before we can build a model, we need to make sure that the time series is stationary. There are two main ways to determine if a given time series is stationary:

* **Rolling statistics** : Plot the moving average and moving standard deviation. The time series is stationary if it remains constant over time (by eye, look to see if the lines are straight and parallel to the x axis)
* **Augmented Dickey-Fuller (ADF) test**: The time series is considered stationary if the p-value is small (according to the null hypothesis) and if the critical values at 1%, 5%, 10% confidence intervals are as close as possible to the ADF statistics.

For those who don't understand the difference between average and moving average, a 10-day moving average calculates the average of the closing prices of the first 10 days as the first data point. And so on for each subsequent data point.

In [None]:
df_close_bitcoin = bitcoin['Close']
df_close_ethereum = ethereum['Close']
df_close_cardano = cardano['Close']
df_close_maker = maker['Close']

In [None]:
def test_stationarity(df_close):
    plt.figure(figsize=(10,6))
    rolling_mean = df_close.rolling(window = 12).mean()
    rolling_std = df_close.rolling(window = 12).std()
    plt.plot(df_close, color = 'blue', label = 'Origine')
    plt.plot(rolling_mean, color = 'red', label = 'Moyenne mobile')
    plt.plot(rolling_std, color = 'black', label = 'Ecart-type mobile')
    plt.legend(loc = 'best')
    plt.title('Moyenne et Ecart-type mobiles')
    plt.show()
    
test_stationarity(df_close_bitcoin)
test_stationarity(df_close_ethereum)
test_stationarity(df_close_cardano)

As you can see, the moving average and the moving standard deviation increase with time. We can therefore conclude that the time series is not stationary.

In [None]:
df_log = np.log(df_close_bitcoin)
plt.plot(df_log)

In [None]:
result = seasonal_decompose(df_close_bitcoin, model='multiplicative', period = 30)
fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(16, 9)

In [None]:
#split data into train and training set
train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):]
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Closing Prices')
plt.plot(df_log, 'green', label='Train data')
plt.plot(test_data, 'blue', label='Test data')
plt.legend()

In [None]:
bitcoin['timestamp'] = bitcoin.index
# Predicting the close price of BTC  
rcParams['figure.figsize'] = 16, 6
modelARIMA = ARIMA(bitcoin["Close"].diff().iloc[1:].values, order=(2,1,0))
result = modelARIMA.fit()
print(result.summary())
result.plot_predict(start=700, end=1000)
_ = plt.title('modelARIMA: predicting Close price for BITCCOIN')
plt.show()

In [None]:
import math
count = len(btc["Close"].diff().iloc[1000:1101].values)
rmse = math.sqrt(mean_squared_error(bitcoin["Close"].diff().iloc[1000:1101].values, result.predict(start=1000,end=999+count)))
print("The root mean squared error is {}.".format(rmse))

In [None]:
predicted_result = result.predict()

In [None]:
import gresearch_crypto
#env = gresearch_crypto.make_env()   # initialize the environment
#iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission
for (df_test, sample_prediction_df) in iter_test:
    sample_prediction_df['Target'] = 0  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions