In [None]:
import os
import pickle

import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from matplotlib import pyplot as plt

From Wikipedia, the free encyclopedia

Cointegration is a statistical property of a collection (X1, X2, ..., Xk) of time series variables. First, all of the series must be integrated of order d (see Order of integration). Next, if a linear combination of this collection is integrated of order less than d, then the collection is said to be co-integrated. Formally, if (X,Y,Z) are each integrated of order d, and there exist coefficients a,b,c such that aX + bY + cZ is integrated of order less than d, then X, Y, and Z are cointegrated. 

A common example is where the individual series are first-order integrated but some (cointegrating) vector of coefficients exists to form a stationary linear combination of them. For instance, a stock market index and the price of its associated futures contract move through time, each roughly following a random walk. 

If the prices of two assets are said to be cointegrated, then their prices can be expressed in the linear form of y = ax + b. b denotes the residual, which is stationary in time.

Here we perform a cointegration test for each pair of assets in the dataset to determine whether there is a cointegration relationship between them

# Read Data

In [None]:
df_ad = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv').sort_values('Asset_ID')
id2name = {}

for row in df_ad.itertuples():
    id2name[row.Asset_ID] = row.Asset_Name
    
id2name

In [None]:
if not os.path.exists('df.p'):
    df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv')

    df['trade_date'] = pd.to_datetime(df['timestamp'], unit='s')
    df.drop('timestamp', axis=1, inplace=True)
    df = df.set_index(['Asset_ID', 'trade_date']).sort_index().astype(np.float32)
    df.to_pickle('df.p')
else:
    df = pd.read_pickle('df.p')

df

Here we use data after 2020

In [None]:
df2y = df.query('trade_date > "2020-01-01"')
df2y.head()

Prices in log form

In [None]:
for k, v in id2name.items():
    _df = df2y.loc[k]
    _df['Close'].plot(logy=True, figsize=(12, 9))

plt.legend([v for k, v in id2name.items()])

# Cointegration test of BTC and ETH classic

In [None]:
from sklearn.linear_model import LinearRegression
import statsmodels.tsa.stattools as ts

Prices of BTC and ETH classic in log form

In [None]:
k1 = 1 # btc
k2 = 7 # eth classic
df_price = pd.merge(df2y.loc[k1][['Close']], df2y.loc[k2][['Close']], on='trade_date', how='inner')
df_price.plot(logy=True, figsize=(12, 9))

## Engle–Granger two-step test
If x and y are non-stationary and Order of integration d=1, then a linear combination of them must be stationary for some value of a and b . In other words:

y - ax = b

where b is stationary.

If we knew a , we could just test it for stationarity with something like a Dickey–Fuller test, Phillips–Perron test and be done. But because we don't know a , we must estimate this first, generally by using ordinary least squares and then run our stationarity test on the estimated b series.

First we fit a linear regression model

In [None]:
x = np.log(df_price['Close_x'].values)
y = np.log(df_price['Close_y'].values)

lm_model = LinearRegression(fit_intercept=True, normalize=False, n_jobs=1)
lm_model.fit(x.reshape(-1, 1), y)        # fit() expects 2D array

lm_model.coef_, lm_model.intercept_

The result above means that log(eth) = log(btc) * 0.92 - 6.6 + c, and c is a stationary random variable.

Then we calc the residual

In [None]:
yfit = lm_model.coef_ * x + lm_model.intercept_
y_residual = y - yfit
df_res = df_price[[]].copy()
df_res['res'] = y_residual
df_res.plot(figsize=(12, 9))

This also suggests that there may be some arbitrage opportunities between these two assets. We may be able to get some excess return by going long and short over the two assets respectively depending on the spread of the residuals

Here is the Dickey–Fuller test result. the t-value is -3.8 and p is 0.002, which is pretty significant. That is, there is a high probability that there is a cointegration relationship between the two assets.

In [None]:
rst = ts.adfuller(y_residual, 1)
rst

# Pair wise cointegration test

Here we perform cointegration tests for all asset pairs and obtain their p-values.

In [None]:
def CADF(x, y):
    lm_model = LinearRegression(fit_intercept=True, normalize=False, n_jobs=1)
    lm_model.fit(x.reshape(-1, 1), y)        # fit() expects 2D array
    yfit = lm_model.coef_ * x + lm_model.intercept_
    y_residual = y - yfit
    rst = ts.adfuller(y_residual, 1)           # lag = 1
    
    return rst[1] # p

In [None]:
tmp = []
for k1, v1 in tqdm(id2name.items()):
    for k2, v2 in id2name.items():
        if k1 != k2:
            df_price = pd.merge(df2y.loc[k1][['Close']], df2y.loc[k2][['Close']], on='trade_date', how='inner')

            x = np.log(df_price['Close_x'].values)
            y = np.log(df_price['Close_y'].values)
            p = CADF(x, y)
            
            tmp.append((k1, v1, k2, v2, p))
            
dfp = pd.DataFrame(tmp, columns=['k1', 'v1', 'k2', 'v2', 'p'])
dfp.sort_values('p')

The we use an 'confusion matrix' to visualize the result

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
m = np.zeros((len(id2name), len(id2name)))

for k1, v1, k2, v2, p in tmp:
    m[k1, k2] = p

f = plt.figure(figsize=(16, 16))
ConfusionMatrixDisplay(m, display_labels=[v for k, v in id2name.items()], ).plot(include_values=False, ax=f.gca())

The above figure shows that there is a strong cointegration relationship between eth classic and all other assets. While BTC only has this relationship with a few.