# In this notebook, we compute the TE (transfer entropy) to perform statistical significance tests

$TE_{B \rightarrow A} = H(A^+|A)-H(A^+|A,B)$

where $H(X|Y) = -E(\log P(X|Y)) = -\sum\limits_{x,y} P(x,y)\log P(x|y)$ is the conditional entropy of $X$ given $Y$, $A$ and $B$ are time series, and $A^+$ is the "future" of $A$

Here $TE_{XBT \rightarrow ETH}$ represents how much knowing XBT’s past helps predict ETH, beyond what ETH’s own past can tell us.

We first synchronize the asynchronous time series of XBT's features and ETH's prices. Then we we compute the transfer entropy to test for the statistical significance of the different features with a $\chi^2$ test (cf. [arXiv:2206.10173v1](https://arxiv.org/abs/2206.10173#) by Christian Bongiorno & Damien Challet) using a repository on [Christian Bongiorno's github PV-TE](https://github.com/bongiornoc/PV-TE).

In [6]:
import numpy as np
import pandas as pd
import scipy
import requests
from typing import Tuple
import regex as re

url_tepv = "https://raw.githubusercontent.com/bongiornoc/PV-TE/refs/heads/main/TEpv.py"
response = requests.get(url_tepv)
if response.status_code == 200:
    code = response.text
    # Execute the code dynamically
    exec(code)
else:
    print(f"Failed to fetch the file: {response.status_code}")



In [7]:
features = pd.read_parquet("../data/features/DATA_0/XBT_EUR.parquet")
features.index = pd.to_datetime(features.index, unit='ms', utc=True)
eth = pd.read_parquet("../data/features/DATA_0/ETH_EUR.parquet")
eth.index = pd.to_datetime(eth.index, unit='ms', utc=True)
eth = eth[["level-1-bid-price"]]

In [13]:
def backward_matching(A: pd.DataFrame, B: pd.DataFrame, timeshift=pd.Timedelta('0s')) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Transform asynchronous time series into synchronous time series using the union method.
    For each timestamp from A, takes the lates timestamp of B shifted by timeshift.
    
    Args:
        A (pd.DataFrame): target time series with datetime index
        B (pd.DataFrame): base time series with datetime index (the one that will be synced)
    
    Returns:
        pd.DataFrame: Synchronized time series B_sync (with respect to A)
    """
    
    # Shift B by the specified timeshift
    B_shifted = B.shift(freq=timeshift)
    
    # Reindex B_shifted to match the index of A, using the latest available values
    B_sync = pd.merge_asof(A, B_shifted, on='timestamp', direction="backward", suffixes=("_A", "")).filter(regex='.*(?<!_A$)')
    
    return B_sync

In [19]:
sampple_size = 1000
sample_features = features.iloc[:sampple_size].copy()
sample_eth = eth.iloc[:sampple_size].copy()
synced_features = backward_matching(sample_eth, sample_features, timeshift=pd.Timedelta('1ms'))
print(synced_features.shape, sample_eth.shape)
TE_test_result = pd.DataFrame({feat: transfer_entropy_analysis(synced_features[feat].iloc[:-1], sample_eth.iloc[:-1], sample_eth.iloc[1:]) for feat in synced_features.columns if feat != 'timestamp'})
print(TE_test_result)

(1000, 58) (1000, 1)




  level-1-bid-price_A level-1-bid-price level-1-bid-volume level-2-bid-price  \
0                 0.0          0.031158           0.039247          0.115221   
1                 1.0               1.0                1.0               1.0   
2                 0.0          0.062315           0.078494          0.230441   
3              362952             15336              46008            194256   
4                  lr                lr                 lr                lr   

  level-2-bid-volume level-3-bid-price level-3-bid-volume level-4-bid-price  \
0           0.108314          0.112987            0.11071          0.133416   
1                1.0               1.0                1.0               1.0   
2           0.216629          0.225974           0.221421          0.266832   
3             209592            224928             214704            316944   
4                 lr                lr                 lr                lr   

  level-4-bid-volume level-5-bid-price  ... 

