# Exploration of Missing Data in G-Research Crypto Forecasting

This notebook presents results from my exploration of missing data in the G-Research Crypto Forecasting dataset. 

I explore the missing data from 2 aspects:

1. Missing timepoints, further reffered to as gaps.
2. Missing values in `Target`, which typically stem from the time gaps (point 1.)

A big portion of the code is abstracted in the `gresearch_crypto_utils` file, which is available if you are interested in the computational details.

In [1]:
import matplotlib.pyplot as plt
import gresearch_crypto_utils as utils

In [2]:
MAKE_LARGE_PLOTS = True
dataset = utils.GResearchDataset(
    base_path = "/kaggle/input/g-research-crypto-forecasting"
)

For some cryptoassets, the data is available since 2019, so that we will subsample all the training set such as to have the data for all cryptoassets. More precisely, we will use the data in the 2019-04-12 14:34:00 - 2021-09-21 00:00:00 time period in this analysis.

The preprocessing of the raw training data is done inside the `dataset.get_processed_dfs_dict_by_asset`. It includes addition of rows (with NaNs) for each minute in the targeted time period.

In [3]:
assets_train = dataset.get_processed_dfs_dict_by_asset()
assets_train[0].head(3)

## NAs in prices

In the training data, there are no cases for which all the prices (HLCO) are missing. Also, if there is data for one type of price, there is data for all the other prices.

In [4]:
na_counts = dataset.train[['Close', 'Open', 'Low', 'High']].isna().sum(axis="columns")
(na_counts == 4).sum(), ((1 <= na_counts) & (na_counts <= 3)).sum()

## Gaps in time-series for each asset

We compute a dataframe where each row corresponds to a time gap in a particular asset. 

For example, row number 1 below is a time gap of length 1 min in asset with id==2, starting at 2019-06-01 00:00:00.

In [5]:
gap_stats = utils.compute_gap_stats_all_assets(dataset.assets_train, dataset.asset_details)
gap_stats.head()

Given this computed table, let's evaluate, for each asset:
- distribution of gap durations
- distribution of intergap intervals
- gap length in time

In [6]:
if MAKE_LARGE_PLOTS:
    for i, row in dataset.get_assets_iter():
        fig, ax = plt.subplots(figsize=(12, 50 // 14))
        (gap_stats
            .loc[(gap_stats['Asset_ID'] == row['Asset_ID'])]
            ['gap_duration_mins']
            .plot.hist(bins=50, title=f"Gap duration (mins): {row['Asset_Name']}", ax=ax)
        )

In [11]:
if MAKE_LARGE_PLOTS:
    for i, row in dataset.get_assets_iter():
        fig, ax = plt.subplots(figsize=(12,50 // 14))
        (gap_stats
            .loc[(gap_stats['Asset_ID'] == row['Asset_ID'])]
            ['mins_from_last_gap']
            .plot.hist(bins=50, title=f"Intergap duration (mins): {row['Asset_Name']}", ax=ax)
        )

In [12]:
if MAKE_LARGE_PLOTS:
    for i, row in dataset.get_assets_iter():
        fig, ax = plt.subplots(figsize=(15,80 // 14))
        (gap_stats
            .loc[(gap_stats['Asset_ID'] == row['Asset_ID'])] 
            .plot.scatter(
                x="gap_start", y="gap_duration_mins", 
                title=f"Gap lengths in time: {row['Asset_Name']}", ax=ax)
        )

## Missing data in `Target`

We'll explore:
- Evolution of `Target` in time (not strictly related to missing data exploration)
- Distribution of `Target` in time (not strictly related to missing data exploration)
- Missing data in `Target` evolution in time:
  - We'll compare missing data cumulative plots of `Target` with those computed from time gaps. The similarity of these 2 graphs, for each cryptoasset, indicates that they are associated.

In [13]:
if MAKE_LARGE_PLOTS:
    for i, row in dataset.get_assets_iter():
        fig, ax = plt.subplots(figsize=(15,100 // 14))
        asset_id = row['Asset_ID']
        df_asset = assets_train[asset_id]
        df_asset["Target"].plot(
            ax=ax, title=f"Target: {row['Asset_Name']}")

In [14]:
if MAKE_LARGE_PLOTS:
    fig, axs = plt.subplots(nrows=dataset.asset_details.shape[0], figsize=(10,70), sharex=True)

    for i, row in dataset.get_assets_iter():
        asset_id = row['Asset_ID']
        df_asset = assets_train[asset_id]
        df_asset["Target"].plot.hist(
            bins=50,
            ax=axs[i], 
            title=f"Target histogram: {row['Asset_Name']}")

In [15]:
if MAKE_LARGE_PLOTS:
    for i, row in dataset.get_assets_iter():
        fig, axs = plt.subplots(ncols=2, figsize=(20,80 // 14))
        df_asset = assets_train[row['Asset_ID']]
        df_asset['Asset_ID'].isna().cumsum().plot(
            ax=axs[0], 
            title=f"No time data cumsum: {row['Asset_Name']}"
        )
        df_asset['Target_original_na'].fillna(False).cumsum().plot(
            ax=axs[1], 
            title=f"No target data (original train dataset): {row['Asset_Name']}"
        )

### Explaining the Target NaNs

From the plots above (cumulative), it looks like, generally speaking, `Target` NaNs stem from the time gaps.

We'll evaluate how many `Target` NaNs are there when the prices at times t+1 and t+16 are available. The code exploits the verified fact that if one price is available from HLCO, every other one is for that timepoint.

In [27]:
for _, row in dataset.get_assets_iter():
    df_asset = assets_train[row['Asset_ID']]
    
    df_asset['Close_t+1'] = df_asset['Close'].shift(-1)
    df_asset['Close_t+16'] = df_asset['Close'].shift(-16)
    
    mask_na_with_available_prices = (
        (df_asset['Target_original_na'] == True)
        & (
            (~df_asset['Close_t+1'].isna()) 
            & (~df_asset['Close_t+16'].isna())
        )
    )

    mask_na_no_price = (
        (df_asset['Target_original_na'] == True)
        & (
            (df_asset['Close_t+1'].isna()) 
            | (df_asset['Close_t+16'].isna())
        )
    )

    # mask_na_no_t1 = (
    #     (df_asset['Target_original_na'] == True)
    #     & (df_asset['Close_t+1'].isna())
    # )

    # mask_na_no_t16 = (
    #     (df_asset['Target_original_na'] == True)
    #     & (df_asset['Close_t+16'].isna())
    # )

    mask_target_na = df_asset['Target_original_na'] == True
    
    print(
        f"Asset: {row['Asset_Name']:<16s} | ",
        f"P(t+1) && P(t+16) exist: {mask_na_with_available_prices.sum():<2d} | ",
        f"not P(t+1) || not P(t+16): {mask_na_no_price.sum():<6d} | "
        # f"not P(t+1): {mask_na_no_t1.sum():<7d}",
        # f" | not P(t+16): {mask_na_no_t16.sum():<7d}"
        f"Target NaNs: {mask_target_na.sum():<7d}"
    )

It looks like NaNs in `Target` come from the time gaps, therefore making it impossible to compute log returns due to inavailability of prices at t+1 and t+16. There are negligible exceptions (<20 per cryptoasset) which might come from boundary cases.

It also means that the hosts likely interpolated the values at the missing time points and used them to compute the rolling averages that are required for weighted average market returns $M(t)$ and the $\beta^a$.