# Target reconstruction: what's up with 2018-01-03 14:29:00?
# UPDATE: 
### There's nothing special about that date except that it is 3749 steps after the train set starts! Adding in a timestamp at 2018-01-01 00:00 while calculating the Target completely changes the value of the four rows mentioned below. See [updated notebook](https://www.kaggle.com/jagofc/target-reconstruction-quick-summary-of-sota) and [renewed discussion](https://www.kaggle.com/c/g-research-crypto-forecasting/discussion/286778#1626395).

# Old notebook below:

This is a quick note to draw attention to one timestamp that the heroic target-reconstruction work collected in alexfir's [notebook](https://www.kaggle.com/alexfir/recreating-target/) isn't reproducing.
The plus side is that - excluding just 4 rows from 2018-01-03 14:29:00 - the reconstruction is pretty much perfect.

## Summary statistics:

**Calculated on all rows:**

* Mean absolute error: 6.0e-10
* Max absolute error: 3.8e-03
* Std absolute error: 1.4e-06


**Calculated on all rows except [3748, 3899904, 9762453, 11718653]:**

* Mean absolute error: 8.2e-16
* Max absolute error: 2.8e-15
* Std absolute error: 8.5e-14

**The question is: why does it have a large error for those 4 rows, and is perfect (up to precision) everywhere else?**


# Preliminaries

In [None]:
import os
import numpy as np
import pandas as pd

data_path = '../input/g-research-crypto-forecasting/train.csv'
dtypes = {
    'timestamp': np.int64,
    'Asset_ID': np.int8,
    'Close': np.float64,
    'Target': np.float64,
}
crypto_df = pd.read_csv(data_path, dtype=dtypes, usecols=list(dtypes.keys()))

asset_details_path = '../input/g-research-crypto-forecasting/asset_details.csv'
asset_details = pd.read_csv(asset_details_path)

crypto_df = crypto_df.merge(asset_details, on='Asset_ID')

In [None]:
# from https://www.kaggle.com/alexfir/recreating-target/
def calculate_target(data: pd.DataFrame, details: pd.DataFrame, price_column: str):
    ids = list(details.Asset_ID)
    asset_names = list(details.Asset_Name)
    weights = np.array(list(details.Weight))

    all_timestamps = np.sort(data['timestamp'].unique())
    targets = pd.DataFrame(index=all_timestamps)

    for i, id in enumerate(ids):
        asset = data[data.Asset_ID == id].set_index(keys='timestamp')
        price = pd.Series(index=all_timestamps, data=asset[price_column])
        targets[asset_names[i]] = (
            price.shift(periods=-16) /
            price.shift(periods=-1)
        ) - 1
    
    targets['m'] = np.average(targets.fillna(0), axis=1, weights=weights)
    
    m = targets['m']

    num = targets.multiply(m.values, axis=0).rolling(3750).mean().values
    denom = m.multiply(m.values, axis=0).rolling(3750).mean().values
    beta = np.nan_to_num(num.T / denom, nan=0., posinf=0., neginf=0.)

    targets = targets - (beta * m.values).T
    targets.drop('m', axis=1, inplace=True)
    
    return targets

In [None]:
recon_targets = calculate_target(data=crypto_df, details=asset_details, price_column='Close')
recon_targets = pd.melt(recon_targets.reset_index(), id_vars='index')
recon_targets = recon_targets.rename(columns={'index':'timestamp', 'variable':'Asset_Name', 'value':'recon_Target'})

In [None]:
crypto_df = crypto_df.merge(recon_targets, on=['Asset_Name', 'timestamp'])

# Checks

Check that all NaNs match up:

In [None]:
all(crypto_df.Target.isna() == crypto_df.recon_Target.isna()) 

Reproduce absolute error claims:

In [None]:
crypto_df['abs_error'] = abs(crypto_df['Target'] - crypto_df['recon_Target'])
print("abs_error statistics for the entirety of train data:")
crypto_df['abs_error'].describe()

...seems to be highly skewed. Check out the large abs_errors:

In [None]:
crypto_df[crypto_df.abs_error > 1e-13]

They're all in the same minute!

In [None]:
print(f"Datetime of offence: {pd.to_datetime(1514989740, unit='s')}")

In [None]:
indices = crypto_df[crypto_df.abs_error > 1e-13].index
print("abs_error statistics after dropping offenders:")
crypto_df.drop(indices).abs_error.describe()

After dropping the four rows above, target reconstruction is pretty much perfect.

**So, what's happening in this minute...?**