# Tabularization Refactoring Experiments

In [1]:
from tab_experiment_utils import test_correctness, perform_profiling, perform_benchmarks

## Correctness Checks

### With Unequal Frequency Series (Uses Date Intersection method)

In [None]:
# Using series with `pd.RangeIndex`-typed `time_index`s:
test_correctness(equal_freq=False, use_range_idx=True)

Passed 1000/10240


In [None]:
# Using series with `pd.DatetimeIndex`-typed `time_index`s:
test_correctness(equal_freq=False, use_range_idx=False)

### With Equal Frequency Series (Uses Sliding Window method)

In [None]:
# Using series with `pd.RangeIndex`-typed `time_index`s:
test_correctness(equal_freq=True, use_range_idx=True)

In [None]:
# Using series with `pd.DatetimeIndex`-typed `time_index`s:
test_correctness(equal_freq=True, use_range_idx=False)

## Speed Benchmarks

### Small Number of Lags, No `max_samples_per_ts`

In [None]:
perform_benchmarks(
    num_repeats=1000,
    use_range_idx=False,
    multi_models=True,
    lags=[-1],
    lags_past_covariates=[-2],
    lags_future_covariates=[-3],
    output_chunk_length=10,
    max_samples_per_ts=None,
)

### Small Number of Lags, `max_samples_per_ts` = 10

In [None]:
perform_benchmarks(
    num_repeats=1000,
    use_range_idx=False,
    multi_models=True,
    lags=[-1],
    lags_past_covariates=[-2],
    lags_future_covariates=[-3],
    output_chunk_length=10,
    max_samples_per_ts=10,
)

### Large Number of Lags, No `max_samples_per_ts`

In [None]:
# Use fewer repeats here for sake of brevity (these benchmarks take longer):
perform_benchmarks(
    num_repeats=100,
    use_range_idx=False,
    multi_models=True,
    lags=range(-30, 0, 1),
    lags_past_covariates=range(-62, 0, 2),
    lags_future_covariates=range(-100, 0, 3),
    output_chunk_length=10,
    max_samples_per_ts=None,
)

### Large Number of Lags, `max_samples_per_ts` = 10

In [None]:
# Use fewer repeats here for sake of brevity (these benchmarks take longer):
perform_benchmarks(
    num_repeats=100,
    use_range_idx=False,
    multi_models=True,
    lags=range(-30, 0, 1),
    lags_past_covariates=range(-62, 0, 2),
    lags_future_covariates=range(-100, 0, 3),
    output_chunk_length=10,
    max_samples_per_ts=10,
)

## Unexpected Behaviour of `_create_lagged_data` when `is_training=False`

This section describes a simple test example where `_create_lagged_data` shows unexpected behavior when
`is_training = False`.

Let's first create a single component `target_series` and `past_series`. We'll define both series to have `5` timesteps, but `target_series` will be sampled every `1` timestep and `past_series` will be sampled every `2` timesteps:

In [1]:
from darts.utils.data.tabularization import _create_lagged_data, create_lagged_data
from tab_experiment_utils import create_index_series
import numpy as np

num_timesteps = 5
# Have `target_series` values starting from 10 and `past_series` starting from 20:
target_series = create_index_series(num_timesteps, num_components=1, offset=10, freq=1)
past_series = create_index_series(num_timesteps, num_components=1, offset=20, freq=2)

Let's visualise the values of these series alongside their `time_index`; the values are shown in the top row, whilst the `time_index` is shown in the bottom row:

In [2]:
# target_series
print(np.stack([target_series.all_values().squeeze(), target_series.time_index]))

[[10. 11. 12. 13. 14.]
 [ 0.  1.  2.  3.  4.]]


In [3]:
# past_series
print(np.stack([past_series.all_values().squeeze(), past_series.time_index]))

[[20. 21. 22. 23. 24.]
 [ 0.  2.  4.  6.  8.]]


Let's now suppose we want to create lagged data using `lags = [-1]`, `lags_past_covariates = [-2]`, `output_chunk_length = 1`, `multi_models = 1`, and `max_samples_per_ts = 1`. When `is_training = True`, we observe that:

In [4]:
(X, y, Ts) = _create_lagged_data(
    target_series,
    lags=[-1],
    past_covariates=past_series,
    lags_past_covariates=[-2],
    output_chunk_length=1,
    multi_models=False,
    is_training=True,
    max_samples_per_ts=1,
)
print(Ts[0])
print(X)

Int64Index([4], dtype='int64', name='time')
[[13. 20.]]


This result is as expected:
- `Ts` equals `4`, since it's the latest `time_index` which we can create features + labels for
- `X[0]` is `13` since this is `-1` lag away from `t = 4` in `target_series`.
- `X[1]` is `20` since this is `-2` lags away from `t = 4` in `past_series`.

Let's now repeat this, but with `is_training = False`: 

In [5]:
(X, y, Ts) = _create_lagged_data(
    target_series,
    lags=[-1],
    past_covariates=past_series,
    lags_past_covariates=[-2],
    output_chunk_length=1,
    multi_models=False,
    is_training=False,
    max_samples_per_ts=1,
)
print(Ts[0])
print(X)

Int64Index([4], dtype='int64', name='time')
[[13. 21.]]


What's unexpected about this result is that `X[1]` is `21`: this value is **not** `-2` lags away from `t = 4` in `past_series` but, instead, is only `-1` lags away from the `t = 4` value in `past_series`. 

For what it's worth, the reimplemented `create_lagged_data` produces the 'intuitively correct' result:

In [6]:
(X, y, Ts) = create_lagged_data(
    target_series,
    lags=[-1],
    past_covariates=past_series,
    lags_past_covariates=[-2],
    output_chunk_length=1,
    multi_models=False,
    is_training=False,
    max_samples_per_ts=1,
)
print(Ts[0])
print(X)

4
[[[13.]
  [20.]]]
