# Tabularization Refactoring Experiments

In [1]:
from tab_experiment_utils import test_correctness, perform_profiling, perform_benchmarks

## Correctness Checks

### With Unequal Frequency Series (Uses Date Intersection method)

In [2]:
# Using series with `pd.RangeIndex`-typed `time_index`s:
test_correctness(equal_freq=False, use_range_idx=True)

Passed 1000/10240
Passed 2000/10240
Passed 3000/10240
Passed 4000/10240
Passed 5000/10240
Passed 6000/10240
Passed 7000/10240
Passed 8000/10240
Passed 9000/10240
Passed 10000/10240


In [3]:
# Using series with `pd.DatetimeIndex`-typed `time_index`s:
test_correctness(equal_freq=False, use_range_idx=False)

Passed 1000/10240
Passed 2000/10240
Passed 3000/10240
Passed 4000/10240
Passed 5000/10240
Passed 6000/10240
Passed 7000/10240
Passed 8000/10240
Passed 9000/10240
Passed 10000/10240


### With Equal Frequency Series (Uses Sliding Window method)

In [4]:
# Using series with `pd.RangeIndex`-typed `time_index`s:
test_correctness(equal_freq=True, use_range_idx=True)

Passed 1000/10240
Passed 2000/10240
Passed 3000/10240
Passed 4000/10240
Passed 5000/10240
Passed 6000/10240
Passed 7000/10240
Passed 8000/10240
Passed 9000/10240
Passed 10000/10240


In [5]:
# Using series with `pd.DatetimeIndex`-typed `time_index`s:
test_correctness(equal_freq=True, use_range_idx=False)

Passed 1000/10240
Passed 2000/10240
Passed 3000/10240
Passed 4000/10240
Passed 5000/10240
Passed 6000/10240
Passed 7000/10240
Passed 8000/10240
Passed 9000/10240
Passed 10000/10240


## Speed Benchmarks

### Small Number of Lags, No `max_samples_per_ts`

In [6]:
perform_benchmarks(
    num_repeats=1000,
    use_range_idx=False,
    multi_models=True,
    lags=[-1],
    lags_past_covariates=[-2],
    lags_future_covariates=[-3],
    output_chunk_length=10,
    max_samples_per_ts=None,
)

With Unequal Frequency Timeseries:
Current implementation: 21.079285621643066 secs for 1000 repetitions
New implementation: 1.2758994102478027 secs for 1000 repetitions
Speed up = 16.521118712288683 fold

With Equal Frequency Timeseries, Using Moving Windows:
Current implementation: 10.484981775283813 secs for 1000 repetitions
New implementation: 0.8654282093048096 secs for 1000 repetitions
Speed up = 12.1153686262507 fold

With Equal Frequency Timeseries, Using Time Intersections:
Current implementation: 10.055682897567749 secs for 1000 repetitions
New implementation: 3.3570518493652344 secs for 1000 repetitions
Speed up = 2.9953910004306668 fold



### Small Number of Lags, `max_samples_per_ts` = 10

In [7]:
perform_benchmarks(
    num_repeats=1000,
    use_range_idx=False,
    multi_models=True,
    lags=[-1],
    lags_past_covariates=[-2],
    lags_future_covariates=[-3],
    output_chunk_length=10,
    max_samples_per_ts=10,
)

With Unequal Frequency Timeseries:
Current implementation: 19.58695125579834 secs for 1000 repetitions
New implementation: 0.7003414630889893 secs for 1000 repetitions
Speed up = 27.96771616149409 fold

With Equal Frequency Timeseries, Using Moving Windows:
Current implementation: 9.660383462905884 secs for 1000 repetitions
New implementation: 0.5997965335845947 secs for 1000 repetitions
Speed up = 16.1061008558553 fold

With Equal Frequency Timeseries, Using Time Intersections:
Current implementation: 10.26090693473816 secs for 1000 repetitions
New implementation: 0.7066986560821533 secs for 1000 repetitions
Speed up = 14.519494053693707 fold



### Large Number of Lags, No `max_samples_per_ts`

In [8]:
# Use fewer repeats here for sake of brevity (these benchmarks take longer):
perform_benchmarks(
    num_repeats=100,
    use_range_idx=False,
    multi_models=True,
    lags=range(-30, 0, 1),
    lags_past_covariates=range(-62, 0, 2),
    lags_future_covariates=range(-100, 0, 3),
    output_chunk_length=10,
    max_samples_per_ts=None,
)

With Unequal Frequency Timeseries:
Current implementation: 30.294672966003418 secs for 100 repetitions
New implementation: 0.5056872367858887 secs for 100 repetitions
Speed up = 59.907924824352214 fold

With Equal Frequency Timeseries, Using Moving Windows:
Current implementation: 8.417512655258179 secs for 100 repetitions
New implementation: 2.1929337978363037 secs for 100 repetitions
Speed up = 3.838470939507369 fold

With Equal Frequency Timeseries, Using Time Intersections:
Current implementation: 8.822782516479492 secs for 100 repetitions
New implementation: 2.626560926437378 secs for 100 repetitions
Speed up = 3.3590625778654837 fold



### Large Number of Lags, `max_samples_per_ts` = 10

In [9]:
# Use fewer repeats here for sake of brevity (these benchmarks take longer):
perform_benchmarks(
    num_repeats=100,
    use_range_idx=False,
    multi_models=True,
    lags=range(-30, 0, 1),
    lags_past_covariates=range(-62, 0, 2),
    lags_future_covariates=range(-100, 0, 3),
    output_chunk_length=10,
    max_samples_per_ts=10,
)

With Unequal Frequency Timeseries:
Current implementation: 27.349496841430664 secs for 100 repetitions
New implementation: 0.07856512069702148 secs for 100 repetitions
Speed up = 348.11245243167457 fold

With Equal Frequency Timeseries, Using Moving Windows:
Current implementation: 9.392585277557373 secs for 100 repetitions
New implementation: 0.05825614929199219 secs for 100 repetitions
Speed up = 161.22907867596504 fold

With Equal Frequency Timeseries, Using Time Intersections:
Current implementation: 9.368183135986328 secs for 100 repetitions
New implementation: 0.09797930717468262 secs for 100 repetitions
Speed up = 95.61389446533076 fold



## Unexpected Behaviour of `_create_lagged_data` when `is_training=False`

This section describes a simple test example where `_create_lagged_data` shows unexpected behavior when
`is_training = False`. Fortunately, `_create_lagged_data` doesn't appear to be called with `is_training = False` anywhere in the codebase, so this potential bug has no real implications.

Let's first create a single component `target_series` and `past_series`. We'll define both series to have `5` timesteps, but `target_series` will be sampled every `1` timestep and `past_series` will be sampled every `2` timesteps:

In [10]:
from darts.utils.data.tabularization import _create_lagged_data
from tab_experiment_utils import create_index_series
import numpy as np

num_timesteps = 5
# Have `target_series` values starting from 10 and `past_series` starting from 20:
target_series = create_index_series(num_timesteps, num_components=1, offset=10, freq=1)
past_series = create_index_series(num_timesteps, num_components=1, offset=20, freq=2)

Let's visualise the values of these series alongside their `time_index`; the values are shown in the top row, whilst the `time_index` is shown in the bottom row:

In [11]:
# target_series
print(np.stack([target_series.all_values().squeeze(), target_series.time_index]))

[[10. 11. 12. 13. 14.]
 [ 0.  1.  2.  3.  4.]]


In [12]:
# past_series
print(np.stack([past_series.all_values().squeeze(), past_series.time_index]))

[[20. 21. 22. 23. 24.]
 [ 0.  2.  4.  6.  8.]]


Let's now suppose we want to create lagged data using `lags = [-1]`, `lags_past_covariates = [-2]`, `output_chunk_length = 1`, `multi_models = 1`, and `max_samples_per_ts = 1`. When `is_training = True`, we observe that:

In [13]:
(X, y, Ts) = _create_lagged_data(
    target_series,
    lags=[-1],
    past_covariates=past_series,
    lags_past_covariates=[-2],
    output_chunk_length=1,
    multi_models=False,
    is_training=True,
    max_samples_per_ts=1,
)
print(Ts[0])
print(X)

Int64Index([4], dtype='int64', name='time')
[[13. 20.]]


This result is as expected:
- `Ts` equals `4`, since it's the latest `time_index` which we can create features + labels for
- `X[0]` is `13` since this is `-1` lag away from `t = 4` in `target_series`.
- `X[1]` is `20` since this is `-2` lags away from `t = 4` in `past_series`.

Let's now repeat this, but with `is_training = False`: 

In [14]:
(X, y, Ts) = _create_lagged_data(
    target_series,
    lags=[-1],
    past_covariates=past_series,
    lags_past_covariates=[-2],
    output_chunk_length=1,
    multi_models=False,
    is_training=False,
    max_samples_per_ts=1,
)
print(Ts[0])
print(X)

Int64Index([4], dtype='int64', name='time')
[[13. 21.]]


What's unexpected about this result is that `X[1]` is `21`: this value is **not** `-2` lags away from `t = 4` in `past_series` but, instead, is only `-1` lags away from the `t = 4` value in `past_series`. 