**Table of contents**<a id='toc0_'></a>    
- [Introduction](#toc1_)    
- [Working with Data](#toc2_)    
    - [TSDataset](#toc2_1_1_)    
    - [Pipeline and Transformers](#toc2_1_2_)    
      - [What kind of transformers are there?](#toc2_1_2_1_)    
      - [Transformers must be assembled in order!](#toc2_1_2_2_)    
      - [How to build a Pipeline?](#toc2_1_2_3_)    
      - [Can I use exogenous variables in the pipeline?](#toc2_1_2_4_)    
      - [Model, Validator and Trainer](#toc2_1_2_5_)    
      - [Strategy](#toc2_1_2_6_)    
  - [Backtest validation of pipeline](#toc2_2_)    
  - [Sliding Window Validation](#toc2_3_)    
  - [Working with raw time series' granularity](#toc2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Introduction](#toc0_)

In this tutorial, we will explore a basic example of forecasting multiple time series and go over the key components of the forecasting pipeline provided by the `Tsururu` library.

Let's import everything we will need.

In [1]:
import warnings

warnings.filterwarnings("ignore")

from typing import List, Optional, Union

import numpy as np
import pandas as pd

from tsururu.dataset import Pipeline, TSDataset
from tsururu.model_training import MLTrainer
from tsururu.model_training import KFoldCrossValidator
from tsururu.models.boost import CatBoost
from tsururu.strategies import RecursiveStrategy

In [2]:
def get_results(
    cv: int,
    regime: str,
    y_true: Optional[List[np.ndarray]] = None,
    y_pred: Optional[List[np.ndarray]] = None,
    ids: Optional[List[Union[float, str]]] = None,
) -> pd.DataFrame:
    def _get_fold_value(
        value: Optional[Union[float, np.ndarray]], idx: int
    ) -> List[Optional[Union[float, np.ndarray]]]:
        if value is None:
            return [None]
        if isinstance(value[idx], float):
            return value[idx]
        if isinstance(value[idx], np.ndarray):
            return value[idx].reshape(-1)
        raise TypeError(f"Unexpected value type. Value: {value}")

    df_res_dict = {}

    for idx_fold in range(cv):
        # Fill df_res_dict
        for name, value in [("y_true", y_true), ("y_pred", y_pred)]:
            df_res_dict[f"{name}_{idx_fold+1}"] = _get_fold_value(
                value, idx_fold
            )
        if regime != "local":
            df_res_dict[f"id_{idx_fold+1}"] = _get_fold_value(ids, idx_fold)

    # Save datasets to specified directory
    df_res = pd.DataFrame(df_res_dict)
    return df_res

There are several main objects to look out for when working with the library:
1) `TSDataset`.
2) `Pipeline` and `Transformers`
3) `Strategy`.
4) `Model`.

# <a id='toc2_'></a>[Working with Data](#toc0_)

### <a id='toc2_1_1_'></a>[TSDataset](#toc0_)

This class is needed to store data and meta-information about it.

To initialise it is necessary to submit the data in `pd.DataFrame` format and define some meta-information about roles that necessary for solving the task of time series forecasting: `id`, `date`, `target`.

In [3]:
df_path = "../datasets/global/simulated_data_to_check.csv"

dataset_params = {
    "target": {
        "columns": ["value"],
        "type": "continuous",
    },
    "date": {
        "columns": ["date"],
        "type": "datetime",
    },
    "id": {
        "columns": ["id"],
        "type": "categorical",
    }
}

In [4]:
dataset = TSDataset(
    data=pd.read_csv(df_path),
    columns_params=dataset_params,
    print_freq_period_info=True,
)

freq: Day; period: 1


### <a id='toc2_1_2_'></a>[Pipeline and Transformers](#toc0_)

The Pipeline class is designed to create and apply a sequence of transformations (transformers) to time series data. It is used for data preprocessing, feature and target generation, as well as performing transformations required for forecasting models.

In this tutorial, we will cover a simplified approach to initializing the Pipeline. For detailed information on the available transformers and methods for building a pipeline, refer to Tutorial 3 (Tutorial_3_Pipeline.ipynb).

#### <a id='toc2_1_2_1_'></a>[What kind of transformers are there?](#toc0_)

Special attention should be paid to the `Transformer` class: the elements of the pipeline that are responsible for transforming the values of a series and generating features. `Pipeline` class is a wrapper over transformers which is needed to provide some additional methods and functions above them.

There are two types of transformers that are used to collect pipelines:
- `Union` transformers;
- `Sequential` transformers.

Below is a list of available Transformers: 
- **StandardScalerTransformer** *(Series2Series)*.
- **DifferenceNormalizer** *(Series2Series)*: subtract the previous value or divide by it.
- **TimeToNumGenerator** and **DateSeasonsGenerator** *(Series2Series)* - generator for seasonal features by dates.
- **LabelEncodingTransformer** and **OneHotEncodingTransformer** *(Series2Series)* - encoders for categorical features.
- **MissingValuesImputer** *(Series2Series)*.
- **LagTransformer** *(Series2Features)* - generator for lags. 
- **LastKnownNormalizer** *(Features2Features)*: normalize all lags by the last known one: divide by it or subtract.

!!!The lag transformer must necessarily be present in the sequential transformer, otherwise the features will not be generated.!!!

Finally, to generate targets, you need to use **TargetGenerator**.

#### <a id='toc2_1_2_2_'></a>[Transformers must be assembled in order!](#toc0_)

The __SeriesToSeries__ transformers should come first, followed by the LagTransformer and TargetGenerator (__SeriesToFeatures__), and then the __FeaturesToFeatures__ transformers.

!!!Thus, **StandardScalerNormalizer** and **DifferenceNormalizer** should be before **LagTransformer** and **LastKnownNormalizer** after it!!!

#### <a id='toc2_1_2_3_'></a>[How to build a Pipeline?](#toc0_)

In [39]:
pipeline_easy_params = {
    "target_lags": 3,
    "date_lags": 1,
#    "exog_lags": 1,  # Uncomment this line if you have exogenous features
    # One from ["none", "standard_scaler", "difference_normalizer", "last_known_normalizer"]
    "target_normalizer": "standard_scaler",
    # One from ["none", "delta", "ratio"]  (MUST BE "none" for "standard_scaler" and NOT "none" for others)
    "target_normalizer_regime": "none",
}

In [40]:
pipeline = Pipeline.easy_setup(dataset_params, pipeline_easy_params, multivariate=False)

Or:

In [41]:
pipeline_params = {
    "target": {
        "columns": ["value"],
        "features": {
            "StandardScalerTransformer":
                {
                    "transform_target": True, 
                    "transform_features": True
                },
            "LagTransformer": {"lags": 7},
        },
    },
    "date": {
        "columns": ["date"],
        "features": {
            "DateSeasonsGenerator": {
                # Use seasonality features from the date column as 
                # features with datetime lags
                # Possible values: [
                #    "y": year, "m": month, "d": day, 
                #    "wd": weekday, "doy": dayofyear,
                #    "hour": hour, "min": minute, "sec": second, 
                #    "ms": microsecond,  "ns": nanosecond
                # ]
                "seasonalities": ['doy', 'm', 'wd'], 
                # Use date from target point to make datetime features
                "from_target_date": True,
            },
            "LagTransformer": {"lags": 3}
        },
    },
    "id": {
        "columns": ["id"],
        "features": {
            "LagTransformer": {"lags": 1},
        },
    }
}

In [62]:
pipeline = Pipeline.from_dict(pipeline_params, multivariate=False)

#### <a id='toc2_1_2_4_'></a>[Can I use exogenous variables in the pipeline?](#toc0_)

Yes! Exogenous variables can also be specified here. Just include them in your pipeline.

However, their operation is currently tested only for the `MIMOStrategy` in global-modelling. For other strategies support of additional variables is under development.

In [63]:
# pipeline_params["exog_group_1"] = {
#     "columns": ["value"],
#     "features": {
#         "StandardScalerTransformer":
#             {
#                 "transform_target": False, 
#                 "transform_features": True
#             },
#         "LagTransformer": {"lags": 7},
#     },
# }

__Make sure you have the transform_target = False flag for exogenous features!__

#### <a id='toc2_1_2_5_'></a>[Model, Validator and Trainer](#toc0_)

- `Model`:
  - The model is separate from the strategy. Any model can be run in any strategy if it supports this input and output format.
  - You can use on of the implemented ML models (for instance, GBM (Gradient Boosting Machine)).
- `Validator`:
  - The validator is responsible for setting up the validation process, which includes creating training and validation folds. It ensures that the data is split correctly so that the model’s performance can be accurately assessed. 
- `Trainer`:
  - The trainer is the component that trains the model with provided validator. 
  - It is necessary to choose a trainer in accordance with the type of model (ML, DL, stats).

In [64]:
# Configure the model parameters
model = CatBoost
model_params = {
    "loss_function": "MultiRMSE",
    "early_stopping_rounds": 100,
    "verbose": 500,
}

# Configure the validation parameters
validation = KFoldCrossValidator
validation_params = {
    "n_splits": 2,
}

trainer_params = {}

trainer = MLTrainer(
    model,
    model_params,
    validation,
    validation_params,
)

#### <a id='toc2_1_2_6_'></a>[Strategy](#toc0_)

- _Recursive:_ 
    - one model for all points of the forecast horizon;
    - *training*: the model is trained to predict one point ahead;
    - *prediction*: a prediction is iteratively made one point ahead, and then this prediction is used to further shape the features in the test data. 
- _Recursive-reduced:_
    - one model for all points in the prediction horizon;
    - *training*: the model is trained to predict one point ahead;
    - *prediction*: features are generated for all test observations at once, unavailable values are replaced by NaN.
- _Direct:_ 
    - individual models for each point in the prediction horizon. 
- _MultiOutput (MIMO - Multi-input-multi-output):_
    - one model that learns to predict the entire prediction horizon. 
    - __Also, this strategy supports the presence of `exogenous features` (only for local- or global-modelling).__
- _FlatWideMIMO:_.
    - mixture of Direct and MIMO, fit one model, but uses deployed over horizon Direct's features.
    - __Number of `lags for datetime features` should be equal to `horizon` while using this strategy.__

In [65]:
horizon = 3
history = 7

In [66]:
strategy = RecursiveStrategy(horizon, history, trainer, pipeline)

In [67]:
fit_time, _ = strategy.fit(dataset)

0:	learn: 0.9606080	test: 0.9667407	best: 0.9667407 (0)	total: 2.64ms	remaining: 2.64s
500:	learn: 0.0051947	test: 0.0053699	best: 0.0053699 (500)	total: 458ms	remaining: 456ms
999:	learn: 0.0031608	test: 0.0033676	best: 0.0033676 (999)	total: 973ms	remaining: 0us

bestTest = 0.003367620955
bestIteration = 999

Fold 0. Score: 0.0033676209549416128
0:	learn: 0.9659554	test: 0.9614093	best: 0.9614093 (0)	total: 1.56ms	remaining: 1.56s
500:	learn: 0.0052698	test: 0.0054766	best: 0.0054766 (500)	total: 445ms	remaining: 443ms
999:	learn: 0.0031515	test: 0.0033391	best: 0.0033391 (999)	total: 912ms	remaining: 0us

bestTest = 0.003339095317
bestIteration = 999

Fold 1. Score: 0.0033390953168007786
Mean score: 0.0034
Std: 0.0


In [68]:
forecast_time, current_pred = strategy.predict(dataset)

freq: Day; period: 1


In [69]:
current_pred

Unnamed: 0,id,date,value
0,0,2022-09-27,1992.837874
1,0,2022-09-28,1993.026917
2,0,2022-09-29,1981.524299
3,1,2022-09-27,2993.408144
4,1,2022-09-28,2993.582053
5,1,2022-09-29,2982.066608
6,2,2022-09-27,3993.473395
7,2,2022-09-28,3993.639452
8,2,2022-09-29,3982.13657
9,3,2022-09-27,4993.459173


## <a id='toc2_2_'></a>[Backtest validation of pipeline](#toc0_)

Backtest focuses on evaluating models on the most recent horizons with iterative retraining as newer data becomes available. 

In [70]:
ids, test, pred, fit_time, forecast_time = strategy.back_test(dataset, cv=3)

freq: Day; period: 1
0:	learn: 0.9618043	test: 0.9656878	best: 0.9656878 (0)	total: 2.18ms	remaining: 2.18s
500:	learn: 0.0051787	test: 0.0052308	best: 0.0052308 (500)	total: 455ms	remaining: 453ms
999:	learn: 0.0030887	test: 0.0032962	best: 0.0032962 (999)	total: 956ms	remaining: 0us

bestTest = 0.00329621851
bestIteration = 999

Fold 0. Score: 0.0032962185095815415
0:	learn: 0.9647141	test: 0.9623340	best: 0.9623340 (0)	total: 1.5ms	remaining: 1.5s
500:	learn: 0.0055104	test: 0.0057187	best: 0.0057187 (500)	total: 440ms	remaining: 438ms
999:	learn: 0.0033140	test: 0.0035696	best: 0.0035696 (999)	total: 866ms	remaining: 0us

bestTest = 0.003569589613
bestIteration = 999

Fold 1. Score: 0.003569589613068342
Mean score: 0.0034
Std: 0.0001
freq: Day; period: 1
freq: Day; period: 1
0:	learn: 0.9635630	test: 0.9636501	best: 0.9636501 (0)	total: 959us	remaining: 958ms
500:	learn: 0.0051732	test: 0.0053000	best: 0.0053000 (500)	total: 446ms	remaining: 444ms
999:	learn: 0.0031204	test: 0.0032

In [71]:
get_results(cv=3, regime="global", y_true=test, y_pred=pred, ids=ids)

Unnamed: 0,y_true_1,y_pred_1,id_1,y_true_2,y_pred_2,id_2,y_true_3,y_pred_3,id_3
0,1997.0,1993.466612,0,1994.0,1989.660865,0,1991.0,1985.570009,0
1,1998.0,1993.89183,0,1995.0,1989.906115,0,1992.0,1986.309039,0
2,1999.0,1994.653188,0,1996.0,1990.068356,0,1993.0,1984.264202,0
3,2997.0,2994.040063,1,2994.0,2990.189639,1,2991.0,2986.108931,1
4,2998.0,2994.45766,1,2995.0,2990.460632,1,2992.0,2986.8504,1
5,2999.0,2995.221471,1,2996.0,2990.630855,1,2993.0,2984.845248,1
6,3997.0,3994.116753,2,3994.0,3990.252623,2,3991.0,3986.186164,2
7,3998.0,3994.53536,2,3995.0,3990.536104,2,3992.0,3986.91842,2
8,3999.0,3995.289903,2,3996.0,3990.707172,2,3993.0,3984.909912,2
9,4997.0,4994.105357,3,4994.0,4990.238486,3,4991.0,4986.174836,3


## <a id='toc2_3_'></a>[Sliding Window Validation](#toc0_)

Sliding window validation is a technique often used in research papers dedicated to time series forecasting. 

The testing part is further subdivided using rolling windows, where a “history” window and a “horizon” window are repeatedly created with a fixed step size.

In [72]:
full_df = pd.read_csv(df_path)

train_df = full_df[full_df["date"] < "2022-01-01"]
test_df = full_df[full_df["date"] >= "2022-01-01"]

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

train_dataset = TSDataset(
    data=train_df,
    columns_params=dataset_params,
    print_freq_period_info=True,
)

test_dataset = TSDataset(
    data=test_df,
    columns_params=dataset_params,
    print_freq_period_info=True,
)

Train shape: (7310, 3)
Test shape: (2690, 3)
freq: Day; period: 1
freq: Day; period: 1


In [73]:
fit_time, _ = strategy.fit(dataset)

0:	learn: 0.9606080	test: 0.9667407	best: 0.9667407 (0)	total: 1.78ms	remaining: 1.78s
500:	learn: 0.0051947	test: 0.0053699	best: 0.0053699 (500)	total: 511ms	remaining: 509ms
999:	learn: 0.0031608	test: 0.0033676	best: 0.0033676 (999)	total: 1.03s	remaining: 0us

bestTest = 0.003367620955
bestIteration = 999

Fold 0. Score: 0.0033676209549416128
0:	learn: 0.9659554	test: 0.9614093	best: 0.9614093 (0)	total: 1.47ms	remaining: 1.47s
500:	learn: 0.0052698	test: 0.0054766	best: 0.0054766 (500)	total: 431ms	remaining: 429ms
999:	learn: 0.0031515	test: 0.0033391	best: 0.0033391 (999)	total: 857ms	remaining: 0us

bestTest = 0.003339095317
bestIteration = 999

Fold 1. Score: 0.0033390953168007786
Mean score: 0.0034
Std: 0.0


In [74]:
forecast_time, current_pred = strategy.predict(dataset, test_all=True)

freq: Day; period: 1

                It seems that the data is not regular. Please, check the data and the frequency info.                
                For multivariate regime it is critical to have regular data.
                For global regime each regular part of time series will be processed as separate time series.           
                


It is normal to see this warning when using the sliding window validation.

In [75]:
current_pred

Unnamed: 0,id,date,value
0,0,2020-01-08,1008.4361
1,0,2020-01-09,1008.769569
2,0,2020-01-10,1009.748942
3,1,2020-01-08,2008.084646
4,1,2020-01-09,2008.425281
5,1,2020-01-10,2009.385426
6,2,2020-01-08,3007.931338
7,2,2020-01-09,3008.294974
8,2,2020-01-10,3009.266806
9,3,2020-01-08,4007.738118


We can see that after 2020-01-10 it is 2020-01-09 and this means that sliding window validation is working correctly.

## <a id='toc2_4_'></a>[Working with raw time series' granularity](#toc0_)

Time series come in different granularities, from hourly and daily time series to more complex ones such as the end of each quarter.

If the rows do not contain segments that are too short (that are shorter than history + horizon), then `tsururu` will try to extract the row granularity on its own. We currently support the following types:

- Yearly (and YearlyEnd)
- Quarterly (and Quarterly)
- Monthly (and MonthlyEnd)
- Weekly
- Daily
- Hourly
- Minlutely
- Secondly
- Microsecondly

There is also support for compound granularities (10 days, 15 minutes, 32 seconds, etc.). The correctness of the selected granularity can be checked from the output after the `Dataset` class has been created.

However, there are tricky situations (e.g. 28 days) where the monthly granularity may be guessed incorrectly. Therefore, it is possible to set your own granularity using the `pd.DateOffset` class or related classes from `pandas.tseries.offsets`, which must be fed as `delta` parameter into the `Dataset` class. Then the time column will be processed according to the user's settings.

Consider a time series where each point is exactly __28 daily points away__ from each other

In [22]:
df_path_2 = "../datasets/global/simulated_data_to_check_28D.csv"

# Configure the features settings
dataset_params_2 = {
    "target": {
        "columns": ["value"],
        "type": "continuous",
    },
    "date": {
        "columns": ["date"],
        "type": "datetime",
    },
    "id": {
        "columns": ["id"],
        "type": "categorical",
    }
}

In [23]:
dataset_2 = TSDataset(
    data=pd.read_csv(df_path_2),
    columns_params=dataset_params_2,
    print_freq_period_info=True,
)

freq: Month; period: 1.0

                It seems that the data is not regular. Please, check the data and the frequency info.                
                For multivariate regime it is critical to have regular data.
                For global regime each regular part of time series will be processed as separate time series.           
                


We see that the frequency of the series is incorrectly defined as monthly. Let's try to pass the `delta` parameter.

In [24]:
dataset_2 = TSDataset(
    data=pd.read_csv(df_path_2),
    columns_params=dataset_params_2,
    delta=pd.DateOffset(days=28),
    print_freq_period_info=True,
)

Custom OffSet: <DateOffset: days=28>


Now it's all detected correctly.