In [1]:
import warnings

warnings.filterwarnings("ignore")

from typing import List, Optional, Union

import numpy as np
import pandas as pd

from tsururu.dataset import IndexSlicer, Pipeline, TSDataset
from tsururu.models import CatBoostRegressor_CV, MeanBlender, BestModel, ClassicBlender
from tsururu.strategies import (
    DirectStrategy,
    FlatWideMIMOStrategy,
    MIMOStrategy,
    RecursiveStrategy,
)
from tsururu.transformers import (
    DateSeasonsGenerator,
    DifferenceNormalizer,
    LagTransformer,
    LastKnownNormalizer,
    SequentialTransformer,
    StandardScalerTransformer,
    TargetGenerator,
    UnionTransformer,
)

In [2]:
def get_results(
    cv: int,
    regime: str,
    y_true: Optional[List[np.ndarray]] = None,
    y_pred: Optional[List[np.ndarray]] = None,
    ids: Optional[List[Union[float, str]]] = None,
) -> pd.DataFrame:
    def _get_fold_value(
        value: Optional[Union[float, np.ndarray]], idx: int
    ) -> List[Optional[Union[float, np.ndarray]]]:
        if value is None:
            return [None]
        if isinstance(value[idx], float):
            return value[idx]
        if isinstance(value[idx], np.ndarray):
            return value[idx].reshape(-1)
        raise TypeError(f"Unexpected value type. Value: {value}")

    df_res_dict = {}

    for idx_fold in range(cv):
        # Fill df_res_dict
        for name, value in [("y_true", y_true), ("y_pred", y_pred)]:
            df_res_dict[f"{name}_{idx_fold+1}"] = _get_fold_value(
                value, idx_fold
            )
        if regime != "local":
            df_res_dict[f"id_{idx_fold+1}"] = _get_fold_value(ids, idx_fold)

    # Save datasets to specified directory
    df_res = pd.DataFrame(df_res_dict)
    return df_res

There are several main objects to look out for when working with the library:
1) `TSDataset`.
2) `Pipeline` and `Transformers`
3) `Strategy`.
4) `Model`.

### TSDataset

This class is needed to store data and meta-information about it.

To initialise it is necessary to submit the data in `pd.DataFrame` format and define some meta-information about roles that necessary for solving the task of time series forecasting: `id`, `date`, `target`.

In [3]:
df_path = "datasets/global/ettm1.csv"

dataset_params = {
    "target": {
        "columns": ["value"],
        "type": "continious",
    },
    "date": {
        "columns": ["date"],
        "type": "datetime",
    },
    "id": {
        "columns": ["id"],
        "type": "categorical",
    }
}

In [4]:
dataset = TSDataset(
    data=pd.read_csv(df_path),
    columns_params=dataset_params,
)

freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds


### Pipeline and Transformers

#### What kind of transformers are there?

Special attention should be paid to the `Transformer` class: the elements of the pipeline that are responsible for transforming the values of a series and generating features. `Pipeline` class is a wrapper over transformers which is needed to provide some additional methods and functions above them.

There are two types of transformers that are used to collect pipelines:
- `Union` transformers;
- `Sequential` transformers.

Below is a list of available Transformers: 
- `StandardScalerTransformer` *(Series4Series)*.
- `DifferenceNormalizer` *(Series4Series)*: subtract the previous value or divide by it.
- `LastKnownNormalizer` *(Features4Features)*: normalize all lags by the last known one: divide by it or subtract.

This three transformers provide flags `transform_features` / `transform_target`, that allow you to manipulate traits and targets separately and get different results from them.

Besides, __DifferenceNormalizer__ and __LastKnownNormalizer__ can be applied in two regimes: `delta` and `ratio`: in the first case, normalisation means subtracting the target value from the current value, and in the second, dividing by it.

- `LabelEncodingTransformer` and `OneHotEncodingTransformer` *(Series4Series)* - encoders for categorical features.
- `TimeToNumGenerator` and `DateSeasonsGenerator` *(Series4Series)* - generator for seasonal features by dates.
- `LagTransformer` *(Series4Features) - generator for lags. 

__!!!The lag transformer must necessarily be present in the sequential transformer, otherwise the features will not be generated.!!!__

Finally, to generate targets, you need to use `TargetGenerator`.

#### Transformers must be assembled in order!

The __SeriesToSeries__ transformers should come first, followed by the LagTransformer and TargetGenerator (__SeriesToFeatures__), and then the __FeaturesToFeatures__ transformers.

#### How to build a Pipeline?

So, there are two ways to build a pipline from transformers: initialise the transformers of interest by hand or use a config in the form of a dictionary. Let's look at both ways.

In [5]:
standard_scaler = StandardScalerTransformer(
    transform_features=True,
    transform_target=True
)

lag = LagTransformer(lags=3)
date_lag = LagTransformer(lags=3)
id_lag = LagTransformer(lags=1)

target_generator = TargetGenerator()

date_seasons = DateSeasonsGenerator(
    seasonalities=["doy", "m", "wd"],
    from_target_date=True,
)

In [6]:
union_1 = UnionTransformer(transformers_list=[lag, target_generator])

seq_1 = SequentialTransformer(transformers_list=[standard_scaler, union_1], input_features=["value"])
seq_2 = SequentialTransformer(transformers_list=[date_seasons, date_lag], input_features=["date"])
seq_3 = SequentialTransformer(transformers_list=[id_lag], input_features=["id"])

union = UnionTransformer(transformers_list=[seq_1, seq_2, seq_3])

In [7]:
pipeline_1 = Pipeline(union, multivariate=False)

In [8]:
pipeline_1.__dict__

{'transformers': <tsururu.transformers.base.UnionTransformer at 0x1ae6deba750>,
 'multivariate': False,
 'is_fitted': False,
 'strategy_name': None,
 'output_features': None,
 'y_original_shape': None}

Or:

In [9]:
pipeline_params = {
    "target": {
        "columns": ["value"],
        "features": {
            "StandardScalerTransformer":
                {
                    "transform_target": True, 
                    "transform_features": True
                },
            "LagTransformer": {"lags": 7},
        },
    },
    "date": {
        "columns": ["date"],
        "features": {
            "DateSeasonsGenerator": {
                # Use seasonality features from the date column as 
                # features with datetime lags
                # Possible values: [
                #    "y": year, "m": month, "d": day, 
                #    "wd": weekday, "doy": dayofyear,
                #    "hour": hour, "min": minute, "sec": second, 
                #    "ms": microsecond,  "ns": nanosecond
                # ]
                "seasonalities": ['doy', 'm', 'wd'], 
                # Use date from target point to make datetime features
                "from_target_date": True,
            },
            "LagTransformer": {"lags": 3}
        },
    },
    "id": {
        "columns": ["id"],
        "features": {
            "LagTransformer": {"lags": 1},
        },
    }
}

In [10]:
pipeline = Pipeline.from_dict(pipeline_params, multivariate=False)

#### Can I use exogenous variables in the pipeline?

Yes! Exogenous variables can also be specified here. Just include them in your pipeline.

However, their operation is currently tested only for the `MIMOStrategy` in global-modelling. For other strategies support of additional variables is under development.

In [11]:
pipeline_params["exog_group_1"] = {
     "columns": ["value"],
     "features": {
         "StandardScalerTransformer":
             {
                 "transform_target": False, 
                 "transform_features": True
             },
         "LagTransformer": {"lags": 7},
     },
 }

__Make sure you have the transform_target = False flag for exogenous features!__

#### Model

The model is separate from the strategy. Any model can be run in any strategy if it supports this input and output format

One of the easiest options – is to use GBM.

In [12]:
# Configure the model parameters
model_params = {
    "loss_function": "MultiRMSE",
    "early_stopping_rounds": 100,
    "verbose": 500,
}

# Configure the validation parameters
validation_params = {
    "type": 'KFold',
    "n_splits": 2,
}

In [13]:
model = CatBoostRegressor_CV(validation_params, model_params)

#### Strategy

- _Recursive:_ 
    - one model for all points of the forecast horizon;
    - *training*: the model is trained to predict one point ahead;
    - *prediction*: a prediction is iteratively made one point ahead, and then this prediction is used to further shape the features in the test data. 
- _Recursive-reduced:_
    - one model for all points in the prediction horizon;
    - *training*: the model is trained to predict one point ahead;
    - *prediction*: features are generated for all test observations at once, unavailable values are replaced by NaN.
- _Direct:_ 
    - individual models for each point in the prediction horizon. 
- _MultiOutput (MIMO - Multi-input-multi-output):_
    - one model that learns to predict the entire prediction horizon. 
- _FlatWideMIMO:_.
    - mixture of Direct and MIMO, fit one model, but uses deployed over horizon Direct's features.

In [14]:
horizon = 3
history = 7
step = 1

strategy = RecursiveStrategy(horizon, history, step, model, pipeline)
strategy2 = DirectStrategy(horizon, history, step, model, pipeline)

strtgs = [strategy, strategy2]

In [15]:
# Blender test
from sklearn.ensemble import RandomForestRegressor


blender = ClassicBlender(strtgs, RandomForestRegressor(max_depth=30), 0.2)

In [16]:
fitted_blender = blender.fit(dataset)

freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds
                      date  id  value
0      2016-07-01 00:00:00   0  5.827
1      2016-07-01 00:15:00   0  5.760
2      2016-07-01 00:30:00   0  5.760
3      2016-07-01 00:45:00   0  5.760
4      2016-07-01 01:00:00   0  5.693
...                    ...  ..    ...
432011 2016-11-23 02:45:00   6  2.884
432012 2016-11-23 03:00:00   6  2.955
432013 2016-11-23 03:15:00   6  3.236
432014 2016-11-23 03:30:00   6  3.447
432015 2016-11-23 03:45:00   6  3.447

[97552 rows x 3 columns]
freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds
                      date  id   value
13936  2016-11-23 04:00:00   0  10.985
13937  2016-11-23 04:15:00   0  12.659
13938  2016-11-23 04:30:00   0  11.788
13939  2016-11-23 04:45:00   0  11.052
13940  2016-11-23 05:00:00   0  12.525
...                    ...  ..     ...
487755 2018-06-26 18:45:00   6   9.567
487756 2018-06-26 19:00:00   6   9.567
487757 2018-06-26 19:15:00   6   9.42

In [17]:
train_data = []
val_data = []

for id, group in dataset.data.groupby('id'):
    split_index = int(len(group) * 0.8)
    train_data.append(group.iloc[:split_index])
    val_data.append(group.iloc[split_index:])
        
train_data = pd.concat(train_data)
val_data = pd.concat(val_data)

train_dataset = TSDataset(train_data, dataset.columns_params, dataset.delta)
val_dataset = TSDataset(val_data, dataset.columns_params, dataset.delta)

freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds
freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds


In [18]:
train_dataset.data[train_dataset.data.id==0]

Unnamed: 0,date,id,value
0,2016-07-01 00:00:00,0,5.827
1,2016-07-01 00:15:00,0,5.760
2,2016-07-01 00:30:00,0,5.760
3,2016-07-01 00:45:00,0,5.760
4,2016-07-01 01:00:00,0,5.693
...,...,...,...
55739,2018-02-01 14:45:00,0,4.957
55740,2018-02-01 15:00:00,0,5.961
55741,2018-02-01 15:15:00,0,-1.474
55742,2018-02-01 15:30:00,0,-2.143


In [19]:
val_dataset.data[val_dataset.data.id==0]

Unnamed: 0,date,id,value
55744,2018-02-01 16:00:00,0,3.684
55745,2018-02-01 16:15:00,0,5.693
55746,2018-02-01 16:30:00,0,11.721
55747,2018-02-01 16:45:00,0,11.922
55748,2018-02-01 17:00:00,0,13.329
...,...,...,...
69675,2018-06-26 18:45:00,0,9.310
69676,2018-06-26 19:00:00,0,10.114
69677,2018-06-26 19:15:00,0,10.784
69678,2018-06-26 19:30:00,0,11.655


In [20]:
fit_time, _ = strategy.fit(dataset)

0:	learn: 0.9745768	test: 0.9727270	best: 0.9727270 (0)	total: 9.51ms	remaining: 9.5s
500:	learn: 0.2140805	test: 0.2164315	best: 0.2164315 (500)	total: 3.23s	remaining: 3.22s
999:	learn: 0.2104164	test: 0.2147511	best: 0.2147511 (999)	total: 6.43s	remaining: 0us

bestTest = 0.2147510735
bestIteration = 999

Fold 0:
MultiRMSE: 0.21475107347532177
0:	learn: 0.9727297	test: 0.9745896	best: 0.9745896 (0)	total: 8.82ms	remaining: 8.82s
500:	learn: 0.2141088	test: 0.2163332	best: 0.2163332 (500)	total: 3.22s	remaining: 3.21s
999:	learn: 0.2101859	test: 0.2146711	best: 0.2146711 (999)	total: 6.4s	remaining: 0us

bestTest = 0.2146710724
bestIteration = 999

Fold 1:
MultiRMSE: 0.21467107242976863
Mean MultiRMSE: 0.2147
Std: 0.0


In [21]:
forecast_time, current_pred = strategy.predict(dataset)
current_pred

freq: less then Day (Hour, Min, Sec, etc); period: 900.0 seconds


Unnamed: 0,id,date,value
0,0,2018-06-26 20:00:00,12.333083
1,0,2018-06-26 20:15:00,12.13246
2,0,2018-06-26 20:30:00,12.207673
3,1,2018-06-26 20:00:00,3.671136
4,1,2018-06-26 20:15:00,3.653514
5,1,2018-06-26 20:30:00,3.679937
6,2,2018-06-26 20:00:00,4.554753
7,2,2018-06-26 20:15:00,4.518773
8,2,2018-06-26 20:30:00,4.512365
9,3,2018-06-26 20:00:00,1.500343


In [22]:
#best_preds, best_strat = blender.fit_predict(strtgs, dataset)

In [23]:
#np.mean(strategy.models[0].scores)

In [24]:
#scores = []
#for model in strategy2.models:
#    score = np.mean(model.scores)
#    scores.append(score)

In [25]:
#np.mean(scores)

In [26]:
#best_strat

In [27]:
#best_preds

In [28]:
#fit_time, _ = strategy.fit(dataset)
#fit_time2, _ = strategy2.fit(dataset)

In [29]:
#forecast_time, current_pred = strategy.predict(dataset)

# Blender test
#forecast_time2, current_pred2 = strategy2.predict(dataset)

In [30]:
#current_pred.shape

In [31]:
#current_pred2

In [32]:
#final_preds = blender.fit_predict([current_pred.value, current_pred2.value])

## Backtest validation of pipeline

In [33]:
#ids, test, pred, fit_time, forecast_time = strategy.back_test(dataset, cv=1)

# Blender test
#ids2, test2, pred2, fit_time2, forecast_time2 = strategy2.back_test(dataset, cv=1)

In [34]:
#pred

In [35]:
#test

In [36]:
#pred2

In [37]:
# Synthetic preds
#pred3 = [el/2 for el in pred2]
#pred3

In [38]:
# Mean blender
#final_preds = blender.fit_predict([pred, pred2, pred3])
#final_preds

In [39]:
#(pred[0]+pred2[0]+pred3[0])/3 == final_preds[0]

In [40]:
#get_results(cv=1, regime="global", y_true=test, y_pred=pred, ids=ids)

## Working with raw time series' granularity

Time series come in different granularities, from hourly and daily time series to more complex ones such as the end of each quarter.

If the rows do not contain segments that are too short (that are shorter than history + horizon), then `tsururu` will try to extract the row granularity on its own. We currently support the following types:

- Yearly (and YearlyEnd)
- Quarterly (and Quarterly)
- Monthly (and MonthlyEnd)
- Weekly
- Daily
- Hourly
- Minlutely
- Secondly
- Microsecondly

There is also support for compound granularities (10 days, 15 minutes, 32 seconds, etc.). The correctness of the selected granularity can be checked from the output after the `Dataset` class has been created.

However, there are tricky situations (e.g. 28 days) where the monthly granularity may be guessed incorrectly. Therefore, it is possible to set your own granularity using the `pd.DateOffset` class or related classes from `pandas.tseries.offsets`, which must be fed as `delta` parameter into the `Dataset` class. Then the time column will be processed according to the user's settings.

Consider a time series where each point is exactly __28 daily points away__ from each other

In [41]:
df_path_2 = "datasets/global/simulated_data_to_check_28D.csv"

# Configure the features settings
dataset_params_2 = {
    "target": {
        "columns": ["value"],
        "type": "continious",
    },
    "date": {
        "columns": ["date"],
        "type": "datetime",
    },
    "id": {
        "columns": ["id"],
        "type": "categorical",
    }
}

In [42]:
dataset_2 = TSDataset(
    data=pd.read_csv(df_path_2),
    columns_params=dataset_params_2,
)

freq: Month; period: 1.0


We see that the frequency of the series is incorrectly defined as monthly. Let's try to pass the `delta` parameter.

In [43]:
dataset_2 = TSDataset(
    data=pd.read_csv(df_path_2),
    columns_params=dataset_params_2,
    delta=pd.DateOffset(days=28),
)

Custom OffSet: <DateOffset: days=28>


Now it's all detected correctly.