# Explaining Lag/Lead Operations and Feature Origin Times

## Introduction
This notebook illustrates a couple concepts that usually only make sense within a time series forecasting scenario:
1. lag/lead operations
1. `origin_time` for features

We start by loading all the necessary components from the AML Package for Forecasting (AMLPF).

In [1]:
import pandas as pd
import numpy as np
from ftk import TimeSeriesDataFrame
from ftk.transforms import LagLeadOperator
print('import done')

  from pandas.core import datetools


import done


We will make a tiny data frame and convert it into a `TimeSeriesDataFrame` to enable AMLPF operations.

In [2]:
small_df = pd.DataFrame({
    'date': pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03'])
    , 'brand': ['A'] * 3 
    , 'sales' : [1.0, 2.0, 3.0]
})
tsdf = TimeSeriesDataFrame(small_df, time_colname='date', grain_colnames='brand', ts_value_colname='sales')
tsdf

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
date,brand,Unnamed: 2_level_1
2018-01-01,A,1.0
2018-01-02,A,2.0
2018-01-03,A,3.0


## How Lags Work
In time series, you use the information about the past to predict the future. Lags are previous values of a time series that are often highly predictive of what will happen in the future. 

In AMLPF we use the following definition of `Lag1`:
* For a time series `x`, the value of `x_lag1` is the most recent value of `x` that is known to us at the time of constructing the forecast.

Mathematically, $L(x_t) = x_{t-1}$ for a time series $x_t$.

Let's work through an example. It implictly assumes we are interested in a one-step-ahead forecast, an assumption we will relax in a bit.
1. Consider forecasting `sales` on `2018-01-01`. In a one-step-ahead scenario, the forecast would have to be constructed on `2017-12-31`. The most recent value of `sales` available on that date is not known to us - lag features almost always have this "initial condition" problem. Hence the value of `sales_lag1` for `2018-01-01` should be `NaN`.
1. Next, consider forecasting `sales` on `2018-01-02`. In a one-step-ahead scenario, the forecast would have to be constructed on `2018-01-01`. The most recent value of `sales` available on that date is valu of `sales` from `2018-01-01`, which is `1.0`, hence the value of `sales_lag1` for `2018-01-02` should be `1.0`.
1. By the same line of reasoning, the value of `sales_lag1` for `2018-01-03` should be `2.0`.

Closely related to the concept of lags is another concept of "leads". Simply put, lead operator is the inverse of the lag operator, so $F(x_t)=x_{t+1}$ for a time series $x_t$. As one would imagine, $F(L(x_t))=x_t$. 

With a clear understanding of `Lag1`, understanding all other lags and leads is straightforward:
* Lag 2 would be the previous value to that of Lag 1.
* Lead 0 would be the next value that follows after the Lag 1 value.
* All other lags and leads will work similarly with the corresponding offset.

Let's now demonstrate this calculation using the `LagLeadOperator()` from the AMLPF. 
* Notice that AMLPF uses "lead" to signal that the feature uses values that are generally not known at the time of forecast creation. 
* Most of the time we would not use such features, but sometimes we may. 
  * An example would be using future prices to forecast sales of products for which we have the ability to set prices in advance.

In [3]:
lags_dict = {'sales': list(range(0,3))}
lag_transform = LagLeadOperator(lags_to_construct=lags_dict)
lag_transform.fit(tsdf)
lag_transform.transform(tsdf)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales,sales_lead0,sales_lag1,sales_lag2
date,brand,origin,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01,A,2017-12-31,1.0,1.0,,
2018-01-02,A,2018-01-01,2.0,2.0,1.0,
2018-01-03,A,2018-01-02,3.0,3.0,2.0,1.0


## Explaining `origin_time`
The above output creates a new index column in the `TimeSeriesDataFrame` - `origin`. This column is added by the `LagLeadOperator` to capture the concept of `origin_time`, which is an important forecasting concept that does not get articulated well enough sometimes. 
Here is the definition used in AMLPF:
* `origin_time` represents the latest date from which actual values of all features are assumed to be known with certainty.


In the first output from `LagLeadOperator()`, we saw the `origin` column added automatically. It uniformly lags the `date` index column by one day. The temporal frequency of the data had been automatically inferred by the `LagLeadOperator()`, and the shift by one is done because the default value of the `max_horizon` input argument is `1`. 
* When the same date is forecasted from more than one origin, you can construct the difference between origins and dates and use it as a feature. It is commonly referred to as `horizon`, and the `GrainIndexFeaturizer()` module in AMLPF can be used to construct it automatically.

We will now illustrate what happens when we are interested in the multi-step forecast. We will first rerun the above code with the `max_horizon` set to `2` and then discuss the results.

In [4]:
lag_transform_horizon2 = LagLeadOperator(lags_to_construct=lags_dict, max_horizon=2)
lag_transform_horizon2.fit(tsdf)
lag_transform_horizon2.transform(tsdf)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales,sales_lead0,sales_lag1,sales_lag2
date,brand,origin,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-01,A,2017-12-30,1.0,,,
2018-01-01,A,2017-12-31,1.0,1.0,,
2018-01-02,A,2017-12-31,2.0,1.0,,
2018-01-02,A,2018-01-01,2.0,2.0,1.0,
2018-01-03,A,2018-01-01,3.0,2.0,1.0,
2018-01-03,A,2018-01-02,3.0,3.0,2.0,1.0


Now the `transform()` call returns two rows per each `date`, because we set `max_horizon=2`. Each date in the `origin` column is either `1` or `2` days behind the corresponding `date` value. 
Notice how value of `sales_lag1` is different for the same `date` depending on `origin`. Consider making forecasts for `2018-01-02` and examine the values of `sales_lag1`:
1. When forecast is made on `2017-12-31`, i.e. a two-steps-ahead forecast, value of `sales_lag1` is not available, because `sales` are not known to us in `2017`.
1. But when forecast is made on `2018-01-01`, i.e. a one-step-ahead forecast, value of `sales_lag1` _is_ available, because `sales` on `2018-01-01` are known, and are equal to `1.0`. 
The same exact logic applies to lags of higher order, and to leads. 
  * Notice, for example, that `sales_lead0` values are only equal to `sales` values in a one-step-ahead subset of the data. In a two-steps-ahead subset, values differ. 
1. Pay particular attention to row 5, where the `date` is `2018-01-03` and `origin` is `2018-01-01`. This row illustrates how AMLPF's defition of `Lag1` handles multi-step forecasts: 
  * A naive application of the lag operation would suggest that the value of `sales_lag1` for that row should be value of `sales` from `2018-01-02` - this is, after all, the value from one day prior to the date being forecasted.
  * The above naive interpretation would make us use `NaN` value instead of `1`, because on `2018-01-01` value of sales on `2018-01-02` is not yet known.
  * Instead, we fall back to our definition: _the value of `lag1` is the most recent value that is known to us at the time of constructing the forecast_. The most recent value of `sales` known on `2018-01-01` is `1`, which is what we use.

## Caching example with train/test split
In a typical data science scenario, the training-testing data split happens randomly, because all records are interchangeable. In time series, this is a recipe for overfitting, and splits have to be done by time, with older data being in the training set, and newer data in the test set.
This creates a subtle issue with featurizers such as lags: you need to use some values from the training data to "patch the holes" in the testing data. AMLPF handles this automatically for you.
The following example illustrates the problem:


In [5]:
train_tsdf = tsdf.iloc[0:2, :]
train_tsdf

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
date,brand,Unnamed: 2_level_1
2018-01-01,A,1.0
2018-01-02,A,2.0


In [6]:
test_tsdf  = tsdf.iloc[2:3, :]
test_tsdf

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
date,brand,Unnamed: 2_level_1
2018-01-03,A,3.0


We will compute `sales_lag1` via built-in `pandas` method `shift()` first:

In [7]:
naive_lag_train = train_tsdf.shift(1)
naive_lag_train.columns = ['sales_lag1']
naive_lag_train

Unnamed: 0_level_0,Unnamed: 1_level_0,sales_lag1
date,brand,Unnamed: 2_level_1
2018-01-01,A,
2018-01-02,A,1.0


In [8]:
naive_lag_test = test_tsdf.shift(1)
naive_lag_test.columns = ['sales_lag1']
naive_lag_test

Unnamed: 0_level_0,Unnamed: 1_level_0,sales_lag1
date,brand,Unnamed: 2_level_1
2018-01-03,A,


This is not what we wanted, but `pandas` did exactly what we asked it to do. Operations on the test data frame have no knowledge of the training data and the holes cannot be "patched". 
Now let us do the same thing with `LagLeadOperator()`.

In [9]:
lag_with_split = LagLeadOperator(lags_to_construct={'sales': 1})
lag_with_split.fit(train_tsdf)
lag_with_split.transform(train_tsdf)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales,sales_lag1
date,brand,origin,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-01,A,2017-12-31,1.0,
2018-01-02,A,2018-01-01,2.0,1.0


In [10]:
lag_with_split.transform(test_tsdf)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sales,sales_lag1
date,brand,origin,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-03,A,2018-01-02,3.0,2.0


Now this is the result we wanted! And the training data frame was also handled correctly!