In [2]:
#skip
%run 00_basic.ipynb

# Time Genreators in keras

Time-series analysis requires to transform the input data into a different structure before it can be fed to LSTM.

The Keras DNN library provides the TimeseriesGenerator to facilitate one of such transformation to arrange he input in batches.

as a motivating example, suppose we have a time series data:
$[0,1,2,3,4,5,,7,8,9]$ and we are using past 2 readings to predict the current data, then we need to transform the data as:

```
     X,     y
    [0,1]   2
    [1,2]   3
    ...
    [6,7]   8
    [7,8]   9
```

There are many ways to do this; However TimeseriesGenerator makes this transformation a breeze; in addition TimeseriesGenerator does this automatically during training saving a huge amount of memory and preprocessing overhead.

## Manually Converting Data

Code below shows some forma of a function one can write to make this work and its corresponding usage.
We use this dataframe for all our examples to demonstrate various functions.

In [291]:
'''#--------------------------------------------------------------------------------
#Manual way to transform data for LSTM
''' 
df=pd.DataFrame({0: range(1,16), 1:range(10,160,10)})

def get_data(dataset, target, start, end, history, target_size, skip=1,oneStep=True):
    data   = []
    labels = []

    start = start + history
    if end is None:
        end = len(dataset) - target_size

    for i in range(start, end):
        indices = range(i-history, i, skip)
        if( len(dataset.shape) <=1 or dataset.shape[-1] == 1):
            dt = dataset[indices]
        else:
            dt = np.reshape(dataset[indices], ( -1, dataset.shape[-1]))
        data.append(dt)

        if oneStep:
            labels.append(target[i+target_size])
        else:
            labels.append(target[i:i+target_size])

    return np.array(data), np.array(labels)

# Example Data Frame 
df

Unnamed: 0,0,1
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50
5,6,60
6,7,70
7,8,80
8,9,90
9,10,100


In [323]:
dv1=df.values[:,0]#.reshape((15,1))
#dv1=df.values
data, labels=get_data(dv1, dv1,0,None,history=2, target_size=0)
data.shape, data, labels
for i in range(len(data)):
    #ll= [list(g) for g in data[i]] # for clean formatted o/p
    ll= data[i] # for clean formatted o/p
    print(f'\t{ll} => {labels[i]}')
    if (i > 3): break;

	[1 2] => 3
	[2 3] => 4
	[3 4] => 5
	[4 5] => 6
	[5 6] => 7


In [324]:
dv1=df.values[:,0].reshape((15,1))
#dv1=df.values
data, labels=get_data(dv1, dv1,0,None,history=3, target_size=0)
data.shape, data, labels
for i in range(len(data)):
    ll= [list(g) for g in data[i]] # for clean formatted o/p
    print(f'\t{ll} => {labels[i]}')
    if (i > 3): break;    

	[[1], [2], [3]] => [4]
	[[2], [3], [4]] => [5]
	[[3], [4], [5]] => [6]
	[[4], [5], [6]] => [7]
	[[5], [6], [7]] => [8]


In [332]:
dv1=df.values
# trget_size indicates how far is the predicted value
# SKip indicates the increament for range to sample values
data, labels=get_data(dv1, dv1[:,0],0,None,history=2, target_size=0, skip=1, oneStep=True)
data.shape, data, labels
for i in range(len(data)):
    ll= [list(g) for g in data[i]] # for clean formatted o/p
    print(f'\t{ll} => {labels[i]}')
    if (i > 3): break;    

	[[1, 10], [2, 20]] => 3
	[[2, 20], [3, 30]] => 4
	[[3, 30], [4, 40]] => 5
	[[4, 40], [5, 50]] => 6
	[[5, 50], [6, 60]] => 7


In [305]:
dv1=df.values
#oneStep is falsem that says, we are doing multiuple step predictions
#target size and number of predictions are combined - this is somehow not always the case
#
data, labels=get_data(dv1, dv1[:,0],0,None,history=3, target_size=2, skip=2, oneStep=False)
data.shape, data, labels
for i in range(len(data)):
    ll= [list(g) for g in data[i]] # for clean formatted o/p
    print(f'\t{ll} => {labels[i]}')
    if (i > 3): break;    

	[[1, 10], [3, 30]] => [4 5]
	[[2, 20], [4, 40]] => [5 6]
	[[3, 30], [5, 50]] => [6 7]
	[[4, 40], [6, 60]] => [7 8]
	[[5, 50], [7, 70]] => [8 9]


# Using Time Series Generator

As you can see from above example, preparing the dataset ahead of time is possible with number of disadvantages of using memory proportional to number of $history$.

In addition managing and maintaining code can become cumbersome. 

For data set of length $n$, it takes, $(n-history) * history $ memory - preparing something ahead of time before hand, takes time as well and not to mention the complexity of getting the dimensions correct. Instead one can use TimeseriesGenerator as shown below that feeds LSTM just right when it needs it.

In addition Time Generator can arrange the dataset suitable for batches dynamically. This is especially suitable when the data arrives incrementally.

In [326]:
import tensorflow as tf
from tensorflow import keras
from numpy import array
from keras.preprocessing.sequence import TimeseriesGenerator

ts = df.values[:,0]
print("---> Batch size=1")
gn1 = TimeseriesGenerator(ts, ts, length=2, batch_size=1)
for i in range(len(gn1)):
    x,y = gn1[i]
    print(f'\t{x:} => {y:}')
    if (i > 3): break;        

print("---> Batch size=2")
gn2 = TimeseriesGenerator(ts, ts, length=3, batch_size=1)
for i in range(len(gn1)):
    x,y = gn2[i]
    ll= [list(g) for g in x] # for clean formatted o/p
    print(f'\t{ll} => {y:}')
    if (i > 3): break;        

---> Batch size=1
	[[1 2]] => [3]
	[[2 3]] => [4]
	[[3 4]] => [5]
	[[4 5]] => [6]
	[[5 6]] => [7]
---> Batch size=2
	[[1, 2, 3]] => [4]
	[[2, 3, 4]] => [5]
	[[3, 4, 5]] => [6]
	[[4, 5, 6]] => [7]
	[[5, 6, 7]] => [8]


In the above example, if we used it as-is, we would be using only one column from input to make prediction; besides we need to normalize the data only on the training set etc.

In the next example, we will use all the columns from the sample data set, use batch size, normalize the data and see a complete example.

In [448]:

data = df.values
label= df.values[:,0]
gn3 = TimeseriesGenerator(data, label, length=2, batch_size=1)

for i in range(len(gn3)):
    x, y = gn3[i]    
    ll= [[list(g2) for g2 in g1] for g1 in x] # for clean formatted o/p
    print(f'\t{ll} => {y:}')
    if (i > 3): break;            

	[[[1, 10], [2, 20]]] => [3]
	[[[2, 20], [3, 30]]] => [4]
	[[[3, 30], [4, 40]]] => [5]
	[[[4, 40], [5, 50]]] => [6]
	[[[5, 50], [6, 60]]] => [7]


# Complicated example:

In order to construct an example as shown above for multi-step predictions, with different **sampling rate**  you must construct the labels carefully. Hopefully you wont have to do multi-step prediction because of following conjencture 

**Conjecture**: A LSTM that predicts one step will have over-all loss less-than-or equal to loss metric multi-step prediction\
**Intuit**: No need to prove the fact, it is generally understood. But the intiution is that, LSTM uses one set of weights to predict the output. In case of multi-step, the weights are adjusted so as to reduce the overall error across multiple steps; this may cause the back propogation not to precisely reduce the loss across one step.


In [525]:
data = df.values
# Construct Label - careful - carefully construct it - you may want to write it down:
l1= df[0].values
l2= pd.DataFrame({0:l1[:-1], 1: l1[1:]})
label=np.append([0,0], l2.values).reshape(15,2)
label

gn3 = TimeseriesGenerator(data, label, length=4, batch_size=1, stride=1, sampling_rate=2)

for i in range(len(gn3)):
    x, y = gn3[i]    
    ll= [[list(g2) for g2 in g1] for g1 in x] # for clean formatted o/p
    print(f'\t{ll} => {y:}')
    #if (i > 3): break;                

	[[[1, 10], [3, 30]]] => [[4 5]]
	[[[2, 20], [4, 40]]] => [[5 6]]
	[[[3, 30], [5, 50]]] => [[6 7]]
	[[[4, 40], [6, 60]]] => [[7 8]]
	[[[5, 50], [7, 70]]] => [[8 9]]
	[[[6, 60], [8, 80]]] => [[ 9 10]]
	[[[7, 70], [9, 90]]] => [[10 11]]
	[[[8, 80], [10, 100]]] => [[11 12]]
	[[[9, 90], [11, 110]]] => [[12 13]]
	[[[10, 100], [12, 120]]] => [[13 14]]
	[[[11, 110], [13, 130]]] => [[14 15]]
