<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-time-series-forecasting/blob/part-3-deep-learning-methods/1_preparing_time_series_data_for_cnn_and_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Time Series Data for CNNs and LSTMs

Time series data must be transformed before it can be used to fit a supervised learning model. In this form, the data can be used immediately to fit a supervised machine learning algorithm and even a Multilayer Perceptron neural network. 

One further transformation is required in order to ready the data for fitting a Convolutional Neural Network (CNN) or Long Short-Term Memory (LSTM) Neural Network. Speciffically, the two-dimensional structure of the supervised
learning data must be transformed to a three-dimensional structure. 

This is perhaps the largest sticking point for practitioners looking to implement deep learning methods for time series forecasting. 

We will discover exactly how to transform a time series data set into a three-dimensional structure ready for fitting a CNN or LSTM model.

* How to transform a time series dataset into a two-dimensional supervised learning format.
* How to transform a two-dimensional time series dataset into a three-dimensional structure suitable for CNNs and LSTMs.
* How to step through a worked example of splitting a very long time series into subsequences ready for training a CNN or LSTM model.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

## Time Series to Supervised

Time series data requires preparation before it can be used to train a supervised learning model, such as an LSTM neural network. 

For example, a univariate time series is represented as a vector of observations:

```python
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```

A supervised learning algorithm requires that data is provided as a collection of samples,where each sample has an input component (X) and an output component (y).

```python
X,                y
sample input,    sample output
sample input,    sample output
sample input,    sample output
...
```

The model will learn how to map inputs to outputs from the provided examples.

$$y = f(X)$$

A time series must be transformed into samples with input and output components. The transform both informs what the model will learn and how you intend to use the model in the future when making predictions, e.g. what is required to make a prediction (X) and what prediction is made (y). 

For a univariate time series problem where we are interested in one-step
predictions, the observations at prior time steps, so-called lag observations, are used as input and the output is the observation at the current time step. 

For example, the above 10-step univariate series can be expressed as a supervised learning problem with three time steps for input and one step as output, as follows:

```python
X,          y
[1, 2, 3], [4]
[2, 3, 4], [5]
[3, 4, 5], [6]
...
```

You can write code to perform this transform yourself and that is the general
approach and recommend for greater understanding of your data and control over the transformation process. 

The split sequence() function below implements this behavior and will split a given univariate sequence into multiple samples where each sample has a specified number of time steps and the output is a single time step.


In [0]:
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
  X, y = list(), list()
  for i in range(len(sequence)):
    # find the end of this pattern
    end_ix = i + n_steps
    # check if we are beyond the sequence
    if end_ix > len(sequence) - 1:
      break
    # gather input and output parts of the pattern
    seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
    X.append(seq_x)
    y.append(seq_y)
  return np.array(X), np.array(y)

After you have transformed your data into a form suitable for training a supervised learning model it will be represented as rows and columns. Each column will represent a feature to the model and may correspond to a separate lag observation. Each row will represent a sample and will correspond to a new example with input and output components.

* **Feature**: A column in a dataset, such as a lag observation for a time series dataset.
* **Sample**: A row in a dataset, such as an input and output sequence for a time series dataset.

For example, our univariate time series may look as follows:

```python
x1,  x2,  x3,  y
1,   2,   3,   4
2,   3,   4,   5
3,   4,   5,   6
................
................
```

The dataset will be represented in Python using a NumPy array. The array will have two dimensions. The length of each dimension is referred to as the shape of the array. 

For example, a time series with 3 inputs, 1 output will be transformed into a supervised learning problem with 4 columns, or really 3 columns for the input data and 1 for the output data. 

If we have 7 rows and 3 columns for the input data then the shape of the dataset would be [7, 3], or 7 samples and 3 features. We can make this concrete by transforming our small contrived dataset.

In [7]:
# define univariate time series
series = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(series)

# transform to a supervised learning problem
X, y = split_sequence(series, 3)
print(X.shape, y.shape)

# show each sample
for i in range(len(X)):
  print(X[i], y[i])

[ 1  2  3  4  5  6  7  8  9 10]
(7, 3) (7,)
[1 2 3] 4
[2 3 4] 5
[3 4 5] 6
[4 5 6] 7
[5 6 7] 8
[6 7 8] 9
[7 8 9] 10


We can see that for the chosen representation that we have 7 samples for the
input and output and 3 input features. The shape of the output is 7 samples represented as (7,) indicating that the array is a single column. It could also be represented as a two-dimensional array with 7 rows and 1 column [7, 1].

Data in this form can be used directly to train a simple neural network, such as a Multilayer Perceptron. The difficulty for beginners comes when trying to prepare this data for CNNs and LSTMs that require data to have a three-dimensional structure instead of the two-dimensional structure described so far.

## 3D Data Preparation Basics