<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-time-series-forecasting/blob/part-3-deep-learning-methods/1_preparing_time_series_data_for_cnn_and_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Time Series Data for CNNs and LSTMs

Time series data must be transformed before it can be used to fit a supervised learning model. In this form, the data can be used immediately to fit a supervised machine learning algorithm and even a Multilayer Perceptron neural network. 

One further transformation is required in order to ready the data for fitting a Convolutional Neural Network (CNN) or Long Short-Term Memory (LSTM) Neural Network. Speciffically, the two-dimensional structure of the supervised
learning data must be transformed to a three-dimensional structure. 

This is perhaps the largest sticking point for practitioners looking to implement deep learning methods for time series forecasting. 

We will discover exactly how to transform a time series data set into a three-dimensional structure ready for fitting a CNN or LSTM model.

* How to transform a time series dataset into a two-dimensional supervised learning format.
* How to transform a two-dimensional time series dataset into a three-dimensional structure suitable for CNNs and LSTMs.
* How to step through a worked example of splitting a very long time series into subsequences ready for training a CNN or LSTM model.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

## Time Series to Supervised

Time series data requires preparation before it can be used to train a supervised learning model, such as an LSTM neural network. 

For example, a univariate time series is represented as a vector of observations:

```python
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```

A supervised learning algorithm requires that data is provided as a collection of samples,where each sample has an input component (X) and an output component (y).

```python
X,                y
sample input,    sample output
sample input,    sample output
sample input,    sample output
...
```

The model will learn how to map inputs to outputs from the provided examples.

$$y = f(X)$$

A time series must be transformed into samples with input and output components. The transform both informs what the model will learn and how you intend to use the model in the future when making predictions, e.g. what is required to make a prediction (X) and what prediction is made (y). 

For a univariate time series problem where we are interested in one-step
predictions, the observations at prior time steps, so-called lag observations, are used as input and the output is the observation at the current time step. 

For example, the above 10-step univariate series can be expressed as a supervised learning problem with three time steps for input and one step as output, as follows:

```python
X,          y
[1, 2, 3], [4]
[2, 3, 4], [5]
[3, 4, 5], [6]
...
```

You can write code to perform this transform yourself and that is the general
approach and recommend for greater understanding of your data and control over the transformation process. 

The split sequence() function below implements this behavior and will split a given univariate sequence into multiple samples where each sample has a specified number of time steps and the output is a single time step.


In [0]:
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
  X, y = list(), list()
  for i in range(len(sequence)):
    # find the end of this pattern
    end_ix = i + n_steps
    # check if we are beyond the sequence
    if end_ix > len(sequence) - 1:
      break
    # gather input and output parts of the pattern
    seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
    X.append(seq_x)
    y.append(seq_y)
  return np.array(X), np.array(y)

After you have transformed your data into a form suitable for training a supervised learning model it will be represented as rows and columns. Each column will represent a feature to the model and may correspond to a separate lag observation. Each row will represent a sample and will correspond to a new example with input and output components.

* **Feature**: A column in a dataset, such as a lag observation for a time series dataset.
* **Sample**: A row in a dataset, such as an input and output sequence for a time series dataset.

For example, our univariate time series may look as follows:

```python
x1,  x2,  x3,  y
1,   2,   3,   4
2,   3,   4,   5
3,   4,   5,   6
................
................
```

The dataset will be represented in Python using a NumPy array. The array will have two dimensions. The length of each dimension is referred to as the shape of the array. 

For example, a time series with 3 inputs, 1 output will be transformed into a supervised learning problem with 4 columns, or really 3 columns for the input data and 1 for the output data. 

If we have 7 rows and 3 columns for the input data then the shape of the dataset would be [7, 3], or 7 samples and 3 features. We can make this concrete by transforming our small contrived dataset.

In [3]:
# define univariate time series
series = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(series)

# transform to a supervised learning problem
X, y = split_sequence(series, 3)
print(X.shape, y.shape)

# show each sample
for i in range(len(X)):
  print(X[i], y[i])

[ 1  2  3  4  5  6  7  8  9 10]
(7, 3) (7,)
[1 2 3] 4
[2 3 4] 5
[3 4 5] 6
[4 5 6] 7
[5 6 7] 8
[6 7 8] 9
[7 8 9] 10


We can see that for the chosen representation that we have 7 samples for the
input and output and 3 input features. The shape of the output is 7 samples represented as (7,) indicating that the array is a single column. It could also be represented as a two-dimensional array with 7 rows and 1 column [7, 1].

Data in this form can be used directly to train a simple neural network, such as a Multilayer Perceptron. The difficulty for beginners comes when trying to prepare this data for CNNs and LSTMs that require data to have a three-dimensional structure instead of the two-dimensional structure described so far.

## 3D Data Preparation Basics

Preparing time series data for CNNs and LSTMs requires one additional step beyond transforming the data into a supervised learning problem. This one additional step causes the most confusion for beginners. 

The input layer for CNN and LSTM models is specified by the input shape argument on the first hidden layer of the network. This too can make things confusing for beginners as intuitively we may expect the first layer defined in the model be the input layer, not the first hidden layer.

```python
model = Sequential()
model.add(LSTM(32))
model.add(Dense(1))
```

The input to every CNN and LSTM layer must be three-dimensional. The three dimensions of this input are:
* **Samples**. One sequence is one sample. A batch is comprised of one or more samples.
* **Time Steps**. One time step is one point of observation in the sample. One sample is comprised of multiple time steps.
* **Features**. One feature is one observation at a time step. One time step is comprised of one or more features.

This expected three-dimensional structure of input data is often summarized using the array shape notation of: $[samples, timesteps, features]$.

Remember, that the two-dimensional shape of a dataset that we are familiar with from the previous section has the array shape of: $[samples, features]$. this means we are adding the new dimension of time steps. Except, in time series forecasting problems our features are observations at time steps. So, really, we are adding the dimension of features, where a univariate time series has only one feature.

When defining the input layer of your LSTM network, the network assumes you have one or more samples and requires that you specify the number of time steps and the number of features. You can do this by specifying a tuple to the input shape argument.

For example, the model below defines an input layer that expects 1 or more samples, 3 time steps, and 1 feature.

```python
model = Sequential()
model.add(LSTM(32, input_shape=(3, 1))
model.add(Dense(1))
```

Remember, the first layer in the network is actually the first hidden layer, so in this example 32 refers to the number of units in the first hidden layer. The number of units in the first hidden layer is completely unrelated to the number of samples, time steps or features in your input data.


This example maps onto our univariate time series that we split into having 3 input time steps and 1 feature. 

We may have loaded our time series dataset from CSV or transformed it to a supervised learning problem in memory. It will have a two-dimensional
shape and we must convert it to a three-dimensional shape with some number of samples, 3 time steps per sample and 1 feature per time step, or $[?, 3, 1]$.

For example, if we have 7 samples and 3 time steps per sample for the input element of our time series, we can reshape it into $[7, 3, 1]$ by providing a tuple to the reshape() function specifying the desired new shape of (7, 3, 1). The array must have enough data to support the new shape, which in this case it does as $[7, 3]$ and $[7, 3, 1]$ are functionally the same thing.

```python
# transform input from [samples, features] to [samples, timesteps, features]
X = X.reshape((7, 3, 1))
```

A short-cut in reshaping the array is to use the known shapes, such as the number of samples and the number of times steps from the array returned from the call to the X.shape property of the array.

* X.shape[0] refers to the number of rows in a 2D array, in this case the
number of samples.
* X.shape[1] refers to the number of columns in a 2D array, in this case
the number of feature that we will use as the number of time steps. 

The reshape can therefore be written as:

```python
# transform input from [samples, features] to [samples, timesteps, features]
X = X.reshape((X.shape[0], X.shape[1], 1))
```

We can make this concept concrete with a worked example.


In [5]:
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
  X, y = list(), list()
  for i in range(len(sequence)):
    # find the end of this pattern
    end_ix = i + n_steps
    # check if we are beyond the sequence
    if end_ix > len(sequence) - 1:
      break
    # gather input and output parts of the pattern
    seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
    X.append(seq_x)
    y.append(seq_y)
  return np.array(X), np.array(y)

# define univariate time series
series = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(series)
print(series.shape)

# transform to a supervised learning problem
X, y = split_sequence(series, 3)
print(X.shape, y.shape)

# transform input from [samples, features] to [samples, timesteps, features]
X = X.reshape((X.shape[0], X.shape[1], 1))
print(X.shape)

[ 1  2  3  4  5  6  7  8  9 10]
(10,)
(7, 3) (7,)
(7, 3, 1)


Finally, the input element of each sample is reshaped to be three-dimensional suitable for fitting an LSTM or CNN and now has the shape $[7, 3, 1]$ or 7
samples, 3 time steps, 1 feature.

## Data Preparation Example

Consider that you are in the current situation:

``
I have two columns in my data file with 5,000 rows, column 1 is time (with 1 hour interval) and column 2 is the number of sales and I am trying to forecast the number of sales for future time steps. Help me to set the number of samples, time steps and features in this data for an LSTM?
``

There are few problems here:

* **Data Shape**: LSTMs expect 3D input, and it can be challenging to get your head around this the first time.
* **Sequence Length**: LSTMs don't like sequences of more than 200-400 time steps, so the data will need to be split into subsamples.

We will work through this example, broken down into the following 4 steps:

1. Load the Data
2. Drop the Time Column
3. Split Into Samples
4. Reshape Subsequences



### Load the Data

In [6]:
# load time series dataset
# series = pd.read_csv('filename.csv', header=0, index_col=0)

# We will mock loading by defining a new dataset in memory with 5,000 time steps.
# define the dataset
data = list()
n = 5000
for i in range(n):
  data.append([i+1, (i+1) * 10])
data = np.array(data)
print(data[:5, :])
print(data.shape)

[[ 1 10]
 [ 2 20]
 [ 3 30]
 [ 4 40]
 [ 5 50]]
(5000, 2)


We can see we have 5,000 rows and 2 columns: a standard univariate time series dataset.

### Drop the Time Column

If your time series data is uniform over time and there is no missing values, we can drop the time column. If not, you may want to look at imputing the missing values, resampling the data to a new time scale, or developing a model that can handle missing values. 

Here, we just drop the first column:

In [7]:
# define the dataset
data = list()
n = 5000
for i in range(n):
  data.append([i+1, (i+1) * 10])
data = np.array(data)

# drop time
data = data[:, 1]
print(data.shape)

(5000,)


### Split Into Samples

LSTMs need to process samples where each sample is a single sequence of observations. In this case, 5,000 time steps is too long; LSTMs work better with 200-to-400 time steps. Therefore, we need to split the 5,000 time steps into multiple shorter sub-sequences.

For example, perhaps you need overlapping sequences, perhaps non-overlapping is good but your model needs state across the sub-sequences and so on. 

In this example, we will split the 5,000 time steps into 25 sub-sequences of 200 time steps each.

In [8]:
# define the dataset
data = list()
n = 5000
for i in range(n):
  data.append([i+1, (i+1) * 10])
data = np.array(data)

# drop time
data = data[:, 1]
print(data.shape)

# split into samples (e.g. 5000/200 = 25)
samples = list()
length = 200

# step over the 5,000 in jumps of 200
for i in range(0, n , length):
  sample = data[i: i + length]  # grab from i to i + 200
  samples.append(sample)
print(len(samples))

(5000,)
25


In [12]:
len(samples[:5][0])

200

In [14]:
samples[:2]

[array([  10,   20,   30,   40,   50,   60,   70,   80,   90,  100,  110,
         120,  130,  140,  150,  160,  170,  180,  190,  200,  210,  220,
         230,  240,  250,  260,  270,  280,  290,  300,  310,  320,  330,
         340,  350,  360,  370,  380,  390,  400,  410,  420,  430,  440,
         450,  460,  470,  480,  490,  500,  510,  520,  530,  540,  550,
         560,  570,  580,  590,  600,  610,  620,  630,  640,  650,  660,
         670,  680,  690,  700,  710,  720,  730,  740,  750,  760,  770,
         780,  790,  800,  810,  820,  830,  840,  850,  860,  870,  880,
         890,  900,  910,  920,  930,  940,  950,  960,  970,  980,  990,
        1000, 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080, 1090, 1100,
        1110, 1120, 1130, 1140, 1150, 1160, 1170, 1180, 1190, 1200, 1210,
        1220, 1230, 1240, 1250, 1260, 1270, 1280, 1290, 1300, 1310, 1320,
        1330, 1340, 1350, 1360, 1370, 1380, 1390, 1400, 1410, 1420, 1430,
        1440, 1450, 1460, 1470, 1480, 

We now have 25 subsequences of 200 time steps each.

### Reshape Subsequences

The LSTM needs data with the format of $[samples, timesteps, features]$. We have 25 samples, 200 time steps per sample, and 1 feature. 

First, we need to convert our list of arrays into a 2D NumPy array with the shape $[25, 200]$.

In [15]:
# define the dataset
data = list()
n = 5000
for i in range(n):
  data.append([i+1, (i+1) * 10])
data = np.array(data)

# drop time
data = data[:, 1]
print(data.shape)

# split into samples (e.g. 5000/200 = 25)
samples = list()
length = 200

# step over the 5,000 in jumps of 200
for i in range(0, n , length):
  sample = data[i: i + length]  # grab from i to i + 200
  samples.append(sample)
print(len(samples))

# convert list of arrays into 2d array
data = np.array(samples)
print(data.shape)

(5000,)
25
(25, 200)


Now we have 25 rows and 200 columns. Interpreted in a machine learning context, this dataset has 25 samples and 200 features per sample.

Next, we can use the reshape() function to add one additional dimension for our single feature and use the existing columns as time steps instead.

In [16]:
# reshape into [samples, timesteps, features]
data = data.reshape((len(samples), length, 1))
print(data.shape)

(25, 200, 1)


And that is it. The data can now be used as an input (X) to an LSTM model, or even a CNN model.