# A brief introduction to Deep Learning 

ChatGPT: 

In a nutshell, deep learning is a subset of machine learning that involves the use of neural networks with multiple layers (deep neural networks) to model and solve complex problems. 

These neural networks are inspired by the structure and function of the human brain. Deep learning algorithms learn to perform tasks by processing and analyzing large amounts of data, extracting relevant features, and making predictions or decisions without explicit programming. 

It has been particularly successful in tasks such as image and speech recognition, natural language processing, and playing games. The term "deep" refers to the multiple layers through which the data is transformed and processed in these neural networks.

In a sense, deep learning is about learning a function that captures the underlying relationships in the data.

## Why non-linearity? 

Activation functions introduce non-linearity to the network, allowing it to learn and approximate complex patterns in data.

Without non-linearity, deep neural networks work the same as linear transform

With non-linearity, newtworks with more layers can approximate complex functions.


![](Pictures/activation_fcn.jpeg)

## How to define good functions?

Loss functions quantifies the disparity between the predicted output of the model and the actual target values. The loss function generates a single scalar value that represents the error, and the objective during training is to minimize this error.

Common loss functions:

- Square loss : (Mean Squared Error for regression):

    This is commonly used for regression tasks.

- Hinge loss:  (for binary classification)

    This is often used with Support Vector Machines (SVMs) and is suitable for binary classification problems.

- Logistic loss:(Binary Cross Entropy Loss for binary classification)

    This is commonly used in logistic regression and binary classification tasks.

- Cross entropy loss loss:  (Categorical Cross Entropy for multiclass classification)

    This is often used for multiclass classification problems.

## How to pick the best functions

Backpropagation Training Algorithm

Backpropagation refers to two things:

- The mathematical method used to calculate derivatives and an application of the derivative chain rule.
- The training algorithm for updating network weights to minimize error.

The goal of the backpropagation training algorithm is to modify the weights of a neural network in order to minimize the error of the network outputs compared to some expected output in response to corresponding inputs.

The general algorithm is as follows:

1. Present a training input pattern and propagate it through the network to get an output.
2. Compare the predicted outputs to the expected outputs and calculate the error.
3. Calculate the derivatives of the error with respect to the network weights.
4. Adjust the weights to minimize the error.
5. Repeat.

## Hyperparameters tuning

### Batch size and Epoch

The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

A training dataset can be divided into one or more batches.

- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

In the case of mini-batch gradient descent, popular batch sizes include 32, 64, and 128 samples. You may see these values used in models in the literature and in tutorials.

The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.

One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above, an epoch that has one batch is called the batch gradient descent learning algorithm.

In a nutshell:

- The batch size is a number of samples processed before the model is updated.

- The number of epochs is the number of complete passes through the training dataset.

![](Pictures/batchsize.png)

### Momentum

- Gradient Descent + Momentum（考慮動量）: 綜合梯度 + 前一步的方向

### Adaptive Learning Rate


$$\theta^{t+1}_i=\theta^t_i-\frac{\eta}{\sigma^t_i}g^t_i$$

**基本原則：**

- 某一個方向上 gradient 的值很小，非常的平坦 ⇒ learning rate 調大一點
- 某一個方向上非常的陡峭，坡度很大 ⇒ learning rate 可以設小一點

1. Adagrad:考慮之前所有的梯度大小

2. RMSProp 調整當前梯度與過去梯度的重要性

3. Adam = RMSProp + Momentum（最常用的策略）

There are many different ways to potentially improve a neural network. Some of the most common include: 

- increasing the number of layers (making the network deeper), 
- increasing the number of hidden units (making the network wider) and 
- changing the learning rate. 

Because these values are all human-changeable, they're referred to as hyperparameters) and the practice of trying to find the best hyperparameters is referred to as hyperparameter tuning.

# RNN Basics

## Introduction

Recurrent Neural Networks are a family of networks for processing sequential data (time series, text, audio, video)


RNN can be used for a number of sequence-based problems:

- One to one: one input, one output, such as image classification.
- One to many: one input, many outputs, such as image captioning (image input, a sequence of text as caption output).
- Many to one: many inputs, one outputs, such as text classification (classifying a Tweet as real diaster or not real diaster).
- Many to many: many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text (audio wave as input, text as output).

![Input/Output sequence](Pictures/IO.png)


RNN have memory: prior input influence the current input and output  →  output depends on prior elements:

How can RNN do this? Ans: It applys a recurrence relation at every time step to process a sequence

\begin{align}
h_t = f_W(h_{t-1},x_t)
\end{align}

where $h_t$ is cell state, $f_W$ is a function parameterized by $W$, $h_{t-1}$ is old state, and $x_t$ is a input vector at time step t.


![RNN have loops](./Pictures/RNN_schematics.png)


In the above diagram, a chunk of neural network (purple chunk), looks at some input $x_t$ and outputs a value $h_t$. A loop allows information to be passed from one step of the network to the next.

Once you unfold it, a recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.

![The repeating module in a standard RNN contains a single layer](./Pictures/LSTM3-SimpleRNN.png)

In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.


N.B.: The same function $f$ and set of parameters $\theta=\{U, V, W\}$ are used at every time step.


In [None]:
# RNN intuition pseudo code

my_rnn = RNN()

hidden_state = [0,0,0,0]

sentence = ["I", "love", "RNN"]

for word in sentence:
    prediction, hidden_state = my_rnn(word, hidden_state)
    
next_word_prediction = prediction

At each time step, 

Given our input vector $x_t$, RNN applys a function to update its hidden state:

$h_t=\sigma(Wh_{t-1}+Ux_t) = \sigma(W_{hh}h_{t-1}+W_{hx}x_t)$, where $\sigma(\cdot):$ activation functions

$\hat y_t=o_t=Vh_t = W_{yh}h_t$ 

Then, we can compute loss $L_t$ at each time step

The total loss is simply the losses from all the individual loss at each time step, i.e., $L = \sum_{t}L_t$



## Model Training Issue

In RNN, losses are back propagated at each individual time step and finally across all time steps, all the way from the end of sequence to the begining. This is so-called **Back Propagation Through Time (BPTT)**.

Backpropagation Through Time, or BPTT, is the training algorithm used to update weights in recurrent neural networks like LSTMs.

Conceptually, BPTT works by unrolling all input timesteps. Each timestep has one input timestep, one copy of the network, and one output. Errors are then calculated and accumulated for each timestep. The network is rolled back up and the weights are updated.

We can summarize the algorithm as follows:

1. Present a sequence of timesteps of input and output pairs to the network.
2. Unroll the network then calculate and accumulate errors across each timestep.
3. Roll-up the network and update weights.
4. Repeat.

All model parameters $\theta$ can be updated by 

\begin{align}
\theta^{i+1}=\theta^{i}-\eta \nabla_{\theta}L(\theta^i)
\end{align}

where $\eta$ is the learning rate.

Let $L(\theta)$ be the total loss function, since the total loss is simply the losses from all the individual loss at each time step, we have $L = L_0+L_1+...L_T$

In particular, let's focus on the paramter $W$

$\frac{\partial L}{\partial W} = \frac{\partial L_0}{\partial W}+\frac{\partial L_1}{\partial W}+...+\frac{\partial L_T}{\partial W}$

$\frac{\partial L_0}{\partial W} = \frac{\partial L_0}{\partial y_0}\frac{\partial y_0}{\partial h_0}\frac{\partial h_0}{\partial W}$

$\frac{\partial L_1}{\partial W} = \frac{\partial L_1}{\partial y_1}\frac{\partial y_1}{\partial h_1}\frac{\partial h_1}{\partial W} = \frac{\partial L_1}{\partial y_1}\frac{\partial y_1}{\partial h_1}\frac{\partial h_1}{\partial h_0}\frac{\partial h_0}{\partial W}$

$\vdots$

$\frac{\partial L_T}{\partial W} = \sum_{k=0}^{T}\frac{\partial L_T}{\partial y_T} \frac{\partial y_T}{\partial h_T}\frac{\partial h_T}{\partial h_k} \frac{\partial h_k}{\partial W}$

where $\frac{\partial h_T}{\partial h_k}=\prod_{j=k+1}^{T}\frac{\partial h_j}{\partial h_{j-1}}$

Notice that each partial is a Jacobian matrix, and thus the gradient is a product of Jacobian matrices, each associated with a step in the forward computation.

As the time horizon gets bigger, this product gets longer and longer.
If many of the values involved in these multiplications are smaller than 1, we are multiplying a lot of small numbers, which leads to small gradients and thus results in biased parameters and unable to capture long term dependencies.

Computing the gradient w.r.t $h_0$ involves many factors of $W$ and repeated gradient computation. This is problematic!!

Why? 

Ans: 

- The gradient is a product of Jacobian matrices, each associated with a step in the forward computation.

- Multiply the **same** matrix at each tiem step during backpropation.

Consider two cases:

1. Exploding gradients: when there are many values involved in these computations are > 1. In this case, the gradient become extremely large, and we cannot optimize them.

2. Vanishing gradients: when there are many values involved in these computations are < 1. In this case, the gradient become extremely small, and we cannot optimize them.

Exploding gradients problem is relatively easy to solve (e.g. clipping); However, the Vanishing gradients problem is much troublesome (a popular solution is gating). 

How to solve vanishing gradient problem?

1. Use Activation Function that prevents fast shrinkage of gradient
2. Use weight initialization techniques that ensure that the initial weights are not too small
3. Use gradient clipping which limits the magnitude of the gradients from becoming too
small (vanishing gradient) or too large (exploding gradient)
4. Use batch normalization, which normalizes the input to each layer and helps to reduce the
range of activation values and thus the likelihood of vanishing gradients.
5. Use a different optimization algorithm that is more resilient to vanishing gradients, such
as Adam or RMSprop.
6. Gated cells: Use some sort of skip connections, which allow gradients to bypass some
of the layers in the network and thus prevent them from becoming too small.

## The Problem of Long-Term Dependencies

Sometimes, we only need to look at recent information to perform the present task. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context.It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by **Hochreiter (1991) [German]** and **Bengio, et al. (1994)**, who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!



# LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by **Hochreiter & Schmidhuber (1997)**, and were refined and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem (by **Gating**). Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

Before we get started, let's introuduce some notations first.

![notations](Pictures/LSTM2-notation.png)

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

![chain](Pictures/LSTM3-chain.png)

Instead of having a single neural network layer like in RNN, there are four, interacting in a very special way that control information flow.


## The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt (快速通關). It runs straight down the entire chain, with only some minor linear interactions (no non-linear interactions). It’s very easy for information to just flow along it unchanged.

![](Pictures/LSTM3-C-line.png)

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

![](Pictures/LSTM3-gate.png)

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

## Step-by-Step LSTM Walk Through



Step 1. Forget gate gets rid of irrelevant information

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at $h_{t−1}$ and $x_t$, and outputs a number between 0 and 1 (this is what sigmoid function does) for each number in the cell state $C_{t−1}$. A 1 ($f_t=1$) represents “completely keep this” while a 0 ($f_t=0$) represents “completely get rid of this.”

![](Pictures/LSTM3-focus-f.png)

Step 2. Store relevant information from current input

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $\tilde C_t$ (This is exactly what we will get from vanilla RNN), that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

cf. Notice that $\tilde C_t$ is exactly what we will get directly from vanilla RNN; However, in LSTM, it uses an term $i_t$ to adjust how much information we want to get updated from $\tilde C_t$ 

![](Pictures/LSTM3-focus-i.png)

Step 3. Selectively update cell state

It’s now time to update the old cell state, $C_{t−1}$, into the new cell state $C_t$. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by $f_t$, forgetting the things we decided to forget earlier. Then we add $i_t*\tilde C_t$. This is the new candidate values, scaled by how much we decided to update each state value.

![](Pictures/LSTM3-focus-C.png)

Step 4. Output gate returns a filtered version of the cell state

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

![](Pictures/LSTM3-focus-o.png)


In a nutshell, LSTM contains 3 gates (sigmoid layers): forget gate, input gate, and output gate.


## Variants on LSTM

What we’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

*LSTM with Peephole Connections*

One popular LSTM variant, introduced by **Gers & Schmidhuber (2000)**, is adding “peephole connections.” This means that we let the gate layers look at the cell state (偷看快速通關的資訊).

![](Pictures/LSTM3-var-peepholes-1.png)

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

*LSTM with Coupled Forget/Input Gates*

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together (i.e., combining the forget gate and input gate together, so $i_t=1-f_t$). We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

![](Pictures/LSTM3-var-tied-1.png)


*GRU*

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by **Cho, et al. (2014)**. It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models (less parameters and only two gates: update gate and reset gate), and has been growing increasingly popular.

Notice that $r_t=0$: ignore previous memory and only stores the new information.


![](Pictures/LSTM3-var-GRU-1.png)


These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by **Yao, et al. (2015)**. There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by **Koutnik, et al. (2014)**.

Which of these variants is best? Do the differences matter? **Greff, et al. (2015)** do a nice comparison of popular variants, finding that they’re all about the same. **Jozefowicz, et al. (2015)** tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

# GRU

GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results.



## How do GRUs work?

To solve the vanishing gradient problem of a standard RNN, GRU uses, so-called, **update gate** and **reset gate**. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.

![](Pictures/LSTM3-var-GRU-1.png)









## Update gate and Reset gate

*Update gate*

We start with calculating the update gate $z_t$ for time step $t$ using the formula:

$$z_t=\sigma(W_z\cdot[h_{t-1}, x_t])$$

The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. That is really powerful because the model can decide to copy all the information from the past and eliminate the risk of vanishing gradient problem.

*Reset gate*

Essentially, this gate is used from the model to decide how much of the past information to forget. To calculate it, we use:

$$r_t=\sigma(W_r\cdot[h_{t-1}, x_t])$$

*Current memory content*


Let’s see how exactly the gates will affect the final output. First, we start with the usage of the reset gate. We introduce a new memory content which will use the reset gate to store the relevant information from the past. It is calculated as follows:

$$\tilde h_t = tanh(W\cdot [r_t*h_{t-1},x_t])$$

Notice the element-wise production in $r_t*h_{t-1}$. This will determine what to remove from the previous time steps. 

*Final memory at current time step*

As the last step, the network needs to calculate $h_t$ — vector which holds information for the current unit and passes it down to the network. In order to do that the update gate is needed. It determines what to collect from the current memory content — $\tilde h_t$ and what from the previous steps — $h_{t-1}$. That is done as follows:

$$h_t = (1-z_t)* h_{t-1}+z_t\tilde h_t$$


Following through, you can see how $z_t$ is used to calculate $1-z_t$ and then combined with $\tilde h_t$. $z_t$ is also used with $h_{t-1}$ in an element-wise multiplication. Finally, $h_t$ is a result of the summation of the outputs.

# Stacked LSTM

Stacking LSTM hidden layers makes the model deeper, more accurately earning the description as a deep learning technique.

Stacked LSTMs or Deep LSTMs were introduced by **Speech Recognition with Deep Recurrent Neural Network (Graves, et al)**. in their application of LSTMs to speech recognition, beating a benchmark on a challenging standard problem.


In the same work, they found that the depth of the network was more important than the number of memory cells in a given layer to model skill.

Stacked LSTMs are now a stable technique for challenging sequence prediction problems. A Stacked LSTM architecture can be defined as an LSTM model comprised of multiple LSTM layers. An LSTM layer above provides a sequence output rather than a single value output to the LSTM layer below. Specifically, one output per input time step, rather than one output time step for all input time steps.

![](Pictures/stacked_lstm.png)


In practice, notice that each LSTMs memory cell requires a 3D input. When an LSTM processes one input sequence of time steps, each memory cell will output a single value for the whole sequence as a 2D array.

To stack LSTM layers, we need to change the configuration of the prior LSTM layer to output a 3D array as input for the subsequent layer.

We can do this by setting the `return_sequences` argument on the layer to `True` (defaults to False). This will return one output for each input time step and provide a 3D array.

# Overfitting Issue and Recurrent Dropout

Recurrent Dropout: use drop out to fight overfitting in the recurrent layers (in addition to drop out for the dense layers)

The **same recurrnt dropout pattern** should be applied at every timestep.


# Conclusions

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, **Xu, et al. (2015)** do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by **Kalchbrenner, et al. (2015)** seem extremely promising. Work using RNNs in generative models – such as **Gregor, et al. (2015)**, **Chung, et al. (2015)**, or **Bayer & Osendorfer (2015)** – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

# Time Series

## Making Predictions with Sequences

The sequence imposes an explicit order on the observations.

The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.

### Sequence Prediction

Sequence prediction involves predicting the next value for a given input sequence.

For example:

- Given: 1, 2, 3, 4, 5
- Predict: 6

![](Pictures/sequence_pred.png)

Some examples of sequence prediction problems include:


- Stock Market Prediction. Given a sequence of movements of a security over time, predict the next movement of the security.
- Product Recommendation. Given a sequence of past purchases of a customer, predict the next purchase of a customer.

### Sequence-to-Sequence Prediction

Sequence-to-sequence prediction involves predicting an output sequence given an input sequence.

For example:

- Given: 1, 2, 3, 4, 5
- Predict: 6, 7, 8, 9, 10

![](Pictures/seq2seq.png)

If the input and output sequences are a time series, then the problem may be referred to as multi-step time series forecasting.

- Multi-Step Time Series Forecasting. Given a time series of observations, predict a sequence of observations for a range of future time steps.

### Cardinality from Timesteps (NOT features!!)

The cardinality of the sequence prediction models defined above refers to time steps, not features (e.g. univariate or multivariate sequences).

A sequence may be comprised of single values, one for each time step.

Alternately, a sequence could just as easily represent a vector of multiple observations at the time step. Each item in the vector for a time step may be thought of as its own separate time series. 

For example, a model that takes as **input one time step** of temperature and pressure and **predicts one time step** of temperature and pressure is a **one-to-one model**, not a many-to-many model.

![](Pictures/Multiple-Feature.png)

# Data Preparation for LSTMs

## Stationary Time Series

Trends can result in a varying mean over time, whereas seasonality can result in a changing variance over time, both which define a time series as being non-stationary. Stationary datasets are those that have a stable mean and variance, and are in turn much easier to model.

Differencing is a popular and widely used data transform for making time series data stationary.

In [2]:
# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced forecast
def inverse_difference(last_ob, value):
	return value + last_ob


In [3]:
# define a dataset with a linear trend
data = [i+1 for i in range(20)]
print(data)
# difference the dataset
diff = difference(data)
print(diff)
# invert the difference
inverted = [inverse_difference(data[i], diff[i]) for i in range(len(diff))]
print(inverted)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]


## Scaling Data

When a network is fit on unscaled data that has a range of values (e.g. quantities in the 10s to 100s) it is possible for large inputs to slow down the learning and convergence of your network and in some cases prevent the network from effectively learning your problem.

### Normalize Series Data

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

In [None]:
# pseudo code
y = (x - min) / (max - min)

You can normalize your dataset using the scikit-learn object `MinMaxScaler`.

Good practice usage with the `MinMaxScaler` and other scaling techniques is as follows:

- Fit the scaler using available **training data**. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the `fit()` function.
- Apply the scale to **training data**. This means you can use the normalized data to train your model. This is done by calling the `transform()` function.
- Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.

If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the `inverse_transform()` function.

In [7]:
from pandas import Series
from sklearn.preprocessing import MinMaxScaler
# define contrived series
data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
print(values)
# train the normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(values)
print('Min: %f, Max: %f' % (scaler.data_min_, scaler.data_max_))
# normalize the dataset and print
normalized = scaler.transform(values)
print(normalized)
# inverse transform and print
inversed = scaler.inverse_transform(normalized)
print(inversed)

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0
dtype: float64
[[ 10.]
 [ 20.]
 [ 30.]
 [ 40.]
 [ 50.]
 [ 60.]
 [ 70.]
 [ 80.]
 [ 90.]
 [100.]]
Min: 10.000000, Max: 100.000000
[[0.        ]
 [0.11111111]
 [0.22222222]
 [0.33333333]
 [0.44444444]
 [0.55555556]
 [0.66666667]
 [0.77777778]
 [0.88888889]
 [1.        ]]
[[ 10.]
 [ 20.]
 [ 30.]
 [ 40.]
 [ 50.]
 [ 60.]
 [ 70.]
 [ 80.]
 [ 90.]
 [100.]]


### Standardize Series Data

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your time series data if this expectation is not met, but you may not get reliable results.



In [None]:
# pseudo code
y = (x - mean) / standard_deviation

You can standardize your dataset using the scikit-learn object `StandardScaler`.

In [8]:
from pandas import Series
from sklearn.preprocessing import StandardScaler
from math import sqrt
# define contrived series
data = [1.0, 5.5, 9.0, 2.6, 8.8, 3.0, 4.1, 7.9, 6.3]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
# train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('Mean: %f, StandardDeviation: %f' % (scaler.mean_, sqrt(scaler.var_)))
# normalize the dataset and print
standardized = scaler.transform(values)
print(standardized)
# inverse transform and print
inversed = scaler.inverse_transform(standardized)
print(inversed)

0    1.0
1    5.5
2    9.0
3    2.6
4    8.8
5    3.0
6    4.1
7    7.9
8    6.3
dtype: float64
Mean: 5.355556, StandardDeviation: 2.712568
[[-1.60569456]
 [ 0.05325007]
 [ 1.34354035]
 [-1.01584758]
 [ 1.26980948]
 [-0.86838584]
 [-0.46286604]
 [ 0.93802055]
 [ 0.34817357]]
[[1. ]
 [5.5]
 [9. ]
 [2.6]
 [8.8]
 [3. ]
 [4.1]
 [7.9]
 [6.3]]


### Real-Valued Inputs

You may have a sequence of quantities as inputs, such as **prices** or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise the series should be normalized. This applies if the range of quantity values is large (10s 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0-1) and the distribution is limited (e.g. standard deviation near 1) then perhaps you can get away with no scaling of the series.

### Scaling Output Variable

You must ensure that the scale of your output variable matches the scale of the activation function (transfer function) on the output layer of your network.

If the output is a real value, this is best modeled with a linear activation function. If the distribution of the value is normal, then you can standardize the output variable. Otherwise, the output variable can be normalized.

### Considerations

- Estimate descriptive statistics. You can estimate descriptive statistics (min and max values for normalization or mean and standard deviation for standardization) from the training data. Inspect these first-cut estimates and use domain knowledge or domain experts to help improve these estimates so that they will be usefully correct on all data in the future.
- Save descriptive statistics. You will need to normalize new data in the future in exactly the same way as the data used to train your model. Save the descriptive statistics used to file and load them later when you need to scale new data when making predictions.
- Data Analysis. Use data analysis to help you better understand your data. For example, a simple histogram can help you quickly get a feeling for the distribution of quantities to see if standardization would make sense.
- Scale Each Series. If your problem has multiple series, treat each as a separate variable and in turn scale them separately.
- Scale At The Right Time. It is important to apply any scaling transforms at the right time. For example, if you have a series of quantities that is non-stationary, it may be appropriate to **scale after first making your data stationary**. It would not be appropriate to scale the series after it has been transformed into a supervised learning problem as each column would be handled differently, which would be incorrect.
- Scale if in Doubt. You probably do need to rescale your input and output variables. If in doubt, **at least normalize** your data.


## Handle Missing Timesteps

Sequences must be framed as a supervised learning problem when using neural networks.

That means the sequence needs to be divided into input and output pairs.

e.g. $y_t = f(X_t, X_{t-1})$

In [1]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame

# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]

# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 0]
	return X, y

# generate sequence
n_timesteps = 10
X, y = generate_data(n_timesteps)
# print sequence
for i in range(n_timesteps):
	print(X[i], '=>', y[i])

[       nan 0.68627597] => nan
[0.68627597 0.74802302] => 0.686275974635361
[0.74802302 0.66601443] => 0.7480230160283416
[0.66601443 0.21218973] => 0.6660144337919112
[0.21218973 0.32765495] => 0.21218972806821557
[0.32765495 0.57127038] => 0.32765494520847505
[0.57127038 0.61242966] => 0.571270382249948
[0.61242966 0.38275331] => 0.6124296573244923
[0.38275331 0.3765882 ] => 0.38275330826352305
[0.3765882 0.9972823] => 0.3765882027159263


We can generate sequences of random values between 0 and 1 using the `random()` function in the random module.




The Pandas `shift()` function can be used to create a shifted version of the sequence that can be used to represent the observations at the prior timestep. This can be concatenated with the raw sequence to provide the $X_{t-1}$ and $X_{t}$ input values.

### Remove all rows that contain a NaN value

This can be done by `dropna()` function.

In [2]:
# generate data for the lstm
def generate_data(n_timesteps):
 # generate sequence
 sequence = generate_sequence(n_timesteps)
 sequence = array(sequence)
 # create lag
 df = DataFrame(sequence)
 df = concat([df.shift(1), df], axis=1)
 # remove rows with missing values
 df.dropna(inplace=True)
 values = df.values
 # specify input and output data
 X, y = values, values[:, 0]
 return X, y

### Replace Missing Values

we can replace all NaN values with a specific value that does not appear naturally in the input, such as -1. To do this, we can use the `fillna()` Pandas function.

In [None]:
# generate data for the lstm
def generate_data(n_timesteps):
 # generate sequence
 sequence = generate_sequence(n_timesteps)
 sequence = array(sequence)
 # create lag
 df = DataFrame(sequence)
 df = concat([df.shift(1), df], axis=1)
 # replace missing values with -1
 df.fillna(-1, inplace=True)
 values = df.values
 # specify input and output data
 X, y = values, values[:, 1]
 return X, y

### Masking Missing Values

The marked missing input values can be masked from all calculations in the network.

We can do this by using a `Masking layer` as the first layer to the network.

When defining the layer, we can specify which value in the input to mask. If all features for a timestep contain the masked value, then the whole timestep will be excluded from calculations.

This provides a middle ground between excluding the row completely and forcing the network to learn the impact of marked missing values.

Because the Masking layer is the first in the network, it must specify the expected shape of the input.

# LSTM with tensorflow

Please check how I implement LSTM with tensorflow on Google Colab for a better undestanding.

## Define Network

The first step is to define your network.

Neural networks are defined in Keras as a sequence of layers. The container for these layers is the Sequential class.

The first step is to create an instance of the Sequential class. Then you can create your layers and add them in the order that they should be connected. The LSTM recurrent layer comprised of memory units is called LSTM(). A fully connected layer that often follows LSTM layers and is used for outputting a prediction is called Dense().

### Reshape Input layer

The first layer in the network must define the number of inputs to expect. Input must be three-dimensional, comprised of samples, timesteps, and features.

- Samples. These are the rows in your data. One sequence is one sample. A batch is comprised of one or more samples.
- Timesteps. These are the past observations for a feature, such as lag variables. One time step in one point of observation in the sample
- Features. These are columns in your data. One feature is one observation at a time step.

![](Pictures/input_shape.png)


When defining the input layer of your LSTM network, the network assumes you have 1 or more samples and requires that you specify the number of time steps and the number of features.

### Windowing Data set (Rolling window)

Windowing is a method to turn a time series dataset into supervised learning problem.

In other words, we want to use windows of the past to predict the future.

For example for a univariate time series, windowing for one week (window=7) to predict the next single value (horizon=1) might look like:

In [None]:
# pseudo
Window for one week (univariate time series)

[0, 1, 2, 3, 4, 5, 6] -> [7]
[1, 2, 3, 4, 5, 6, 7] -> [8]
[2, 3, 4, 5, 6, 7, 8] -> [9]

- Window size (input): number of time steps of historical data used to predict horizon
- Horizon (output): number of time steps to predict into the future

### Choice of Activation functions

The choice of activation function is most important for the output layer as it will define the format that predictions will take.

For a common predictive modeling problem (e.g. **Regression**) types and the structure and standard activation function that you can use in the output layer:

**Regression**: Linear activation function, or ‘linear’, and the number of neurons matching the number of outputs.

## Compile Network

Once we have defined our network, we must compile it.

Compilation is an efficiency step. It transforms the simple sequence of layers that we defined into a highly efficient series of matrix transforms in a format intended to be executed on your GPU or CPU, depending on how Keras is configured.

Think of compilation as a precompute step for your network. It is always required after defining a model.


Compilation requires a number of parameters to be specified, specifically tailored to training your network. Specifically, the optimization algorithm to use to train the network and the loss function used to evaluate the network that is minimized by the optimization algorithm.

For example, below is a case of compiling a defined model and specifying the **stochastic gradient descent (sgd)** optimization algorithm and the **mean squared error** loss function, intended for a **regression** type problem.

Perhaps the most commonly used optimization algorithms because of their generally better performance are:

- **Stochastic Gradient Descent**, that requires the tuning of a learning rate and momentum.
- **ADAM**, that requires the tuning of learning rate.
- **RMSprop**, that requires the tuning of learning rate.

Finally, you can also specify **metrics** to collect while fitting your model in addition to the loss function. 

## Fit Network

Once the network is compiled, it can be fit, which means adapt the weights on a training dataset.

Fitting the network requires the training data to be specified, both a matrix of input patterns, X, and an array of matching output patterns, y.

The network is trained using the backpropagation algorithm and optimized according to the optimization algorithm and loss function specified when compiling the model.

The backpropagation algorithm requires that the network be trained for a specified number of **epochs** or exposures to the training dataset.

Each epoch can be partitioned into groups of input-output pattern pairs called **batches**. This defines the number of patterns that the network is exposed to before the weights are updated within an epoch. It is also an efficiency optimization, ensuring that not too many input patterns are loaded into memory at a time.

Once fit, a **history** object is returned that provides a summary of the performance of the model during training. This includes both the loss and any additional metrics specified when compiling the model, recorded each epoch.

Training can take a long time, from seconds to hours to days depending on the size of the network and the size of the training data.

By default, a progress bar is displayed on the command line for each epoch. This may create too much noise for you, or may cause problems for your environment, such as if you are in an interactive notebook or IDE.

You can **reduce the amount of information displayed** to just the loss each epoch by setting the verbose argument to 2. You can turn off all output by setting `verbose=0`.



## Evaluate Network

Once the network is trained, it can be evaluated.

The network can be evaluated on the training data, but this will not provide a useful indication of the performance of the network as a predictive model, as it has seen all of this data before.

We can evaluate the performance of the network on a separate dataset, unseen during testing. This will provide an estimate of the performance of the network at making predictions for unseen data in the future.

The model evaluates the loss across all of the test patterns, as well as any other metrics specified when the model was compiled. A list of evaluation metrics is returned.

As with fitting the network, verbose output is provided to give an idea of the progress of evaluating the model. We can turn this off by setting the `verbose = 0`.

## Make Predictions

Once we are satisfied with the performance of our fit model, we can use it to make predictions on new data.

This is as easy as calling the predict() function on the model with an array of new input patterns.

The predictions will be returned in the format provided by the output layer of the network.

In the case of a **regression** problem, these predictions may be in the format of the problem directly, provided by a linear activation function.

As with fitting and evaluating the network, verbose output is provided to given an idea of the progress of the model making predictions. We can turn this off by setting the `verbose = 0`.