# Long Short Term Memory (LSTM) Network

<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you'll learn in this class](#what-you-will-learn-in-this-class)
* [Introduction](#introduction)
* [Recurrent neuron network](#recurrent-neural-networks)
  * [Recurrent Neuron](#recurrent-neuron)
  * [Long-term network problems](#long---term-network-problems)
* [LSTM networks](#lstm-networks)
  * [Central idea of LSTMs](#central-idea-of-lstms)
  * [Step-by-step explanation of LSTMs](#step---by---step-explanation-of-lstms)
* [Conclusion](#conclusion)
<!-- TOC END -->

## What you'll learn in this class

In this course, we will focus on learning a deep learning model that can be used in all situations where one is interested in analyzing data with important and informative timelines. This is obviously the case for time series, but also for language analysis where word order is very important.

The general idea behind LTSM networks is that previous information can retain an influence on the information that follows immediately or in the longer term. Here we will apply this model to the construction of an automatic translator from English to French.


## Introduction

Human beings do not reset their thinking with every passing second, even though some people are said to have goldfish memories! In fact, as you read this course, the information you have read previously is useful to you and is used by your brain to understand what is being explained next. You don't instantly forget with each word you read the word you read just before. Your thoughts have a certain lifespan.

The problem is that single neural networks do not have this memory capacity, they process each piece of information (image, word, explanatory variable) separately for each observation, regardless of the observation they processed just before. Imagine that you are trying to translate a text from English to French, a classical neural network will be able to translate the words one by one literally, but will not be able to use the context to help the translation of the next term.

For example, the sentence :


<table>
  <tr>
   <td>I
   </td>
   <td>rose
   </td>
   <td>from
   </td>
   <td>the
   </td>
   <td>ashes
   </td>
  </tr>
</table>


is translate by “je me levais de mes cendres”, however without the context some words have several possible translations.


<table>
  <tr>
   <td>I
   </td>
   <td>rose
   </td>
   <td>from
   </td>
   <td>the
   </td>
   <td>ashes
   </td>
  </tr>
  <tr>
   <td>je
   </td>
   <td>rose
   </td>
   <td>de
   </td>
   <td>les
   </td>
   <td>cendres
   </td>
  </tr>
  <tr>
   <td>moi
   </td>
   <td>Me levais
   </td>
   <td>À partir de
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
</table>


It is only by taking the whole sentence into account that one realizes that the word "rose" is a verb and should be translated as "got up" and not a noun that would be translated as "rose". Single neural networks are unable to take this context into account.
The ones that solve this problem are networks composed of neurons that contain loops that allow them to take into account the previous information.






## Recurrent neural network

### Recurrent Neurons

We will explain the principle of a recurrent neuron using the following diagram:




![ ] (https://drive.google.com/uc?export=view&id=13XkLQCR0eZMA5QMGh7y6WrEC9kpLw61E)




The neuron represented in the diagram above by box A takes as input $x_t$, which corresponds to the information that enters the neuron (usually it is a vector containing the explanatory variables or outputs of other neurons in the network), and which generates an output $h_t$ by its combining and activating function. 

A loop allows the information to be used to complete the information given by the next observation. These loops make recurrent networks somewhat intriguing, but when you think about it, they are not that different from single neural networks. They can be thought of as several copies of the same network, each giving a message to the next network. Graphically, this can be visualized as follows:




![](https://drive.google.com/uc?export=view&id=13eq5WJVh2cjf811z0FPZ5U2b2EZMxlQg)





Recurrent networks create a chain of information that persists from observation to observation. This chain structure is intimately linked to notions of lists or sequences. This is why these networks are particularly suitable for analyzing data of the time series or text type.

And they are widely used! In recent years, recurrent neural networks have shown their efficiency to deal with many problems: speech recognition, language modeling, translation, image legends, etc... If you want to explore the many possible applications of recurrent neural networks, I encourage you to go to this blog which gives a detailed presentation: [The unreasonnable effectiveness of recurrent neural networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

An essential part of the success of recurrent neural networks comes from the use of LSTM networks, a very particular variety of recurrent models that are in many cases much more efficient than the standard version of them.



### Long-term network problems

The main advantage of recurrent neural networks is their ability to use previous information to solve the present task, just as knowledge of the last word translated in a sentence can help translate the next word. If recurrent networks could do this in a meaningful way, they would be extremely useful, but can they?

Sometimes it is enough to take into account very recent information to solve the present task. For example, imagine a model that tries to predict the next word from the previous word. If we are trying to predict the last word in the sentence "the clouds are in the _sky_," _we don't need more information, it is fairly obvious that the last word in the sentence will be sky_. In this kind of situation where the task at hand and the information to assist us in the task are not too far apart, neural networks can learn to use the previous information to solve the current problems._






![](https://drive.google.com/uc?export=view&id=1ztVRKSPdL0PuqvbFBTP29xvKSId-0e9a)





But there are also situations in which a richer or older context is needed in order to carry out present tasks. Imagine a problem where you have to guess the last word of the following text: "I grew up in France, near Paris, in a small town called Fontenay-sous-Bois, where I spent eleven years of my life, that's why I am fluent in _French." _In this example, recent information suggests that the last word is definitely a language, but if you want to guess which one it is exactly, you need the context of the beginning of the sentence that talks about France. Frequently, the distance between the relevant information and the point at which it is needed to solve a problem becomes very great._

Unfortunately, as this distance increases, recurrent neural networks are less able to make the connection between common tasks and relevant information.





![](https://drive.google.com/uc?export=view&id=1b_jQnCNZpfvQC3nmAqLGswNe_VwHl3FD)




In theory, recurrent neural networks are capable of managing such "long-term relationships". A human being could carefully select the right parameters to solve these kinds of problems. Unfortunately, in practice, classical recurrent neural networks do not seem to be able to find these parameters themselves when learning. This is why a slightly more complex and much more useful model has been developed, the Long Short Term Memory Networks!



## LSTM networks

LSTM networks are recurrent neural networks capable of managing and understanding these famous long-term relationships while maintaining the ability to understand short-term connections. They were invented by [Hochreiter & Schmidhuber (1997)](http://www.bioinf.jku.at/publications/older/2604.pdf) and have been improved and popularized by many other researchers since then. They work well to solve many problems and are now widely used.

MSTLs have been developed expressly to avoid the difficulties associated with long-term relationships. Their default operation is precisely based on long-term retention of information and not on learning as you go.

All recurrent networks have a very similar chain structure of several identical networks that give information to the next network. For standard recurrent networks these repeated modules will adopt a very simple structure like a single layer of "tanh".









![](https://drive.google.com/uc?export=view&id=1uQUEUTGNPSExO_x8tENy7tjmoysTSrnZ)








Here, the current information and the information given by the previous network (which processes the previous observation) are summed and sent in a "tanh" function before the neuron in question produces an $h_t$ output.

LSTM networks have a similar structure in the form of a chain, but the link between two links in the chain is different: instead of a single layer of neural networks, there are four neural networks that interact in a precise manner:









![](https://drive.google.com/uc?export=view&id=1BZHPdgevIHRWK4qi7Os4XPIZ3r1-WlM5)









Don't be alarmed by the apparent complexity of this diagram, we will explain how this model works in detail in what follows. For now, let's introduce the visual chart used :








![](https://drive.google.com/uc?export=view&id=1tlofNiMnCGZnAjzSauTcqum2TQKskEHy)








In the diagram above, each line carries an integer vector that corresponds to the output of one neuron and is sent as the input of another neuron. The pink circles represent operations applied to the vectors element by element, such as the addition of two vectors, while the yellow rectangles are layers that the network will have to optimize (with weights etc...). Lines that meet denote the concatenation of two vectors, while a line that separates into two lines means that the vector in question is duplicated and that each identical copy goes to different destinations in the network.



### Central idea of LSTMs

The key to the LSTM is the cell state, the horizontal line that runs across the diagram from left to right.

This state can be seen as the conveyor belt of an assembly line. This state traverses the entire line with only minor linear transformations, the information generally passes through the cell without being drastically modified.






![](https://drive.google.com/uc?export=view&id=1C38xfW-AJiwjEZg7erl-Luif5xdBAr3G)





MSTLs have the ability to change the state of the cell by removing or adding information: they can do this in a carefully controlled manner through structures called _gates._.

Gates are means of passing information optionally in the state of the system. They consist of a neural layer comprising a sigmoid followed by an element by element multiplication.










![](https://drive.google.com/uc?export=view&id=1fdrNDGLEp9NIpDLwkflTJKuwdGYCQO76)















The sigmoid layer produces a vector of numbers between $0$ and $1$ which describes what proportion of each element of the state is to be retained as a result of the element by element multiplication operation. A value of zero at the $i$ position means removing all information from the $i$ element of the state vector, a value of one at the $i$ position at the sigmoid output means letting all information pass. This operation can be illustrated by the following table:


<table>
  <tr>
   <td>Position in the status vector
   </td>
   <td>Status value
   </td>
   <td>
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>1
   </td>
   <td>1.05
   </td>
   <td rowspan="6" >






![](https://drive.google.com/uc?export=view&id=1fdrNDGLEp9NIpDLwkflTJKuwdGYCQO76)







   </td>
   <td>0.21
   </td>
  </tr>
  <tr>
   <td>...
   </td>
   <td>...
   </td>
   <td>...
   </td>
  </tr>
  <tr>
   <td>i
   </td>
   <td>3.76
   </td>
   <td>3.76
   </td>
  </tr>
  <tr>
   <td>i+1
   </td>
   <td>-2.37
   </td>
   <td>0
   </td>
  </tr>
  <tr>
   <td>...
   </td>
   <td>
   </td>
   <td>...
   </td>
  </tr>
  <tr>
   <td>n
   </td>
   <td>0.44
   </td>
   <td>0.22
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>Position in sigmoid layer
   </td>
   <td>Value in sigmoid layer
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>1
   </td>
   <td>0.2
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>...
   </td>
   <td>...
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>i
   </td>
   <td>1
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>i+1
   </td>
   <td>0
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>...
   </td>
   <td>...
   </td>
   <td>
   </td>
  </tr>
  <tr>
   <td>
   </td>
   <td>n
   </td>
   <td>0.5
   </td>
   <td>
   </td>
  </tr>
</table>


The LSTMs have three gates to control and protect the state of the cell.



### Step-by-step explanation of LSTMs

The first step in MSTLs is to decide what information will be retained or discarded about the condition of the airframe. This decision is made by the sigmoid layer we just discussed, called the "forget gate layer", which observes the output of a neuron from the previous network (which processed the previous observation).
 $h_{t-1}$ and the current observation $x_t$ (or where applicable the output of the previous layer if the neuron in question is on an intermediate layer of the network which takes as input the output of other neurons of the same network). This gate has as output a vector of numbers between $0$ and $1$ of the same size as the state of the cell $C_{t-1}$, a $1$ representing "keep information completely", a $0$ representing "remove information completely".

Let's go back to our example where we try to predict the next word in the sentence based on the previous words. In such a problem, the state of the cell may for example contain the gender of the subject of the sentence, so that the correct pronouns are used. When you encounter a new subject, you want to forget the gender of the previous subject and replace it with the gender of the current subject.







![](https://drive.google.com/uc?export=view&id=1Vyoa4hWSOY2e7FQE91X21dhCx_yoOsR9)






Where Wf is the vector of weights used by the sigmoid layer, $\sigma$ is the sigmoid function, $h_{t-1}$ is the output of the neuron corresponding to the current neuron in the network processing the previous observation, $x_t$ is the current observation or the current output of a previous layer of the current network, $b_f$ is the bias of the sigmoid layer.

The next step is to decide what new information will be added to the cell state. First we have a sigmoid layer called "input gate layer" which allows us to decide what information to add (i.e. what elements of the vector to keep to add to the state, it's a kind of filter). Then, a hyperbolic tangent layer ``tanh`` creates a vector of new candidate values from $[h_{t-1}, x_t]$ noted $~C_t$, which could be added to the state.

In the example of our language model, we would like to add the information describing the new gender of the subject to the cell state. In the next step we will combine the filter and this new vector in order to replace the old ones we forgot.








![](https://drive.google.com/uc?export=view&id=1Ttoq0Ajixov3GJT8Fa-QclYQm3s3LlEH)














Where $W_i$ and $b_i$ are the weights and bias associated with the sigmoid layer of the input gate layer and $W_c$ and $b_c$ are the weights and bias associated with the hyperbolic tangent layer of the input gate layer.

It is now time to update the old cell state,  $C_{t-1}$ to produce the new cell state, $C_t$. The above steps have already taken care of deciding what to do, so now it's time to do it.

We multiply element by element $C_t$ by $f_t$ to forget the elements we no longer need (this is the "forget gate layer" gate) and then we sum this product element by element with $i_t \times ~C_t$ (the "input gate layer" gate). The result is $C_t$.

In the case of our language model, this is where we forget the information about the gender of the old subject and add the information about the gender of the new subject of the sentence.





![](https://drive.google.com/uc?export=view&id=1rlC-M1NgPTLssL55AqqG_rV6zhtob9fo)





Finally, we have to decide what to produce at the output of the neuron under consideration (for the moment we have only dealt with the state of the cell, that is, the information that travels through the different successive networks that process the successive observations of our text). This output will be based on the state of the cell, but in a filtered version. To start with, we have a new sigmoid layer that will decide which components of the cell state we will use as output. Then, we use a hyperbolic tangent function in order to obtain values between $-1$ and $1$ that we multiply element by element with the output of the sigmoid layer in order to keep only the relevant elements in the output of the neuron considered.

In the example of the language model, as we have just observed the appearance in the sentence of a new subject, we will undoubtedly want to have useful information for verbs or adjectives in the output, in case these types of words arrive later. For example, you might want to know whether the subject is singular or plural so that you can conjugate the next verb or tune the adjective that comes next.







![](https://drive.google.com/uc?export=view&id=1puGXdAvImcegSZIbiUFEjLc8TpmfLPVQ)








Where $W_0$ and $b_0$ are the weights and the bias of the output gate layer.



## Conclusion

The impressive results of recurrent neural networks were mentioned earlier. These are largely due to the development of MSTLs. They really work better for many tasks.

Simply put as a system of equations, MSTLs seem very intimidating, but taken step by step we realize that the underlying ideas are quite intuitive. Most importantly, you now understand why MSTLs have this long term memory capacity! To put it simply, we have seen that the essential element of each neuron in an LSTM network is the state of the system at the level of this neuron, which is a vector that travels through the different successive networks, which themselves process successive information. This state is changed only when it is realized that new, more relevant information replaces information stored in memory. Thus, to use our example of a language model, if the sentence contains only one subject that is described at length, then the information concerning the subject will remain stored in the state variable until the end of the series of information to be processed, and this until the end of the sentence or text! That is to say that without any event disturbing the state, the default behaviour of the network is to keep the state as it is, where does this long-term memory come from!

LSTMs represent a great step forward in the development of recurrent networks. Is there a next step to be anticipated? Researchers think so, and call the next new capability of these networks "**attention**". The idea would be to enable recurrent networks to process a selected portion of the information observed in the problem under consideration. For example, if one wanted to use a recurrent network to caption an image, the network could focus its attention on a different area of the image for each word or group of words that the network produces.

This course is largely inspired by the following blog post: [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/), from which the explanatory diagrams are taken.
