# Understanding the LSTM cell


What makes the LSTM cells so special? How do the LSTM cells achieve long term
dependency? How does it know what information to keep and what information to discard
from the memory?

This is all achieved by a special structure called gates. As shown in the following figure, a
typical LSTM cell consists of three special gates called input gate, output gate, and forget
gate:

![image](images/1.png)

These three gates are responsible for deciding what information to add; output and forget
from the memory. With these gates, LSTM effectively keeps information in the memory
only as long as they required. 

In an RNN cell, we used hidden state $h_t$ for two purposes, one for storing the information
and other for making predictions. Unlike RNN, in the LSTM cell, we break the hidden
states into two states called cell state and hidden state.

* Cell state is also called internal memory where all the information will be stored.
* Hidden state is used for computing the output. 

Both of these cell state and hidden states are shared across every time steps. Now we will
deep dive into LSTM cell and see how exactly these gates are used and how hidden state is
computed.


## Forget Gate 

The forget gate $f_t$ is responsible for deciding what information should be removed from
the cell state (memory). 


Consider the following sentences: Harry is a good singer. He lives in
New York. Zayn is also a good singer.

As soon as we start talking about Zayn, the network will understand that the subject has
been changed from Harry to Zayn and the information about Harry is no longer required.
Now, the forget gate will remove/forget information about Harry from the cell state.
The forget gate is controlled by a sigmoid function. At a time step $t$ , we pass the
input $x_t$ and previous hidden state ${h_{t-1}}$to the forget gate. It will return 0 if the particular
information from the cell state should be removed and returns 1 if the information should
not be removed. The forget gate $f$ at a time step $t$ is expressed as follows:

$$f_{t}=\sigma\left(U_{f} x_{t}+W_{f} h_{t-1}+b_{f}\right)$$

Where:

* $U_f$ is the input to hidden weights of the forget gate
* $W_f$ is the hidden to hidden weights of the forget gate
* $b_f$ is the bias of the forget gate

The following figure shows the forget gate. As you can see, input $x_t$ is multiplied
$U_f$ with and previous hidden state $h_{t-1}$ will be multiplied with $W_f$, both of them will get
added together and sent to the sigmoid function which returns values from 0 to 1.

![image](images/2.png)

## Input Gate


The input gate is responsible for deciding what information should be stored in the cell
state.

Let's consider the same example: Harry is a good singer. He lives in New York. Zayn is
also a good singer.


After the forget gate removes information from the cell state, the input gate decides what
information it has to keep in the memory. Here, since the information about Harry is
removed from the cell state by the forget gate, the input gate decides to update the cell state
with the information about Zayn.
Similar to forget gate, the input gate is controlled by a sigmoid function which returns
either 0 or 1. If it returns 1 then the particular information will be stored/update to the cell
state and if it returns 0 then we will not store the information to the cell state. The input
gate $i$ at a time step $t$ is expressed as follows:

$$ i_{t}=\sigma\left(U_{i} x_{t}+W_{i} h_{t-1}+b_{i}\right)$$


Where:
* $U_i$ is the input to hidden weights of the input gate
* $W_i$ is the hidden to hidden weights of the input gate
* $b_i$ is the bias of the input gate


The following figure shows the input gate:

![image](images/3.png)









## Output gate

We will have a lot of information in the cell state (memory). The output gate is responsible
for deciding what information should be taken from the cell state to give as an
output. 

Consider the following sentences. Zayn's debut album was a huge success. Congrats
____.


The output gate will look up all the information in the cell state and select the correct
information to fill the blank. Here, congrats is an adjective which is used to describe a noun.
So the output gate will predict Zayn (noun), to fill the blank. Similar to other gates, it is also
controlled by a sigmoid function. The output gate $o$ at a time step $t$ is expressed as follows:

$o_{t}=\sigma\left(U_{o} x_{t}+W_{o} h_{t-1}+b_{o}\right)$

Where:
* $U_o$ is the input to hidden weights of the output gate
* $W_o$ is the hidden to hidden weights of the output gate
* $b_o$ is the bias of the output gate

The output gate is shown in the following figure:

![image](images/4.png)

## Updating the cell state


We just learned how all the three gates in the LSTM works. But, the question is how can we
actually update the cell state by adding the relevant new information and deleting the
information that is not required from the cell state with the help of the gates?

__First, we will see how to add new relevant information to the cell state:__


To hold all the new
information that can be added to the cell state, we create a new vector called $g_t$. It is called
a candidate state or internal state vector. Unlike gates which is regulated by the sigmoid
function, candidate state is regulated by the tanh function. But, why? Sigmoid function
returns either 0 or 1 i.e it is always positive. We need to allow the values of $g_t$ to be either
positive or negative. So, we use tanh function which returns either +1 or -1.
The candidate state $g$ at a time $t$ is expressed as follows:


$$g_{t}=\tanh \left(U_{g} x_{t}+W_{g} h_{t-1}+b_{g}\right)$$


Where:
* $U_g$ is the input to hidden weights of the candidate state
* $W_g$ is the hidden to hidden weights of the candidate state
* $b_g$ is the bias of the candidate state

Thus, the candidate state holds all the new information that can be added to the memory
and it is shown in the following figure:

![image](images/5.png)

But how do we decide whether the information in the candidate state is relevant? How do
we decide whether to add or not add new information in the candidate state to the cell
sate? We learned that the input gate is responsible for deciding whether to add new
information or not to the cell state. So if we multiply $g_t$ and $i_t$, we get only relevant
information which should be added to the memory. 


That is, as we know input gate returns 0 if the information is not required and 1 if the
information is required. Say, $i_t=0$ , then multiplying $g_t$ and $i_t$ gives 0 which means the
information in $g_t$ is not required and we don't want to update the cell state with $g_t$. When
$i_t=1$, then multiplying $g_t$ and $i_t$ gives $g_t$ which implies we can update the information
in the $g_t$ to the cell state.


Adding the new information to the cell state with the input gate $i_t$, and the candidate
state $g_t$, is shown in the following figure

![image](images/6.png)

__ Now, we will see how to remove information from the previous cell state which is not
required anymore.__


We learned that forget gate is used for removing information which is not required in the
cell state. So if we multiply previous cell state $c_{t-1}$ and forget gate $f_t$ then we retain only
relevant information in the cell state.

Say,$f_t = 0$ , then multiplying $c_{t-1}$ and $f_t$ gives 0 which means the information in the cell
state $c_{t-1}$ is not required and it should be removed (forgotten). When $f_t=1$ , then
multiplying $c_{t-1}$ and $f_t$ gives $c_{t-1}$ which imples that information in the previous cell
state is required and it should not be removed.
Removing information from the previous cell state$c_{t-1}$ with the forget gate $f_t$ is shown in
the following figure:

![image](images/7.png)

Thus, in a nutshell we update our cell state by multiplying $g_t$ and $i_t$ to add new
information and multiplying $c_{t-1}$ and $f_t$ to remove information. We can express the cell
state equation as follows:

$$c_{t}=f_{t} c_{t-1}+i_{t} g_{t} $$


## Updating hidden state 

We just learned how the information in the cell state will be updated. Now we will see,
how the information in the hidden state $h_$ will be updated. We learned that the hidden state
 is used for computing the output. But how can we compute the output?
 
We know that the output gate is responsible for deciding what information should be taken
from the cell state to give as an output. Thus multiplying $o_t$ and tanh (to squash between -1
and +1) of cell state $tanh(c_t)$, returns the output.
Thus, hidden state $h_t$ is expressed as follows:

$$h_{t}=o_{t} \tanh \left(c_{t}\right)$$ 


The following figure shows how the hidden sate $h_t$ is computed by mutliplying $o_t$ and
$tanh(c_t)$ :

![image to be added](images/8.png)

And finally, once we have the hidden state value, we can apply the softmax function and
compute $\hat{y}_t$ as shown:

$$\hat{y}_{t}=\operatorname{softmax}\left(V h_{t}\right)$$

Where, $V$ is the hidden to output layer weights. 


In the next section, we will see how exactly forward propgation is performed in the LSTM cell. 