![](pics/header.png)

# Deep Learning: Long Short Term Memory (LSTM)

Kevin Walchko

---

## Overview

<table> 
    <tr>
        <td><img src="pics/lstm/rnn-neuron.png" width="70%"></td>
        <td><img src="pics/lstm/lstm.png"></td>
    </tr>
</table>

A normal RNN neuron is shown on the left and an LSTM cell is shown on the right.

- In general, LSTMs and GRUs perform better than vanilla RNNs
- Avoid vanshing gradient problem
- Learns over many time steps with backpropagation
    - Fully differentiable
    - Contains: sigmoid, hyperbolic tangent, multiplications and addition
    
Limitations of RNNs

- Encoding bottleneck
- Slow, no parallelization
- Not long term memory
    
## References

- CS231n: [Andrej Karpathy's lecture on RNN and LSTMs](https://www.youtube.com/watch?v=iX5V1WpxxkY)
- Chris Olah's Understanding LSTM Networks [post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Edwin Chen's LSTM [post](http://blog.echen.me/2017/05/30/exploring-lstms/)
- Visualizing Machine Learning [blog](https://jalammar.github.io/)
- PyTorch docs: [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM)

## LSTM

![](pics/lstm/lstm-eq.jpg)

where $x_t$ is the event, $C_t$ is long term memory and $h_t$ is short term memory.

## Gated Recurrent Units (GRUs)

A gated recurrent unit (GRU) is basically an LSTM without an output gate, which therefore fully writes the contents from its memory cell to the larger net at each time step.

## Embeddings

Embedding allows NNs to learn from text by converting them to numbers. It also has the effect of reducing the dimensionality.

- Code words to number: "heart" = 981
- Input the word code to a fully connected embedding layer which transcribes it to a kind of one hot encoding
    - When dealing with millions of words, one hot encoding is inefficient since almost a million of your values are 0 and only one is 1
    - Basically it is like a lookup table that already did the all of the multiplications and basically now all you do is (in the heart example above) grab the 981st row of the weight table
        - You do this because everything else is multiplied by 0 and that row is the only one multiplied by 1 ... no real multiplication ... faster!
- The embedding layer then feeds the output to the reset of the network

### Reference

- Efficient Estimation of Wrod Representations in Vector Space [pdf](https://video.udacity-data.com/topher/2018/October/5bc56d28_word2vec-mikolov/word2vec-mikolov.pdf)
- Distributed Representations of Wrods and Phrases and their Compositionality [pdf](https://video.udacity-data.com/topher/2018/October/5bc56da8_distributed-representations-mikolov2/distributed-representations-mikolov2.pdf)

In [7]:
import torch
from torch import nn
from helper import summary

In [11]:
lstm = nn.LSTM(
        input_size=10, 
        hidden_size=20, 
        num_layers=2, 
        dropout=0.2, 
        batch_first=True)

summary(lstm)

Layer (type (var_name))                  Kernel Shape              Param #
LSTM (LSTM)                              --                        5,920
Total params: 5,920
Trainable params: 5,920
Non-trainable params: 0


In [17]:
"""
h0 - initial hidden state
c0 - inital cell state
"""
input = torch.randn(5, 3, 10) # (batch_size, sequence_len, input_size)
h0 = torch.randn(2, 5, 20)    # (num_layers, batch_size, hidden_size)
c0 = torch.randn(2, 5, 20)    # (num_layers, batch_size, hidden_size)
output, (hn, cn) = lstm(input, (h0, c0))
output.shape

torch.Size([5, 3, 20])