In [1]:
# RNN Intuition

my_rnn = RNN()  # Some RNN
hidden_state = [0,0,0,0]

sentence = ["I", "love", "recurrent", "neural"]

for word in sentence:
    prediction, hidden_state = my_rnn(word, hidden_state)

next_word_prediction = prediction
# >>> "networks!"

In [None]:
import tensorflow as tf
print(tf.__version__)

In [3]:
class MyRNNCell(tf.keras.layers.Layer):
  def __init__(self, rnn_units, input_dim, output_dim):
    super(MyRNNCell, self).__init()

    # Initalize weight matrics
    self.W_xh = self.add_weights([rnn_units, input_dim]) # Input
    self.W_hh = self.add_weights([rnn_units, rnn_units]) # temporal matrics
    self.W_hy = self.add_weights([output_dim, rnn_units]) # Output

    # Initalize hidden state to zeros
    self.h = tf.zeros([rnn_units, 1])

  def call(self, x):
    # Update the hidden state
    self.h = tf.math.tanh(self.W_hh * self.h + self.W_xh * x)

    # compute the output
    output = self.W_hy * self.h

    # return the current output and hidden state
    return output, self.h

In [None]:
# or just use the tensorflow implementation of the above

tf.keras.layers.SimpleRNN(rnn_units)

# RNNs for Sequence Modeling

Design Criteria

To model sequences, we need to:
1. Handle variable-length sequences
2. Track long-term dependencies
3. Maintain information about order
4. Share parameters across the sequences (the weight needs to be able to share across different time steps and still produce meaningful predictions)



# Modeling problem
"Predict the next word"

e.g "This morning I took my cat for a **walk**."

Goal is to predict the word "walk"

- How, representing Language to a Neural Network
- Encoding Language for a Neural Network

Concept: Embedding

Transform indexes into a vector of fixed size.

1. Vocabulary: Corpus of words
2. Indexing: word to index
      e.g a --> 1
          cat --> 2

3. Embedding: index to fixed-sized vector
e.g One-hot Embedding
"cat" = [ 0 ,1 ,0,0,0,0,0[
e.g. Learned Embedding (neural network for embedding)

---
Backpropagation Through Time (BPTT)


Recall: Backpropagation in Feed Forward models
Backpropagation algorithm
1. Take the derivative (gradient) of the loss w.r.t to each parameter
2. Shift parameters in order to minimize loss


With RNN w regards to Temporal unrolling
- Backward pass through model state

Standard RNN Gradient Flow
- Computing the gradient wrt h0 involves many factors of W_hh + repeated gradient computation!

Issues:

Weight Matrics
- Many Values >1; Exploding Gradients
  - Mitigating: Gradient Clipping to scale big gradients
- Many Values <1: Vanishing gradients
  [Mitgating methods]
  - Activation function
  - Weight initalization
  - Network Architecture



---
The Problem of long-term dependencies

Why are vanishing gradients a problem?
Multiply many small numbers together --> Errors due to further back time steps have smaller and smaller gradients --> Bias parameters to capture short-term dependencies

RNN becomes unable to predict when the required "info" is very far apart.

- Tricks #1: Activation Functions
  Using ReLU prevents f' from shrinking the gradients when x > 0
- Trick #2: Parameter Initialization
  Initalize weights to identity matrix, ini
- Trick #3: Gated Cells (Most robust to handle long term dependencies)
  idea: Use Gates to selectively add or remove information within each recurrent unit with
  - LSTM networks rely on a gated cell to track informaton throughout

  Gated LSTM Key Concepts
  - Maintaining a cell states (like a standard rnn)
  - Use gates to control the flow of information
    - Forget gate gets rid of irrelevant information
    - Store relevant information from current input
    - Selectively update cell state
    - Ouput gate returns a filtered version of the cell state
  - Backpropagation through time with partially uniterrupted gradient flow

---
RNN Applications & Limitations

e.g Music Generation
Input: Sheet music
Output: Next character in sheet music

e.g Sentiment Classification
Sequence of input text to single output
- Tweet sentiment classification



[Limitations of RNNs]
1. Encoding Bottleneck
  - how to encode till the end of time without much loss
2. Slow, no parllelization
3. Not long memory
  - capacity of RNN, LSTM is not that long (handling like millions of words)


[Desired Capabilities]
- Continuous stream
- Parallelization
- Long memory

---
Hence leading to

**Attention is All You Need**
  - Foundation of the Transformer Architecture
  - Intuition behind self-attention
    - Attending to the most important parts of an input
      1. Identifty which parts to attend to (Similar to a search problem!)
        - Understanding Attention with Search
        - Find Overlaps between **Query** and **Key**, how similar?
          - Compute attention mask
      2. Extract the features with high attention
        

Learning self-attention with Neural networks

e.g. "He tossed the tennis ball to serve"
1. Encode position information
  - Position-aware encoding
  - Data is fed in all at once! need to encode position information to understand order
2. Extract query, key, value for search
  - Positional Embedding * Linear Layer for Q, K, V
3. Compute attention weighting
  - Attention score: computer pairwise similarity between each query and Key
  - How to compute similarity between 2 sets of features?
    - Usually dot product (a.k.a cosine similarity)
4. Extract features with high attention
  - Attention Matrix * Value = Output


Goal: Identify and attend to most important features in input

softmax( Q*Kt / Scaling ) * V

These operations form a self-attention head that can plug into a larger network.
each head attends to a different part of input

--> forms a rich network


**Self-attention applied**
1. Language Processing
  - BERT, GPT3 (and more)
2. Biological sequences
  - AlphaFold2
3. Computer vision
  - Vision Transformers


---
**Summary**
1. RNNs are well suited for sequence modeling tasks
2. Model sequences via a recurrence relation
3. Training RNNs with backpropagation through time
4. Modeling for e.g music
5. Self-attention to model sequences without recurrence