<a href="https://colab.research.google.com/github/rida-manzoor/DL/blob/main/20_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agenda

- Why RNN Needed
- RNN vs ANN
- Data in RNN
- How RNN work?
- Forward Propagation in RNN
- vectors
- Embeddings
- Types of RNN
- Backpropagation in RNN
- Problems with RNN

RNN is the type of sequential model to work on Sequential data.

## Why RNN Needed

##  Why RNNs are needed: A bullet point breakdown

**1. Modeling sequential information:**

* **Remember past inputs:** Unlike standard neural networks, RNNs have internal "memory" that allows them to retain information from previous inputs. This is crucial for tasks like:
    * **Natural Language Processing (NLP):** Analyzing text requires understanding the context of words based on what came before. RNNs are used in machine translation, sentiment analysis, and text generation.
    * **Speech Recognition:** Understanding spoken language depends on the sequence of sounds and their context within the sentence.
    * **Time Series Forecasting:** Predicting future values in a sequence (stock prices, weather patterns) relies on understanding past trends.

**2. Handling variable-length sequences:**

* **Flexible input lengths:** Unlike traditional methods that require fixed-length inputs, RNNs can process sequences of any length. This is beneficial for:
    * **Video captioning:** Generating descriptions for videos of varying lengths.
    * **Music generation:** Creating musical pieces with different durations.
    * **Chatbots:** Responding to user queries of varying lengths in a conversation.

**3. Dealing with complex relationships:**

* **Capturing long-term dependencies:** RNNs can learn complex relationships between elements in a sequence, even if they are far apart. This is useful for:
    * **Machine translation:** Capturing long-range grammatical dependencies between words in different languages.
    * **Sentiment analysis:** Understanding the overall sentiment of a text, even if positive and negative words are separated by other content.

**However, it's important to note:**

* **Training challenges:** Standard RNNs can suffer from vanishing gradients, making them difficult to train on long sequences. Variants like LSTMs and GRUs address this issue.
* **Computational cost:** RNNs can be computationally expensive to train and run due to their sequential nature.

**In conclusion, RNNs are powerful tools for tasks involving sequential data due to their ability to remember past information, handle variable lengths, and learn complex relationships. However, their training challenges and computational cost need to be considered.**

## ANN vs. RNN for NLP tasks

**Difficulties of ANNs for NLP:**

* **No memory:** ANNs lack internal memory, making it difficult to understand the context of words in a sequence. They struggle with understanding:
    * **Sentence structure:** They wouldn't recognize the difference between "The dog chased the cat" and "The cat chased the dog."
    * **Sentiment:** They might misinterpret sarcasm or irony due to lacking context.

**Difficulties with specific NLP tasks:**

* **Machine translation:** They wouldn't capture long-range dependencies in grammar and word order between languages.
* **Text summarization:** They might miss important points or generate summaries lacking coherence.
* **Question answering:** They might miss relevant information spread across the text due to no context memory.

**Reasons not to use ANNs for NLP:**

* **Limited accuracy:** They often underperform compared to RNNs on NLP tasks.
* **Limited flexibility:** They struggle with variable-length input and capturing long-term dependencies.

**RNNs to the rescue!**

* **Internal memory:** They remember previous words, enabling them to understand context and relationships.
* **Flexible input:** They handle different sentence lengths naturally.
* **Long-term dependencies:** They capture connections between elements even if distant in the sequence.





## Data in RNN

Data in RNNs is stored in two key ways: **timesteps** and **input features**. Here's a breakdown:

**Timesteps:**

* Imagine an RNN processing a sentence. Each word in the sentence becomes a **timestep**. The network processes information at each timestep, incorporating it into its internal memory.
* Think of it like stepping through the sentence word by word. Each step represents a new timestep with new information.
* The number of timesteps depends on the length of the sequence. A short sentence might have 5-10 timesteps, while a long document might have hundreds or even thousands.

**Input Features:**

* Each timestep doesn't just hold a single word. It also contains information about that word, like its **embedding vector**. This vector represents the word's meaning in a high-dimensional space.
* Think of it like capturing various aspects of the word, such as its meaning, part of speech, and relationships with other words.
* The number of features depends on the chosen representation (e.g., word embeddings can have hundreds of dimensions).

**Combining Timesteps and Features:**

* At each timestep, the RNN processes the input features (e.g., embedding vector) and its internal memory (information from previous timesteps).
* This processing involves complex calculations using functions like gates and activation functions.
* The result of this processing updates the internal memory and can also generate an output (e.g., predict the next word in a sentence).




## How RNN work?

Let's assume we are working on following data

|review|Sentiment| RNN Input shape |
--------|----|------------|
|Movie was good | 1| (3,5)
Movie was bad | 0|(3,5)
Movie was not good | 0| (4,5)


Here in this data we will have 5 vectors.

movie → [1,0,0,0,0]

was → [0,1,0,0,0]

good → [0,0,1,0,0]

bad → [0,0,0,1,0]

not → [0,0,0,0,1]


Here in this data (3,5) means there are 3 time steps and 5 total features.


<hr>

If we are saying our input is 'x'. Then
- Row1 will be 'x1'
- Row2 will be 'x2'
- Row3 will be 'x3'

And each input feature will be 'x11'
- For row1 (x1)

  - Movie → x_11
  - was → x_12
  - good → x_13

- For row2 (x2)
  - movie → x_21
  - was → x_22
  - bad → x_23

- For row3 (x3)
  - movie → x_31
  - was → x_32
  - not → x_33
  - good → x_34

In [None]:
from keras import Sequential
from keras.layers import Dense, SimpleRNN

In [None]:
#RNN Architecture.
# one input layer and one hidden layer

model = Sequential()
model.add(SimpleRNN(3,input_shape=(4,5)))
model.add(Dense(1,activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 3)                 27        
                                                                 
 dense (Dense)               (None, 1)                 4         
                                                                 
Total params: 31 (124.00 Byte)
Trainable params: 31 (124.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
print(model.get_weights()[0].shape)
model.get_weights()[0]

(5, 3)


array([[ 0.48703045,  0.3188302 , -0.52045697],
       [ 0.388331  , -0.39352322, -0.696949  ],
       [-0.42381108, -0.02778387, -0.66490006],
       [-0.5465789 ,  0.12045324,  0.08881193],
       [-0.46420035,  0.80267555, -0.7627485 ]], dtype=float32)

In [None]:
print(model.get_weights()[1].shape)
model.get_weights()[1]

(3, 3)


array([[ 0.66310775,  0.39399698,  0.6364389 ],
       [ 0.05589984,  0.8218182 , -0.567001  ],
       [ 0.7464337 , -0.41155964, -0.52292967]], dtype=float32)

In [None]:
print(model.get_weights()[2].shape)
model.get_weights()[2]

(3,)


array([0., 0., 0.], dtype=float32)

In [None]:
print(model.get_weights()[3].shape)
model.get_weights()[3]

(3, 1)


array([[-1.2209935 ],
       [-0.31076336],
       [-1.1767027 ]], dtype=float32)

In [None]:
print(model.get_weights()[4].shape)
model.get_weights()[4]

(1,)


array([0.], dtype=float32)

# Forward Propagation

Forward propagation in a recurrent neural network (RNN) is the process of passing information from the first input to the last output, one step at a time. Here's a breakdown:

**1. Setting the Stage:**

* Before starting, the RNN's internal memory (hidden state) is initialized with a specific value, usually zeros.
* The input sequence is divided into individual elements (words, characters, etc.), each representing a timestep.

**2. Processing Each Timestep:**

* At each timestep:
    * **Combine inputs:** The current input element and the previous hidden state are combined.
    * **Activate:** This combined information is passed through an activation function, transforming it into a new value.
    * **Update memory:** The resulting value is used to update the RNN's hidden state (internal memory). This captures information from previous steps.
    * **Generate output (optional):** Depending on the task, an output might be generated at each timestep based on the current hidden state and input.

**3. Moving Forward:**

* The updated hidden state becomes the starting point for processing the next timestep.
* This process repeats for all timesteps in the sequence.
* The final output, if not generated at each step, is usually based on the final hidden state of the sequence.

**Here's an analogy:**

Imagine you're reading a sentence word by word. Each word is a timestep. You keep track of the previous words' meaning in your mind (similar to the hidden state). As you read each new word, you combine its meaning with what you remember from before and update your understanding of the sentence (similar to updating the hidden state). Finally, you reach the end of the sentence and have a complete understanding (similar to the final output).

**Key points to remember:**

* Information flows from left to right (or forward) in the network.
* The hidden state captures information from previous timesteps, allowing for context-aware processing.
* RNNs can be used for tasks requiring sequential data analysis, like text generation, translation, and speech recognition.


In [None]:
import numpy as np

In [None]:
docs = ['go Pak',
        'pak pak',
        'hip hip hurray',
        'jeety ga bhai koi to jeety ga',
        'babar azam',
        'deep learning',
        'recurrent neural network',
        'stucked',
        'viral and trendy']

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='bad')

In [None]:
tokenizer.fit_on_texts(docs)

In [None]:
tokenizer.word_index

{'bad': 1,
 'pak': 2,
 'hip': 3,
 'jeety': 4,
 'ga': 5,
 'go': 6,
 'hurray': 7,
 'bhai': 8,
 'koi': 9,
 'to': 10,
 'babar': 11,
 'azam': 12,
 'deep': 13,
 'learning': 14,
 'recurrent': 15,
 'neural': 16,
 'network': 17,
 'stucked': 18,
 'viral': 19,
 'and': 20,
 'trendy': 21}

In [None]:
tokenizer.word_counts

OrderedDict([('go', 1),
             ('pak', 3),
             ('hip', 2),
             ('hurray', 1),
             ('jeety', 2),
             ('ga', 2),
             ('bhai', 1),
             ('koi', 1),
             ('to', 1),
             ('babar', 1),
             ('azam', 1),
             ('deep', 1),
             ('learning', 1),
             ('recurrent', 1),
             ('neural', 1),
             ('network', 1),
             ('stucked', 1),
             ('viral', 1),
             ('and', 1),
             ('trendy', 1)])

In [None]:
sequence = tokenizer.texts_to_sequences(docs)
sequence

[[6, 2],
 [2, 2],
 [3, 3, 7],
 [4, 5, 8, 9, 10, 4, 5],
 [11, 12],
 [13, 14],
 [15, 16, 17],
 [18],
 [19, 20, 21]]

In [None]:
# to keep all sequences of same size we will pad sequences

from keras.utils import pad_sequences

In [None]:
sequence = pad_sequences(sequence, padding='post')

In [None]:
sequence

array([[ 6,  2,  0,  0,  0,  0,  0],
       [ 2,  2,  0,  0,  0,  0,  0],
       [ 3,  3,  7,  0,  0,  0,  0],
       [ 4,  5,  8,  9, 10,  4,  5],
       [11, 12,  0,  0,  0,  0,  0],
       [13, 14,  0,  0,  0,  0,  0],
       [15, 16, 17,  0,  0,  0,  0],
       [18,  0,  0,  0,  0,  0,  0],
       [19, 20, 21,  0,  0,  0,  0]], dtype=int32)

# Embeddings

In NLP, word embedding is a term used for representation of words for text analysis, typically in the form of real valued vector that encodes the meaning of the word such that the words that are closer in vector space are expected to be similar in meaning.

ADvantage
- Dense represtation
- Can capture semantic meanings

# TYpe of RNN

- Many to Many
- Many to one
- Many to Many
- One to One

## Many to one
- Input should be in sequence (Sentences, characters, time series)
- Output will be non-sequential(Scalar value)
**Application*
    - Sentiment Analysis
    - Rating Prediction



## One to Many
- Input is nonsequential data (img, tabular)
- Output will be sequential
**Application*
    - Image Captioning
    - Music Generation

## Many to Many
- Both input and output data is sequential
- Two types
    - Same length many to many
            - POS tagging
            - Named Entity Recognition
    - Variable length many to many
            - Machine Translation


## One to One
Not really RNNs

- *Application*
      - Image classification

# Backpropagation through time

We suppose we are working with many to one RNN problem. In which we have sentences as input and sentiment as output.


| Setences | Sentiments|
|-----------|----------|
|cat mat rat | 1 |
| rat mat cat | 0 |
| rat cat mat | 1|

- Step 1 : Generate Vocabulary
      - cat [1 0 0]
      - mat [0 1 0]
      - rat [0 0 1]

- Step 2: Vectorize data

|  | X | Y |
|---|---|---|
| x_1 | [1 0 0] [0 1 0][0 0 1] | 1
| x_2 | [0 0 1][0 1 0][1 0 0] | 0
| x_3 | [0 0 1][1 0 0][0 1 0] | 1


Let's suppose we have only one hidden layer with 3 neurons. Total weights will be:

$w_i = (3x3)$

$w_h = (3x3)$

$w_0 = (3x1)$

$O_1 = f(x_11 w_i + O_0 w_h)$

$O_2 = f(x_12 w_i + O_1 w_h)$

$O_3 = f(x_13 w_i + O_2 w_h)$

$ŷ = σ(O_3 w_0)$

Now we will calculate loss:

$L = -y_i logŷ_i - (1-y_i)log(1-ŷ_i)$

Now we have to minimize loss, we will use gradient descent

$w_0 = w_0 -η\frac{∂L}{∂w_0}$

$w_i = w_i -η\frac{∂L}{∂w_i}$

$w_h = w_h -η\frac{∂L}{∂w_h}$


To find these gradient we have all values(initial weights, learning rate) except partial derivatives. SO in next step we will calculate all partial derivatives.


$\frac{∂L}{∂w_0}=\frac{∂L}{∂ŷ}\frac{∂ŷ}{∂w_0}$

$\frac{∂L}{∂w_i}=\frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂w_i} + \frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂O_2}\frac{∂O_2}{∂w_i} +  \frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂O_2}\frac{∂O_2}{∂O_1}\frac{∂O_1}{w_i}$

$\frac{∂L}{∂w_i}=\sum_{j=1}^{3}\frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_j}\frac{∂O_j}{∂w_i}$


$\frac{∂L}{∂w_h}=\frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂w_h} + \frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂O_2}\frac{∂O_2}{∂w_h} +  \frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_3}\frac{∂O_3}{∂O_2}\frac{∂O_2}{∂O_1}\frac{∂O_1}{w_h}$

$\frac{∂L}{∂w_h}=\sum_{j=1}^{3}\frac{∂L}{∂ŷ}\frac{∂ŷ}{∂O_j}\frac{∂O_j}{∂w_h}$




# Problems with RNN

RNNs are not much used for sequential data because of these two problems:
- The problem of long term dependency (vinishing gradient)
- The problem of unstable gradients (exploding gradient)