<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/06.rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/06.rnn.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# RNNs and LSTMs

📝 SALP chapter 8

## Recurrent Neural Network(RNN)
- **What is an RNN?**
  - A neural network designed for sequential data.
  - Ability to retain information across time steps.
- **Application Areas:**
  - Time Series Prediction, Natural Language Processing, Speech Recognition.

---

### **RNN Structures**
![rnn loop](./images/rnn/loop.png)

- RNN units consist of inputs, hidden layers, and outputs.
- The `Looping mechanism (recurrent connection)` in the hidden layer feeds the output of one step as input to the next step.
  - Offers a new way to represent `the prior context`, hundreds of words in the past
    - Removed the fixed-length limit (a sliding window) on this prior context as in feedforward networks

---

### **Working Principle of RNNs**
- RNN unfolded in time steps

![temporal unfold](./images/rnn/unfold.png)

- Takes a sequence of inputs $x_t$ at time step $t$
- Hidden states $h_t$ capture information $W_{hh}^{(t-1)}$ from previous steps $h_{t-1}$.
  - Retains information from previous steps (memory) to infer the next output.
  - $h_t = f(W_{xh}^{(t)}x_t + W_{hh}^{(t-1)}h_{t-1} + b_h^{(t)})$
  - $W_{xh}, W_{hh}, W_{hy}$: Weight matrices
  - Activation functions $f$ and $g$ (typically $\tanh$ or $ReLU$ for hidden states, and softmax for output)
- Outputs $y_t$ are calculated at time step $t$:
  - $y_t = g(W_{hy}^{(t)}h_t + b_y^{(t)})$
  - Predictions at each time step depend on current input and hidden state.

---

### **Training RNNs**
- Backpropagation Through Time (BPTT)
- Loss calculated at each step, summed across the sequence.
- Updates weights using gradient descent.
  - $\nabla L = \sum_{t=1}^{T} \nabla L_t$
- **Challenges:**
  - Vanishing/Exploding gradients.
  - Long-term dependency issues.

---

### **Stacked RNNs**
![stacked rnn](./images/rnn/stack.png)

- Multiple RNN layers stacked on top of each other.
- Output of one RNN layer serves as input to the next.
- **Advantages:**
  - Increases the capacity of the network.
  - Handles more complex patterns and long-term dependencies.

---

### **Bidirectional RNNs**
![bidir rnn](./images/rnn/bidrnn.png)

- RNNs with two hidden states: one moving forward, another moving backward.
- Processes information from both past and future.
- Useful for tasks where context from both directions is important (e.g., NLP).

### 🍎 **Create RNNs**
- **simple RNN**

```python
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense

# Define a simple RNN model
model = Sequential()
model.add(SimpleRNN(50, input_shape=(timesteps, features)))
model.add(Dense(output_units))
model.compile(loss='mse', optimizer='adam')
model.summary()
```
---

- **Stacked RNN**:
```python
model = Sequential()
model.add(SimpleRNN(50, return_sequences=True, input_shape=(timesteps, features)))
model.add(SimpleRNN(50))
model.add(Dense(output_units))
```
- **Bidirectional RNN**:
```python
from keras.layers import Bidirectional

model = Sequential()
model.add(Bidirectional(SimpleRNN(50), input_shape=(timesteps, features)))
model.add(Dense(output_units))
```

## RNNs as Language Models

**Language Models Overview**:
- **Goal**: Predict the next word in a sequence given its preceding context.
  - Example: Given the context *“Thanks for all the”*, how likely is the next word *“fish”*?
  - $P(\text{fish} | \text{Thanks for all the})$
  
**Conditional Probability for Entire Sequences**:
- Language models assign probabilities to entire sequences using the chain rule:
  - $\displaystyle P(w_{1:n}) = \prod_{i=1}^{n} P(w_i | w_{<i})$
- Each word’s probability depends on the context of the preceding words.

---

### **Comparing Language Models**
- **N-Gram Models**:
  - **Fixed Context Size**: Probability depends on $n - 1$ previous words.
  - Context limited to **short history**.

![ffnn vs rnn](./images/rnn/fnnvrnn.png)

- **a) Feedforward Models**:
  - Context depends on a **window size**.
  - **Fixed-length context** that is independent of the sequence length.

- **b) RNN Language Models**:
  - **Dynamic Context**: Uses all preceding words via hidden state.
  - **No Fixed Context Limit**: The hidden state $h_{t-1}$ summarizes the entire past.
  - Overcomes **limited context** of n-gram models and **fixed context** of feedforward models.

### Forward Inference in an RNN Language Model
- **Input Sequence $X$:**
  - The input sequence $X = [x_1, x_2, \dots, x_t, \dots, x_N]$ consists of words represented as **one-hot vectors** of size $|V| \times 1$, where $|V|$ is the vocabulary size.
  - Assume the embedding dimension $d_e$ and hidden dimension $d_h$ are the same for simplicity, denoted as model dimension $d$.
- **Output prediction $y$:** 
   - A vector representing a probability distribution over the vocabulary.
- **Step-by-Step Process**:

1. **Word Embedding**:
   - Retrieve the word embedding $e_t$ for the current word $x_t$ using the embedding matrix $E$:
     - $e_t = E x_t$
     where:
     - $E \in \mathbb{R}^{d \times |V|}$ is the embedding matrix.
     - $x_t \in \mathbb{R}^{|V| \times 1}$ is the one-hot encoded word.
     - $e_t \in \mathbb{R}^{d \times 1}$ is the embedding vector.

2. **Hidden State Update**:
   - Compute the new hidden state $h_t$ using the previous hidden state $h_{t-1}$ and the current embedding $e_t$:
     - $h_t = g(U h_{t-1} + W e_t)$
     where:
     - $U \in \mathbb{R}^{d \times d}$ is the recurrent weight matrix.
     - $W \in \mathbb{R}^{d \times d}$ is the input weight matrix.
     - $g$ is the activation function (commonly $\tanh$).

3. **Output Layer**:
   - Compute the raw scores for each word in the vocabulary using the hidden state $h_t$:
     - $\hat{y}_t = V h_t$
     where:
     - $V \in \mathbb{R}^{|V| \times d}$ is the output weight matrix.
     - $\hat{y}_t \in \mathbb{R}^{|V| \times 1}$ contains the unnormalized scores over the vocabulary.

4. **Softmax Layer**:
   - Apply the **softmax** function to convert the scores into a probability distribution over the vocabulary:
     - $\hat{y}_t = \text{softmax}(V h_t)$
     - The probability that the next word is $w_{t+1} = k$ is given by:
       - $P(w_{t+1} = k | w_1, \dots, w_t) = \hat{y}_t[k]$
---

### **Sequence Probability and Chain Rule**

- The probability of an entire sequence $P(w_1, w_2, \dots, w_n)=P(w_{1:n})$ is the product of the probabilities of each word in the sequence:
  - $P(w_{1:n}) = \prod_{i=1}^{n} P(w_i | w_{1:i-1}) = \prod_{i=1}^{n} \hat{y}_i[w_i]$
  - $\hat{y}_i[w_i]$ is the probability assigned to the true word $w_i$ at time step $i$.

## Training an RNN Language Model

**Self-supervision**:
- RNN language models are trained by **self-supervision**. 
  - At each time step $t$, the model predicts the next word in the sequence based on the preceding context.
  - The sequence of words itself provides the **supervision** without needing labeled data.

![training rnn as lm](./images/rnn/training.png)

**Training Process**:
- The model’s weights are adjusted using **Gradient Descent** to minimize the **average cross-entropy loss** across the sequence.
  
**Loss Function**:
- We use **Cross-Entropy Loss (CE)** to measure how well the model’s predicted probability distribution $\hat{y}_t$ matches the true next word’s distribution $y_t$.
  - $\displaystyle L_{CE} = - \sum_{w \in V} y_t[w] \log \hat{y}_t[w]$
  - $V$ is the vocabulary,
  - ∵ $y_t[w]$ is 1 for the true next word $w_{t+1}$ and 0 otherwise.
    - represented as a one-hot vector corresponding to the vocabulary
- ∴ $L_{CE}(\hat{y}_t, y_t) = - \log \hat{y}_t[w_{t+1}]$

---

### **Training Process Steps**
1. At time step $t$, input the correct word $w_t$ and the hidden state $h_{t-1}$ (which encodes the preceding context $w_1, w_2, \dots, w_{t-1}$).
2. Compute the predicted probability distribution $\hat{y}_t$ for the next word.
3. Calculate the **cross-entropy loss** $L_{CE}(\hat{y}_t, y_t)$ based on the `true next word` $w_{t+1}$.
4. Use **Gradient Descent** to adjust the weights in the RNN to minimize the total cross-entropy loss over the entire training sequence.
5. Move to the next word in the sequence and repeat, always feeding the true word into the RNN.

- Giving the model the `true history sequence` to predict the next word (rather than feeding the model its best prediction from the previous time step) is called **teacher forcing**.

### 🍎 RNN as a language model
- predict the next word giving previous words

In [None]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Sample text data (sentences) for illustration
sentences = [
    "I love deep learning",
    "I love machine learning",
    "deep learning is fun",
    "machine learning is powerful"
]

# Parameters
vocab_size = 1000  # Adjust vocabulary size
timesteps = 4      # Number of previous words to consider (sequence length)
output_units = vocab_size  # Output units should match vocabulary size (for word prediction)

# Tokenize the sentences
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Prepare input-output pairs for word prediction
input_sequences = []
output_words = []

for seq in sequences:
    for i in range(1, len(seq)):
        input_seq = seq[:i]
        output_word = seq[i]
        input_sequences.append(input_seq)
        output_words.append(output_word)

# Pad sequences to have the same length
input_sequences = pad_sequences(input_sequences, maxlen=timesteps, padding='pre')

# Convert output words to categorical format (one-hot encoding)
output_words = tf.keras.utils.to_categorical(output_words, num_classes=vocab_size)

# Reshape input to 3D for RNN [samples, timesteps, features]
input_sequences = np.expand_dims(input_sequences, axis=-1)

# Define a simple RNN model
model = Sequential()
model.add(SimpleRNN(50, input_shape=(timesteps, 1)))  # input_shape=(timesteps, features)
model.add(Dense(output_units, activation='softmax'))  # Softmax for word prediction
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Show model summary and architecture
model.summary()
tf.keras.utils.plot_model(model, show_shapes=True)

# Train the model (example)
# Assuming you have a larger dataset, adjust the number of epochs, batch size, etc.
model.fit(input_sequences, output_words, epochs=50, batch_size=16)

# Example of predicting the next word given previous words
def predict_next_word(model, tokenizer, input_text, max_length):
    input_seq = tokenizer.texts_to_sequences([input_text])
    input_seq = pad_sequences(input_seq, maxlen=max_length, padding='pre')
    input_seq = np.expand_dims(input_seq, axis=-1)
    predicted = np.argmax(model.predict(input_seq), axis=-1)
    predicted_word = tokenizer.index_word[predicted[0]]
    return predicted_word

# Predict the next word
input_text = "I love"
next_word = predict_next_word(model, tokenizer, input_text, timesteps)
print(f"Next word after '{input_text}': {next_word}")

## Weight Tying
- In RNN language models, both the input embedding matrix $E$ and the output matrix $V$ (for softmax) map words to vector spaces
  - $V$ is essentially the transpose of $E$.
  - $E$ is of shape $[d \times |V|]$ and $V$ of shape $[|V| × d]$.

- **Weight Tying**: Instead of learning two separate matrices $E$ and $V$, we **share** the same weights for both:
  - Use $E$ for input embedding and $E^\top$ for output prediction.
  - This **halves** the number of parameters for embedding and softmax layers.

- ∴ weight-tied equations for an RNN language model
  - $e_t = E x_t$
  - $h_t = g(U h_{t-1} + W e_t)$
  - $\hat{y}_t = \text{softmax}(E^\top h_t)$

## RNNs in NLP Tasks
**RNNs** are widely used in various NLP applications:

- **Sequence Labeling**: e.g., part-of-speech(POS) tagging
- **Sequence Classification**: e.g., Sentiment analysis
- **Text Generation**: e.g., Machine translation, summarization

---

### **Sequence Labeling**

- Assign a label to each token in a sequence
  - e.g., POS tagging, named entity recognition.
- Inputs: Pretrained **word embeddings** for each token.
- Outputs: **Tag probabilities** for each token from a softmax layer.

![pos](./images/rnn/pos.png)

- **RNN Forward Inference**:
  - Pass input tokens one-by-one through the RNN.
  - Generate tag probabilities at each step using a softmax layer.
  - Select the most likely tag.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Sample data: words and their corresponding labels (e.g., POS tags)
data = [
    (['The', 'dog', 'barked'], ['DET', 'NOUN', 'VERB']),
    (['A', 'cat', 'meowed'], ['DET', 'NOUN', 'VERB']),
    (['The', 'bird', 'sings'], ['DET', 'NOUN', 'VERB'])
]

# Mapping words and labels to integers
word_to_idx = {'<PAD>': 0}
label_to_idx = {}
idx = 1

for sentence, labels in data:
    for word in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = idx
            idx += 1

    for label in labels:
        if label not in label_to_idx:
            label_to_idx[label] = len(label_to_idx)

idx_to_label = {v: k for k, v in label_to_idx.items()}

# Dataset class for sequence labeling
class SequenceDataset(Dataset):
    def __init__(self, data, word_to_idx, label_to_idx, max_len=5):
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = label_to_idx
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sentence, labels = self.data[idx]
        sentence_idx = [self.word_to_idx[word] for word in sentence]
        label_idx = [self.label_to_idx[label] for label in labels]

        # Padding sequences to max length
        sentence_idx += [self.word_to_idx['<PAD>']] * (self.max_len - len(sentence_idx))
        label_idx += [self.label_to_idx['DET']] * (self.max_len - len(label_idx))  # Assume 'DET' as default padding tag

        return torch.tensor(sentence_idx), torch.tensor(label_idx)

# Hyperparameters
embedding_dim = 10
hidden_dim = 20
num_labels = len(label_to_idx)
max_len = 5
batch_size = 2
learning_rate = 0.01
num_epochs = 10

# Create dataset and dataloader
dataset = SequenceDataset(data, word_to_idx, label_to_idx, max_len=max_len)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# RNN-based model for sequence labeling
class RNNSequenceLabelingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
        super(RNNSequenceLabelingModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_labels)

    def forward(self, x):
        x = self.embedding(x)  # Convert words to embeddings
        rnn_out, _ = self.rnn(x)  # Pass embeddings through RNN
        logits = self.fc(rnn_out)  # Pass RNN output through the fully connected layer
        return logits

# Model, loss function, and optimizer
model = RNNSequenceLabelingModel(len(word_to_idx), embedding_dim, hidden_dim, num_labels)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for sentences, labels in dataloader:
        optimizer.zero_grad()

        # Forward pass
        logits = model(sentences)

        # Reshape logits and labels to calculate loss
        logits = logits.view(-1, num_labels)  # Flatten logits to (batch_size * max_len, num_labels)
        labels = labels.view(-1)  # Flatten labels to (batch_size * max_len)

        # Compute loss
        loss = criterion(logits, labels)
        total_loss += loss.item()

        # Backpropagation and optimization
        loss.backward()
        optimizer.step()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(dataloader):.4f}')

# Inference on a new sentence
def predict(model, sentence, word_to_idx, idx_to_label):
    model.eval()
    sentence_idx = [word_to_idx.get(word, 0) for word in sentence]
    sentence_idx += [word_to_idx['<PAD>']] * (max_len - len(sentence_idx))  # Pad sequence

    with torch.no_grad():
        inputs = torch.tensor(sentence_idx).unsqueeze(0)  # Add batch dimension
        logits = model(inputs)
        predictions = torch.argmax(logits, dim=2).squeeze(0)  # Get label predictions

    return [idx_to_label[pred.item()] for pred in predictions]

# Example inference
sentence = ['A', 'bird', 'sings']
predicted_labels = predict(model, sentence, word_to_idx, idx_to_label)
print(f'Input: {sentence}')
print(f'Predicted Labels: {predicted_labels}')

### **Sequence Classification**
- **Task**: Assign a label to an entire sequence.
  - e.g., Sentiment analysis, document classification.
- **Process**:
  - Pass the sequence through an RNN, word-by-word.
  - Use **final hidden state** $h_n$ as a compressed representation of the entire sequence.

![text classification](./images/rnn/classification.png)

**Training for Sequence Classification**:
- Use a **feedforward network** with a softmax layer for classification.
- **End-to-End Training**:
  - Loss from the classification task is **backpropagated** through the entire network.
  - Cross-entropy loss drives training.

**Alternative Approach**:
- Instead of using the final hidden state $h_n$, use a **pooling function** (mean or max) over all hidden states $h_1, h_2, ..., h_n$.
  - $\displaystyle\bar{h} = \frac{1}{n}\sum_{i=1}^n h_i$

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Sample data: sentences and their corresponding labels (e.g., sentiment: 0 for negative, 1 for positive)
data = [
    ("The movie was fantastic", 1),
    ("I did not like the film", 0),
    ("It was an amazing experience", 1),
    ("The plot was very boring", 0),
    ("The acting was great", 1),
    ("I would not recommend this movie", 0),
]

# Mapping words to indices
word_to_idx = {'<PAD>': 0}
idx = 1

for sentence, _ in data:
    for word in sentence.split():
        if word.lower() not in word_to_idx:
            word_to_idx[word.lower()] = idx
            idx += 1

# Dataset class for sentence classification
class SentenceDataset(Dataset):
    def __init__(self, data, word_to_idx, max_len=10):
        self.data = data
        self.word_to_idx = word_to_idx
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sentence, label = self.data[idx]
        sentence_idx = [self.word_to_idx[word.lower()] for word in sentence.split()]

        # Padding sequences to max length
        sentence_idx += [self.word_to_idx['<PAD>']] * (self.max_len - len(sentence_idx))
        sentence_idx = sentence_idx[:self.max_len]  # Truncate to max length if longer

        return torch.tensor(sentence_idx), torch.tensor(label)

# Hyperparameters
embedding_dim = 10
hidden_dim = 20
output_dim = 2  # Binary classification (positive, negative)
max_len = 10
batch_size = 2
learning_rate = 0.001
num_epochs = 10

# Create dataset and dataloader
dataset = SentenceDataset(data, word_to_idx, max_len=max_len)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# RNN-based model for sentence classification
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(RNNClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)  # Convert words to embeddings
        rnn_out, _ = self.rnn(x)  # Pass embeddings through RNN
        hidden_state = rnn_out[:, -1, :]  # Take the hidden state of the last time step
        logits = self.fc(hidden_state)  # Pass the last hidden state through the fully connected layer
        return logits

# Model, loss function, and optimizer
model = RNNClassifier(len(word_to_idx), embedding_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    correct = 0
    total = 0
    
    for sentences, labels in dataloader:
        optimizer.zero_grad()

        # Forward pass
        logits = model(sentences)
        
        # Compute loss
        loss = criterion(logits, labels)
        total_loss += loss.item()

        # Backpropagation and optimization
        loss.backward()
        optimizer.step()

        # Accuracy
        _, predicted = torch.max(logits.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(dataloader):.4f}, Accuracy: {accuracy:.2f}%')

# Inference on a new sentence
def predict(model, sentence, word_to_idx):
    model.eval()
    sentence_idx = [word_to_idx.get(word.lower(), 0) for word in sentence.split()]
    sentence_idx += [word_to_idx['<PAD>']] * (max_len - len(sentence_idx))  # Pad sequence
    sentence_idx = sentence_idx[:max_len]  # Truncate to max length if longer

    with torch.no_grad():
        inputs = torch.tensor(sentence_idx).unsqueeze(0)  # Add batch dimension
        logits = model(inputs)
        predicted_class = torch.argmax(logits, dim=1).item()

    return predicted_class

# Example inference
test_sentence = "The film was boring"
predicted_label = predict(model, test_sentence, word_to_idx)
label_map = {0: "Negative", 1: "Positive"}

print(f'Sentence: "{test_sentence}"')
print(f'Predicted Sentiment: {label_map[predicted_label]}')

### **Text Generation**
- RNN-based models can generate text word-by-word conditioned on some other text.
  - Applications: Machine translation, summarization, dialogue generation.
  
- **Autoregressive Generation**:  generates words by repeatedly sampling the next word conditioned on its previous choices.
  1. Start with a seed (e.g., `<s>`) or a richer task-appropriate context
     - e.g., questions in QA, documents to be summarized, etc.
  2. Generate the first word by sampling from the softmax output.
  3. Feed the generated word back into the RNN for the next time step.
  4. Continue until the end-of-sequence marker is generated or a fixed length is reached.

![autogregressive generation](./images/rnn/gen.png)

In [None]:
import torch
import torch.nn as nn
import numpy as np
from torch.autograd import Variable
import torch.optim as optim

# Load the text data
text_data = """
Your text data goes here. This is an example of text that will be used to train the RNN. 
You can replace this with any large text dataset to improve results.
"""

# Step 1: Data Preprocessing
# Tokenization and creating sequences
class TextPreprocessor:
    def __init__(self, text):
        self.text = text.lower().split()
        self.word2idx = {}
        self.idx2word = {}
        self.build_vocab()

    def build_vocab(self):
        words = set(self.text)
        for i, word in enumerate(words):
            self.word2idx[word] = i
            self.idx2word[i] = word

    def text_to_sequences(self):
        return [self.word2idx[word] for word in self.text]

    def get_vocab_size(self):
        return len(self.word2idx)

text_preprocessor = TextPreprocessor(text_data)
sequences = text_preprocessor.text_to_sequences()
vocab_size = text_preprocessor.get_vocab_size()

# Create input-output pairs
def create_sequences(data, seq_length):
    sequences = []
    targets = []
    for i in range(len(data) - seq_length):
        seq = data[i:i + seq_length]
        target = data[i + seq_length]
        sequences.append(seq)
        targets.append(target)
    return torch.tensor(sequences), torch.tensor(targets)

seq_length = 5  # Number of words to consider as input
X, y = create_sequences(sequences, seq_length)

# Step 2: Define the RNN Model
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        x = self.embedding(x)
        output, hidden = self.rnn(x)
        output = self.fc(output[:, -1, :])  # Take the last output state
        return output

# Define model parameters
embedding_dim = 128
hidden_dim = 256
output_dim = vocab_size

# Create the model
model = RNN(vocab_size, embedding_dim, hidden_dim, output_dim)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Step 3: Train the Model
def train_model(model, X, y, epochs):
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f'Epoch [{epoch}/{epochs}], Loss: {loss.item():.4f}')

# Train the model for 100 epochs
train_model(model, X, y, epochs=100)

# Step 4: Text Generation
def generate_text(seed_text, next_words, model, seq_length, vocab_size):
    model.eval()
    for _ in range(next_words):
        # Convert seed text to sequences
        seed_sequence = [text_preprocessor.word2idx[word] for word in seed_text.split()[-seq_length:]]
        seed_sequence = torch.tensor(seed_sequence).unsqueeze(0)

        with torch.no_grad():
            output = model(seed_sequence)
            predicted_idx = torch.argmax(output, dim=1).item()

        predicted_word = text_preprocessor.idx2word[predicted_idx]
        seed_text += " " + predicted_word
    return seed_text

# Generate text using seed text
seed_text = "your text text"
generated_text = generate_text(seed_text, next_words=20, model=model, seq_length=seq_length, vocab_size=vocab_size)
print(generated_text)

### **Problems with RNNs**
- **Distant Dependencies:**
  - Hidden states update for current information.
  - Hard to retain and recall distant information accurately.
  - 🍎 Given *The flights the airline was canceling were full.*
    - "Airline" is close to "was" (singular), but "flights" (plural) is distant, making prediction difficult.
- **Vanishing Gradient**
  - RNNs struggle to carry forward critical information over long sequences.
  - RNNs update weights through sequences, resulting in repeated multiplication of gradients.
  - Gradients shrink, leading to the `vanishing gradient problem`.

---

## Long Short-Term Memory (LSTM) Networks

- **Solution to RNN Issues**:
  - LSTMs manage context more effectively.
  - Learn to remember important information and forget irrelevant data.
- **Architecture**:
  - Adds an `explicit context layer`.
  - Uses gates to control the flow of information.
    - Each consists of a feedforward layer, 
      - followed by a sigmoid activation function (control how much to open a gate)
      - followed by a pointwise multiplication ⊙ with the layer being gated
- Standard for modern systems requiring sequence processing, like NLP.

![A single LSTM unit displayed as a computation graph](./images/rnn/lstmunit.png)

### **LSTM Gate Functionalities**
- **Forget Gate**: Discards unnecessary information.
  - $f_t = \sigma(U_f h_{t-1} + W_f x_t)$
  - $k_t = c_{t-1} ⊙ f_t$
- Compute the actual information from the previous hidden state and current inputs
  - $g_t = \tanh(U_g h_{t-1} + W_g x_t)$
- **Add Gate (Input Gate)**: Adds the selected information to the context.
  - $i_t = \sigma(U_i h_{t-1} + W_i x_t)$
  - $j_t = g_t ⊙ i_t$
  - $c_t = j_t + k_t$
- **Output Gate**: Determines what information is used for the current output.
  - $o_t = \sigma(U_o h_{t-1} + W_o x_t)$
  - $h_t = o_t ⊙ \tanh(c_t)$

### 🍎 Text generation using LSTM

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import random

# 1. Load and preprocess text data
def load_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read().lower()
    return text

# Sample text file
text = """The flights the airline was canceling were full. The sky was blue and clear. 
          The world of artificial intelligence is advancing rapidly."""

# 2. Tokenizing text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

# Converting text to sequences
input_sequences = []
for line in text.split('.'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# 3. Padding sequences to ensure uniform length
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Split input and label
X = input_sequences[:, :-1]  # Input sequence
y = input_sequences[:, -1]   # Label (next word)
y = to_categorical(y, num_classes=total_words)  # Convert to one-hot encoding

# 4. Building the LSTM Model
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))  # Embedding layer
model.add(LSTM(150, return_sequences=True))  # LSTM layer
model.add(LSTM(100))  # Another LSTM layer
model.add(Dense(total_words, activation='softmax'))  # Output layer

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 5. Train the model
history = model.fit(X, y, epochs=100, verbose=1)

# 6. Text generation function
def generate_text(seed_text, next_words, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = np.argmax(model.predict(token_list), axis=-1)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

# 7. Generate new text
seed_text = "The airline"
next_words = 10
print(generate_text(seed_text, next_words, max_sequence_len))

## Encoder-Decoder Model with RNNs
- **Sequence Labeling:** Input and output sequences have the same length (e.g., POS tagging).  
- **Sequence-to-Sequence (Encoder-Decoder):** Input and output sequences differ in length and structure (e.g., machine translation).
  - **Encoder RNN $e(.)$:** Converts input sequence $x_{1:n}$ into a contextualized representation $h^e_{1:n}$. 
    - $h^e_{1:n} = e(x_{1:n})$
  - **Context Vector $c(.)$:** ,a function of $h^e_{1:n}$, captures the essence of the input sequence.
    - $h^d_{1:m} = c(h^e_{1:n})$  
  - **Decoder RNN $d(.)$:** Generates output sequence $y_{1:m}$ using the context vector $h^d_{1:m}$.
    - $y_{1:m} = d(h^d_{1:m})$

![Four architectures for NLP tasks](./images/rnn/fourarch.png)

### **RNN Language Model**
- Models the probability $p(y)$ of a sequence $y$.  
  - $p(y) = p(y_1)p(y_2|y_1)p(y_3|y_1, y_2)...p(y_m|y_1,...,y_{m-1})$  
- **Autoregressive Generation:** Use hidden states $h_t$ to predict next token $\hat{y}_t$.
  - $h_t = g(h_{t-1}, x_t)$
  - $\hat{y}_t = \text{softmax} (h_t)$
- Transition to Encoder-Decoder Model 
  - **Modification:** Add a sentence separator (e.g., `<s>`) token to the input sequence $x$.  
  - **Sequence Translation:**   source text $x$ `<s>` target text $y$ → target text $y$
    - $p(y|x) = p(y_1|x)p(y_2|y_1, x)p(y_3|y_1, y_2, x)...$

![Encoder-decoder for translation](./images/rnn/edtran.png)
 
- **Process:** Encoder processes the English sentence, context vector is passed to the decoder, and Spanish sentence is generated step by step.

---

### **Encoder-Decoder Architecture with RNNs**
- **Encoder:** Generates hidden states for each token in the input.  
- **Decoder:** Autoregressively generates output using context $c$ and previously predicted tokens.  
- **Stacked Architectures:** Often use stacked biLSTMs for more complex representations.

![Formal enc-dec translator](./images/rnn/formed.png)
---

- **Decoder Process**
  - **Decoder Initialization:** 
    - $h_0^d = c = h_n^e$  
  - **Decoder State Update:**  
    - The context $c$ is used as input at each decoding timestep
      - to keep long distance dependency
    - $h_t^d = g(\hat{y}_{t-1}, h_{t-1}^d, c)$  
  - **Output:**  
    - $\hat{y}_t = \text{softmax}(h_t^d)$ 

## Training the Encoder-Decoder Model
![Training RNN encoder-decoder of machine translation](./images/rnn/trantrain.png)

-Encoder-decoder architectures are trained `end-to-end`.
- **Training Data:** Pairs of source and target sequences
  - e.g., machine translation datasets with aligned sentence pairs.  
- **Input to Encoder:** Source text with a separator token.  
- **Decoder Training:** Autoregressively predicts the next token starting from the separator.  
- **Inference vs. Training:**  
  - **Inference:** Decoder uses its own outputs to generate the next token
    - may cause drift.  
  - **Training:** The decoder is forced to use the `correct target token` from the training data instead of its own previous prediction.
    - **Benefit:** Faster training and reduces error accumulation.
- **Loss Computation**
  - **Softmax Output:** The decoder produces probabilities for each possible word.  
  - **Token-Level Loss:** Compare predicted vs. correct token.  
  - **Sentence-Level Loss:** Average loss over all tokens in the sentence.
- **Backpropagation Process:**  
  1. Compute token-level loss.  
  2. Backpropagate through the decoder.  
  3. Backpropagate through the encoder.  
- **Parameter Update:** Use gradient descent to optimize both encoder and decoder parameters, minimizing overall loss.

## Attention
**Bottleneck Problem**
  - In a simple encoder-decoder model, the decoder relies solely on the `final hidden state` $h_n^e$ of the encoder, creating a `bottleneck`.
  - Information at the beginning of the sentence may not be equally well represented in the context vector

![encoder-decoder bottleneck](./images/rnn/bottleneck.png)

- **AAttention Solution** solves this bottleneck by allowing the decoder to access `all encoder hidden states`, not just the final one.
  - Instead of a static context vector, attention creates a `dynamically generated context` vector $c$ that takes into account relevant parts of the input at each decoding step.
    - $c = f(h_1^e, h_2^e, ⋯, h_n^e)$

---

### **Attention Mechanism**
- The context vector $c_i$ is computed for each decoding step, based on a weighted sum of the encoder's hidden states.
- Weights reflect how relevant each encoder state is to the current decoding step $h_i^d$.
  - $h_i^d = g(\hat{y}_{i-1}, h^d_{i-1}, c_i)$
- The relevance score between each decoder hidden state $h^d_{i−1}$ and encoder hidden state $h^e_j$ can be calculated via **dot-product attention**:
  - $\text{score}(h^d_{i−1} , h^e_j ) = h^d_{i−1} · h^e_j$
- Apply a **softmax** to these scores to normalize the relevance weights $α_{ij}$, 
  - $α_{ij} = \text{softmax}(\text{score}(h^d_{i−1} , h^e_j ))$
    - $\displaystyle =\dfrac{e^{α_{ij}}}{∑_k e^{α_{ik}}}$
  - which are used to compute a fixed-length $c_i$ as a weighted sum of all encoder hidden states for the current decoder state:
    - $\displaystyle c_i=∑_j α_{ij}h_j^e$

![encoder-decoder network with attention](./images/rnn/withattension.png)
---

### **Sophisticated Attention**
- More complex scoring functions can be used, like the **bilinear attention model**, where the score is parameterized by trainable weights $W_s$:
  - $α_{ij} = \text{score}(h^d_{i−1} , h^e_j) = h^d_{i−1} W_s h^e_j$
- This allows for different dimensionalities between encoder and decoder states, unlike dot-product attention.
- The concept of attention is key to more advanced architectures, such as **self-attention** in transformers, which we’ll explore further.

