# 🗓️ June 19 – Sequence Modeling Foundations

---

## 🔹 1. Why Traditional Models Fail on Sequential Data

Traditional machine learning models (like Logistic Regression, Naive Bayes, and SVMs) process **text as a bag of words** or fixed-length feature vectors.

They ignore:
- **Word order**
- **Contextual meaning**
- **Temporal dependencies**

### ❌ Problem Example:
Consider:
- Sentence A: "I did not enjoy the movie."
- Sentence B: "I enjoyed the movie."

Both contain similar words, but **meaning is opposite**. Traditional models may assign them similar sentiment due to bag-of-words encoding.

---

## 🔹 2. Recurrent Neural Networks (RNNs)

RNNs are designed to handle sequences by **remembering previous inputs** using a hidden state that gets updated at each step.

### 🔁 RNN Working:
At each time step `t`, it uses:
- Current input `xₜ`
- Previous hidden state `hₜ₋₁`

To compute:
```math
hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b)


In [1]:
# Simple RNN from Scratch

import torch
import torch.nn.functional as F

# Simulated input: 3 time steps, 4 input features
x_seq = torch.randn(3, 4)
hidden_dim = 5
h_prev = torch.zeros(hidden_dim)

# Weights
Wx = torch.randn(4, hidden_dim)
Wh = torch.randn(hidden_dim, hidden_dim)
b = torch.randn(hidden_dim)

# RNN Loop
print("Simple RNN Computation:\n")
for t in range(3):
    x_t = x_seq[t]
    h_prev = torch.tanh(x_t @ Wx + h_prev @ Wh + b)
    print(f"Time Step {t+1}: Hidden State = {h_prev.detach().numpy()}")


Simple RNN Computation:

Time Step 1: Hidden State = [-0.49749747 -0.0036857  -0.39356345 -0.94412524  0.6471663 ]
Time Step 2: Hidden State = [-0.93307126  0.5306133   0.95593184 -0.9828039   0.99930495]
Time Step 3: Hidden State = [ 0.9712605   0.9459313   0.40194666 -0.99999994  0.97436684]


In [2]:
# Built-in RNN Layer in PyTorch

import torch
import torch.nn as nn

# A simple RNN layer
rnn = nn.RNN(input_size=10, hidden_size=20, batch_first=True)

#This creates a vanilla RNN layer with the following:

#input_size=10: Each time step has 10 features.
#hidden_size=20: The RNN will output a hidden vector of size 20.
#batch_first=True: Input/output tensors will have shape (batch, seq_len, feature).
#This means your input and output tensors will look like (batch_size, sequence_length, input_size or hidden_size).

# Input: batch_size x seq_len x input_size
x = torch.randn(5, 3, 10)
#Creates a random input tensor:

#batch_size = 5: 5 different sequences (like 5 sentences)
#sequence_length = 3: Each sequence has 3 time steps (like 3 words)
#input_size = 10: Each time step is represented by a 10-dimensional vector (like a word embedding)
#x shape = [5, 3, 10]

h0 = torch.zeros(1, 5, 20)  # (num_layers, batch, hidden_size)
#Initial hidden state:
#1 → Number of RNN layers (num_layers)
#5 → Batch size (same as input)
#20 → Hidden size (output feature size)
#h0 shape = [num_layers, batch_size, hidden_size]

out, hn = rnn(x, h0)
#Passes the input x and initial hidden state h0 into the RNN.
#Returns:
#out: Hidden states at each time step for each sequence → shape [batch_size, seq_len, hidden_size]
#hn: Hidden state only at the final time step → shape [num_layers, batch_size, hidden_size]
#So:
#out.shape = [5, 3, 20]
#hn.shape = [1, 5, 20]

print("Output shape:", out.shape)  # (batch_size, seq_len, hidden_size)


Output shape: torch.Size([5, 3, 20])


# Advantages of RNNs
✅ Learns from arbitrary-length sequences

✅ Shared weights across time steps

✅ Can model sequential dependencies (to some extent)

# Limitations of RNNs
❌ Sequential computation → slow to train

❌ Short-term memory → loses info over long sequences

❌ Suffers from vanishing gradients when backpropagating over time

These limitations led to improved models: LSTMs and GRUs

# 3. LSTM – Long Short-Term Memory (Preview)
LSTMs improve upon RNNs by using:

A cell state to preserve long-term information

Gates to decide what to keep, forget, and output

They are designed to overcome vanishing gradients and capture long-range dependencies.


# Extra: From Neural Dependency Parsing Lecture
Neural parsers use dense, learned representations instead of symbolic grammar rules.

RNNs (and later LSTMs) help in modeling dependencies in sentences efficiently and accurately.

These architectures form the base for sequence models in modern NLP.

# 📌 Recap
Concept	Summary

Traditional Models     | 	Lose order/context

RNNs	               |   Learn from sequences

PyTorch RNN	           |  Easy implementation with nn.RNN

Limitations	           |   Vanishing gradients, slow training

Solution Preview	   |        LSTMs, GRUs

# 🧠 June 20–21 – LSTMs, Bidirectional LSTMs, and Attention Mechanisms

---

## 🔹 1. Language Modeling and RNNs in NLP

### 🧠 Language Modeling Task:
Given a sequence of words, predict the next word:
> Input: "I am going to the" → Predict: "store"

### ❌ Traditional models (e.g., n-grams) are limited by:
- Fixed context window
- Inability to generalize across similar patterns

### ✅ RNNs improve by:
- Maintaining a **hidden state** that captures prior context
- Learning dependencies from previous words

---

## 🔧 RNN in Language Modeling – Code Recap

```python
import torch
import torch.nn as nn

# RNN layer
rnn = nn.RNN(input_size=10, hidden_size=20, batch_first=True)

x = torch.randn(5, 3, 10)        # batch_size=5, seq_len=3, input_dim=10
h0 = torch.zeros(1, 5, 20)       # num_layers=1, batch_size=5, hidden_dim=20

out, hn = rnn(x, h0)
print(out.shape)  # (5, 3, 20)


## ⚠️ 2. Challenges in RNNs
### ❌ Vanishing Gradients
As the sequence grows, gradients become too small during backpropagation.

Model "forgets" earlier parts of long sequences.

🧪 Example:
Sentence: "The movie was not good."
RNN might remember only "good", forgetting "not".

# 🔹 3. Long Short-Term Memory (LSTM) Networks
LSTM solves vanishing gradients via gates that control memory flow:

Forget Gate – What to discard

Input Gate – What to store

Output Gate – What to output

## 🧠 Example:
Input: "I lived in France for two years, so I speak French fluently."
LSTM can retain "France" till "fluently" due to gated memory control.

## 🔧 Code: Simple LSTM Model in PyTorch

In [3]:
lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True)

x = torch.randn(5, 3, 10)         # batch_size=5, seq_len=3
h0 = torch.zeros(1, 5, 20)        # (num_layers, batch_size, hidden_dim)
c0 = torch.zeros(1, 5, 20)        # cell state

out, (hn, cn) = lstm(x, (h0, c0))

print(out.shape)   # Shape: (5, 3, 20)


torch.Size([5, 3, 20])


# 🔹 4. Bidirectional LSTM
BiLSTM reads the sequence forward and backward, capturing both past and future context.

## ✅ Useful for:
Named Entity Recognition

Sentiment Analysis

Any task needing context from both sides

## 🔧 Code: BiLSTM in PyTorch

In [4]:
bilstm = nn.LSTM(input_size=10, hidden_size=20, bidirectional=True, batch_first=True)

x = torch.randn(5, 3, 10)
h0 = torch.zeros(2, 5, 20)  # 2 for bidirectional
c0 = torch.zeros(2, 5, 20)

out, (hn, cn) = bilstm(x, (h0, c0))
print(out.shape)  # Shape: (5, 3, 40) → 20 for forward + 20 for backward


torch.Size([5, 3, 40])


# 🔹 5. Attention Mechanism (June 21)
RNNs (even LSTMs) may still struggle with very long sequences.

## 🧠 Attention helps by:
Learning which parts of the sequence are important

Computing a weighted average of all hidden states

## 🔧 Custom Attention Layer in PyTorch

In [5]:
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = nn.Linear(hidden_dim, 1)

    def forward(self, lstm_out):
        # lstm_out: [batch, seq_len, hidden_dim]
        scores = self.attn(lstm_out).squeeze(-1)         # [batch, seq_len]
        weights = torch.softmax(scores, dim=1)           # [batch, seq_len]
        context = torch.sum(lstm_out * weights.unsqueeze(-1), dim=1)  # [batch, hidden_dim]
        return context, weights


 ## Combine: BiLSTM + Attention Classifier

In [6]:
class BiLSTM_Attention(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.attn = Attention(hidden_dim * 2)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        context, _ = self.attn(lstm_out)
        return self.fc(context)


# 📊 6. Applications of LSTM in NLP
| Task                | Example                               |
| ------------------- | ------------------------------------- |
| Language Modeling   | Predict next word                     |
| Sentiment Analysis  | Positive or negative sentence         |
| Text Generation     | "Once upon a" → "time there was a..." |
| Machine Translation | English → French                      |
| Speech Recognition  | Audio to text                         |


# 🔄 7. RNNs vs. LSTMs vs. Transformers

| Feature                | RNN   | LSTM | Transformer |
| ---------------------- | ----- | ---- | ----------- |
| Memory of past         | Short | Long | Global      |
| Parallelizable         | ❌    | ❌  | ✅          |
| Handles long sequences | ❌    | ✅  | ✅          |
| SOTA Performance       | ❌    | ✅  | ✅✅       |


| Concept        | Summary                                          |
| -------------- | ------------------------------------------------ |
| RNN            | Sequence model with short-term memory            |
| LSTM           | Handles long-term dependencies via gates         |
| BiLSTM         | Reads input forwards and backwards               |
| Attention      | Learns to focus on important parts of the input  |
| Real-world Use | Sentiment analysis, text generation, translation |


# 🧠 June 22–23 – GRUs, Custom Data Handling, and Text Classification

---

## 🔹 June 22 – GRUs and Data Handling

---

## 1. Gated Recurrent Units (GRUs)

GRUs are a simplified version of LSTMs with **fewer gates**, making them faster and easier to train while still solving the vanishing gradient problem.

### 🎯 Key Differences (vs LSTM):

| Feature       | LSTM              | GRU               |
|---------------|-------------------|-------------------|
| Gates         | 3 (Forget, Input, Output) | 2 (Update, Reset) |
| Cell State    | Separate cell and hidden state | Combined |
| Complexity    | Higher            | Lower             |
| Performance   | Similar (GRUs may perform better on smaller datasets) |

---

### 🧠 GRU Equations

```math
zₜ = σ(Wz·xₜ + Uz·hₜ₋₁)      ← Update Gate  
rₜ = σ(Wr·xₜ + Ur·hₜ₋₁)      ← Reset Gate  
h̃ₜ = tanh(W·xₜ + U·(rₜ * hₜ₋₁))  
hₜ = (1 - zₜ) * hₜ₋₁ + zₜ * h̃ₜ


In [7]:
import torch
import torch.nn as nn

gru = nn.GRU(input_size=10, hidden_size=20, batch_first=True)

x = torch.randn(5, 3, 10)      # batch_size = 5, seq_len = 3, input_size = 10
h0 = torch.zeros(1, 5, 20)     # num_layers = 1, batch_size = 5, hidden_size = 20

out, hn = gru(x, h0)
print(out.shape)  # torch.Size([5, 3, 20])


torch.Size([5, 3, 20])


## 2. Custom Data Handling with torchtext
We'll use torchtext to load and process the IMDb sentiment dataset.

In [8]:
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")

# Load training dataset
train_iter = IMDB(split='train')

# Tokenize + build vocab
def yield_tokens(data_iter):
    for label, line in data_iter:
        yield tokenizer(line)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])


🔁 Encode Sentence Example

In [9]:
text = "The movie was great!"
tokens = tokenizer(text)
encoded = vocab(tokens)
print("Tokens:", tokens)
print("Encoded:", encoded)


Tokens: ['the', 'movie', 'was', 'great', '!']
Encoded: [1, 20, 16, 92, 35]


🧩 Prepare Batches (basic version)

In [10]:
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    labels, texts = [], []
    for label, text in batch:
        labels.append(1 if label == "pos" else 0)
        processed = torch.tensor(vocab(tokenizer(text)), dtype=torch.int64)
        texts.append(processed)
    texts = pad_sequence(texts, batch_first=True)
    return torch.tensor(labels), texts


#🔹 June 23 – Apply Models to Text Classification
## 1. Sentiment Analysis Task: IMDb
Binary classification task:

pos → 1

neg → 0

## 🧠 Model Architecture (BiLSTM or GRU + FC)

In [14]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
    
    def forward(self, x):
        x = self.embedding(x)
        gru_out, _ = self.gru(x)
        pooled = torch.mean(gru_out, dim=1)
        return self.fc(pooled)


2. Evaluation Metrics

In [15]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

def evaluate(model, dataloader):
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for label, text in dataloader:
            output = model(text)
            pred = torch.argmax(output, dim=1)
            preds.extend(pred.tolist())
            labels.extend(label.tolist())
    
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    cm = confusion_matrix(labels, preds)
    return acc, f1, cm


# ✅ Summary
| Task                  | Concept                              |
| --------------------- | ------------------------------------ |
| GRU                   | Efficient memory-based RNN           |
| torchtext             | Handles tokenization, vocab, batches |
| BiLSTM/GRU Classifier | Predicts sentiment                   |
| Evaluation            | Accuracy, F1, Confusion Matrix       |
