In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Sequence Modeling with Recurrent Neural Networks (RNN)**


## 1. Introduction to Sequence Modeling

**Sequence modeling** involves predicting or understanding data where the **order of elements matters**. Examples include:

* Time series forecasting: stock prices, weather
* Natural language processing: text, speech
* Signal processing: audio, sensor readings
* DNA sequence analysis: genomics

**Challenge:** Traditional feedforward networks cannot capture **temporal dependencies**, because they process inputs independently.



## 2. Recurrent Neural Networks (RNN)

RNNs are designed for **sequence data**:

* They have **hidden states** that are updated at each time step.
* This allows RNNs to **remember information** from previous steps.
* Formally:

[
h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h)
]

[
y_t = g(W_{hy} h_t + b_y)
]

Where:

* (x_t) = input at time t
* (h_t) = hidden state at time t
* (y_t) = output at time t
* (W_{xh}, W_{hh}, W_{hy}) = weight matrices
* (f, g) = activation functions (e.g., tanh, softmax)

**Key idea:** The hidden state (h_t) acts as a **memory** carrying information from previous inputs.



## 3. Sequence Modeling Problem Example

**Problem:** Predict the next word in a sentence (language modeling).

**Input Sequence:**

```
"The cat sat on the"
```

**Target Output:**

```
"mat"
```

* Each word is encoded (one-hot or embedding).
* RNN reads words **one at a time**, updates hidden state.
* At the last step, the output predicts the next word.



## 4. Steps to Solve Sequence Modeling with RNN

### Step 1: Prepare the Data

* Convert sequence to numerical representation (tokenization or embedding).
* Split into input-output pairs:

| Input Sequence | Target |
| -------------- | ------ |
| The            | cat    |
| The cat        | sat    |
| The cat sat    | on     |

### Step 2: Build RNN Model

```python
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])  # take last time step
        return out
```

### Step 3: Train the Model

* Loss function: Cross-Entropy Loss (for classification)
* Optimizer: Adam / SGD
* Feed input sequences → predict next element → compute loss → backpropagate



## 5. Example Applications of RNN in Sequence Modeling

| Application             | Input               | Output          |
| ----------------------- | ------------------- | --------------- |
| Language Modeling       | Words in a sentence | Next word       |
| Time Series Forecasting | Stock prices        | Future price    |
| Speech Recognition      | Audio frames        | Text transcript |
| Machine Translation     | Source sentence     | Target sentence |
| Music Generation        | Note sequence       | Next note       |



## 6. Advantages of RNNs

* Can process **arbitrary-length sequences**
* Maintains **temporal dependencies** via hidden states
* Suitable for sequential predictions



## 7. Limitations

* **Vanishing / exploding gradients** for long sequences
* Hard to capture **long-range dependencies**
* Training can be slow

**Solution:** Use **LSTM** or **GRU** networks, which are variants of RNN that mitigate these issues.



## 8. Visualization of RNN Unrolling

```
Time t=1       Time t=2       Time t=3       Time t=4
  x1 → h1 → y1   x2 → h2 → y2   x3 → h3 → y3   x4 → h4 → y4
         ↑           ↑            ↑            ↑
         h0           h1           h2           h3  (hidden states carry memory)
```

* The hidden state (h_t) connects the network across time steps.



## 9. Summary

* **Sequence modeling** = predicting/understanding ordered data
* **RNN** = neural network with memory (hidden state)
* **Applications**: text, time series, speech, genomics
* **Limitations**: vanishing gradient, long-range dependencies
* **Improved variants**: LSTM, GRU

## Major problems of ANN
- Variable size of input/output neurons
- Too much computation
- No parameter sharing

# **Types of Recurrent Neural Networks (RNNs)**

RNNs come in **different architectures** depending on how they handle input, output, and sequence direction. Choosing the right type depends on the problem (e.g., prediction, classification, sequence generation).



## 1. **Vanilla RNN (Simple RNN)**

* **Structure:** Standard RNN with one hidden state per time step.
* **Equations:**
  [
  h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)
  ]
  [
  y_t = W_{hy}h_t + b_y
  ]
* **Use case:** Simple sequence modeling tasks.
* **Limitations:**

  * Vanishing/exploding gradients
  * Difficulty capturing long-term dependencies



## 2. **Long Short-Term Memory (LSTM)**

* **Structure:** RNN with **gates** to control information flow:

  * **Forget gate** (f_t) – decides what to forget
  * **Input gate** (i_t) – decides what to add to cell state
  * **Output gate** (o_t) – decides what to output
* **Cell state (C_t):** Maintains long-term memory
* **Advantages:**

  * Handles long-range dependencies
  * Reduces vanishing gradient problem
* **Use case:** Language modeling, machine translation, speech recognition



## 3. **Gated Recurrent Unit (GRU)**

* **Structure:** Simplified version of LSTM:

  * Combines forget and input gates into **update gate**
  * Uses **reset gate** to control memory update
* **Advantages:**

  * Fewer parameters than LSTM → faster training
  * Often performs similarly to LSTM
* **Use case:** Similar to LSTM, especially when computation efficiency is important



## 4. **Bidirectional RNN (BiRNN)**

* **Structure:** Processes sequence in **both directions**:

  * Forward RNN: left → right
  * Backward RNN: right → left
* **Output:** Concatenates hidden states from both directions
* **Advantages:** Captures **past and future context** for each time step
* **Use case:** Part-of-speech tagging, named entity recognition, speech processing



## 5. **Deep RNN**

* **Structure:** Multiple RNN layers stacked on top of each other
* **Advantages:** Can capture more **complex patterns**
* **Use case:** Complex sequence modeling tasks like multi-layered language models



## 6. **Echo State Networks (ESN)**

* **Structure:** Sparse, randomly connected hidden layer (reservoir)
* **Characteristic:** Only output weights are trained
* **Use case:** Time series prediction, computationally efficient RNN



## 7. **Recursive RNN (Tree-structured RNN)**

* **Structure:** Works on **tree-structured data** instead of linear sequences
* **Use case:** Natural language parsing, sentiment analysis

---

## 8. Summary Table

| Type          | Key Feature                  | Pros                           | Cons                                    | Use Cases                   |
| ------------- | ---------------------------- | ------------------------------ | --------------------------------------- | --------------------------- |
| Vanilla RNN   | Simple hidden state          | Simple, easy to implement      | Vanishing gradient, short memory        | Basic sequences             |
| LSTM          | Gates + cell state           | Long-term dependencies, stable | More parameters                         | NLP, speech, time series    |
| GRU           | Update/reset gates           | Fewer parameters, fast         | Less expressive than LSTM in some tasks | NLP, time series            |
| BiRNN         | Processes forward & backward | Uses past & future context     | Double computation                      | POS tagging, NER            |
| Deep RNN      | Stacked layers               | Captures complex patterns      | Harder to train                         | Complex sequences           |
| ESN           | Fixed random reservoir       | Fast training                  | Limited flexibility                     | Time series                 |
| Recursive RNN | Tree-structured              | Works on hierarchies           | Complex                                 | Parsing, sentiment analysis |

---

## 9. Visualization Idea

```
Vanilla RNN:   x1 → h1 → y1
               x2 → h2 → y2
               x3 → h3 → y3

LSTM:          x1 → [Cell + Gates] → h1 → y1
               x2 → [Cell + Gates] → h2 → y2

GRU:           x1 → [Update/Reset Gates] → h1 → y1
               x2 → [Update/Reset Gates] → h2 → y2

BiRNN:         x1 → h1_forward →  
                        ↘
                         → y1
               x1 → h1_backward →  
```

# **RNN in Language Translation and Named Entity Recognition**


## 1. RNN in Language Translation (Sequence-to-Sequence)

**Problem:** Translate a sentence from **source language** (e.g., English) to **target language** (e.g., French).

### a) Overview

* Translation is a **sequence-to-sequence (Seq2Seq)** problem.
* The input and output sequences are **not necessarily the same length**.
* RNNs (often LSTM or GRU) are used in **encoder-decoder architecture**.


### b) Encoder-Decoder Architecture

**Encoder:**

* Reads the **input sentence** word by word.
* Updates hidden state (h_t) at each time step.
* Final hidden state (h_T) is a **context vector** summarizing the entire input sentence.

**Decoder:**

* Initializes its hidden state with the encoder’s final hidden state.
* Generates the **output sentence word by word**.
* Uses the previous output word as input for the next time step during training (teacher forcing).

```
Input sentence: "I am happy"
Encoder RNN:    x1 -> h1
                x2 -> h2
                x3 -> h3
Context vector: h3

Decoder RNN:    h3 + <START> -> y1: "Je"
                h1 + y1 -> y2: "suis"
                h2 + y2 -> y3: "heureux"
Output sentence: "Je suis heureux"
```

---

### c) Attention Mechanism (Optional but Common)

* Instead of encoding everything into **a single context vector**, attention allows the decoder to **focus on specific encoder states** at each time step.
* This improves translation quality, especially for **long sentences**.



### d) Key Steps in RNN Translation

1. Tokenize input and output sentences → embeddings
2. Pass input through **encoder RNN** → get hidden states
3. Initialize **decoder RNN** with encoder’s hidden state
4. Generate output sequence → predict each word
5. Optionally apply **attention** over encoder hidden states
6. Compute **loss** (cross-entropy between predicted and true words) → backpropagation



### e) Advantages

* Captures sequential dependencies in source language
* Handles variable-length sequences
* Can be extended with LSTM/GRU and attention



## 2. RNN in Named Entity Recognition (NER)

**Problem:** Identify entities (person, location, organization, etc.) in a text.

**Example:**

```
Input sentence: "Sanjan Acharya works at OpenAI."
Output: ["Sanjan Acharya" → PERSON, "OpenAI" → ORG]
```

---

### a) How RNN Handles NER

* NER is a **sequence labeling problem**.
* Input: sequence of words
* Output: sequence of labels (one per word)

**RNN Approach:**

1. Convert words to embeddings: `x1, x2, x3, ...`
2. Pass embeddings through RNN:

   ```
   x1 -> h1 -> y1
   x2 -> h2 -> y2
   x3 -> h3 -> y3
   ```
3. Output `y_t` is the **predicted entity label** for word `t`:

   * `B-PER` = beginning of a person name
   * `I-PER` = inside person name
   * `B-ORG`, `I-ORG`, `O` = outside any entity
4. Use **softmax** on each `y_t` to get probabilities over entity classes
5. Train with **cross-entropy loss** for sequence labeling



### b) Bi-directional RNN in NER

* Standard RNN reads left → right, but context on the **right** is also useful.
* **BiRNN** or **BiLSTM** reads sequence both ways:

  ```
  Forward RNN: x1 -> h1f, x2 -> h2f, ...
  Backward RNN: x3 -> h3b, x2 -> h2b, ...
  Hidden state at t: h_t = concat(h_tf, h_tb)
  ```
* Each word prediction uses **past and future context**, improving entity recognition accuracy.



### c) Example Workflow in NER

```
Sentence: "John lives in London"

Word embeddings: x1("John"), x2("lives"), x3("in"), x4("London")

BiRNN hidden states: h1, h2, h3, h4

Softmax output:
y1: B-PER
y2: O
y3: O
y4: B-LOC
```

* Predicted entities: **John → PERSON**, **London → LOCATION**

---

## 3. Key Differences Between Translation and NER

| Aspect       | Translation                         | NER                                |
| ------------ | ----------------------------------- | ---------------------------------- |
| Input        | Sequence of words                   | Sequence of words                  |
| Output       | Sequence of words (target language) | Sequence of labels (entities)      |
| Architecture | Encoder-Decoder RNN (+Attention)    | BiRNN / RNN with sequence labeling |
| Goal         | Generate new sequence               | Tag each word                      |
| Memory usage | Context vector or attention         | Hidden state per word              |

---

## 4. Summary

* **RNN in Translation:** Encoder-decoder architecture, generates variable-length target sequence, optionally uses attention.
* **RNN in NER:** Sequence labeling task, BiRNN recommended, predicts labels per word using contextual hidden states.
* **Commonality:** Both rely on **temporal dependencies** captured by RNN hidden states.


# **Types of RNN Architectures**

RNNs can be categorized based on the **relationship between input and output sequences**.



## 1. **One-to-One RNN**

* **Description:** Standard neural network.
* **Input:** Single data point
* **Output:** Single output
* **Example:** Image classification

```
Input x → [RNN] → Output y
```

* **Use case:** When sequence modeling is **not required**; behaves like a feedforward network.



## 2. **One-to-Many RNN**

* **Description:** Single input produces a **sequence of outputs**.

* **Example:** Image captioning

* **Workflow:**

  ```
  Input x (image) → RNN → y1, y2, y3, ... (caption words)
  ```

* **Use case:** Generating sequences from a single input.



## 3. **Many-to-One RNN**

* **Description:** Sequence of inputs produces a **single output**.

* **Example:** Sentiment analysis (predict sentiment from a sentence)

* **Workflow:**

  ```
  x1 → h1
  x2 → h2
  x3 → h3
          ↓
          y (sentiment)
  ```

* **Use case:** Classification or regression of sequential data.



## 4. **Many-to-Many RNN**

* **Description:** Sequence of inputs produces a sequence of outputs.

### a) **Equal Length Input-Output**

* Input and output sequences have the **same length**
* **Example:** Part-of-speech tagging, named entity recognition

```
x1 → y1
x2 → y2
x3 → y3
```

### b) **Variable Length Input-Output**

* Input and output sequences can have **different lengths**
* **Example:** Machine translation

```
Input: "I am happy"
Output: "Je suis heureux"
```

* Implemented using **Encoder-Decoder RNNs**, often with attention mechanism.

---

## 5. Summary Table

| RNN Type                       | Input    | Output   | Example              |
| ------------------------------ | -------- | -------- | -------------------- |
| One-to-One                     | Single   | Single   | Image classification |
| One-to-Many                    | Single   | Sequence | Image captioning     |
| Many-to-One                    | Sequence | Single   | Sentiment analysis   |
| Many-to-Many (equal length)    | Sequence | Sequence | NER, POS tagging     |
| Many-to-Many (variable length) | Sequence | Sequence | Machine translation  |