# Session 2 – NLP Models and Training Basics (RNN)
  
In this notebook, we will dive into fundamental sequence models such as **RNNs** and **LSTMs**. We’ll also cover basic neural embeddings, training objectives, and see how to implement and train a **simple text generator** using an LSTM.


## Table of Contents

1. [Introduction and Overview](#introduction)
2. [Recurrent Neural Networks (RNNs)](#rnns)
   - [The RNN Cell](#rnn-cell)
   - [Vanishing and Exploding Gradients](#vanishing)
3. [Long Short-Term Memory (LSTM)](#lstm)
   - [Key Intuition Behind LSTM Gates](#lstm-gates)
4. [Embeddings](#embeddings)
5. [Basic Training Objectives in Language Modeling](#training-objectives)
   - [Next Token Prediction](#next-token-pred)
   - [Perplexity](#perplexity)
6. [Implementing a Simple LSTM Text Generator in PyTorch](#implementation)
   - [Data Preparation](#data-prep)
   - [Model Definition](#model-def)
   - [Training Loop](#training-loop)
   - [Generating Text](#generate-text)
7. [Conclusion](#conclusion)

Each section will be followed by one or more **Exercises** to help you practice.

# <a id="overview"></a>1. Overview and Setup

This tutorial assumes you have:

- **Basic Python** knowledge.
- A local or cloud environment (e.g., Jupyter, Colab) with **PyTorch** installed.
  - If needed, install PyTorch via `pip install torch` or follow instructions at [pytorch.org](https://pytorch.org/get-started/locally/).

No prior reading of other sessions is required; we’ll present all the essentials here.

### Quick Setup Check
```python
import torch
print("PyTorch version:", torch.__version__)
```

Ensure you see a version number (e.g., `2.0.0` or similar) printed. If you get an error, please install or update PyTorch before continuing.

- We’ll focus on **RNNs** and **LSTMs**.  
- We’ll learn **why** they are powerful for sequential data.  
- We’ll cover **basic training objectives** (like next-token prediction) for language modeling.  
- Finally, we’ll implement a small **LSTM-based text generator**.

**By the end of this session**, you should be able to:
1. Understand how an RNN cell and LSTM cell process sequential data.  
2. Implement an **LSTM** in a deep learning framework (here, PyTorch).  
3. Train and evaluate a **text-generation** model.  

In [1]:
import torch
print("PyTorch version:", torch.__version__)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
PyTorch version: 2.6.0


# 2. Recurrent Neural Networks (RNNs)<a id="rnns"></a>

Recurrent Neural Networks are designed to handle **sequential data** by maintaining a hidden state that captures information about previous time steps. 

## Key Idea
At each time step $t$:
1. The RNN takes an input $x_t$ and the hidden state from the previous time step $h_{t-1}$.
2. It produces a new hidden state $h_t$.

Mathematically, a very **basic** RNN can be written as:
$$
\begin{aligned}
h_t &= \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \\
y_t &= W_{hy} h_t + b_y
\end{aligned}
$$

- $h_t$ is the updated hidden state.
- $y_t$ is the output at time step $t$ (used for tasks like classification or next-token prediction).
- $W_{hh}, W_{xh}, W_{hy}$ are learned weight matrices.

**Rearrangement of Terms**

Notice that the term $W_{hh} h_{t-1} + W_{xh} x_t$ uses two matrix multiplications and an addition.
Unless compiled, these two multiplications will be performed sequentially.
We can gain a slight improvement if we concatenate $h$ and $x$, and use a single matrix multiplication by a larger weight matrix:

$$
\begin{aligned}
h_t &= \tanh(W_h H_t + b_h) \\
y_t &= W_{hy} h_t + b_y
\end{aligned}
$$

- $H_t = [h_{t-1}||x_t]$ is the concatenation of $h$ and $x$
- $W_{h} = [W_{hh}||W_{xh}]$ is the cconcatenaation of $W_{hh}, W_{xh}$


<img src="img/RNNs.png"/>


## <a id="rnn-cell"></a>The RNN Cell

The **RNN cell** is the fundamental computational unit. At time step $t$:
1. **Input**: current token (often embedded) + previous hidden state.
2. **Output**: updated hidden state + optional output vector.

If you unroll this cell over time for $T$ steps, you get a **computation graph** that looks like a chain, where each link is an RNN cell.

<img src="img/RNN-folded.png"/>
<img src="img/RNN-unfolded.png"/>

## <a id="vanishing"></a>Vanishing and Exploding Gradients

**Problem**: Simple RNNs often struggle with **long-term dependencies** due to **vanishing** or **exploding gradients**. That means:
- When sequences are long, the gradient that flows backward through time either becomes extremely small (**vanishes**) or extremely large (**explodes**).
- This makes training unstable or ineffective for capturing long-range context.

**Solution**: Specialized RNN variants like **LSTM** or **GRU** mitigate these issues by incorporating gating mechanisms.


### Research Note: let's invent a GRU

**RNN** : $h_t = \phi(W_hh_{t-1} + W_xx_{t})$

* **Problem:** To compute the gradient of $h_1$ (or any early token), we need to multiply the gradients by small values in $W_h$, thus **vanishing** it.
* **Solution:** Intelligently choose the previous memory: $h_t = \phi(W_hh_{t-1} + W_xx_{t})$ or $h_t = h_{t-1}$

**RNN with no vanishing** : $h_t = \alpha\odot\hat{h}_t + (1-\alpha)\odot h_{t-1}$, where $\hat{h}_t=\phi(W_hh_{t-1} + W_xx_{t})$

* **Problem:** To compute the gradient of $h_1$ (or any early token), we need to multiply the gradients by large values in $W_h$, thus **exploding** it.
* **Solution:** Intelligently choose to set the previous memory to zero before multiplying it by the weights: $h_t = \phi(W_hh_{t-1} + W_xx_{t})$ or $h_t = \phi(W_xx_{t})$

**RNN with no explosion** : $h_t = \phi(W_h(\beta \odot h_{t-1}) + W_xx_{t})$

* **Problem:** How do we decide on the values of $\alpha$ and $\beta$?
* **Solution:** Don't! Let the data decide (learning)

$$
\begin{aligned}
h_t &= \overbrace{\alpha\odot\underbrace{\phi\left(W_h(\beta \odot h_{t-1}) + W_xx_{t}\right)}_{\text{no explosion}} + (1-\alpha)\odot h_{t-1}}^\text{no vanishing} \\
\text{where}\\
\alpha &= \sigma\left(Ah_{t-1} + Bx_t\right) &&\text{Reset Gate}\\
\beta &= \sigma\left(Ch_{t-1} + Dx_t\right) &&\text{Update Gate}
\end{aligned}
$$

**Congratulations**, you have just invented a **Gated Recurrent Network**!


---

## Exercise 2: Implement a Toy RNN Cell
**Goal**:  
1. Write a Python function that computes a single time-step of an RNN.  
2. Use NumPy or PyTorch (in NumPy style) to do the matrix multiplication and a `tanh` activation.  
3. Test it on a small input (e.g., input dimension of 5, hidden dimension of 3).  

**Hint**:  
```python
import torch

def rnn_step(x_t, h_prev, Wxh, Whh, bh):
    # x_t: shape (batch_size, input_dim)
    # h_prev: shape (batch_size, hidden_dim)
    # Wxh: shape (input_dim, hidden_dim)
    # Whh: shape (hidden_dim, hidden_dim)
    # bh: shape (hidden_dim,)
    # return h_t: shape (batch_size, hidden_dim)
    pass
```
*(Keep it simple—focus on the concept, not a full RNN unrolled over time.)*  

In [None]:
a