<img src="./image/labai.png" width="200px">

# Text Generation with GRU

In this exercise your goal is to build text generation model with GRU model by complete all piece of code below, you can add or change code as we can


**Objective**:  
In this exercise, your goal is to build a text generation model using a Gated Recurrent Unit (GRU). You will complete all the provided code segments and are encouraged to add or modify code to improve the model. The key steps involve:

1. Preprocessing the text data.
2. Implementing the GRU-based neural network.
3. Training the model on the provided dataset.
4. Generating new text based on a seed sequence.

**Instructions**:
- Follow the code structure provided and complete the missing sections.
- Experiment with different hyperparameters to improve performance.
- You are free to adjust the code as needed to enhance results.

**Please use Google colab for free GPU**


In [None]:
!pip install torch==2.2.0 torchtext==0.17.0

Collecting torch==2.2.0
  Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting torchtext==0.17.0
  Downloading torchtext-0.17.0-cp310-cp310-manylinux1_x86_64.whl.metadata (7.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.0)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.0)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collec

In [None]:
!pip install lightning

Collecting lightning
  Downloading lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.8-py3-none-any.whl.metadata (5.2 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.5.1-py3-none-any.whl.metadata (20 kB)
Collecting pytorch-lightning (from lightning)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Downloading lightning-2.4.0-py3-none-any.whl (810 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m811.0/811.0 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lightning_utilities-0.11.8-py3-none-any.whl (26 kB)
Downloading torchmetrics-1.5.1-py3-none-any.whl (890 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.6/890.6 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytorch_lightning-2.4.0-py3-none-any.whl (815 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# import sommes packages
import torch
torch.cuda.empty_cache()

import re
import torch
import torchtext
import torch.nn as nn
from pathlib import Path
from typing import List,Dict

import lightning as L
import torch.nn.functional as F
import torch.optim as optim
import unicodedata
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchtext.vocab  import build_vocab_from_iterator
from lightning.pytorch.loggers import TensorBoardLogger

# Attempt GPU; if not, stay on CPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


### I- Load dataset

In [None]:
# load dataset
text = Path('./data/tiny-shakespeare.txt').read_text()

In [None]:
# print total number of characters:
print(f'Number of characters in text file: {len(text):,}')

Number of characters in text file: 1,115,394


In [None]:
print(text[0:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


## II - Word-Based Text Generation

The first model you'll build for **text generation** will use Word-based tokens. Each token will be a single word from the text and the model will learn to predict the next word (a token).

To generate text, the model will take in a new string, word-by-word, and then generate a new likely word based on the past input. Then the model will take into account that new word and generate the following word and so on and so on until the model has produced a set number of word.

### II.1  Tokenization :
Create a tokenizer that will create tokens by character

In [None]:

class  WordTokenizer(nn.Module):

    def __init__(self, vocab: torchtext.vocab.Vocab|Dict[str,int])-> None:
        super().__init__()

        if isinstance(vocab, torchtext.vocab.Vocab):
            self.token2id=vocab.get_stoi()
            self.id2token={id:ch for ch,id in vocab.get_stoi().items()}
            self.vocab_size=len(self.token2id)

        elif isinstance(vocab, dict):
            self.token2id=vocab
            self.id2token={id:ch for ch,id in vocab.items()}
            self.vocab_size=len(self.token2id)

        else:
            raise TypeError("Please loads a vocabulary file into a dictionary \
                            Dict[str,int] or torchtext.vocab.Vocab")

    def encode(self, text:List[str]|str):
        if isinstance(text, str):
            text_list=self.tokenize(text)

        tokenid=[]
        for token in text_list:
            tokenid.append(self.token2id[token])
        return  torch.tensor(tokenid,  dtype=torch.long)


    def decode(self, idx:torch.tensor):
        #idx: torch.Tensor containing integers
        token=[]
        for id in idx.tolist():
            token.append(self.id2token[id])
        return ' '.join(token)

    @staticmethod
    def tokenize(text: str) -> List[str]:

        # Normalize incoming text; can be multiple actions
        text= text.lower().strip() ## Your code Here ##

        # split text into tokens
        tokens= text.split() ## Your code Here ##

        return tokens

    @staticmethod
    def _tokenizer_corpus(corpus:List[str]):
        for text in corpus:
            yield WordTokenizer.tokenize(text)

    @staticmethod
    def train_from_text(text: str) -> List[str]:
        """build vocab from one text corpus"""
        vocab=build_vocab_from_iterator(WordTokenizer._tokenizer_corpus(WordTokenizer.tokenize(text)),
                                        specials=["<unk>"]
                                       )
        vocab.set_default_index(vocab["<unk>"])

        return WordTokenizer(vocab)


In [None]:
# create tokenizer from text
tokenizer = WordTokenizer.train_from_text(text)## Your code Here ##

In [None]:
# show example of word-based tokens
print(tokenizer.tokenize(text[0:300]))

['first', 'citizen:', 'before', 'we', 'proceed', 'any', 'further,', 'hear', 'me', 'speak.', 'all:', 'speak,', 'speak.', 'first', 'citizen:', 'you', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish?', 'all:', 'resolved.', 'resolved.', 'first', 'citizen:', 'first,', 'you', 'know', 'caius', 'marcius', 'is', 'chief', 'enemy', 'to', 'the', 'people.', 'all:', 'we', "know't,", 'we', "know't.", 'first', 'citizen:', 'let', 'us']


In [None]:
# tokenization
encode_text=tokenizer.encode("Welcome to the deep learning course.")
encode_text

tensor([ 533,    3,    1,  592, 4449, 4180])

In [None]:
decode_text=tokenizer.decode(encode_text)
decode_text

'welcome to the deep learning course.'

### III - Prepare dataset for training

In [None]:
class shakespeareDataset(Dataset):
    def __init__(self, encode_text, max_seq_length: int):
        self.encode_text     = encode_text
        self.max_seq_length  = max_seq_length

    def __len__(self):
        return len(self.encode_text)-self.max_seq_length

    def __getitem__(self, idx):
        assert idx < len(self.encode_text)-self.max_seq_length

        x_train= self.encode_text[idx:idx+self.max_seq_length]

        # Target is shifted by one character/token
        y_target= self.encode_text[idx+1:idx+1+self.max_seq_length]

        return x_train, y_target


In [None]:
dataset=shakespeareDataset(encode_text=tokenizer.encode(text),max_seq_length=100)

In [None]:
# check
dataset[0]

(tensor([   82,   225,   147,    31,  1650,   128,  4313,   124,    25,   561,
           650,   547,   561,    82,   225,     8,    36,    35,  1954,   319,
             3,   332,    52,     3, 14444,   650,  9129,  9129,    82,   225,
           569,     8,    87,  1273,   675,    11,  2936,  1091,     3,     1,
          1645,   650,    31, 16696,    31,  5774,    82,   225,    57,    97,
           494,   140,     2,   306,    21,  2562,    46,    34,   170, 19178,
           639,     7, 22816,   650,    40,    54,  4835,  8748,    57,    22,
            16,  1716,   712,   849,   142,   225,    69,   912,    43,  7216,
            82,   225,    31,    36,  6708,   159,  2940,     1,  4599,  1499,
            28,  2519, 21556,    44,    48,  6135,  1445,    33,    49,    48]),
 tensor([  225,   147,    31,  1650,   128,  4313,   124,    25,   561,   650,
           547,   561,    82,   225,     8,    36,    35,  1954,   319,     3,
           332,    52,     3, 14444,   650,  9129,

In [None]:
# check
tokenizer.decode(dataset[0][0])

"first citizen: before we proceed any further, hear me speak. all: speak, speak. first citizen: you are all resolved rather to die than to famish? all: resolved. resolved. first citizen: first, you know caius marcius is chief enemy to the people. all: we know't, we know't. first citizen: let us kill him, and we'll have corn at our own price. is't a verdict? all: no more talking on't; let it be done: away, away! second citizen: one word, good citizens. first citizen: we are accounted poor citizens, the patricians good. what authority surfeits on would relieve us: if they would"

In [None]:
# check
tokenizer.decode(dataset[1][0])

"citizen: before we proceed any further, hear me speak. all: speak, speak. first citizen: you are all resolved rather to die than to famish? all: resolved. resolved. first citizen: first, you know caius marcius is chief enemy to the people. all: we know't, we know't. first citizen: let us kill him, and we'll have corn at our own price. is't a verdict? all: no more talking on't; let it be done: away, away! second citizen: one word, good citizens. first citizen: we are accounted poor citizens, the patricians good. what authority surfeits on would relieve us: if they would yield"

In [None]:
# batch dataset
train_dataloader = DataLoader(dataset, batch_size=64, shuffle=False)

### Build GRU model


In [None]:
class GRUTextGen(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout):

        super(GRUTextGen, self).__init__()


        assert 0 <= dropout <=1 , "dropout value must be between [0,1]"

        self.vocab_size = vocab_size

        self.embedding=nn.Embedding(vocab_size,embed_size)


        self.gru=nn.GRU(
            input_size=embed_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,  # Only apply dropout if more than 1 layer
            batch_first=True
        )

        self.fc=nn.Linear(hidden_size, vocab_size)


    def forward(self, x: torch.tensor):
        assert x.ndim==2, "x tensor must be 2D dimensions with shape (B,S), B=batch, S=sequence length"
        x = self.embedding(x)
        output, h = self.gru(x)
        logits    = self.fc(output)
        return logits

In [None]:
GRU_model = GRUTextGen(
    vocab_size= 24000 ,
    embed_size= 100,
    hidden_size= 128,
    num_layers=2,
    dropout = 0.3
)


GRU_model.to(device)
GRU_model

GRUTextGen(
  (embedding): Embedding(24000, 100)
  (gru): GRU(100, 128, num_layers=2, batch_first=True, dropout=0.3)
  (fc): Linear(in_features=128, out_features=24000, bias=True)
)

## Inference mode: Define Text Generation :
Generate text with a character-based model

The `generate_text_by_word` function will use your tokenizer and LSTM model to generate new text token-by-token by taking in the input text and token sampling parameters. We can use temperature and top-k sampling to adjust the "creativeness" of the generated text.

We also pass in the num_tokens parameter to tell the function how many tokens to generate.

In [None]:
@torch.no_grad()
def generate_text_by_word(input_text:str, max_tokens:int=15,
                          temperature:int=1, top_k:int|None=None,
                          do_sample:bool=False,
                        tokenizer=tokenizer):

    """Inference: Define Text Generation"""
    idx=tokenizer.encode(input_text).unsqueeze(dim=0).to(device)

    max_sequence_length=31

    assert idx.ndim==2, "input token must be 2D with sahpe (B, S) B batch,S sequence Length"

    for _ in range(max_tokens): # The maximum number of tokens that can be generated
        # if the sequence context is growing too long we must crop it at context_size
        idx_cond=idx if idx.size(1)<=max_sequence_length else idx[:,-max_sequence_length:]

        # forward the model to get the logits for the index in the sequence
        logits=GRU_model(idx_cond)

        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:, -1, :] / temperature

        if top_k is not None:
            values= torch.topk(logits, top_k).values
            logits[logits < values[:,[-1]]]=-torch.inf

        # apply softmax to convert logits to (normalized) probabilities
        probs =F.softmax(logits, dim=-1)

        if do_sample:
            idx_next=torch.multinomial(probs, num_samples=1)
        else:
            idx_next=torch.topk(probs, k=1, dim=-1).indices  # greedy decoding

        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=1)

    return tokenizer.decode(idx.squeeze())

In [None]:
# check text generation without training model
TEST_PHRASE = 'To be or not to be'
generate_text_by_word(TEST_PHRASE)

'to be or not to be foes hearty greatest issue; deputy discredits. proceedings powers, proceedings hate, hate, cloud, commanded somerset? wretches'

## Train GRU :


In [None]:
GRU_model = GRU_model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(GRU_model.parameters(), lr=0.001)

# Use more epochs if not CPU device
epochs = 5

for epoch in range(epochs):
    # Set model into "training mode"
    GRU_model.train()
    total_loss = 0

    for X_batch, y_batch in train_dataloader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        output = GRU_model(X_batch)
        loss   = criterion(output.view(-1, output.size(-1)), y_batch.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_dataloader)}')
    print('-'*72)

    gen_output = generate_text_by_word(
        input_text=TEST_PHRASE,
        temperature=0.8,
        max_tokens=30,
        top_k=None,
        do_sample=False,
        tokenizer=tokenizer
    )
    print(gen_output)

Epoch 1/5, Loss: 7.2576527661810175
------------------------------------------------------------------------
to be or not to be a sebastian: as i am a month i am not a month gonzalo: to be a dukedom. gonzalo: i am not so a brother. gonzalo: i am not to be
Epoch 2/5, Loss: 6.240594867206109
------------------------------------------------------------------------
to be or not to be a kind of milan, antonio: and i have no more than a match! to the rest of the king's ship; and the rest of the earth of the island. of
Epoch 3/5, Loss: 5.729409746197162
------------------------------------------------------------------------
to be or not to be a hair of the sea, of the isle, and all the earth of the earth of the island. gonzalo: i have no more than this island of the sea, of
Epoch 4/5, Loss: 5.346152390424297
------------------------------------------------------------------------
to be or not to be inclined and not a kind of tunis. sebastian: i am not to be desert,-- gonzalo: i have no more 

## Generate Text

Now that the model has been trained, go ahead and observe how it performs!

Try adjusting the different sampling methods using the `temperature` and `topk`
parameters on the same input string to see the differences.

You might also try different phrases as well as how many tokens  to generate and observe how it does.

In [None]:
output = generate_text_by_word(
    input_text='To be or ',
    max_tokens=20,
    do_sample=False,
    tokenizer=tokenizer,
    temperature=1.0,
    top_k=None,
)
print(output)

to be or to the rest sebastian: a dollar. gonzalo: and the mariners of the curl'd nook, and the son owes. antonio: antonio:


Great Job 👏