# Character level text generation with RNNs using PyTorch Lightning

- toc: true 
- badges: true
- comments: true
- categories: [RNN, tutorial]

In this article, we will show how to generate the text using Recurrent Neural Networks. We will use it to generate surnames of people and while doing so we will take into account the country they come from. 

As a recurrent network, we will use LSTM. For the training, we will use PyTorch Lightning. We will show how to use the `collate_fn` so we can have batches of sequences of the different lengths. 

The article was inspired by https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html, the data used for the training was also taken from there

# !pip install pytorch-lightning==0.9.1rc3

# !pip install pytorch-lightning==0.9.1rc3

# !pip install pytorch-lightning==0.9.1rc3

In [1]:
# !pip install pytorch-lightning==0.9.1rc3

# Imports

In [2]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import lr_scheduler, Adam

import pytorch_lightning as pl
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

import pandas as pd
import string

# Dataset

The Dataset consists of 20072 examples of surnames from different countries, data can be found [here](https://)


We will use those examples to generate new names using LSTM

In [3]:
df = pd.read_csv("text_generation/names.csv")
df

Unnamed: 0,Category,Name
0,English,Abbas
1,English,Abbey
2,English,Abbott
3,English,Abdi
4,English,Abel
...,...,...
20069,Russian,Zolotnitsky
20070,Russian,Zolotnitzky
20071,Russian,Zozrov
20072,Russian,Zozulya


In [4]:
df['Category'].value_counts()

Russian       9408
English       3668
Arabic        2000
Japanese       991
German         724
Italian        709
Czech          519
Spanish        298
Dutch          297
French         277
Chinese        268
Irish          232
Greek          203
Polish         139
Scottish       100
Korean          94
Portuguese      74
Vietnamese      73
Name: Category, dtype: int64

# String to int and int to string mappers

We will treat each letter as a separate element of a sequence. Thanks to that we don't need to perform any sophisticated tokenization, whatsoever. 

Besides letters, we will add tho additional tokens `<pad>` and `<eos>`. The fist is needed for padding (we will pad the names with 0 so we can have sequences of the same size in the single batch) while the second will be used to "announce" to the model that the name generation process ended. 

In [5]:
all_letters = ["<pad>"] + list(string.ascii_letters + " .,;'-") + ["<eos>"]
n_letters = len(all_letters)
n_letters

60

In [6]:
stoi = {letter : idx for idx, letter in enumerate(all_letters)}
itos = [letter for idx, letter in enumerate(all_letters)]

In [7]:
stoi["<eos>"], itos[59]

(59, '<eos>')

In [8]:
len(stoi)

60

# Dataset

To provide the data to the model we need a `Dataset`. The one defined by us will return a dictionary of five elements. 

The three most important are: 

`category_tensor` - one-hot representation of one of the 18 categories.

E.g English is represented as [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]

`input_tensor` int representation of letter tokens from the name. 

`target_tensor` int representation of letter tokens from the target name. Target differs from input in such a way that it skips the first letter and adds `<eos>` at the end. 

We use 0 for `<pad>` token and 59 for `<eos>`

In [9]:
class NamesDataset(Dataset):
    def __init__(self, df, stoi, eos_token="<eos>"):
        self.stoi = stoi
        self.eos_token = eos_token
        self.n_tokens = len(self.stoi)
        
        self.categories = df["Category"].tolist()
        self.names = df["Name"].tolist()
        
        
        self.all_categories = list(set(self.categories))
        self.n_categories = len(self.all_categories)

    def __getitem__(self, item):
        category = self.categories[item]
        name = self.names[item]
        
        category_tensor = self.get_category_tensor(category)
        
        input_tensor = torch.tensor([stoi[char] for char in name])
        target_tensor = torch.tensor([stoi[char] for char in list(name[1:])+[self.eos_token]])
        
        item_dict = {"category": category,
        "name": name,
        "category_tensor": category_tensor,
        "input_tensor": input_tensor,
        "target_tensor": target_tensor}
        
        
        return item_dict

    def __len__(self):
        return len(self.categories)
    
    
    def get_category_tensor(self, category):
        li = self.all_categories.index(category)
        tensor = torch.zeros(1, self.n_categories)
        tensor[0][li] = 1
        return tensor

In [10]:
ds = NamesDataset(df, stoi)

ds[0]

{'category': 'English',
 'category_tensor': tensor([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 'input_tensor': tensor([27,  2,  2,  1, 19]),
 'name': 'Abbas',
 'target_tensor': tensor([ 2,  2,  1, 19, 59])}

## Dataoader

The Dataset returns sequences of a different length, which might cause problems, due to that we need to define the `collate_fn` method which will handle this issue. It will add padding (0) to at the end of the sequences that are shorter than the longest sentence in a batch. Thanks to that we can work with batches of size other than one 

In [11]:
def collate_fn(data):
    def merge(sequences):
        "https://github.com/yunjey/seq2seq-dataloader/blob/master/data_loader.py"
        
        lengths = [len(seq) for seq in sequences]
        padded_seqs = torch.zeros(len(sequences), max(lengths)).long()
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq[:end]
        return padded_seqs, lengths

    categories = [x["category"] for x in data]          
    names = [x["name"] for x in data]          
    category_tensors = torch.cat([x["category_tensor"] for x in data])
    
    input_tensors = [x["input_tensor"] for x in data]
    input_tensors, _ = merge(input_tensors)
    
    target_tensors = [x["target_tensor"] for x in data]
    target_tensors, _ = merge(target_tensors)
    
    return categories, names, category_tensors, input_tensors, target_tensors

In [12]:
dl = DataLoader(ds, batch_size=1, collate_fn=collate_fn, shuffle=True)

In [13]:
next(iter(dl))

(['Czech'],
 ['Lawa'],
 tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 tensor([[38,  1, 23,  1]]),
 tensor([[ 1, 23,  1, 59]]))

# Lightning Datamodule

To pass everything into the Lightning training loop we combine all of the previous steps and we define a `LightningDataModule` object

In [14]:
class NamesDatamodule(pl.LightningDataModule):
    def __init__(self, batch_size):
        super().__init__()
        self.batch_size = batch_size
        self.df = pd.read_csv("text_generation/names.csv")
        
        self.all_letters = all_letters = ["<pad>"] + list(string.ascii_letters + " .,;'-") + ["<eos>"]
        self.stoi = {letter : idx for idx, letter in enumerate(self.all_letters)}

    def setup(self, stage=None):
        self.train_set = NamesDataset(df, self.stoi)

    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn)
    
    def collate_fn(self, data):
        def merge(sequences):
            "https://github.com/yunjey/seq2seq-dataloader/blob/master/data_loader.py"

            lengths = [len(seq) for seq in sequences]
            padded_seqs = torch.zeros(len(sequences), max(lengths)).long()
            for i, seq in enumerate(sequences):
                end = lengths[i]
                padded_seqs[i, :end] = seq[:end]
            return padded_seqs, lengths

        categories = [x["category"] for x in data]          
        names = [x["name"] for x in data]          
        category_tensors = torch.cat([x["category_tensor"] for x in data])

        input_tensors = [x["input_tensor"] for x in data]
        input_tensors, _ = merge(input_tensors)

        target_tensors = [x["target_tensor"] for x in data]
        target_tensors, _ = merge(target_tensors)
        
        item_dict = {"categories": categories, 
                     "names": names, 
                     "category_tensors": category_tensors,
                     "input_tensors": input_tensors,
                     "target_tensors": target_tensors}

        return item_dict

## RNN Lightning

We define a RNN. As a loss function, we will use `CrossEntropyLoss`

In [15]:
class RNN(pl.LightningModule):
    lr = 5e-4

    def __init__(self, input_size, hidden_size, embeding_size, n_categories, n_layers, output_size, p):
        super().__init__()

        self.criterion = nn.CrossEntropyLoss()
        
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        
        
        self.embeding = nn.Embedding(input_size+n_categories, embeding_size)
        self.lstm = nn.LSTM(embeding_size+n_categories, hidden_size, n_layers, dropout=p)
        self.out_fc = nn.Linear(hidden_size, output_size)
        
        self.dropout = nn.Dropout(p)
        

    def forward(self, batch_of_category, batch_of_letter, hidden, cell):
        ## letter level operations
        
        embeding = self.dropout(self.embeding(batch_of_letter))
        category_plus_letter = torch.cat((batch_of_category, embeding), 1)

        #sequence_length = 1
        category_plus_letter = category_plus_letter.unsqueeze(1)
        
        out, (hidden, cell) = self.lstm(category_plus_letter, (hidden, cell))
        out = self.out_fc(out)
        out = out.squeeze(1)
        
        return out, (hidden, cell)
        

    def configure_optimizers(self):
        optimizer = Adam(self.parameters(), self.lr)
        scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

        return [optimizer], [scheduler]

    def training_step(self, batch, batch_idx):
        item_dict = batch
        loss = 0
        batch_of_category = item_dict["category_tensors"]

        #to(device) needed due to some problem with PL
        hidden = torch.zeros(self.n_layers, 1, self.hidden_size).to(self.device)
        cell = torch.zeros(self.n_layers, 1, self.hidden_size).to(self.device)

        #we loop over letters, single batch at the time 
        for t in range(item_dict["input_tensors"].size(1)):
            batch_of_letter = item_dict["input_tensors"][:, t]
            
            output, (hidden, cell) = self(batch_of_category, batch_of_letter, hidden, cell)
            
            loss += self.criterion(output, item_dict["target_tensors"][:, t])

        loss = loss/(t+1)

        tensorboard_logs = {'train_loss': loss}

        return {'loss': loss, 'log': tensorboard_logs}
    
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size)
        cell = torch.zeros(self.n_layers, batch_size, self.hidden_size)
        
        return hidden, cell

# Train the model

Finally, after defining the model and datamodule we can start training. For some strange reason that is not entirely clear to us, the model performs best when it is trained on a batch of size 1. After only 2 or 3 epochs it should be capable of generating the names

In [16]:
dm = NamesDatamodule(1)

rnn_model = RNN(input_size=ds.n_tokens,
            hidden_size=256,
            embeding_size = 128, 
            n_layers=2,    
            n_categories=ds.n_categories,
            output_size=ds.n_tokens,
            p=0.3)


trainer = Trainer(max_epochs=3, 
                  logger=None,
                  gpus=1,
                  early_stop_callback=False,
                  checkpoint_callback=False,
                  )

trainer.fit(rnn_model, dm)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | embeding  | Embedding        | 9 K   
2 | lstm      | LSTM             | 940 K 
3 | out_fc    | Linear           | 15 K  
4 | dropout   | Dropout          | 0     


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

Saving latest checkpoint..





1

# Generate names using the trained model

Finally, we can check how our model handles the generation of names. It works quite well we can see that Russian names "sound" Russian as well as Japanese names "sound" Japanese

In [17]:
max_length = 20

# Sample from a category and starting letter
def sample(category, start_letter, model):
    with torch.no_grad():  # no need to track history in sampling
        category_tensor = ds.get_category_tensor(category)
        
        input = torch.tensor(ds.stoi[start_letter]).unsqueeze(0)
        hidden, cell = model.init_hidden(1)

        output_name = start_letter

        for i in range(max_length):
            output, (hidden, cell) = model(category_tensor, input, hidden, cell)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == ds.stoi["<eos>"]:
                break
            else:
                letter = itos[topi]
                output_name += letter
                
            input = torch.tensor(ds.stoi[letter]).unsqueeze(0)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters, model):
    for start_letter in start_letters:
        print(sample(category, start_letter, model))

In [18]:
samples('English', 'ABC', rnn_model)
print()
samples('Russian', 'ABC', rnn_model)
print()
samples('Japanese', 'ABC', rnn_model)

Allan
Bowe
Corris

Abrakoff
Babakov
Cheparov

Abata
Bakata
Chigawa


# Summary

We showed how to define a LTSM network to generate text, our model works on a character level. We also showed how to define a dataloader for sequences of different lengths using `collate_fn`. 

Of course, the proposed network is very simple, but the goal of this article is to familiarize both author and the reader with the concept of Recurrent Neural Networks