# LSTM Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext, math
from tqdm import tqdm

import nltk

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [3]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## 1. Load data - Gutenberg

The data imported below is from [Project Gutenberg](https://gutenberg.org/) provided in [NLTK corpora](https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/). 

In [4]:
from nltk.corpus import gutenberg

nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\sung2_8l7o06c\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [5]:
gutenberg.sents()

[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

In [6]:
from nltk import tokenize

fileids = gutenberg.fileids()

# split dataset
train = fileids[:int(len(fileids) * 0.6)]
# join word in each file
train = [' '.join(gutenberg.words(f)) for f in train]
# concatenate files
train = ' '.join(train)
train_sents = tokenize.sent_tokenize(train)

valid = fileids[int(len(fileids) * 0.6):int(len(fileids) * 0.8)] 
valid = [' '.join(gutenberg.words(f)) for f in valid]
valid = ' '.join(valid)
valid_sents = tokenize.sent_tokenize(valid)

test = fileids[int(len(fileids) * 0.8):] 
test = [' '.join(gutenberg.words(f)) for f in test]
test = ' '.join(test)
test_sents = tokenize.sent_tokenize(test)

In [7]:
train_sents[1000:1010]

['You could not have visited me !"',
 'she cried , looking aghast . "',
 'No , to be sure you could not ; but I never thought of that before .',
 'That would have been too dreadful !-- What an escape !-- Dear Miss Woodhouse , I would not give up the pleasure and honour of being intimate with you for any thing in the world ."',
 '" Indeed , Harriet , it would have been a severe pang to lose you ; but it must have been .',
 'You would have thrown yourself out of all good society .',
 'I must have given you up ."',
 '" Dear me !-- How should I ever have borne it !',
 'It would have killed me never to come to Hartfield any more !"',
 '" Dear affectionate creature !-- _You_ banished to Abbey - Mill Farm !-- _You_ confined to the society of the illiterate and vulgar all your life !']

In [8]:
print(f'Sentences in train: {len(train_sents)}')
print(f'Sentences in valid: {len(valid_sents)}')
print(f'Sentences in test: {len(test_sents)}')

Sentences in train: 62547
Sentences in valid: 26549
Sentences in test: 9302


In [9]:
from nltk import word_tokenize

tokenized_train_data = [word_tokenize(sent) for sent in train_sents]
tokenized_valid_data = [word_tokenize(sent) for sent in valid_sents]
tokenized_test_data = [word_tokenize(sent) for sent in test_sents]

### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [10]:
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_train_data, min_freq=3)
vocab.insert_token('<unk>', 0)
vocab.insert_token('<eos>', 1)
vocab.set_default_index(vocab['<unk>'])

In [11]:
print(len(vocab))

14204


In [12]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', ',', 'the', 'and', '.', 'of', ':', 'to', 'in']


In [13]:
import pickle

pickle.dump(vocab, open('./vocab/vocab.pkl', 'wb'))

## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [14]:
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        tokens = example+ ['<eos>']
        tokens = [vocab[token] for token in tokens]
        data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches) #view vs. reshape (whether data is contiguous)
    return data #[batch size, seq len]

In [15]:
batch_size = 128
train_data = get_data(tokenized_train_data, vocab, batch_size)
valid_data = get_data(tokenized_valid_data, vocab, batch_size)
test_data  = get_data(tokenized_test_data,  vocab, batch_size)

In [16]:
train_data.shape

torch.Size([128, 14212])

## 4. Modeling 

In [17]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim
        
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
    
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,   
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh
    
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
        
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell
        
    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction =self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [18]:
vocab_size = len(vocab)
emb_dim = 1024              # 400 in the paper
hid_dim = 1024             # 1150 in the paper
num_layers = 3                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [19]:
model      = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer  = optim.Adam(model.parameters(), lr=lr)
criterion  = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 54,294,396 trainable parameters


In [20]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [21]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [22]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [23]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

file_dir = './model/lstm_prog.pt'
best_dir = './model/lstm_best.pt'

# Check if to resume training or start anew
try:
    checkpoint = torch.load(file_dir)

    model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
    model.load_state_dict(checkpoint['state_dict'])

    optimizer = optim.Adam(model.parameters(), lr=lr)
    optimizer.load_state_dict(checkpoint['optimizer'])

    epoch = checkpoint['epoch']

    print(f'Prog found (epoch #{epoch})')
except:
    epoch = 0
    print('Start from zero')

# training
while epoch < n_epochs:

    print(f'\n\tEpoch >>>> {epoch+1}')
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    state = {'epoch': epoch + 1,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict()
        }
    
    torch.save(state, file_dir)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), best_dir)

    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

    epoch += 1

Start from zero

	Epoch >>>> 1


Training:   0%|          | 0/284 [00:00<?, ?it/s]

                                                           

	Train Perplexity: 561.887
	Valid Perplexity: 559.872

	Epoch >>>> 2


                                                           

	Train Perplexity: 510.125
	Valid Perplexity: 571.714

	Epoch >>>> 3


                                                           

	Train Perplexity: 394.412
	Valid Perplexity: 265.023

	Epoch >>>> 4


                                                           

	Train Perplexity: 128.041
	Valid Perplexity: 178.524

	Epoch >>>> 5


                                                           

	Train Perplexity: 93.864
	Valid Perplexity: 149.796

	Epoch >>>> 6


                                                           

	Train Perplexity: 78.162
	Valid Perplexity: 136.807

	Epoch >>>> 7


                                                           

	Train Perplexity: 68.956
	Valid Perplexity: 130.644

	Epoch >>>> 8


                                                           

	Train Perplexity: 62.472
	Valid Perplexity: 125.770

	Epoch >>>> 9


                                                           

	Train Perplexity: 57.674
	Valid Perplexity: 122.620

	Epoch >>>> 10


                                                           

	Train Perplexity: 53.831
	Valid Perplexity: 119.672

	Epoch >>>> 11


                                                           

	Train Perplexity: 50.679
	Valid Perplexity: 117.799

	Epoch >>>> 12


                                                           

	Train Perplexity: 48.148
	Valid Perplexity: 117.545

	Epoch >>>> 13


                                                           

	Train Perplexity: 45.826
	Valid Perplexity: 116.493

	Epoch >>>> 14


                                                           

	Train Perplexity: 43.941
	Valid Perplexity: 116.642

	Epoch >>>> 15


                                                           

	Train Perplexity: 41.739
	Valid Perplexity: 113.987

	Epoch >>>> 16


                                                           

	Train Perplexity: 40.645
	Valid Perplexity: 113.397

	Epoch >>>> 17


                                                           

	Train Perplexity: 39.767
	Valid Perplexity: 113.639

	Epoch >>>> 18


                                                           

	Train Perplexity: 38.783
	Valid Perplexity: 113.147

	Epoch >>>> 19


                                                           

	Train Perplexity: 38.299
	Valid Perplexity: 112.796

	Epoch >>>> 20


                                                           

	Train Perplexity: 37.881
	Valid Perplexity: 112.539

	Epoch >>>> 21


                                                           

	Train Perplexity: 37.497
	Valid Perplexity: 112.755

	Epoch >>>> 22


                                                           

	Train Perplexity: 37.083
	Valid Perplexity: 112.070

	Epoch >>>> 23


                                                           

	Train Perplexity: 36.855
	Valid Perplexity: 112.148

	Epoch >>>> 24


                                                           

	Train Perplexity: 36.681
	Valid Perplexity: 111.979

	Epoch >>>> 25


                                                           

	Train Perplexity: 36.549
	Valid Perplexity: 111.847

	Epoch >>>> 26


                                                           

	Train Perplexity: 36.448
	Valid Perplexity: 111.825

	Epoch >>>> 27


                                                           

	Train Perplexity: 36.326
	Valid Perplexity: 111.439

	Epoch >>>> 28


                                                           

	Train Perplexity: 36.321
	Valid Perplexity: 111.354

	Epoch >>>> 29


                                                           

	Train Perplexity: 36.210
	Valid Perplexity: 111.436

	Epoch >>>> 30


                                                           

	Train Perplexity: 36.217
	Valid Perplexity: 111.186

	Epoch >>>> 31


                                                           

	Train Perplexity: 36.187
	Valid Perplexity: 111.239

	Epoch >>>> 32


                                                           

	Train Perplexity: 36.160
	Valid Perplexity: 111.163

	Epoch >>>> 33


                                                           

	Train Perplexity: 36.103
	Valid Perplexity: 111.120

	Epoch >>>> 34


                                                           

	Train Perplexity: 36.083
	Valid Perplexity: 111.110

	Epoch >>>> 35


                                                           

	Train Perplexity: 36.122
	Valid Perplexity: 111.088

	Epoch >>>> 36


                                                           

	Train Perplexity: 36.104
	Valid Perplexity: 111.074

	Epoch >>>> 37


                                                           

	Train Perplexity: 36.094
	Valid Perplexity: 111.067

	Epoch >>>> 38


                                                           

	Train Perplexity: 36.089
	Valid Perplexity: 111.071

	Epoch >>>> 39


                                                           

	Train Perplexity: 36.055
	Valid Perplexity: 111.068

	Epoch >>>> 40


                                                           

	Train Perplexity: 36.105
	Valid Perplexity: 111.067

	Epoch >>>> 41


                                                           

	Train Perplexity: 36.080
	Valid Perplexity: 111.067

	Epoch >>>> 42


                                                           

	Train Perplexity: 36.122
	Valid Perplexity: 111.066

	Epoch >>>> 43


                                                           

	Train Perplexity: 36.095
	Valid Perplexity: 111.066

	Epoch >>>> 44


                                                           

	Train Perplexity: 36.050
	Valid Perplexity: 111.066

	Epoch >>>> 45


                                                           

	Train Perplexity: 36.070
	Valid Perplexity: 111.066

	Epoch >>>> 46


                                                           

	Train Perplexity: 36.062
	Valid Perplexity: 111.066

	Epoch >>>> 47


                                                           

	Train Perplexity: 36.092
	Valid Perplexity: 111.066

	Epoch >>>> 48


                                                           

	Train Perplexity: 36.105
	Valid Perplexity: 111.065

	Epoch >>>> 49


                                                           

	Train Perplexity: 36.088
	Valid Perplexity: 111.065

	Epoch >>>> 50


                                                           

	Train Perplexity: 36.069
	Valid Perplexity: 111.065


## 6. Testing

In [24]:
model.load_state_dict(torch.load('./model/lstm_best.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 139.100


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [25]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [26]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

In [28]:
prompt = 'Vengeance '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
vengeance ; but the head of the house of the Lord was not the name of the LORD .

0.7
vengeance ; but another of them ' s son , in the feast , the same , wanting -- the name of the one was always to be made in the

0.75
vengeance ; but another of them ' s son , in the feast , the same , wanting -- the man our name , the one and the little ; but

0.8
vengeance ; but another of them ' s son , in the feast , the same , wanting -- the Emperor , our name , the one and the little .

1.0
vengeance ; but another of them ' s level , not the feast , And if , wanting -- though your plaster and fresh things tonight speaketh or divorced , had

