<p style="text-align: center; font-size:50px;">Language Model with LSTM</p>

#### Since the last notebook was about LSTM and the inner workings of it, I was thinking of making a notebook utilizing LSTM. 
#### With LSTM, I will be building a language model that will be able to output words given a text prompt. 

<p style="text-align: center; font-size:30px;">Data</p>

In [1]:
# Important Libraries

import torch
import torch.nn as nn
import torch.optim as optim
import torchinfo 
from torchinfo import summary
import math
import torchtext
import datasets
from tqdm import tqdm

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"The device is set to {device}")
torch.manual_seed(42)

The device is set to cuda:0


<torch._C.Generator at 0x7a7e45f177f0>

In [3]:
dataset = datasets.load_dataset('wikitext', 'wikitext-2-raw-v1')
print(dataset)

Downloading builder script:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


#### Tokenizing our data, using PyTorch basic_english tokenizer.

In [4]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}  
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], 
fn_kwargs={'tokenizer': tokenizer})

  0%|          | 0/4358 [00:00<?, ?ex/s]

  0%|          | 0/36718 [00:00<?, ?ex/s]

  0%|          | 0/3760 [00:00<?, ?ex/s]

In [5]:
print(tokenized_dataset['train'][4]['tokens'])

['the', 'game', 'began', 'development', 'in', '2010', ',', 'carrying', 'over', 'a', 'large', 'portion', 'of', 'the', 'work', 'done', 'on', 'valkyria', 'chronicles', 'ii', '.', 'while', 'it', 'retained', 'the', 'standard', 'features', 'of', 'the', 'series', ',', 'it', 'also', 'underwent', 'multiple', 'adjustments', ',', 'such', 'as', 'making', 'the', 'game', 'more', 'forgiving', 'for', 'series', 'newcomers', '.', 'character', 'designer', 'raita', 'honjou', 'and', 'composer', 'hitoshi', 'sakimoto', 'both', 'returned', 'from', 'previous', 'entries', ',', 'along', 'with', 'valkyria', 'chronicles', 'ii', 'director', 'takeshi', 'ozawa', '.', 'a', 'large', 'team', 'of', 'writers', 'handled', 'the', 'script', '.', 'the', 'game', "'", 's', 'opening', 'theme', 'was', 'sung', 'by', 'may', "'", 'n', '.']


#### Contructing Vocabulary.

In [6]:
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], 
min_freq=3) # Hyperparameter to only include words that have been recorded to appear at least 3 times. 
vocab.insert_token('<unk>', 0) # Unknown string            
vocab.insert_token('<eos>', 1) # End of string so that our model knows when to stop the sentence
vocab.set_default_index(vocab['<unk>'])   
print(f"The length of the vocab is {len(vocab)}")                         
print(vocab.get_itos()[:10])    

The length of the vocab is 29473
['<unk>', '<eos>', 'the', ',', '.', 'of', 'and', 'in', 'to', 'a']


#### DataLoader

In [7]:
def get_data(dataset, vocab, batch_size):
    data = []                                                   
    for example in dataset:
        if example['tokens']:                                      
            tokens = example['tokens'].append('<eos>') # Will append the <end of sentence> token to the end of every sentence to mark it for the model to understand.        
            tokens = [vocab[token] for token in example['tokens']] # Encoding each token to their numerical value 
            data.extend(tokens)                                    
    data = torch.LongTensor(data)                                 
    num_batches = data.shape[0] // batch_size 
    data = data[:num_batches * batch_size]                       
    data = data.view(batch_size, num_batches)          
    return data

batch_size = 128
train_dataloader = get_data(tokenized_dataset['train'], vocab, batch_size)
validation_dataloader = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_dataloader = get_data(tokenized_dataset['test'], vocab, batch_size)

#### Modelling

#### Model Architecture is as such:
##### 1) Embedding Layer with E (Embedding Length) 
##### 2) LSTM Layer with H (Hidden Length) 
##### 3) Dense Layer for classification with V (Vocab Size) 

In [8]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_length, hidden_length, num_layers, dropout, tie_weights):
        super().__init__()
        self.embedding_length = embedding_length
        self.hidden_length = hidden_length
        self.dropout = dropout 
        self.tie_weights = tie_weights
        self.num_layers = num_layers

        self.embedding_layer = nn.Embedding(vocab_size, embedding_length)
        self.LSTM_layer = nn.LSTM(input_size = embedding_length, hidden_size = hidden_length, num_layers = num_layers, dropout = dropout, batch_first = True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_length, vocab_size)

        if tie_weights:
            assert embedding_length == hidden_length, 'cannot tie, check dims'
            self.embedding.weight = self.fc.weight
        self.init_weights() # Initializing the weights 

    def forward(self, x, hidden):
        embedding = self.dropout(self.embedding_layer(x))
        output, hidden = self.LSTM_layer(embedding, hidden)          
        output = self.dropout(output) 
        prediction = self.fc(output)
        return prediction, hidden


    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hidden_length)
        self.embedding_layer.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.LSTM_layer.all_weights[i][0] = torch.FloatTensor(self.embedding_length,
                    self.hidden_length).uniform_(-init_range_other, init_range_other) 
            self.LSTM_layer.all_weights[i][1] = torch.FloatTensor(self.hidden_length, 
                    self.hidden_length).uniform_(-init_range_other, init_range_other) 

    def init_hidden(self, batch_size, device):
        hidden_state = torch.zeros(self.num_layers, batch_size, self.hidden_length).to(device)
        cell_state = torch.zeros(self.num_layers, batch_size, self.hidden_length).to(device)
        return hidden_state, cell_state

    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell

In [9]:
vocab_size = len(vocab)
embedding_dim = 500           
hidden_dim = 1024               
num_layers = 3                  
dropout_rate = 0.4             
tie_weights = False                  
lr = 1e-3  

model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, tie_weights).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

summary(model)

Layer (type:depth-idx)                   Param #
LSTM                                     --
├─Embedding: 1-1                         14,736,500
├─LSTM: 1-2                              23,044,096
├─Dropout: 1-3                           --
├─Linear: 1-4                            30,209,825
Total params: 67,990,421
Trainable params: 67,990,421
Non-trainable params: 0

In [10]:
def get_batch(data, seq_len, num_batches, idx):
    x = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]             
    return x, target

<p style="text-align: center; font-size:30px;">Training</p>

In [11]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len)):  # The last batch can't be a src
        optimizer.zero_grad()
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, num_batches, idx)
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        prediction = prediction.reshape(batch_size * seq_len, -1)   
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, num_batches, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

n_epochs = 50
seq_len = 50
clip = 0.25
saved = False

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

if saved:
    model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
    test_loss = evaluate(model, test_dataloader, criterion, batch_size, seq_len, device)
    print(f'Test Perplexity: {math.exp(test_loss):.3f}')
else:
    best_valid_loss = float('inf')

    for epoch in range(n_epochs):
        print(f"======Epoch {epoch+1}======")
        train_loss = train(model, train_dataloader, optimizer, criterion, 
                    batch_size, seq_len, clip, device)
        valid_loss = evaluate(model, validation_dataloader, criterion, batch_size, 
                    seq_len, device)
        
        lr_scheduler.step(valid_loss)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

        print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
        print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')



100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 1213.781
	Valid Perplexity: 896.809


100%|██████████| 324/324 [01:38<00:00,  3.29it/s]


	Train Perplexity: 1158.066
	Valid Perplexity: 878.153


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1160.220
	Valid Perplexity: 888.603


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1141.997
	Valid Perplexity: 864.661


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1117.261
	Valid Perplexity: 858.625


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1119.864
	Valid Perplexity: 862.287


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1115.668
	Valid Perplexity: 851.669


100%|██████████| 324/324 [01:38<00:00,  3.27it/s]


	Train Perplexity: 1102.258
	Valid Perplexity: 851.166


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1088.205
	Valid Perplexity: 844.091


100%|██████████| 324/324 [01:38<00:00,  3.28it/s]


	Train Perplexity: 1080.856
	Valid Perplexity: 843.298


100%|██████████| 324/324 [01:39<00:00,  3.27it/s]


	Train Perplexity: 959.814
	Valid Perplexity: 622.714


100%|██████████| 324/324 [01:39<00:00,  3.27it/s]


	Train Perplexity: 688.391
	Valid Perplexity: 525.406


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 577.821
	Valid Perplexity: 472.945


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 530.332
	Valid Perplexity: 440.938


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 485.691
	Valid Perplexity: 416.889


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 444.472
	Valid Perplexity: 391.856


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 412.965
	Valid Perplexity: 372.361


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 387.044
	Valid Perplexity: 355.871


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 365.191
	Valid Perplexity: 342.793


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 348.253
	Valid Perplexity: 331.485


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 329.209
	Valid Perplexity: 321.557


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 314.230
	Valid Perplexity: 313.152


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 300.481
	Valid Perplexity: 305.241


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 287.860
	Valid Perplexity: 297.518


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 276.885
	Valid Perplexity: 291.250


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 266.246
	Valid Perplexity: 285.927


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 256.688
	Valid Perplexity: 280.797


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 247.557
	Valid Perplexity: 275.824


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 238.901
	Valid Perplexity: 272.721


100%|██████████| 324/324 [01:39<00:00,  3.26it/s]


	Train Perplexity: 231.019
	Valid Perplexity: 268.392


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 224.137
	Valid Perplexity: 262.616


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 216.994
	Valid Perplexity: 260.741


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 210.426
	Valid Perplexity: 257.038


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 204.251
	Valid Perplexity: 257.405


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 199.760
	Valid Perplexity: 254.185


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 195.506
	Valid Perplexity: 252.541


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 192.407
	Valid Perplexity: 250.304


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 189.359
	Valid Perplexity: 247.229


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 186.543
	Valid Perplexity: 246.006


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 183.780
	Valid Perplexity: 244.227


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 181.098
	Valid Perplexity: 247.215


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 179.313
	Valid Perplexity: 237.937


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 177.965
	Valid Perplexity: 236.884


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 176.487
	Valid Perplexity: 236.541


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 175.372
	Valid Perplexity: 235.661


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 174.105
	Valid Perplexity: 234.049


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 172.769
	Valid Perplexity: 233.062


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 171.296
	Valid Perplexity: 232.815


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 170.020
	Valid Perplexity: 232.615


100%|██████████| 324/324 [01:39<00:00,  3.25it/s]


	Train Perplexity: 168.659
	Valid Perplexity: 231.258


<p style="text-align: center; font-size:30px;">Evaluation</p>

In [12]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']:
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:
                break

            indices.append(prediction)

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

prompt = 'Think about'
max_seq_len = 30
seed = 0

temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
think about a result of the earth , it was a good , and the death of the hnc ' s most @-@ time and the album , which had the only

0.7
think about his time , he was able to continue to agree the role , and the jin ' s tom sample , and the team ' s political ever added to

0.75
think about his time , he was able to examine the movement of a theme of the jin . the tom sample of the crab is not a political ever added to

0.8
think about his time , though i were arranged by the movement of a male friend , as the death condom of the hnc , and did not not tried to be

1.0
think about his conduct canterbury play .



#### As we can see, the results are not as amazing as what we see online but I hope you get the rough idea of how language models operate. 
#### Of course, LSTM is no longer the state of the art and there are better models out there such as transformers that are able to grasp the context and meaning much more meaningfully. 