<a href="https://colab.research.google.com/github/rhiosutoyo/Teaching-Deep-Learning-and-Its-Applications/blob/main/9_3_generative_networks_language_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative Networks: Language Modeling

This Python code trains a word-level language model using an LSTM in PyTorch. It begins by downloading and preprocessing a text dataset, converting words to integers for model training. The `TextDataset` class creates input-target pairs for the model. In this example, we utilize Cinderella Story for the dataset.

Then, the LSTM model is defined with embedding, LSTM, and fully connected layers. The training loop optimizes the model over multiple epochs using cross-entropy loss. A text generation function is provided, ensuring the starting words are in the vocauary, and generating new sequences based on the trained model.

Lastly, the vocabulary is printed for verification.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import requests

# Download the text data
url = "https://raw.githubusercontent.com/rhiosutoyo/Teaching-Deep-Learning-and-Its-Applications/main/dataset/cinderella-story.txt"
response = requests.get(url)
text = response.text

# Preprocessing
words = text.split()
vocab = Counter(words)
vocab = sorted(vocab, key=vocab.get, reverse=True)
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}

encoded_text = [word_to_idx[word] for word in words]

# Parameters
sequence_length = 4
batch_size = 2

# Prepare the dataset
class TextDataset(Dataset):
    def __init__(self, text, sequence_length):
        self.text = text
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.text) - self.sequence_length

    def __getitem__(self, idx):
        return (
            torch.tensor(self.text[idx:idx + self.sequence_length]),
            torch.tensor(self.text[idx + 1:idx + self.sequence_length + 1]),
        )

dataset = TextDataset(encoded_text, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Define the LSTM-based language model
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden):
        x = self.embedding(x)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        return (weight.new_zeros(num_layers, batch_size, hidden_dim),
                weight.new_zeros(num_layers, batch_size, hidden_dim))

# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 10
hidden_dim = 50
num_layers = 2

model = LSTMModel(vocab_size, embedding_dim, hidden_dim, num_layers)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 200
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        # Initialize hidden state with batch size
        hidden = model.init_hidden(inputs.size(0))

        # Zero the gradients
        optimizer.zero_grad()

        # Detach hidden state to prevent backpropagating through the entire training history
        hidden = tuple([each.detach() for each in hidden])

        # Forward pass
        outputs, hidden = model(inputs, hidden)
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch [{epoch}/{num_epochs}], Loss: {loss.item():.4f}')



Epoch [0/200], Loss: 3.4668
Epoch [10/200], Loss: 1.3209
Epoch [20/200], Loss: 0.5407
Epoch [30/200], Loss: 1.4255
Epoch [40/200], Loss: 0.4448
Epoch [50/200], Loss: 0.6190
Epoch [60/200], Loss: 0.4712
Epoch [70/200], Loss: 0.6780
Epoch [80/200], Loss: 2.1376
Epoch [90/200], Loss: 0.0535
Epoch [100/200], Loss: 0.9925
Epoch [110/200], Loss: 1.2489
Epoch [120/200], Loss: 0.6730
Epoch [130/200], Loss: 0.0051
Epoch [140/200], Loss: 0.7394
Epoch [150/200], Loss: 0.9580
Epoch [160/200], Loss: 0.0171
Epoch [170/200], Loss: 0.0275
Epoch [180/200], Loss: 0.9480
Epoch [190/200], Loss: 1.0516


In [2]:
# Text generation function
def generate_text(model, start_text, length=20):
    model.eval()
    words = start_text.split()
    state_h, state_c = model.init_hidden(1)

    for _ in range(length):
        x = torch.tensor([[word_to_idx[w] for w in words[-sequence_length:]]])
        y_pred, (state_h, state_c) = model(x, (state_h, state_c))
        last_word_logits = y_pred[0][-1]
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().numpy()
        word_idx = np.random.choice(len(last_word_logits), p=p)
        words.append(idx_to_word[word_idx])

    return ' '.join(words)

# Generate text with a default start text that is in the vocabulary
start_text = "Cinderella was"
print(generate_text(model, start_text))

start_text = "Cinderella was"
print(generate_text(model, start_text))

start_text = "Cinderella was"
print(generate_text(model, start_text))

Cinderella was the happiest she had ever been. But as the hours would the filled the room. It was a time, in
Cinderella was reunited with the prince, and they lived happily ever after, proving that kindness was the most important thing in the
Cinderella was the happiest she had ever been. But as the hours would the filled the room. It was a kind, gentle,
