# Text Generation In Style

This project will focus on absorbing the text of "The Mysterious Island" authored by Jules Verne, and using an RNN to generate new text that is similar in style to it. This can be generalized beyond this text fairly easily, with similar architectures.

In [1]:
import numpy as np

import torch
import torch.nn as nn

from torch.utils.data import Dataset, DataLoader

from torch.distributions.categorical import Categorical

In [2]:
device = "cuda" if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [3]:
# Use numpy to open the text and encode it
with open('The_Mysterious_Island_Jules_Verne.txt', 'r', encoding='utf8') as fp:
    text = fp.read()

start_idx = text.find('THE MYSTERIOUS ISLAND')
end_idx = text.find('End of the Project Gutenberg')

text = text[start_idx: end_idx]
char_set = set(text)

print(f'Total Length: {len(text)}')
print(f'Unique Characters: {len(char_set)}')

# Sort characters and create a map to integers
sorted_chars = sorted(char_set)
char_arr = np.array(sorted_chars)
char2int = {ch: i for i, ch in enumerate(sorted_chars)}

# Encode the text using the map
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)
# Demonstration
print(f'Shape of Encoding: {text_encoded.shape}')
print(f'Example encoding: {text[1080: 1090]} -> {text_encoded[1080: 1090]}')
print(f'Reverse process: {text_encoded[15: 21]} -> {text[15: 21]}')

Total Length: 1130779
Unique Characters: 86
Shape of Encoding: (1130779,)
Example encoding: Towns were -> [46 67 75 66 71  1 75 57 70 57]
Reverse process: [35 45 38 27 40 30] -> ISLAND


## Sequencing Strategy

The strategy will be to use contiguous chunks of text as inputs, say the first $n$ words, where the output will be a sequence of $n$ words, shifted by one index, from 1 to $n+1$ instead of 0 to $n$. This requires the network to predict a word/character ahead based on the context.

In [4]:
# This is "n"
seq_length = 40
chunk_size = seq_length + 1

# Form overlapping chunks from the text
text_chunks = [text_encoded[i: i + chunk_size] for i in range(len(text_encoded) - chunk_size + 1)]

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
    
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_data = TextDataset(text_chunks=torch.tensor(text_chunks))

for i, (seq, tgt) in enumerate(seq_data):
    print(f'Input: {repr(''.join(char_arr[seq]))}')
    print(f'Target: {repr(''.join(char_arr[tgt]))}\n')
    if i:
        break

batch_size = 64
torch.manual_seed(42)
seq_dl = DataLoader(seq_data, batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)

Input: 'THE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOU'
Target: 'HE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS'

Input: 'HE MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS'
Target: 'E MYSTERIOUS ISLAND ***\n\nTHE MYSTERIOUS '



  seq_data = TextDataset(text_chunks=torch.tensor(text_chunks))


In [5]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(input_size=embed_dim, hidden_size=rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(in_features=rnn_hidden_size, out_features=vocab_size)
    
    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell

vocab_size = len(char_arr)
embed_dim = 256
rnn_hidden_size = 512

torch.manual_seed(42)
model = RNN(vocab_size=vocab_size, embed_dim=embed_dim, rnn_hidden_size=rnn_hidden_size)
model = model.to(device)
print(model)

RNN(
  (embedding): Embedding(86, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=86, bias=True)
)


In [6]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-3)

torch.manual_seed(42)

num_epochs = 10000

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size=batch_size)
    hidden, cell = hidden.to(device), cell.to(device)
    
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch, target_batch = seq_batch.to(device), target_batch.to(device)    

    optimizer.zero_grad()
    loss = 0

    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    
    loss.backward()
    optimizer.step()

    loss = loss.item() / seq_length

    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')

Epoch 0 loss: 4.4663
Epoch 500 loss: 1.4206
Epoch 1000 loss: 1.2193
Epoch 1500 loss: 1.2246
Epoch 2000 loss: 1.2146
Epoch 2500 loss: 1.2477
Epoch 3000 loss: 1.1983
Epoch 3500 loss: 1.1828
Epoch 4000 loss: 1.2234
Epoch 4500 loss: 1.1991
Epoch 5000 loss: 1.2375
Epoch 5500 loss: 1.1971
Epoch 6000 loss: 1.2376
Epoch 6500 loss: 1.1642
Epoch 7000 loss: 1.1411
Epoch 7500 loss: 1.1703
Epoch 8000 loss: 1.2136
Epoch 8500 loss: 1.1398
Epoch 9000 loss: 1.1769
Epoch 9500 loss: 1.1582


## Generating New Text

The RNN that has been trained returns the logits (of size equivalent to our character set) for each unique character, which can easily be converted to probabilities. The value with the highest logit/probability corresponds to the prediction of the next character in the sequence. In order to generate variation in the text produced, it is possible to instead sample from the outputs using the probabilities to form a distribution over characters. This can be acheived by using the Categorical sub-class from torch.distributions.categorical.

In [23]:
def sample(model, starting_str, len_generated_text=500, scale_factor=1.0):
    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1)).to(device)
    generated_str = starting_str

    model.eval()

    hidden, cell = model.init_hidden(1)
    hidden, cell = hidden.to(device), cell.to(device)

    for ctr in range(len(starting_str) - 1):
        _, hidden, cell = model(encoded_input[:, ctr].view(1), hidden, cell)
    
    last_char = encoded_input[:, -1]

    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = torch.squeeze(logits, 0)

        # The higher the scale factor, the more the largest probabilities dominate
        # High scale factor = sampling more of the large logits
        scaled_logits = logits * scale_factor

        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_arr[last_char])
    
    return generated_str

In [28]:
torch.manual_seed(42)

print("Scale Factor of 1.0")
print(sample(model, starting_str='The island'))
print()
print('Higher Scale Factor: 3.0')
print(sample(model, starting_str='The island', scale_factor=3.0))

print()
print('Lower Scale Factor: 0.5')
print(sample(model, starting_str='The island', scale_factor=3.0))

Scale Factor of 1.0
The island.

The stranger had not orden to get his bottle glass, to a cinde bub. At enant sea.”

“The brig, for the world kyut in the work.”

“‘others to hight back of cisadly sharponed, measured in this aspen to time by latitudes large infligence, through the subman. Lastly. In the west ever of the wingtones of the inhabited cord, or
bounding to use the new longipal to beat.

The unfortunate disappearance.”

“Alas. The attackmen performation an islet, you at each said,--

“He tooly they led the little.”


Higher Scale Factor: 3.0
The island was accompanied him to the sea had been so many very slippered to Granite House the sea had been provided a complish to the sea had been transport. The next day, the 29th of October to the sea which had been easy to form the sun would be seen that the
colonists were all reached the coast was then took the wind beam to the northeast was about to make a few minutes were to be feared the corral, and the settlers were considered the