# This notebook contains an implementation for RNN-based language modelling. 
At the core, a language model is a sequence classifier that uses all the tokens produced so far as input in order to produce a probability density function over all possible next tokens (a token could be a word, a character, or something inbetween). We can then either use the "best possible guess" of the classifier as the next token, or we can sample from the candidates according to the likelihood distribution specified by the classifier. People also often manipulate the distribution before sampling, by multiplying values with a given *model temperature*. We do not implement this here.

In fact, producing a probability density function comes for free, when we build a neural classifier that uses a softmax output activation. Therefore, nothing actually changes from "before", when we simply built classifiers. Once we have trained the model, we can repeatedly ask for next tokens, and add these to the context / the state of the model. This is called "autoregressive sequence generation".

However, training as before is quite inefficient for an RNN-based language model. Specifically, we do not want to re-encode the full input for each prediction of the next token. Instead, we combine prediction of next token and forced setting of the correct next token for each full sequence. This makes the training more efficient, but the code more difficult to read. (And it is still very slow...)

In [1]:
import torch
import torch.nn as nn
import ipywidgets as widgets
import random
import matplotlib.pyplot as plt
from collections import defaultdict

In [19]:
# load some text data; we'll try to model that below (or simpler alternatives)

START_SYMBOL = "<s>"
END_SYMBOL = "</s>"

data = open('data/merkel-de.txt', 'r').read() # should be simple plain text file
characters = set(data)
characters = list(sorted(characters))
characters.append(START_SYMBOL)
characters.append(END_SYMBOL)
characters.remove('\n')
NUM_CHARACTERS = len(characters)
sentences = data.splitlines()
int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}
print(characters)
print(sentences[0:4])

[' ', '!', '"', '#', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xad', '½', 'Ä', 'É', 'Ö', 'Ü', 'ß', 'á', 'ä', 'ç', 'è', 'é', 'ê', 'ï', 'ò', 'ó', 'ô', 'ö', 'ú', 'ü', 'ă', 'ć', 'ę', 'ğ', 'ł', 'ń', 'ō', 'ř', 'ś', 'ž', '‐', '–', '‘', '’', '‚', '“', '”', '„', '…', '<s>', '</s>']
['Liebe Mitbürgerinnen und Mitbürger, jetzt geht es los.', 'Der Anstoß zur Fußball-Weltmeisterschaft steht unmittelbar bevor.', 'Millionen haben auf diesen Augenblick gewartet - nicht nur in Deutschland, sondern in der ganzen Welt.', 'Vor dem Eröffnungsspiel gegen Costa Rica bin ich noch einmal mit Jürgen Klinsmann und unserer Nationalmannschaft zusammengetroffen.']


In [9]:
INPUT_SIZE = NUM_CHARACTERS
EMBED_SIZE = 32
HIDDEN_SIZE = 64
LAYERS = 2
MAX_GENERATION_LENGTH = 400
# okay, what's a recurrent neural network anyway? see https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks
NUM_CLASSES = NUM_CHARACTERS

class LM(nn.Module):
    def __init__(self):
        super(LM, self).__init__()
        self.embed = torch.nn.Embedding(INPUT_SIZE, EMBED_SIZE)
        self.rnn = nn.LSTM(EMBED_SIZE, HIDDEN_SIZE, LAYERS)
        self.final_layer = nn.Linear(HIDDEN_SIZE, NUM_CLASSES)

    def forward(self, xs : torch.tensor):
        xs = self.embed(xs)
        rnn_outputs, _ = self.rnn(xs)
        results = nn.functional.log_softmax(self.final_layer(rnn_outputs), dim=1)
        return results

    def forwardx(self, xs : torch.tensor):
        xs = self.embed(xs)
        h_n = torch.zeros(LAYERS, HIDDEN_SIZE)
        rnn_outputs = []
        for x in xs:
            x = x[None,:]
            rnn_output, h_n = self.rnn(x, h_n)
            rnn_outputs.append(rnn_output)
        rnn_outputs = torch.cat(rnn_outputs)
        results = nn.functional.log_softmax(self.final_layer(rnn_outputs), log=1)
        return results

    def generate(self, xs=torch.tensor([char2int[START_SYMBOL]]), sample="max") -> torch.tensor:
        """sample can be "max" or "prop" for max likelihood or proportional sampling"""
        classification = None
        h_n = None
        output = []
        xs = self.embed(xs)
        while ((classification == None) or (classification.item() != char2int[END_SYMBOL])) and (len(output) < MAX_GENERATION_LENGTH):
            rnn_outputs, h_n = self.rnn(xs, h_n)
            if sample == "max":
                classification = torch.argmax(self.final_layer(rnn_outputs[-1]))
            elif sample == "prop":
                classification = torch.multinomial(nn.functional.softmax(self.final_layer(rnn_outputs[-1]), dim=0), 1)[0]
            else:
                assert False, "only max and prop are possible values for sample!"
            output.append(classification)
            xs = self.embed(classification)[None,:]
        output = torch.stack(output[:-1]) if len(output) > 1 else torch.tensor([])
        return output

In [None]:
training_data = ["hello"] * 10
#training_data = ["abcdefghijklmnopqrstuvwxyz"] * 20
#training_data = ["Möglicherweise haben Sie bei einem Fußballspiel schon einmal etwas von einer Bananenflanke gehört."] * 50
#training_data = sentences[0:100] * 50
MAX_EPOCHS = 30

def to_vector(sentence : str, noend=False) -> torch.tensor:
    sentence = [START_SYMBOL] + list(sentence)
    if not noend:
        sentence.append(END_SYMBOL)
    return torch.tensor([char2int[c] for c in sentence])

lm = LM()
optimizer = torch.optim.Adam(lm.parameters())

def training(training_data, validation_data=[]):
    training_data = [to_vector(s) for s in training_data]
    validation_data = [to_vector(s) for s in validation_data]
    for epoch in range(MAX_EPOCHS):
        print(("Epoch {} starting".format(epoch+1)))
        random.shuffle(training_data)
        for s in training_data:
            optimizer.zero_grad()
            all_input = s[:-1]
            all_targets = s[1:]
            outputs = lm(all_input)
            losses = nn.functional.nll_loss(outputs, all_targets)
            losses.backward()
            optimizer.step()
        print("forced: " + "".join([int2char[x] for x in torch.argmax(lm(training_data[0][:-1]), dim=1)]))
        print("freemax:" + "".join([int2char[x] for x in lm.generate()]))
        print("fresamp:" + "".join([int2char[x] for x in lm.generate(sample="prop")]))
    return lm


lm = training(training_data)

Epoch 1 starting
forced: Dch wirl  dass iiele Grfeitseren dendeceeisann  ianen  der dinder distnnt  dissich nn drgendstneenttittn   dht egieilienssansce dnd drail gee tidadnd dst disit din ueseegedür dis snternehren 
freemax:ensend sie der einen wir die Gesundheitsscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitscheitschei
fresamp:vachze welter kchzeinivent und Miblangenigeg beweben den undeslürger EsUndwerhafter Pürgen aben Malt eg E-underm licht est flat Verung zu Ech den Erarmarn gangen
Epoch 2 starting
forced: Das wlterngeld  sas wir terane Veschlossen haben</s> dst ein uichtigen peit agszur besceren keraingarkeit aon derun dnd samitie 
freemax:empt.
fresamp:pozk
Epoch 3 starting
forced: