# Neural Language Model - Basic (Word Prediction Example)

In this example, I'll show an example of simple language model.<br>
In general, the language model is used for a variety of NLP tasks, such as, translation, transcription, summarization, question-answering, etc.

For the purpose of your beginning, here we just train language model for text generation (i.e, next word prediction) with primitive neural networks.

Unlike previous examples (from exercise 01 to 04), language model will recognize the order of words in the sequence. (You don't need other special architecture to detect the sequence of words, such as 1D convolution, any more.)<br>
RNN-based specialized architecture (such as, LSTM, GRU, etc) can also be used to train in advanced language model. Furthermore, a lot of transformer-based algorithms are widely used in today's SOTA language models.<br>
You will see these advanced language models in the later exercises. (See exercise 06 - 09.)<br>
In this example, I'll briefly apply primitive feed-forward networks.

See the following diagram for entire network in this primitive example.<br>
First in this network, the sequence of last 5 words is embedded into the list of vectors. Embedded vectors are then concatenated into a single vector, and this vector is used for the next word's prediction.

![Model in this exercise](images/language_model_beginning.png)

Thereby, I note that this model won't care the long past context.<br>
For example, even when the following sentence is given, 

"In the United States, the president has now been"

it won't care the context "In the United States" when it refers the last 5 words in the network. (It might then predict the incorrect word in this context and the accuracy won't also be so high in this example. In the later examples, we will address this problem.)

Nevertheless, the neural language models will be well-generalized more than traditional statistical models for unseen data. For instance, if "red shirt" and "blud shirt" occurs in training set, "green shirt" (which is not seen in training set) will also be predicted by the trained neural model, because the model knows that "red", "blue", and "green" occur in the same context.

As you can see in this example, the language model can be trained with large unlabeled data (not needing for the labeled data), and this approach is very important for the growth of today's neural language models. This learning method is called **self-supervised learning**.<br>
A lot of today's SOTA algorithms (such as, BERT, T5, GPT-2, etc) learn a lot of language properties with large corpus in this unsupervised way (such as, masked word's prediction, next word's prediction), and can then be fine-tuned for specific downstream tasks with small amount of labeled data by transfer approach.

As you saw in [custom embedding example](./02_custom_embedding.ipynb), the word embedding will also be a byproduct in this example.

> Note : In these examples of this repository, I'll apply **word-level (word-to-word)** tokenization, but you can also use **character-level (character-to-character)** model, which can learn unseen words with signals - such as, prefixes (e.g, "un...", "dis..."), suffixes (e.g, "...ed", "...ing"), capitalization, or presence of certain characters (e.g, hyphen, digits), etc.<br>
> Subword tokenization is the popular method used in today's architecture (such as, Byte Pair Encoding in GPT-2), in which a set of commonly occurring word segments (like "cious", "ing", "pre", etc) is involved in a vocabulary list.<br>
> See [here](https://tsmatz.wordpress.com/2022/10/24/huggingface-japanese-ner-named-entity-recognition/) for SentencePiece tokenization in non-English languages.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install torch pandas numpy nltk

## Prepare data

Same as [this example](./03_word2vec.ipynb), here I also use short description text in news papers dataset.<br>
Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset/versions/2) (collected by HuffPost) in Kaggle.

In [2]:
import pandas as pd

df = pd.read_json("News_Category_Dataset_v2.json",lines=True)
train_data = df["short_description"]
train_data

0         She left her husband. He killed their children...
1                                  Of course it has a song.
2         The actor and his longtime girlfriend Anna Ebe...
3         The actor gives Dems an ass-kicking for not fi...
4         The "Dietland" actress said using the bags is ...
                                ...                        
200848    Verizon Wireless and AT&T are already promotin...
200849    Afterward, Azarenka, more effusive with the pr...
200850    Leading up to Super Bowl XLVI, the most talked...
200851    CORRECTION: An earlier version of this story i...
200852    The five-time all-star center tore into his te...
Name: short_description, Length: 200853, dtype: object

To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove all punctuation except " ' " (e.g, Ken's bag) and "&" (e.g, AT&T)

In [3]:
train_data = train_data.str.lower()
train_data = train_data.str.replace("-", " ", regex=True)
train_data = train_data.str.replace(r"[^'\&\w\s]", "", regex=True)
train_data = train_data.str.strip()
train_data

0         she left her husband he killed their children ...
1                                   of course it has a song
2         the actor and his longtime girlfriend anna ebe...
3         the actor gives dems an ass kicking for not fi...
4         the dietland actress said using the bags is a ...
                                ...                        
200848    verizon wireless and at&t are already promotin...
200849    afterward azarenka more effusive with the pres...
200850    leading up to super bowl xlvi the most talked ...
200851    correction an earlier version of this story in...
200852    the five time all star center tore into his te...
Name: short_description, Length: 200853, dtype: object

Finally we add ```<start>``` and ```<end>``` tokens in each sequence as follows, because these are important information for learning the ordered sequence.

```this is a pen``` --> ```<start> this is a pen <end>```

In [4]:
train_data = [" ".join(["<start>", x, "<end>"]) for x in train_data]
# print first row
train_data[0]

'<start> she left her husband he killed their children just another day in america <end>'

## Generate sequence inputs

Same as in previous examples, we will generate the sequence of word's indices (i.e, tokenize) from text.

![Index vectorize](images/index_vectorize.png)

First we create a list of vocabulary (```vocab```).

In [5]:
from nltk.tokenize import SpaceTokenizer

###
# define Vocab
###
class Vocab:
    def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):
        # count vocab frequency
        vocab_freq = {}
        tokens = tokenization(list_of_sentence)
        for t in tokens:
            for vocab in t:
                if vocab not in vocab_freq:
                    vocab_freq[vocab] = 0 
                vocab_freq[vocab] += 1
        # sort by frequency
        vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}
        # create vocab list
        self.vocabs = [special_token] + list(vocab_freq.keys())
        if max_tokens:
            self.vocabs = self.vocabs[:max_tokens]
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def _get_tokens(self, list_of_sentence):
        for sentence in list_of_sentence:
            tokens = tokenizer.tokenize(sentence)
            yield tokens

    def get_itos(self):
        return self.vocabs

    def get_stoi(self):
        return self.stoi

    def append_token(self, token):
        self.vocabs.append(token)
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def __call__(self, list_of_tokens):
        def get_token_index(token):
            if token in self.stoi:
                return self.stoi[token]
            else:
                return 0
        return [get_token_index(t) for t in list_of_tokens]

    def __len__(self):
        return len(self.vocabs)

###
# generate Vocab
###
max_word = 50000

# create tokenizer
tokenizer = SpaceTokenizer()

# define tokenization function
def yield_tokens(data):
    for text in data:
        tokens = tokenizer.tokenize(text)
        yield tokens

# build vocabulary list
vocab = Vocab(
    train_data,
    tokenization=yield_tokens,
    special_token="<unk>",
    max_tokens=max_word,
)

# get list for index-to-word, and word-to-index.
itos = vocab.get_itos()
stoi = vocab.get_stoi()

In this example, we separate each sentence into 5 preceding word's sequence and word label (total 6 words) as follows.

![Separate words](images/separate_sequence_for_next_words.png)

In [6]:
import numpy as np

seq_len = 5 + 1
input_seq = []
for s in train_data:
    token_list = vocab(tokenizer.tokenize(s))
    for i in range(seq_len, len(token_list) + 1):
        seq_list = token_list[i-seq_len:i]
        input_seq.append(seq_list)
print("The number of training input sequence :{}".format(len(input_seq)))
input_seq = np.array(input_seq)

The number of training input sequence :3609552


Separate into inputs and labels.

In [7]:
X, y = input_seq[:,:-1], input_seq[:,-1]

In [8]:
X

array([[    2,    70,   375,    63,   504],
       [   70,   375,    63,   504,    49],
       [  375,    63,   504,    49,   685],
       ...,
       [ 2209,  2150, 43436,  6752,  3496],
       [ 2150, 43436,  6752,  3496,     4],
       [43436,  6752,  3496,     4,  1354]])

In [9]:
y

array([  49,  685,   46, ...,    4, 1354,    1])

## Build network

Now we build network for our primitive language model. (See above for details about this model.)

![Model in this exercise](images/language_model_beginning.png)

In [10]:
import torch
import torch.nn as nn

embedding_dim = 64

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class SimpleLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim=256):
        super().__init__()

        self.embedding = nn.Embedding(
            vocab_size,
            embedding_dim,
        )
        self.hidden = nn.Linear(embedding_dim*(seq_len - 1), hidden_dim)
        self.classify = nn.Linear(hidden_dim, vocab_size)
        self.relu = nn.ReLU()

    def forward(self, inputs):
        outs = self.embedding(inputs)
        outs = torch.flatten(outs, start_dim=1)
        outs = self.hidden(outs)
        outs = self.relu(outs)
        logits = self.classify(outs)
        return logits

model = SimpleLM(vocab.__len__(), embedding_dim).to(device)

Now let's generate text with this model.<br>
The generated result is messy, because it's still not trained at all.

In [11]:
start_index = stoi["<start>"]
end_index = stoi["<end>"]
max_output = 128

def pred_output(sentence, progressive_output=True):
    test_seq = vocab(tokenizer.tokenize(sentence))
    test_seq.insert(0, start_index)
    for loop in range(max_output):
        input_tensor = torch.tensor([test_seq[-5:]], dtype=torch.int64).to(device)
        pred_logits = model(input_tensor)
        pred_index = pred_logits.argmax()
        test_seq.append(pred_index.item())
        if progressive_output:
            for i in test_seq:
                print(itos[i], end=" ")
            print("\n")
        if pred_index.item() == end_index:
            break
    return test_seq

generated_seq = pred_output("in the united states president", progressive_output=False)
for i in generated_seq:
    print(itos[i], end=" ")
print("\n")

<start> in the united states president squelch evidence nemo motivator bahadur hunger bandstand bushwick innocently dixie characteristics reflecting malmö fonder bahari sketchy ladd ecuador fanciest alan snacking sheath changeorg poured barricades energised cornwall desiree auerbach fleming tbi anchorwoman lille bytesize safeguarding aflutter lollipops barman brant tim pitchers lasted uninteresting medley tj javits tim pflag indecency unfortunately chide indecency fiasco refreshment wilton postage stoudemire wwwmariasfarmcountrykitchencom ozzy wastes intagliata tablecloth conformity unhappily loop downsize seducing bagging guidelines mazes libby swap rabia bledel martyrs harkens researcher wednesday bonde ministers dixie fragile gobbler kremlin emphasizes voyager syphilis bledel skater bytesize innocently architecture cappadocia pudding temptations windy flavorwire decreases mil foolishly contours 9400 sport malmö pressed wireless eurostar gyrocopter flavorwire roundup asus aaaaaaaaaaa

## Train

Now let's train our network.

Here I have just used loss and accuracy for evaluation, but the metrics to evaluate text generation task is not so easy. (Because simply checking an exact match to a reference text is not optimal.)<br>
In practice, use some common metrics available in language models, such as, **BLEU** or **ROUGE**. (See [here](https://tsmatz.wordpress.com/2022/11/25/huggingface-japanese-summarization/) for these metrics.)

In [12]:
from torch.utils.data import DataLoader
from torch.nn import functional as F

num_epochs = 30

dataloader = DataLoader(
    list(zip(y, X)),
    batch_size=512,
    shuffle=True,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
    for labels, seqs in dataloader:
        # optimize
        optimizer.zero_grad()
        logits = model(seqs.to(device))
        loss = F.cross_entropy(logits, labels.to(device))
        loss.backward()
        optimizer.step()
        # calculate accuracy
        pred_labels = logits.argmax(dim=1)
        num_correct = (pred_labels == labels.to(device)).float().sum()
        accuracy = num_correct / len(labels)
        print("Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}".format(epoch+1, loss.item(), accuracy), end="\r")
    print("")

Epoch 1 - loss: 6.1527 - accuracy: 0.15303
Epoch 2 - loss: 5.8232 - accuracy: 0.1746
Epoch 3 - loss: 5.4844 - accuracy: 0.1616
Epoch 4 - loss: 5.3518 - accuracy: 0.1810
Epoch 5 - loss: 5.2448 - accuracy: 0.2069
Epoch 6 - loss: 5.0907 - accuracy: 0.1789
Epoch 7 - loss: 5.2215 - accuracy: 0.1853
Epoch 8 - loss: 4.9841 - accuracy: 0.1746
Epoch 9 - loss: 4.9382 - accuracy: 0.1918
Epoch 10 - loss: 4.8539 - accuracy: 0.1918
Epoch 11 - loss: 4.8543 - accuracy: 0.1789
Epoch 12 - loss: 4.6516 - accuracy: 0.2004
Epoch 13 - loss: 4.6640 - accuracy: 0.2134
Epoch 14 - loss: 4.9017 - accuracy: 0.1875
Epoch 15 - loss: 4.5978 - accuracy: 0.2198
Epoch 16 - loss: 4.5636 - accuracy: 0.2241
Epoch 17 - loss: 4.5713 - accuracy: 0.2091
Epoch 18 - loss: 4.6641 - accuracy: 0.2263
Epoch 19 - loss: 4.6990 - accuracy: 0.1810
Epoch 20 - loss: 4.5859 - accuracy: 0.2328
Epoch 21 - loss: 4.4696 - accuracy: 0.2672
Epoch 22 - loss: 4.6469 - accuracy: 0.1832
Epoch 23 - loss: 4.4517 - accuracy: 0.2328
Epoch 24 - loss: 4.

# Generate text

In this example, I'll just show you how it generates a sentence by predicting the possibility of vocabularies over the given recent 5 words, until predicting the end-of-sequence.<br>
As I have mentioned above, I note that this model doesn't recognize the past context, because this model refers only last 5 words.

> Note : This approach - which repeatedly picks up the next word with maximum probability in each timestep and generates a consequent sentence - is called **greedy search**. For instance, when it retrieves the next word with probability 0.8 and the second next word with probability 0.2, the joint probability will then be 0.8 x 0.2 = 0.16. On the other hand, when it retrieves the next word with smaller probability 0.6 but the second next word with so higher probability 0.9, the joint probability becomes 0.54 and it's then be larger than the former one. This example shows that the greedy search algorithm may sometimes lead to sub-optimal solutions (i.e, label-bias problems). It's known that this algorithm also tends to produce repetitive outputs.<br>
> For this reason, greedy search algorithm is rarely used in practical inference in language models, and a popular method known as **beam search** is used to get more optimal solutions in production.<br>
> For simplification, **here I use greedy search algorithm for all examples in this repository**.

In [13]:
_ = pred_output("in the united states president", progressive_output=True)
_ = pred_output("the man has accused by", progressive_output=True)
_ = pred_output("now he was expected to", progressive_output=True)

<start> in the united states president obama 

<start> in the united states president obama ' 

<start> in the united states president obama ' s 

<start> in the united states president obama ' s inauguration 

<start> in the united states president obama ' s inauguration <end> 

<start> the man has accused by islamist 

<start> the man has accused by islamist radicals 

<start> the man has accused by islamist radicals of 

<start> the man has accused by islamist radicals of the 

<start> the man has accused by islamist radicals of the year 

<start> the man has accused by islamist radicals of the year <end> 

<start> now he was expected to be 

<start> now he was expected to be a 

<start> now he was expected to be a little 

<start> now he was expected to be a little bit 

<start> now he was expected to be a little bit of 

<start> now he was expected to be a little bit of the 

<start> now he was expected to be a little bit of the world 

<start> now he was expected to be a little b

In the following exercises, I'll refine language models step-by-step.