# Generating Bulgarian text with KerasNLP and transformers

The notebook follows [this](https://keras.io/examples/generative/text_generation_gpt/) Keras tutorial, but switches the language to Bulgarian.

### Introduction

Natural language processing (NLP) has advanced significantly in recent years, driven by the development of large-scale pre-trained recurrent or transformer language models (LMs) such as BERT and GPT. Yet, the NLP research community still faces challenges related to the scarcity of comprehensive and diverse datasets for pre-training Transformer models in less-resourced languages, including Bulgarian.

This project uses KerasNLP to build and to train aa scaled down Generative Pre-Trained (GPT) model for Bulgarian text generation. GPT is a Transformer-based model that allows the generation of sophisticated text from a prompt. 

This example demonstrates how KerasNLP tokenization, layers and metrics simplify the training process, and then show how to generate output text using the KerasNLP sampling utilities.

### Building Transformer for Bulgarian

#### Setup dependencies

In [7]:
import keras
import keras_nlp

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

Currently working with keras 2.10.0 and Tensorflow 2.10.1

#### Settings & hyperparameters

Define constants that will be used later in the notebook.

In [8]:
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 5

# Inference
NUM_TOKENS_TO_GENERATE = 80


#### Load data

The trainig and validation datasets are available [here](https://github.com/radev2711/Learning-Deep-Lerning/tree/main/data).

The dataset consists of different themed text documents in Bulgarian collected from Wikipedia.bg, media sites - Investor.bg, DW.bg, Sportal.bg, whit the majority coming from Chitanka's Bulgarian literature category.

Currently the Bulgarian training set contains 29_970_431 with 960_944 characters in the validation set compared to the original 407_403_786 characters of the original English training set with 867_476 characters in the validation set.

In [9]:
# Load the train set and filter out short lines with splitting it into baches and shuffeling the elements of the dataset.
raw_train_ds = (
    tf_data.TextLineDataset("data/bg_train.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

In [10]:
# Load the validation set and filter out short lines with splitting it into baches.
raw_val_ds = (
    tf_data.TextLineDataset("data/bg_valid.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
)

#### Train the tokenizer

Training the tokenizer from the training dataset for a vocabulary size of VOCAB_SIZE, which is a tuned hyperparameter.
 
Limiting the vocabulary as much as possible has a large effect on the number of model parameters, but including too few vocabulary terms, wich can lead to too many out-of-vocabulary (OOV) sub-words. 

In [12]:
# Train tokenizer vocabulary
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

The vocabulary data is used to initialize keras_nlp.tokenizers.WordPieceTokenizer. 

WordPieceTokenizer is an efficient implementation of the WordPiece algorithm used by BERT and other models. It will strip, lower-case and do other irreversible preprocessing operations.

In [13]:
# Load tokenizer
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

By tokenizing and splitting it into features and labels the datasets are preprocesed.

In [14]:
# packer adds a start token
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)

#### Build the model

Create a scaled down GPT model with the following layers:

* One keras_nlp.layers.TokenAndPositionEmbedding layer, which combines the embedding for the token and its position.
* Multiple keras_nlp.layers.TransformerDecoder layers, with the default causal masking. The layer has no cross-attention when run with decoder sequence only.
* One final dense linear layer


In [15]:
inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary. A large majority of the parameters are in the token_and_position_embedding and the output dense layer. This means that the vocabulary size (VOCAB_SIZE) has a large effect on the size of the model, while the number of Transformer decoder layers (NUM_LAYERS) doesn't affect it as much.

In [16]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 256)        1312768   
 g (TokenAndPositionEmbeddin                                     
 g)                                                              
                                                                 
 transformer_decoder (Transf  (None, None, 256)        329085    
 ormerDecoder)                                                   
                                                                 
 transformer_decoder_1 (Tran  (None, None, 256)        329085    
 sformerDecoder)                                                 
                                                                 
 dense (Dense)               (None, None, 5000)        128500

#### Training

Now that we have our model, let's train it with the fit() method.

In NLP perplexity measures how likely the model is to generate the input text sequence.

In [17]:
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1e0a02a64d0>

#### Saving and loading the model

Tensorflow and Keras allows for model progress to be saved during and after training. This means a model can resume where it left off and avoid long training times or be easily shared or published.

Saving the model

In [None]:
#model.save('model/model_name.keras')

Loading the model

In [None]:
#loaded_model = tf.keras.models.load_model('model/model_name.keras')

#### Inference

With the model trained, it is time to test it out to gauge its performance. In order to do this the model needs a seed with an input sequence starting with the "[BOS]" token, followed by progressively sampling the model by making predictions for each subsequent token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing only the "[BOS]" token.

In [18]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>

For inference the keras_nlp.samplers module can be used. The module  requires a callback function wrapping the model we just trained. This wrapper calls the model and returns the logit predictions for the current token we are generating.

In [19]:
def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. With that done, let's test out the different utilities, starting with greedy search.

##### Greedy search

This methos greedily picks the most probable token at each timestep. In other words, we get the argmax of the model output.

In [20]:
sampler = keras_nlp.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
txt = txt.numpy()[0].decode('utf-8') #decode the output bytes of the model into cyrillic characters
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
[BOS] — не е ли ? — каза той , — каза той . — не е ли ? — не е ли ? — каза той . — не е ли ? — каза той . — не е ли ? — каза той . — не е ли ? — не ! — отвърна той . — не е ли ? — не ! — отвърна той . — не е ли ? — не ! — отвърна той . — не е ли ? — не ! — отвърна той . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



The greedy search starts out making some sense, but quickly starts repeating itself. This is a common problem with text generation that can be fixed by some of the probabilistic text generation utilities.

##### Beam search

At a high-level, beam search keeps track of the num_beams most probable sequences at each timestep, and predicts the best next token from all sequences. It is an improvement over greedy search since it stores more possibilities. However, it is less efficient than greedy search since it has to compute and store multiple potential sequences.

Note: beam search with num_beams=1 is identical to greedy search.

In [21]:
sampler = keras_nlp.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
txt = txt.numpy()[0].decode('utf-8')
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
[BOS] вълчицата продължаваше да се съобразява , че се е случвало да се съобразява , че се е случвало да се съобразява , че се е случвало да се съобразява , че се е случвало да се осъществява и да се съобразява , че трябва да се съобразява , че трябва да се осъществява и да се съобрази . . . . . . . [PAD] ! [PAD] ! [PAD] ! . . . . . [PAD] ! [PAD] ! . . . .



Similar to greedy search, beam search quickly starts repeating itself, since it is still a deterministic method.

##### Random search

Random search is a probabilistic method. At each time step, it samples the next token using the softmax probabilities provided by the model.

In [26]:
# !!! Sometimes breaks the kernel. !!!

sampler = keras_nlp.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
txt = txt.numpy()[0].decode('utf-8')
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
[BOS] — исках да узнаете тази европа , със знанието ми е нужен . даия аз наистина ви казах и трудно мога да се изляза от мен . седим ли на алиомана поне затуй пък ти го направих толкова инициаторителни едновременно подеживни . трябва да ми експертите и да ти спре . . тръгна ли прекалено близкото вътре , в черноокия тя . бива ли е наистина обикновена земя , а моята смешнашко глътка апепеля . [PAD]х я гледах , вина — приятелката дойде на тия , дето все не бях ще се с нас



The random approach elimienates repetitions, but may result in some nonsensical words appearing since any word in the vocabulary has a chance of appearing with this sampling method. This is fixed by the next search utility, top-k search.

##### Top-K search

Similar to random search, we sample the next token from the probability distribution provided by the model. The only difference is that here, we select out the top k most probable tokens, and distribute the probability mass over them before sampling. This way, we won't be sampling from low probability tokens, and hence we would have less nonsensical words!

In [22]:
sampler = keras_nlp.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
txt = txt.numpy()[0].decode('utf-8')
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
[BOS] андро се разсладиха войнства на гьонг се чувствуваше , че и той се преизъртили в килим , за да замине на косъм . и когато стреля там с водопроводи , въоди селищата на коприваци , и които бяха разкъсали и изминални , а валутни , които се радваха . той се разшириха , а се изпълнява , че стреля , водородопанските им ор



##### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the number k is fixed, which means it selects the same number of tokens for any probability distribution.

Lets consider two scenarios, one where the probability mass is concentrated over 2 words and another where the probability mass is evenly concentrated across 10. Should we choose k=2 or k=10? There is no one size that fits all k here.

This is where top-p search comes in. Instead of choosing a k, we choose a probability p that we want the probabilities of the top tokens to sum up to. This way, we can dynamically adjust the k based on the probability distribution. By setting p=0.9, if 90% of the probability mass is concentrated on the top 2 tokens, we can filter out the top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will similarly filter out the top 10 tokens to sample from.

In [23]:
sampler = keras_nlp.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
txt = txt.numpy()[0].decode('utf-8')
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
[BOS] когато стана , че елементаре ! . . . . но сега , когато се вървеят не може да не се промени да изобилие . да се усмихне , и той се припомни от всички тия думи , които се ухили , и началниците на свободата да се утвърди , че ще го успокояват с тия страшни очи . . [PAD]ът ще го вземе , че е открай ? . . . [PAD] не може да стане , защото те не е безсилно , но все пак не се е случило . [PAD] да стане , да



#### Using callbacks for text generation

These utilities can alos be wrapped in a callback, which allows the printing out of a prediction sequence for every epoch of the model. Here is an example of a callback for top-k search:

In [24]:
class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_nlp.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        txt = txt.numpy()[0].decode('utf-8')
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
[BOS] — виж , че не си поприказват , защото няма . . не знам какво да правя . аз ще се повъртат , ако не си говоря за да разбере дали няма да си представим , а ти , а може би . . . [PAD] ! аз се опитвам да си представя как ще я навреме с теб и аз ще се свърме ! [PAD] ! [PAD] ! и като че ли ще побереш ли е ? [PAD] ! ще ти отида ! [PAD] ! — отвързаш ! и аз , като да не знам какво правиш с тебе ! и знаеш !

1/1 - 5s - loss: 3.8140 - perplexity: 67.5890 - 5s/epoch - 5s/step
Epoch 2/2
Top-K search generated text: 
[BOS] когато засущните османогнизираха се вълнуваха . той се чувствуваше как да се събират в корита . не се изгоряло от бунтовникът се разсеянова на магията в инстирумент . страдаците не можеха да направят , а несъмнено от разтърсимост от коператичната кооприна . [PAD] гишето на постройка . по - малко се бяха врушени стъкналите на благодарности , ск

1/1 - 5s - loss: 3.6728 - perplexity: 61.7761 - 5s/epoch - 5s/step


<keras.callbacks.History at 0x1e0a0a18460>

### Conclusion

This example utilizes KerasNLP layers to train a sub-word vocabulary, tokenize training data, create a miniature GPT model, and perform inference with the text generation library.

The trainded model is extremely small compared to the newest GPT models. If reportedly an dataset of 50Gb is needed to train a comperhensive language model for Bulgarian, then an model trained on 52 Mb dataset can be compared to an infant trying to speak its first words.

### References

[Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

[GPT-3](https://arxiv.org/pdf/2005.14165.pdf)

[Transformers for Bulgarian](https://acl-bg.org/proceedings/2023/RANLP%202023/pdf/2023.ranlp-1.77.pdf)