# The GPT Language Model

## Imports

Here are the packages we need to import.

In [1]:
from nlpmodels.models import gpt
from nlpmodels.utils import train,utils,gpt_dataset,gpt_sampler
from argparse import Namespace
utils.set_seed_everywhere()

## Language Model: WikiText2

We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database.
I took the first 300k from the training set to reduce computation time.

### Hyper-parameters

These are the data processing and model training hyper-parameters for this run. Note that we are running a smaller model
than cited in the paper for fewer iterations...on a CPU. This is meant merely to demonstrate it works.

In [2]:
args = Namespace(
        # Model hyper-parameters
        num_layers_per_stack=2,  # original value = 12
        dim_model=12, #original value = 768
        dim_ffn=48, # original value = 3072
        num_heads=2, # original value = 12
        block_size=64, # original value = 512, context window
        dropout=0.1,
        # Training hyper-parameters
        num_epochs=2, #obviously super short
        learning_rate=0.0,
        batch_size=32, #original value = 64
    )

In [3]:
train_loader, vocab = gpt_dataset.GPTDataset.get_training_dataloader(args)
model = gpt.GPT(vocab_size = len(vocab),
            num_layers_per_stack= args.num_layers_per_stack,
            dim_model = args.dim_model,
            dim_ffn = args.dim_ffn,
            num_heads = args.num_heads,
            block_size = args.block_size,
            dropout = args.dropout)
trainer = train.GPTTrainer(args,vocab.mask_index,model,train_loader,vocab)

1lines [00:00,  3.40lines/s]


Now we will run the first step in GPT training process, where we train the model to
maximize the objective

```
max p(x[k]|x[k-1],[k-2],...x[k-block_size])
```.

This is an unsupervised (more aptly described as "self-supervised") loss. After this model is trained,
we can run then continue it onto another problem (can freeze layers to only continue training the top layers).

In [4]:
trainer.run()

[Epoch 0]:   0%|          | 10/9375 [00:12<3:08:29,  1.21s/it, loss=10.4]


KeyboardInterrupt: 

# GPT Completes A Sequence

In the spirit of Kaparthy's minGPT::play_char notebook, we can use a greedy_sampler to see how the model
continues a sequence.

In [18]:
import torch



prompt = "Super Mario Land is a 1989 side @-@ scrolling platform video game , " \
         "the first in the Super Mario Land series " \
         ", developed and published by Nintendo as a launch title for their Game Boy " \
         "handheld game console . In gameplay similar to that of the 1985 Super Mario Bros." \
         " , but resized for the smaller device 's screen , the player advances Mario to the " \
         "end of 12 levels by moving to the right and jumping across platforms to avoid enemies" \
         " and pitfalls . "
prompt_tensor = torch.LongTensor([[vocab.lookup_token(s) for s in prompt.split(" ")]])
prompt_tensor_batch = trainer._reformat_data((prompt_tensor,None))
steps = 64
yhat_indices = gpt_sampler.greedy_sampler(model=model, x=prompt_tensor_batch, steps=steps,block_size=64,do_sample=True).src
yhat_tokens = ' '.join([vocab.lookup_index(int(idx)) for idx in yhat_indices[0]])
print(yhat_tokens)

<unk> stringent iguana diaries stylistic erzherzog stroke erase designations fireballs clean hamilton charms ain check publicized minginish reiterating roamed township microscopically grocery nefer minarsih vidal itself laz multitude masters nut subsistence respite indisputably ancient usace 1879 imp kirkus sebastian army maturation beirut technical middle mccombe aluminum 122 fangs protracted consul maryada portion proprietors slipped angles documenting speeches whether treknation harm an carson purchased nearest persons


Once the model is trained in the self-supervised phase, go forth and apply it to a different problem.