In [1]:
!git clone https://github.com/karpathy/nanoGPT.git

Cloning into 'nanoGPT'...
remote: Enumerating objects: 649, done.[K
remote: Total 649 (delta 0), reused 0 (delta 0), pack-reused 649[K
Receiving objects: 100% (649/649), 935.29 KiB | 5.70 MiB/s, done.
Resolving deltas: 100% (374/374), done.


In [3]:
# Download the nanoGPT from Andrej Karpathy's github
import urllib.request
base_url = "https://github.com/karpathy/nanoGPT/raw/master/"
urllib.request.urlretrieve(f"{base_url}/model.py", "model.py")
urllib.request.urlretrieve(f"{base_url}/train.py", "train.py")
urllib.request.urlretrieve(f"{base_url}/configurator.py", "configurator.py")

('configurator.py', <http.client.HTTPMessage at 0x7fcfb1b66740>)

In [3]:
!/bin/python3 train.py configs/azure_docs_training.py

Overriding config with configs/azure_docs_training.py:
out_dir = 'out-test'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10

always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'azure_docs'
# gradient_accumulation_steps = 1
batch_size = 16
block_size = 256 # context of up to 256 previous characters

# # baby GPT model :)
# n_layer = 6
# n_head = 6
# n_embd = 384
# dropout = 0.2

# learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 2000
# lr_decay_iters = 5000 # make equal to max_iters usually
# min_lr = 1e-4 # learning_rate / 10 usually
# beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

tokens per iteration wil

# Train GPT2 from scratch

In [10]:
import os
import urllib.request
import torch

Let's use great simple Transformer implementation minGPT by Andrey Karpathy. Have a look at model.py code to understand how layers are implemented. Note for more serious work you can use his nanoGPT (still lighweight yet a little more complex and optimized) or look at Pytorch nn.Transformer implementation and docs.

For this demo we will use vectorized text we created in training_data_preparation.ipynb. It is full snapshot of Azure documentation in Markdown.

In [12]:
# Load vectors
train_data = torch.load("azure-docs-training.pt")
val_data = torch.load("azure-docs-validation.pt")

# Create a dataset class
from torch.utils.data import Dataset

class AzureDocsDataset(Dataset):
    def __init__(self, data, block_size):
        self.block_size = block_size
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_seq = self.data[idx:idx + self.block_size]
        target_seq = self.data[idx + 1:idx + self.block_size + 1]
        return input_seq, target_seq

# Instantiate the datasets    
train_dataset = AzureDocsDataset(data=train_data, block_size=128)
val_dataset = AzureDocsDataset(data=val_data, block_size=128)

Here is example what it does. Let say input is 5 tokens, target is shifted by one so model needs to predict that one new token (49 in example here).

In [13]:
AzureDocsDataset(data=train_data, block_size=5).__getitem__(3)

(tensor([   25, 22134,  5984,   311,  4303]),
 tensor([22134,  5984,   311,  4303,    49]))

Let's configure transformer. Very small model can be trained even on CPU. For NVIDIA T4 with 16GB RAM I was able to fit model of 51M params (8 layers, 8 heads, 512 embeddings). With A100 I could go full basic GPT2 size (12, 12, 768).

In [14]:
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = None       # We will define hyperparameters explicitly
model_config.n_layer = 12             # 12 for gpt2, 36 for gpt2-large, 3 or 6 for playing
model_config.n_head = 12              # 12 for gpt2, 20 for gpt2-large, 3 or 6 for playing
model_config.n_embd = 768            # 768 for gpt2, 1280 for gpt2-large, 48 or 192 for playing
model_config.vocab_size = 50257      # gpt2 tokenizer is 50257
model_config.block_size = 128
model = GPT(model_config)

number of parameters: 123.75M


In [17]:
from mingpt.trainer import Trainer

train_config = Trainer.get_default_config()
train_config.max_iters = 2
train_config.num_workers = 0
train_config.batch_size = 48  # Default 64 did not fit my NVIDIA T4 GPU with 16GB RAM for GPT2 size model
train_config.ckpt_path = "."
train_config.save_steps = 1
trainer = Trainer(train_config, model, train_dataset)

running on device cuda


In [18]:
def batch_end_callback(trainer):
    if trainer.iter_num % 100 == 0:
        print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")
trainer.set_callback('on_batch_end', batch_end_callback)

trainer.run()

iter_dt 0.00ms; iter 0: train loss 9.69587


In [9]:
prompt = "To configure Azure Virtual Network "

# Tokenize the prompt
import tiktoken
enc = tiktoken.get_encoding("gpt2")
prompt_tokens = torch.tensor(enc.encode(prompt)).to(trainer.device)

num_samples = 1
steps = 250
do_sample = True

x = prompt_tokens.expand(num_samples, -1)
y = model.generate(x, max_new_tokens=steps, do_sample=do_sample, top_k=40)

from IPython.display import Markdown

out = enc.decode(y[0].tolist())
Markdown(out)

To configure Azure Virtual Network 

For more information, see [Create managed identity for Azure VNet](https://f2.apache.noc.io/f3b.net/current/kafka).

### Create a firewall

1. Create a virtual network with the Azure network configuration.

      To update the VNET IP address:

       dscenario: "1.0.0.0.0.0.net
                          (<IP name="passwordless)  
          * OpenShiftPort> 
                  * If you haven't set to `<i>`
                   <pGUID="my-demoserver" /> 
                       <tutorial name="Azure portal" target="true" />

In [11]:
# Save the model
torch.save(model.state_dict(), "xazure-docs-gpt2.pth")

In [8]:
model = torch.load("xazure-docs-gpt2.pth")


AttributeError: 'collections.OrderedDict' object has no attribute 'to'

In [7]:
prompt = "To configure Azure Virtual Network "

# Tokenize the prompt
import tiktoken
enc = tiktoken.get_encoding("gpt2")
prompt_tokens = torch.tensor(enc.encode(prompt)).to("cuda")

num_samples = 1
steps = 250
do_sample = True

x = prompt_tokens.expand(num_samples, -1)
y = model.generate(x, max_new_tokens=steps, do_sample=do_sample, top_k=40)

from IPython.display import Markdown

out = enc.decode(y[0].tolist())
Markdown(out)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)