## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [8]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [9]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [10]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [11]:
block_size = 128 # spatial extent of the model for its context

In [7]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2020-09-12 23:14:38--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.0.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2020-09-12 23:14:38 (15.7 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [12]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 1115394 characters, 65 unique.


In [13]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

09/13/2020 03:19:28 - INFO - mingpt.model -   number of parameters: 2.535219e+07


In [None]:
torch.cuda.max_memory_cached(0)
# 2 GPUs max_memory_cached(0) is 7637827584

In [3]:
torch.cuda.max_memory_allocated(0)
# 2 GPUs max_memory_allocated(0) = 7621877760

0

In [4]:
torch.cuda.memory_allocated(0)
# 2 GPUs cuda.memory_allocated(0)= 417400320

0

In [5]:
torch.cuda.memory_cached(0)
# 2 GPUs cuda.memory_cached(0) = 7054819328

0

In [24]:
# cached minus allocated free memory
torch.cuda.memory_cached(0) - torch.cuda.memory_allocated(0)

267155968

In [36]:
# this script looks up memory usage across all the gpu machines

# courtesy of mjstevens777 Matt

import subprocess

def get_gpu_memory_map():
    """Get the current gpu usage.

    Returns
    -------
    usage: dict
        Keys are device ids as integers.
        Values are memory usage as integers in MB.
    """
    result = subprocess.check_output(
        [
            'nvidia-smi', '--query-gpu=memory.used',
            '--format=csv,nounits,noheader'
        ], encoding='utf-8')
    # Convert lines into a dictionary
    gpu_memory = [int(x) for x in result.strip().split('\n')]
    gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))
    return gpu_memory_map
get_gpu_memory_map()
#{0: 7326, 1: 6992, 2: 6992, 3: 6992}

{0: 7134, 1: 6800, 2: 6800, 3: 6800}

In [35]:
# prints currently alive Tensors and Variables
# courtesy Smth, PyTorch Dev, Facebook AI Research
# use the garbage collector’s book-keeping 
# to print out the currently resident Tensors. 
# Here’s a snippet that shows all the currently allocated Tensors:

import torch
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except:
        pass

<class 'torch.Tensor'> torch.Size([4, 3])
<class 'torch.nn.parameter.Parameter'> torch.Size([1, 128, 512])
<class 'torch.nn.parameter.Parameter'> torch.Size([65, 512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([65, 512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512, 512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512])
<class 'torch.nn.parameter.Parameter'> torch.Size([512, 512])
<class 'torch.nn.par



In [37]:
# make sure you run this pip install pynvml

from pynvml import *
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'total    : {info.total}')
print(f'free     : {info.free}')
print(f'used     : {info.used}')

# total    : 8370061312
# free     : 687865856
# used     : 7682195456
    

total    : 8370061312
free     : 889192448
used     : 7480868864


In [34]:
import torch
from GPUtil import showUtilization as gpu_usage

print("Initial GPU Usage")
gpu_usage()                             

tensorList = []
#for x in range(10):
#    tensorList.append(torch.randn(10000000,10).cuda())   # reduce the size of tensor if you are getting OOM

print("GPU Usage after allcoating a bunch of Tensors")
gpu_usage()

del tensorList

print("GPU Usage after deleting the Tensors")
gpu_usage()  

print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()

Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% | 99% |
|  1 |  0% | 88% |
|  2 |  0% | 88% |
|  3 |  0% | 88% |
GPU Usage after allcoating a bunch of Tensors
| ID | GPU | MEM |
------------------
|  0 |  0% | 99% |
|  1 |  0% | 88% |
|  2 |  0% | 88% |
|  3 |  0% | 88% |
GPU Usage after deleting the Tensors
| ID | GPU | MEM |
------------------
|  0 |  0% | 99% |
|  1 |  0% | 88% |
|  2 |  0% | 88% |
|  3 |  0% | 88% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 |  0% | 89% |
|  1 |  0% | 85% |
|  2 |  1% | 85% |
|  3 |  0% | 85% |


In [None]:

!watch -n 1 free -m

[1BSwap:[5;15H65535[5;31H0[7C65535[24;80H7[4;64H2584[7C25020.1: Sun Sep 13 05:12:57 2020[3;15Htotal[3;28Hused[3;40Hfree[6Cshared  buff/cache   available[1;75H8[24;80H[1;75H9[4;42H29[34C19[20B[1;72H3:00[4;31H4[24;80H[1;75H1[4;43H8[24;80H[1;75H2[4;79H8[20B[1;75H3[4;31H5[24;80H[1;75H4[24;80H[1;75H5[24;80H[1;75H6[4;31H4[4;43H9[35C9[20B[1;75H7[24;80H[1;75H8[4;31H5[4;43H8[35C8[20B[1;75H9[24;80H[1;74H10[24;80H[1;75H1[4;43H7[35C7[20B[1;75H2[24;80H[1;75H3[4;43H8[35C8[20B[1;75H4[24;80H[1;75H5[24;80H[1;75H6[24;80H[1;75H7[24;80H[1;75H8[24;80H[1;75H9[24;80H[1;74H20[4;31H4[4;43H9[35C9[20B[1;75H1[24;80H[1;75H2[24;80H

In [None]:
!watch -n 0.5 nvidia-smi

In [30]:
#Tracking Memory Usage with GPUtil


!pip install GPUtil
import GPUtil
GPUtil.showUtilization()

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
| ID | GPU | MEM |
------------------
|  0 |  0% | 92% |
|  1 |  0% | 88% |
|  2 |  0% | 88% |
|  3 |  0% | 88% |


In [15]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=1, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 188: train loss 2.22687. lr 5.972223e-04:   9%|▊         | 189/2179 [01:55<19:21,  1.71it/s]

KeyboardInterrupt: 

In [10]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! that e'er this tongue of mine,
That laid the sentence of dread banishment
On yon proud man, should take it off again
With words of sooth! O that I were as great
As is my grief, or lesser than my name!
Or that I could forget
With Richmond, I'll tell you what I am,
The Lord Aumerle, .

CLAUDIO:
The prenzie Angelo!

ISABELLA:
O, 'tis the cunning livery of hell,
The damned'st body to invest and cover
In prenzie guards! Dost thou think, Claudio?
If I would yield him my virginity,
Thou mightst be freed.

CLAUDIO:
O heavens! it cannot be.

ISABELLA:
Yes, he would give't thee, from this rank offence,
So to offend him still. This night's the time
That I should do what I abhor to name,
Or else thou diest to-morrow.

CLAUDIO:
Thou shalt not do't.

ISABELLA:
O, were it but my life,
I'ld throw it down for your deliverance
As frankly as a pin.

CLAUDIO:
Thanks, dear Isabel.

ISABELLA:
Be ready, Claudio, for your death tomorrow.

CLAUDIO:
Yes. Has he affections
That profit us.

DUKE VIN

In [None]:
# well that was fun