<a href="https://colab.research.google.com/github/sanzgiri/minGPT/blob/master/play_char_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Adapted from https://github.com/williamFalcon/minGPT


## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some shakespear, which we'll get it to predict character-level.

In [16]:
!pip install pytorch_lightning==0.9.0rc16

Collecting pytorch_lightning==0.9.0rc16
[?25l  Downloading https://files.pythonhosted.org/packages/64/7e/38c1c5656f3263df2e48c7243a3ae8f65b75422fd72e2b232213ea4d9664/pytorch_lightning-0.9.0rc16-py3-none-any.whl (388kB)
[K     |▉                               | 10kB 14.2MB/s eta 0:00:01[K     |█▊                              | 20kB 2.2MB/s eta 0:00:01[K     |██▌                             | 30kB 2.9MB/s eta 0:00:01[K     |███▍                            | 40kB 3.1MB/s eta 0:00:01[K     |████▎                           | 51kB 2.5MB/s eta 0:00:01[K     |█████                           | 61kB 2.8MB/s eta 0:00:01[K     |██████                          | 71kB 3.1MB/s eta 0:00:01[K     |██████▊                         | 81kB 3.4MB/s eta 0:00:01[K     |███████▋                        | 92kB 3.6MB/s eta 0:00:01[K     |████████▌                       | 102kB 3.5MB/s eta 0:00:01[K     |█████████▎                      | 112kB 3.5MB/s eta 0:00:01[K     |██████████▏        

In [1]:
!git clone https://github.com/williamFalcon/minGPT

fatal: destination path 'minGPT' already exists and is not an empty directory.


In [2]:
%cd minGPT

/content/minGPT


In [3]:
# make deterministic
from pytorch_lightning import seed_everything
seed_everything(42)

42

In [4]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [5]:
import math
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))

        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

In [6]:
block_size = 128 # spatial extent of the model for its context

In [7]:
# download tiny shakespeare input text
! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2020-08-20 06:18:27--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2020-08-20 06:18:27 (9.54 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [8]:
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters
train_loader = DataLoader(train_dataset, batch_size=256, num_workers=4)

data has 1115394 characters, 65 unique.


In [9]:
from mingpt.model import GPT
model = GPT(vocab_size=train_dataset.vocab_size, 
            block_size=train_dataset.block_size,
            n_layer=8, 
            n_head=8, 
            n_embd=512, 
            learning_rate=6e-4)

In [10]:
from pytorch_lightning import Trainer
from mingpt.lr_decay import LearningRateDecayCallback

# scheduler
lr_decay = LearningRateDecayCallback(learning_rate=6e-4, warmup_tokens=512*20,
                                    final_tokens=00*len(train_dataset)*block_size)

trainer = Trainer(gpus=1, precision=16, max_epochs=500,
                  gradient_clip_val=1.0, 
                  callbacks=[lr_decay], 
                  progress_bar_refresh_rate=1, 
                  row_log_interval=1)

trainer.fit(model, train_loader)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.

  | Name    | Type       | Params
---------------------------------------
0 | tok_emb | Embedding  | 33 K  
1 | drop    | Dropout    | 0     
2 | blocks  | Sequential | 25 M  
3 | ln_f    | LayerNorm  | 1 K   
4 | head    | Linear     | 33 K  


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

                    When using EvalResult(early_stop_on=X) or TrainResult(early_stop_on=X) the
                    'monitor' key of ModelCheckpoint has no effect.
                    Remove ModelCheckpoint(monitor='loss) to fix')
                





1

In [11]:
# alright, let's sample some character-level shakespear
from mingpt.utils import sample

context = "O God, I code but"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(model.device)
y = sample(model, x, 1000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, I code but me and do impor From the profund.

WARWICK:
Now, Who not obe?

BIONDELLO:
I will not hear the state, my heart, lords,
That she is at Volscience, to state teat,
To getters to the Vold and ender at holy some to
Rathut have I wish even your lander's likeng a talkwing?

First Seenator:
I begging to you, and ye'e to more,
And show to ever pass.

LUCIO:
Is she is too sea, which of we the hat bows
ISABELLA:
Under like to she may in this dand count-sellor,
The bears of with even, we may so pirit,
Read the world Cominius even, let me she'r not for't.

BRUTUS:
Verkener for the for this, of the devil:
What would be solr, in this lawul form.

Messenger:
Pet now, my heart, but shall is liefe ames:
And therefore, in thou given liest, wheret for his larks.

LARTIUS:
Wherein she lieutenand or shorless on my son,
In a lready me to bed so, of infaict, I would tpose thee
May with thou have little wof to statury little him.

FRIAR LAURENCE:
Why, 'tis thou like: what's to 'thou livest!

L