## Dataset Preparation
As data set we use `tiny_shakespeare`, which consists of 40,000 lines of Shakespeare from a variety of Shakespeare's plays. (Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/)


### Reviewing Data

In [None]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget -P ./dataset https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
# read it in to inspect it
with open("./dataset/input.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [None]:
print(f"length of dataset in characters: {len(text)}")

In [None]:
# let's look at the first 1000 characters
print(text[:500])

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(f"Vocabulary size: {vocab_size}")

### Encoder/Decoder

We want to have a `encoder` function, that maps characters to integers and a `decoder` function resolve those integers back to characters. 

In [None]:
# creates maps for resolution
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s]

# decoder: take a list of integers, output a string
decode = lambda l: "".join([itos[i] for i in l])  # ⚠️⚠️⚠️⚠️⚠️ Implement the counter part

print(encode("hii there"))
print(decode(encode("hii there")))

#### Bonus: Looking at TikToken

Though we will use a simple character level implementation, lets have a look a existing tokenizers like tiktoken. tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with OpenAI's models.

In [None]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-2")
print(tokenizer.encode("hii there"))
print(tokenizer.decode(tokenizer.encode("hii there")))

Let's now encode the entire text dataset and store it into a `torch.Tensor`

In [None]:
import torch  # we use PyTorch: https://pytorch.org

data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(
    data[:500]
)  # the 500 characters we looked at earier will to the GPT look like this

<div style="text-align:center"><img align="center" src="https://media5.datahacker.rs/2021/11/Picture3.jpg" width="500"/></div><br>
Let's now split up the data into train and validation sets, to avoid overfitting

In [None]:
ratio = 0.9
n = int(ratio * len(data))  # first 90% will be train, rest val
train_data = data[:n]  # ⚠️⚠️⚠️⚠️⚠️ use python slicing to split the data
val_data = data[n:]  # ⚠️⚠️⚠️⚠️⚠️

We do not feed the transformer the whole text at once! That would be computational very expensive and prohibitive. Instead we take random like chunks for the dataset an train the transformer on those at a time. The length of those chunks is calles `block_size`

In [None]:
block_size = 8
train_data[: block_size + 1]  # +1 because we need a last target token to have 8 trainings samples. Visualization below

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

We do not just put a single chunk into the transformer but a `batch` of chunks, stored in a tensor, to keep the GPUs busy. Those chunks will processed at the same time, but they are indepent and don't _talk_ to each other.

In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

We will not export our dataset as well as the meta data

In [None]:
import pickle
import numpy as np

train_ids = np.array(train_data, dtype=np.uint16)
val_ids = np.array(val_data, dtype=np.uint16)

train_ids.tofile('./dataset/train.bin')
val_ids.tofile('./dataset/val.bin')

# save the meta information as well, to help us encode/decode later
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open ('./dataset/meta.pkl', 'wb') as f:
    pickle.dump(meta, f) # ⚠️⚠️⚠️⚠️⚠️ Store meta as pickle https://docs.python.org/3/library/pickle.html#pickle.dump
