<p style="text-align:center;"> <span style="font-size:30px;"> <b> OpenAI GPT2 implementation: using Pytorch  </b> </span> </p>

![GPT2](https://www.vyrazu.com/wp-content/uploads/2021/01/Gpt2.png)

# Step 1: About GPT-2

OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It‚Äôs a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

The abstract from the paper is the following:

_GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data._


## Understanding the architecture

GPT-2 is based on Transformer architecture which was first proposed by the team of researchers at Google in their paper ['Attention is all You Need'](https://arxiv.org/abs/1706.03762). The paper described an encoder-decoder based architecture that used concepts like multi heads and self-attention.

Transformer architecture is an improvement over RNN based architectures like LSTM and GRU due to various reasons:
- Transformers can achieve parallelization of tokens (which are basically parts of a text) within an input.
- Transformers require constant O(1) number of operations to learn dependency between two tokens independently of their position distance in a sequence. This makes transformers better at capturing long-term dependencies.
- With the help of multi-head attention, the model can capture various aspects of the input and improve its expressive ability.

GPT-2 is essentially a decoder-only transformer. The model is built by stacking up the transformer decoder blocks.
<img src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt1c4150bcaddeae45/601c8d573e70bb4c12c6feea/Vision-Transformer-Model-Architecture-1024x746.jpg" width="600px">

Unlike the self-attention that transformers use, GPT-2 uses **masked self-attention**. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening, which means that they only use the left context to predict the next word.

Let's now try to understand the process step-by-step.

# Step 2: Importing useful libraries

In [None]:
import numpy as np
import pandas as pd
import torch
import re
import random
from numba import cuda
import gc
gc.enable()

import torch.nn as nn                    
import torch.nn.functional as F          
import torch.optim as optim               
from torch.optim import lr_scheduler
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

import transformers
import tokenizers

from sklearn.metrics import classification_report, accuracy_score
from transformers import (GPT2Config,
                          GPT2Tokenizer,
                          GPT2Model,
                          AdamW, 
                          get_cosine_schedule_with_warmup,
                          logging)

from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import Dataset, DataLoader
scaler = torch.cuda.amp.GradScaler()

logging.set_verbosity_warning()
import warnings
warnings.simplefilter('ignore')

%matplotlib inline
import matplotlib.pyplot as plt

### Some configurations for the model

In [None]:
# Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
MAX_LENGTH = 315
TRAIN_BATCH_SIZE = 4
TEST_BATCH_SIZE = 2
EPOCHS = 4

Number of batches - depending on the max sequence length and GPU memory. For 256 sequence length batch of 8/4 works without cuda memory issues. For small sequence length can try batch of 32 or higher.

This is because we know that increasing the batch size will directly result in increasing the required GPU memory. Often times, not having enough GPU memory prevent us from increasing the batch size. And for that reason, keeping a batch size below 16 is recommended for this notebook.


### Loading the data

In [None]:
train_df = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test_df = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

train_df['excerpt'] = train_df['excerpt'].astype(str)
test_df['excerpt'] = test_df['excerpt'].astype(str)

# Step 3: Preparing the data

The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.


### What is GPT2Tokenizer?
GPT-2 tokenizer, based on byte-level Byte-Pair-Encoding (BPE), has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not.

#### What is BPE?
BPE is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. BPE is a middle ground between character and word-level encodings which helps it in managing the vocabulary of large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any ‚Äúunknown‚Äù tokens.

#### Choice of the GPT2 tokenizer: 
We could have chosen another pre-trained BBPE tokenizer for this study. The key point is to use BBPE tokenizers trained on huge corpus because they can thus tokenize any word of any language without using the unknown token.


### Initializing the Tokenizer
We first initialize the tokenizer using the two files we built before ‚Äî using a simple from_pretrained. i.e., the vocab and the merges files.

In [None]:
PATH = "../input/hugging-face-gpt2/gpt2"

tokenizer = GPT2Tokenizer(
            vocab_file = f'{PATH}/vocab.json', 
            merges_file= f'{PATH}/merges.txt', 
            add_prefix_space = True,
            lowercase=True,
            bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

# default to right padding
tokenizer.padding_side = "right"

We need to pad the sentences because GPT-2 is a model with absolute position embeddings so it‚Äôs usually advised to pad the inputs on the right rather than the left.

So, make sure to include them within the special_tokens parameter of our tokenizer‚Äôs train method call.

Here's how the GPT2 special tokens look like.

In [None]:
pd.DataFrame({'Token meaning': ['Beginning of sequence (BOS) token',
                                'End of sequence (EOS) token', 'Padding token']},
             index=['<|startoftext|>', '<|endoftext|>', '<|pad|>'])

But we want to define the PAD token as the EOS token. 

**Why?** Since for getting GPT2 to work, we will need to update the tokenizer's pad token to be the eos token. Adding the EOS token as PAD token to avoid warnings

In [None]:
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

Just printing out the special tokens.

In [None]:
print(tokenizer.eos_token)

In [None]:
print(tokenizer.pad_token)

In [None]:
print(tokenizer.bos_token)

# Step 4: Creating the Input Pipeline
The input pipeline of our training process is the more complex part of the entire process. It consists of us taking our raw training data, transforming it, and loading it into a DataLoader ready for training.

## 4.1: Data Preprocessing
No preprocessing step is required for GPT2. For example, lower casing, tokenization and other step are skipped as it is believed that these pre-processing step restrict the capability of the model and it is able evaluate all language model benchmark.

There are three main parts of this PyTorch Dataset class:

- _**init()**_ where we read in the dataset and transform text and labels into numbers.
- _**len()**_ where we need to return the number of examples we read in. This is used when calling len(MovieReviewsDataset()).
- _**getitem()**_ always takes as an input an int value that represents which example from our examples to return from our dataset. If a value of 3 is passed, we will return the example form our dataset at position 3.

In [None]:
class TextDataset(Dataset):
    def __init__(self, df, tokenizer = tokenizer, max_length = MAX_LENGTH):
        '''PyTorch Dataset class for loading data.
          This is where the data parsing happens.
          This class is built with reusability in mind: it can be used as is as.
        '''
        
        self.df = df
        self.target = "target" in df  
        self.tokenizer = tokenizer
        self.max_length = self.tokenizer.model_max_length if max_length is None else max_length
        
    def __len__(self):
        '''When used `len` return the number of examples. '''
        return len(self.df)
    
    def get_data(self, row):    
        excerpt = " ".join(row.excerpt.lower().split())                
        encoded_input = self.tokenizer('<|startoftext|>'+ excerpt + '<|endoftext|>', truncation=True, 
                       max_length=self.max_length, padding="max_length")
        input_ids = torch.tensor(encoded_input['input_ids'], dtype=torch.long)
        attention_mask = torch.tensor(encoded_input['attention_mask'], dtype=torch.long)
        
        return input_ids, attention_mask
    
    def __getitem__(self, index):
        '''Given an index return an example from the position. '''
        
        data = {}
        row = self.df.iloc[index]        
        input_ids, attention_mask = self.get_data(row)
        data['input_ids'] = input_ids
        data['attention_mask'] = attention_mask
        if self.target:
            data['target'] = row.target
        return data
            

And now we can see that we created our tensors ‚Äî we will be training our model through masked-language modeling (MLM). So, we need three tensors:
- `input_ids` ‚Äî our token_ids with ~15% of tokens masked using the mask token <mask>.
- `attention_mask` ‚Äî a tensor of 1s and 0s, marking the position of ‚Äòreal‚Äô tokens/padding tokens ‚Äî used in attention calculations.
- `target` ‚Äî our token_ids with no masking.    

In [None]:
e = TextDataset(train_df)
e[1]

Our `attention_mask` and `target` tensors are simply extracted from our batch. The `input_ids` tensors require more attention however, for this tensor we mask ~15% of the tokens ‚Äî assigning them the token IDs. In the final output, we can see part of an encoded `input_ids` tensor. The very first token ID is 1 ‚Äî the `<|startoftext|>` token. Dotted around the tensor we have several token IDs ‚Äî these are our newly added masked tokens.


## 4.2: Building the DataLoader
Next, we define our Dataset class ‚Äî which we use to initialize our three encoded tensors as PyTorch torch.utils.data.Dataset objects.

In [None]:
def train_val_dataloaders(df, train_idx, val_idx, batch_size):
        
    train_df = df.iloc[train_idx]
    val_df = df.iloc[val_idx]
    train_loader = torch.utils.data.DataLoader(
        TextDataset(train_df), 
        batch_size=batch_size, 
        shuffle=True, #Should give better results particularly when you are running for more epochs like in training
        num_workers=2,  
        drop_last=True, pin_memory=True)
    val_loader = torch.utils.data.DataLoader(
        TextDataset(val_df),
        batch_size=batch_size, 
        shuffle=True, 
        num_workers=2, pin_memory=False) 
    #The num_workers attribute tells the data loader instance how many sub-processes to use for data loading
    dataloaders_dict = {"train": train_loader, "val": val_loader}
    return dataloaders_dict

def test_loader(df, batch_size=TEST_BATCH_SIZE):
    loader = torch.utils.data.DataLoader(
        TextDataset(test_df), 
        batch_size=batch_size, 
        shuffle=False, 
        num_workers=2, pin_memory=False)    
    return loader

Finally, our dataset is loaded into a PyTorch DataLoader object ‚Äî which we use to load our data into our model during training.



# Step 5: Training the Model
We need two things for training, our DataLoader and a model. The DataLoader we have ‚Äî but no model.

## 5.1: Initializing the Model
For training, we need a pre-trained GPT2Model. To create that, we first need to create a GPT2 config object to describe the parameters we‚Äôd like to initialize GPT2 with. 

_**Config (GPT2Config)** ‚Äì Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights._

Then, we import and initialize our GPT2 model, which is a transformer outputting raw hidden-states without any specific head on top.

Loading the three essential parts of the pretrained GPT2 transformer: `configuration`, `tokenizer` and`model`.

While creating the model_config(`config`), we will mention the number of labels we need for our regression task, which is only need one label for num_labels.

In [None]:
class TextModel(nn.Module):
     
    def __init__(self):
        super(TextModel, self).__init__()
        
        # Get model configuration
        config = GPT2Config.from_pretrained(
            '../input/hugging-face-gpt2/gpt2/config.json', 
            n_layer=12, n_head=12, num_labels=1, layer_norm_epsilon = 1e-7, 
            output_hidden_states=True, output_attentions = False)    
        self.gpt = GPT2Model.from_pretrained(
            '../input/hugging-face-gpt2/gpt2/pytorch_model.bin', config=config)
        
        # resize model embedding to match new tokenizer
        self.gpt.resize_token_embeddings(len(tokenizer)) #resize the dictionary size of the embedding layer        
        # fix model padding token id
        self.gpt.config.pad_token_id = self.gpt.config.eos_token_id
        for param in self.gpt.parameters():
            param.requires_grad = True    
            
        self.drop = nn.Dropout(config.attn_pdrop)
         
        self.attention = nn.Sequential(  
            nn.LayerNorm(config.hidden_size),
            nn.Linear(config.n_embd, config.n_layer),            
            nn.ReLU(),                       
            nn.Linear(config.n_layer, 1),
            nn.Softmax(dim=1)
            )         
        
        self.regressor = nn.Sequential(     
            nn.LayerNorm(config.n_embd),
            nn.Linear(config.n_embd, 1)                        
            )
        
    def forward(self, input_ids, attention_mask, past_key_values=None): 
        # Type: torch tensor
        outputs = self.gpt(input_ids =input_ids,past_key_values =past_key_values,
                           attention_mask= attention_mask) 
        
        # There are a total of 12-layer, 768-hidden, 12-heads and 117M parameters.
        # We take the hidden states from the last Roberta layer.
        last_layer_hidden_state = outputs.last_hidden_state
        
        # Using the dropout
        last_layer_hidden_state = self.drop(last_layer_hidden_state)
        
        # The number of cells is MAX_LEN.
        # The size of the hidden state of each cell is 768 (for gpt2).
        # In order to condense hidden states of all cells to a context vector,
        # we compute a weighted average of the hidden states of all cells.
        # We compute the weight of each cell, using the attention neural network.
        weights = self.attention(last_layer_hidden_state)
        
        # weights.shape is BATCH_SIZE x MAX_LEN x 1
        # last_layer_hidden_states.shape is BATCH_SIZE x MAX_LEN x 768        
        # Now we compute context_vector as the weighted average.
        # context_vector.shape is BATCH_SIZE x 768
        context_vector = torch.sum(weights * last_layer_hidden_state, dim=1) 
        out = self.regressor(context_vector)
        
        return out

## 5.2: Training Preparation
We now activate the training mode of our model ‚Äî and finally, initialize our optimizer.

#### Set seed for reproducibility
A random seed is used to ensure that results of our model are reproducible. In other words, the use of this parameter ensures that anyone who runs your code will get the exact same outputs. In data science reproducibility is extremely important.

In [None]:
def set_seed(seed):
    # Set random seed for reproducibility
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = True
        
seed = 9999

In [None]:
train_loss = []
val_loss = []

In [None]:
def train(model, dataloaders_dict, optimizer, num_epochs, scheduler, device, filename): 
    # Load model to defined device
    model.to(device) 
    
    for epoch in range(num_epochs):
        for key in ['train', 'val']:
            if key == 'train':
                model.train()
                dataloaders = dataloaders_dict['train']
            else:
                model.eval()
                dataloaders = dataloaders_dict['val']

            #total loss for this epoch.
            epoch_loss = 0.0
            
            # Set tqdm to add loading screen and set the length
            # Evaluate data for one epoch
            loader = tqdm(dataloaders, total=len(dataloaders))
            
            # Tracking variables
            all_targets = []
            all_predictions = []  

            rmse = 0.0           
            # loop over the data iterator, and feed the inputs to the network
            # Train the model on each batch
            
            for (idx, data) in enumerate(loader):
                input_ids = data['input_ids']
                attention_mask = data['attention_mask']
                # Add original target labels - use later for evaluation
                target = data['target']

                # Always clear any previously calculated gradients before performing a backward pass
                model.zero_grad() # Reset gradients tensors
                optimizer.zero_grad()

                # move batch values to device
                input_ids = input_ids.to(device, dtype=torch.long, non_blocking=True) # Overlapping transfer if pinned memory
                attention_mask = attention_mask.to(device, dtype=torch.long, non_blocking=True)
                target = target.to(device, dtype=torch.float, non_blocking=True)

                with torch.set_grad_enabled(key == 'train'): 
                    
                    with torch.cuda.amp.autocast():
                        output = model(input_ids, attention_mask).flatten()
                        loss = nn.MSELoss()(output, target)  # defining loss function
                        
                        # Move targets and outputs to CPU
                        all_predictions.append(output.detach().cpu().numpy().tolist())
                        all_targets.append(target.detach().cpu().numpy().tolist())

                    # Accumulate the training loss over all of the batches so that we can
                    # calculate the average loss at the end. `loss` is a Tensor containing a
                    # single value; the `.item()` function just returns the Python value from the tensor.
                    epoch_loss += loss.item()
                    
                    if key == 'train':
                        scaler.scale(loss).backward() # backwards of loss
                        scaler.step(optimizer) # Update optimizer
                        scaler.update() # scaler update
                        scheduler.step() # Update learning rate schedule

                        # Clip the norm of the gradients to 1.0.
                        # This is to help prevent the "exploding gradients" problem.
                        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)   

                    del target, output, loss  # delete these variables to free the GPU memory
                    torch.cuda.empty_cache() # Releases all unoccupied cached memory currently held by the caching allocator
            
            # Calculate the average loss over the training data.
            epoch_loss = epoch_loss / len(dataloaders.dataset) 
            
            # Return all true labels and prediciton for future evaluations
            all_targets = np.concatenate(all_targets) 
            all_predictions = np.concatenate(all_predictions)
                        
            # Score with rmse
            rmse = np.sqrt(mean_squared_error(all_targets,all_predictions)) 

            if key == 'train':
                train_loss.append(epoch_loss)
                losses = np.mean(train_loss)
                print('Epoch {}/{} | {:^5} | Loss: {:.4f} | RMSE: {:.4f}'.format(
                    epoch + 1, num_epochs, key, losses, rmse))   

            else:
                val_loss.append(epoch_loss)
                losses = np.mean(val_loss)
                print('Epoch {}/{} | {:^5} | Loss: {:.4f} | RMSE: {:.4f}'.format(
                    epoch + 1, num_epochs, key, losses, rmse)) 
                

    torch.save(model.state_dict(), filename)
    
    del model, optimizer, scheduler
    gc.collect()   # performs a blocking garbage collection of all generations 
    torch.cuda.empty_cache()

#### Using k-Fold Cross-Validation
K-Folds technique is a popular and easy to understand, it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data, like this dataset.

In [None]:
kfold = KFold(n_splits=5,shuffle=True,random_state=seed)

## 5.3: Setting parameters for the training

We create optimizer and scheduler use by PyTorch in training, and use the most common parameters used by transformers models. Then we loop through the number of defined epochs and call the train and validation functions.

In [None]:
%%time

for fold, (idxTrain, idxVal) in enumerate(kfold.split(train_df)):
    
    print('#'*10)
    print('### FOLD %i'%(fold + 1))
    print('#'*10)
    
    set_seed(seed + fold)
    model = TextModel()
    param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]
    # Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
    # The 'W' stands for 'Weight Decay fix"
    optimizer = AdamW(optimizer_parameters, lr=3e-7, eps=1e-8, correct_bias=False) 
    dataloaders_dict = train_val_dataloaders(train_df, idxTrain, idxVal, batch_size=TRAIN_BATCH_SIZE)
    
    # Total number of training steps is number of batches * number of epochs.
    # `train_dataloader` contains batched data so `len(train_dataloader)` gives 
    # us the number of batches.
    num_training_steps = int(len(dataloaders_dict) / EPOCHS * TRAIN_BATCH_SIZE)
    scheduler = get_cosine_schedule_with_warmup(
                  optimizer,
                  num_warmup_steps=0, 
                  num_training_steps=num_training_steps
                )    # learning rate scheduler
    train(
        model, 
        dataloaders_dict, 
        optimizer, 
        EPOCHS,
        scheduler,
        device,
        f'gpt_fold{fold}.pth')

# Step 6: Evaluate
This is the last step before submission. Just gather all the models we got from our training, and try evaluating the RMSE score.

In [None]:
%%time

t_loader = test_loader(test_df)
predictions = []
models = []
for fold in range(kfold.n_splits):
    model = TextModel()
    model.to(device)
    model.load_state_dict(torch.load(f'./gpt_fold{fold}.pth'))
    model.eval()
    models.append(model)
    
loader = tqdm(t_loader, total=len(t_loader))
for model in models:
    preds = []
    for (idx, data) in enumerate(loader):
        input_ids = data['input_ids']
        attention_mask = data['attention_mask']

        input_ids = input_ids.to(device, dtype=torch.long) 
        attention_mask = attention_mask.to(device, dtype=torch.long)

        with torch.no_grad():
            output = model(input_ids, attention_mask).flatten() 
            output = output.detach().cpu().numpy()
            preds.append(output)
    preds = np.concatenate(preds)
    predictions.append(preds)
    torch.cuda.empty_cache()

In [None]:
predictions = pd.DataFrame(predictions) 
predictions = predictions.T 
predictions = predictions.mean(axis=1) 
predictions = predictions.values

Append predictions to the submission.csv

In [None]:
submission_df = pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")
submission_df.target = predictions
submission_df.to_csv("submission.csv", index=False)

In [None]:
submission_df

## **Thank you so much for reading! Please do upvote if you liked it. üôÇ**