### BasicsXLM-RoBERTa - PyTorch


In [None]:
# Please switch on the TPU before running these lines.
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

In [None]:
# Imports required to use TPUs with Pytorch.
# https://pytorch.org/xla/release/1.5/index.html

import torch_xla
import torch_xla.core.xla_model as xm

In [None]:
import pandas as pd
import numpy as np
import os
import gc
import random

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

import transformers
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
from transformers import AdamW

import warnings
warnings.filterwarnings("ignore")


In [None]:
# Here goes some my tests!
data = pd.read_csv("../input/porn-recognition/data.csv", index_col=0)
data['url'] = data['url'].str.replace('.', ' ')
data.head()

# Section 3

In this section we will train a BERT Model on three folds and train an XLM-RoBERTa model on one fold. We will use PyTorch with a single TPU. For each model we will also make a prediction on the competition test set and create a submission csv file.

### A few notes on using PyTorch with a TPU

- Setting up PyTorch code to use a single xla device (TPU) is easier that setting it up to use all 8 TPU cores. Just a few lines of code need to be changed to switch from a GPU to a single TPU. The speed is not as fast as using all 8 TPU cores but the model does train faster than a GPU and there's more RAM available. 

- Pytorch XLA does not use memory as efficiently as Tensorflow. Therefore, my code tends to consistently crash when I try to use PyTorch with an 8 core TPU setup. 

- There is 4.9GB of disk space available in Kaggle notebooks. What I've found is that models trained on a TPU are larger than models trained on a GPU. For example, a Bert model trained on a GPU is 600MB. However, a BERT model trained on a TPU is approx. 1GB. Therefore, when running 5 fold cross validation, trying to save all 5 fold models (1GB each) will cause the Kaggle notebook to crash because the available disk space will be exceeded. For that reason here we will be training on three folds only.

-  A TPU may take a few seconds to start running. Therefore, if you run your code and you see that nothing is happening, wait a little while. Don't cancel the run because you think that something is wrong.

## 3.2. Train an XLM-RoBERTa Model

In [None]:
MODEL_TYPE = 'xlm-roberta-large'


L_RATE = 1e-5
MAX_LEN = 256

NUM_EPOCHS = 3
BATCH_SIZE = 32
NUM_CORES = os.cpu_count()
RANDOM_STATE = 42

torch.manual_seed(RANDOM_STATE)


NUM_CORES

## Define the device

In [None]:
# USING TPU ON KAGGLE   
device = xm.xla_device()

print(device)

## Загрузка данных

In [None]:
df = data.loc[ (data['target'] != -1) & (data['target'] != -2) ]

percentage_df =  round( df.shape[0] /100*70 )

df_train = df
df_val = df.iloc[percentage_df:len(df),:]

df_test = data.loc[data['target'] == -1 ]

df_train.head()

In [None]:
df_val.head()

In [None]:
df_test.head()

## Instantiate the Tokenizer

In [None]:
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification

# xlm-roberta-large
print('Loading XLMRoberta tokenizer...')
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_TYPE)

## Create the Dataloader

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [None]:
class CompDataset(Dataset):

    def __init__(self, df):
        self.df_data = df

    def __getitem__(self, index):

        # get the sentence from the dataframe
        sentence1 = self.df_data.loc[index, 'url']
        sentence2 = self.df_data.loc[index, 'title']
        sentence3 = self.df_data.loc[index, 'title_re']

        # Process the sentence
        # ---------------------

        encoded_dict = tokenizer.encode_plus(
                    sentence1, sentence2,          # Sentences to encode.
                    add_special_tokens = True,      # Add the special tokens.
                    max_length = MAX_LEN,           # Pad & truncate all sentences.
                    pad_to_max_length = True,
                    return_attention_mask = True,   # Construct attn. masks.
                    return_tensors = 'pt',  
                    truncation=True # Return pytorch tensors.
        )
        
        # These are torch tensors.
        padded_token_list = encoded_dict['input_ids'][0]
        att_mask = encoded_dict['attention_mask'][0]
        
        # Convert the target to a torch tensor
        target = torch.tensor(self.df_data.loc[index, 'target'])
        sample = (padded_token_list, att_mask, target)
        return sample

    def __len__(self):
        return len(self.df_data)
    
class TestDataset(Dataset):

    def __init__(self, df):
        self.df_data = df

    def __getitem__(self, index):

        # get the sentence from the dataframe
        sentence1 = self.df_data.loc[index, 'url']
        sentence2 = self.df_data.loc[index, 'title']
        sentence3 = self.df_data.loc[index, 'title_re']

        # Process the sentence
        # ---------------------

        encoded_dict = tokenizer.encode_plus(
                    sentence1, sentence2,         # Sentence to encode.
                    add_special_tokens = True,      # Add the special tokens.
                    max_length = MAX_LEN,           # Pad & truncate all sentences.
                    pad_to_max_length = True,
                    return_attention_mask = True,   # Construct attn. masks.
                    return_tensors = 'pt', 
                    truncation=True
            # Return pytorch tensors.
        )
        
        # These are torch tensors.
        padded_token_list = encoded_dict['input_ids'][0]
        att_mask = encoded_dict['attention_mask'][0]
        sample = (padded_token_list, att_mask)
        return sample


    def __len__(self):
        return len(self.df_data)

## Define the Model

In [None]:
model = XLMRobertaForSequenceClassification.from_pretrained(
    MODEL_TYPE, 
    num_labels = 2, # The number of output labels. 2 for binary classification.
)

# Send the model to the device.
model.to(device)

## Define the Optimizer

In [None]:
# Define the optimizer
optimizer = AdamW(model.parameters(),
              lr = L_RATE, 
              eps = 1e-8 
            )

## Train the Model

In [None]:
# Create the dataloaders.

train_data = CompDataset(df_train)
val_data = CompDataset(df_val)
test_data = TestDataset(df_test)

train_dataloader = torch.utils.data.DataLoader(train_data,
                                        batch_size=BATCH_SIZE,
                                        shuffle=True,
                                       num_workers=NUM_CORES)

val_dataloader = torch.utils.data.DataLoader(val_data,
                                        batch_size=BATCH_SIZE,
                                        shuffle=True,
                                       num_workers=NUM_CORES)

test_dataloader = torch.utils.data.DataLoader(test_data,
                                        batch_size=BATCH_SIZE,
                                        shuffle=False,
                                       num_workers=NUM_CORES)



print(len(train_dataloader))
print(len(val_dataloader))
print(len(test_dataloader))

In [None]:
%%time


# Set the seed.
seed_val = 101

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []


# For each epoch...
for epoch in range(0, NUM_EPOCHS):
    
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, NUM_EPOCHS))
    

    stacked_val_labels = []
    targets_list = []

    # ========================================
    #               Training
    # ========================================
    
    print('Training...')
    
    # put the model into train mode
    model.train()
    
    # This turns gradient calculations on and off.
    torch.set_grad_enabled(True)


    # Reset the total loss for this epoch.
    total_train_loss = 0

    for i, batch in enumerate(train_dataloader):
        
        train_status = 'Batch ' + str(i) + ' of ' + str(len(train_dataloader))
        
        print(train_status, end='\r')


        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model.zero_grad()        


        outputs = model(b_input_ids, 
                    attention_mask=b_input_mask,
                    labels=b_labels)
        
        # Get the loss from the outputs tuple: (loss, logits)
        loss = outputs[0]
        
        # Convert the loss from a torch tensor to a number.
        # Calculate the total loss.
        total_train_loss = total_train_loss + loss.item()
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        
        
        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        
        
        # Use the optimizer to update the weights.
        
        # Optimizer for GPU
        # optimizer.step() 
        
        # Optimizer for TPU
        # https://pytorch.org/xla/
        xm.optimizer_step(optimizer, barrier=True)

    
    print('Train loss:' ,total_train_loss)


    # ========================================
    #               Validation
    # ========================================
    
    print('\nValidation...')

    # Put the model in evaluation mode.
    model.eval()

    # Turn off the gradient calculations.
    # This tells the model not to compute or store gradients.
    # This step saves memory and speeds up validation.
    torch.set_grad_enabled(False)
    
    
    # Reset the total loss for this epoch.
    total_val_loss = 0
    

    for j, batch in enumerate(val_dataloader):
        
        val_status = 'Batch ' + str(j) + ' of ' + str(len(val_dataloader))
        
        print(val_status, end='\r')

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)      


        outputs = model(b_input_ids, 
                attention_mask=b_input_mask, 
                labels=b_labels)
        
        # Get the loss from the outputs tuple: (loss, logits)
        loss = outputs[0]
        
        # Convert the loss from a torch tensor to a number.
        # Calculate the total loss.
        total_val_loss = total_val_loss + loss.item()
        

        # Get the preds
        preds = outputs[1]


        # Move preds to the CPU
        val_preds = preds.detach().cpu().numpy()
        
        # Move the labels to the cpu
        targets_np = b_labels.to('cpu').numpy()

        # Append the labels to a numpy list
        targets_list.extend(targets_np)

        if j == 0:  # first batch
            stacked_val_preds = val_preds

        else:
            stacked_val_preds = np.vstack((stacked_val_preds, val_preds))

    
    # Calculate the validation accuracy
    y_true = targets_list
    y_pred = np.argmax(stacked_val_preds, axis=1)
    
    val_acc = f1_score(y_true, y_pred)
    
    
    print('Val loss:' ,total_val_loss)
    print('Val acc: ', val_acc)


    # Save the Model
    torch.save(model.state_dict(), 'model.pt')
    
    # Use the garbage collector to save memory.
    gc.collect()

## Make a prediction on the test set

In [None]:
for j, batch in enumerate(test_dataloader):
        
        inference_status = 'Batch ' + str(j+1) + ' of ' + str(len(test_dataloader))
        
        print(inference_status, end='\r')

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)


        outputs = model(b_input_ids, 
                attention_mask=b_input_mask)
        
        
        # Get the preds
        preds = outputs[0]


        # Move preds to the CPU
        preds = preds.detach().cpu().numpy()
        
        # Move the labels to the cpu
        targets_np = b_labels.to('cpu').numpy()

        # Append the labels to a numpy list
        targets_list.extend(targets_np)
        
        # Stack the predictions.

        if j == 0:  # first batch
            stacked_preds = preds

        else:
            stacked_preds = np.vstack((stacked_preds, preds))

In [None]:
stacked_preds

## Process the Predictions

In [None]:
# Take the argmax. This returns the column index of the max value in each row.

preds = np.argmax(stacked_preds, axis=1)

preds

## Create a submission csv file

In [None]:
# Load the sample submission.
# The row order in the test set and the sample submission is the same.

preds.shape

In [None]:
# Assign the preds to the prediction column
data['id'] = [i for i in range(data.shape[0])]

df_test = data.loc[ data['target'] == -1 ]

df_sub = pd.DataFrame()
df_sub['id'] = df_test['id']
df_sub['target'] = preds 

df_sub.head()

In [None]:
df_sub.to_csv("sub.csv", index=False)

## A1 - Acronyms

- NLP - Natural Language Processing
- NLU - Natural Language Understanding
- NLI - Natural Language Inference
- NER - Named Entity Recognition
- NSP - Next Sentence Prediction
- MLM - Masked Language Model
- PoS - Part of Speech
- POST - Part of Speech Tagging
- GLUE - The General Language Understanding Evaluation benchmark
- SQuAD - Stanford Question Answering Dataset
- SWAG - Situations With Adversarial Generations (Dataset)
- XNLI - Cross Lingual Natural Language Inference (Dataset)
- XLU - Cross-lingual Language Understanding



| <a id='GLUE_Datasets'></a>

## A2 - GLUE Datasets

GLUE (General Language Understanding Evaluation) is a performance bechmark that's used to compare the language understanding capability of machine learning models. A model's performance on 9 datasets is reduced to a single number. These are the datasets that are part of GLUE.

1. MNLI -Multi-Genre Natural Language Inference
2. QQP - Quora Question Pairs
3. QNLI - Question Natural Langiage Inference
4. SST-2 - Stanford Sentiment Treebank
5. CoLA - Corpus of Linguistic Acceptability
6. STS-B - Semantic Textual Similarity Benchmark
7. MRPC - Microsoft Research Paraphrase Corpus
8. RTE - Recognizing Textual Entailment
9. WNLI - Winograd NLI

More Info:<br>
GLUE Explained: Understanding BERT Through Benchmarks<br>
https://mccormickml.com/2019/11/05/GLUE/


| <a id='Datasets_Separated_by_Task'></a>

## A3 - Datasets Separated by Task

a) Sentence Pair Classification Tasks<br>
MNLI, QQP, QNLI, STS-B, MRPC, RTE, SWAG

b) Single Sentence Classifications Tasks<br>
SST-2, CoLA

c) Question Answering Tasks<br>
SQuAD (v1.1 and v2.0)

d) Single Sentence Tagging Tasks<br>
CoNLL-2003 NER

| <a id='Papers'></a>

## A4 - Papers

- Attention is all you need<br>
https://arxiv.org/pdf/1706.03762.pdf

- BERT Paper<br>
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding<br>
https://arxiv.org/pdf/1810.04805.pdf

- XLMRoberta Paper<br>
Unsupervised Cross-lingual Representation Learning at Scale<br>
https://arxiv.org/pdf/1911.02116.pdf

- GLUE Paper<br>
https://arxiv.org/abs/1804.07461<br>
Website: https://gluebenchmark.com/

- MultiNLI Paper<br>
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference<br>
https://cims.nyu.edu/~sbowman/multinli/paper.pdf<br>
Website: https://cims.nyu.edu/~sbowman/multinli/

- XNLI Paper<br>
https://arxiv.org/pdf/1809.05053.pdf<br>
Website: https://cims.nyu.edu/~sbowman/xnli/

- SentencePiece Paper<br>
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
https://arxiv.org/abs/1808.06226

| <a id='NLP_Applications'></a>

## A5 - What is NLP used for?

- Text Classification
- Translation
- Named Entity Recognition
- Part of Speech Tagging
- Question Answering
- Text Generation
- Language Modeling
- Text Summarization

| <a id='Helpful_Resources'></a>

## A6 - Helpful Resources

- GLUE Explained: Understanding BERT Through Benchmarks<br>
https://mccormickml.com/2019/11/05/GLUE/

- Improving Language Understanding with Unsupervised Learning<br>
https://openai.com/blog/language-unsupervised/

- Hugging Face Transformers Github<br>
https://github.com/huggingface/transformers

- Hugging Face Summary of Models<br>
https://huggingface.co/transformers/model_summary.html

- Hugging Face - Searchable model listing<br>
https://huggingface.co/models

- Bert Video Series by ChrisMcCormickAI<br>
Part 1<br>
https://www.youtube.com/watch?v=FKlPCK1uFrc<br>
Part 2<br>
https://www.youtube.com/watch?v=zJW57aCBCTk<br>
Part 3<br>
https://www.youtube.com/watch?v=x66kkDnbzi4<br>
Part 4<br>
https://www.youtube.com/watch?v=Hnvb9b7a_Ps<br>

- Data Processing For Question & Answering Systems: BERT vs. RoBERTa by Abhishek Thakur<br>
https://www.youtube.com/watch?v=6a6L_9USZxg

- Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM-RoBERTa And Many More by Abhishek Thakur<br>
https://youtu.be/U51ranzJBpY

- PyTorch on XLA Devices - docs<br>
https://pytorch.org/xla/release/1.5/index.html


**Thank you for reading.**