### Notice in this notebook

- Important handling for Question-Answering models
        ids = [0] + sentiment_id + [2, 2] + encoding.ids + [2]
        offsets = [(0, 0)] * 4 + encoding.offsets + [(0, 0)]

<br>

- Text is lowercased. Notice:
    - The use of `lower()`
    - ByteLevelBPETokenizer is `lowercase`

<br>

- Although Huggingface has a ready-to-use Question-Answering model (class RobertaForQuestionAnswering), we are not using it; we DIY a Question-Answering model by taking the raw output of a Roberta model and specifying the Question-Answering logic in `forward()`. One of the purposes of doing so is to get the opportunity to custimize (e.g. averaging the last four output layers). 
  
  The `forward()` of Roberta or Bert question answering contains the following steps:
  - The Roberta model outputs hidden states
  - Get the last one or the last a few layers of the hidden states
  - Use a fully connected layer to map `hidden_size` to 2
  - Split to start_logits and end_logits
  
  Then loss is calculated.

<br>

- The output of the model would go through `softmax`, before going through `argmax`. The `softmax` makes sure that output values lie in the range [0,1] and sum to 1, so that it makes sense to element-wise add 10 lists of output values from 10 different models (10-fold split) and then take the average, before going through `argmax`.

<br>

- Code structure of this notebook:

        class TweetDataset
            def __init__
            def __getitem__
                process one tweet
                return a dict of information about this tweet
        class TweetModel
            def __init__
            def forward
                return start_logits, end_logits
        def loss_fn
        def train
            for each epoch
                for [train, val]
                    if train
                        model.train()
                    else
                        model.eval() # disable dropout
                    epoch_loss = 0
                    epoch_jaccard = 0
                    for each batch from train or val DataLoader
                        with torch.no_grad() if eval
                            optimizer.zero_grad()
                            model predict
                            calculate and accumulate loss
                            if train
                                loss.backward()
                                optimizer.step()
                                scheduler.step()
                            calculate jaccard_score for each tweet and accumulate them
                    calculate the average loss as epoch_loss
                    calculate the average jaccard_score as epoch_jaccard
                    if val
                        update early stopping, and save model if improved
                do early stopping if out of patience

        for each train-val split
            train

<br>

### Changes compared to the [original notebook](https://www.kaggle.com/shoheiazuma/tweet-sentiment-roberta-pytorch):
- Split into two notebooks (train and inference)
- Added early stopping, and used one of the following manipulations to the Roberta output in `forward()`
    - Averaged the last four layers (Private 0.71557, Public 0.71521)
        - Further added learning rate scheduler and fine tuned, and added memory cleaning (THIS RUN) (Private 0.71620, Public 0.71286)
    - Concatenated last hidden layer and the third last layer, instead of averaging the last four layers (Private 0.71647, Public 0.71240)
    - Reduced to use only the last one layer of hidden layer instead of averaging the last four layers (Private 0.71439, Public 0.71240)
    
### View the [inference notebook here](https://www.kaggle.com/kanruwang/tweet-sentiment-roberta-pytorch-inference)

<br>

# Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import warnings
import random
import torch 
from torch import nn
import torch.optim as optim
from sklearn.model_selection import StratifiedKFold
import tokenizers
from transformers import RobertaModel, RobertaConfig

from early_stopping import * ####
!pip install GPUtil ####
from GPUtil import showUtilization ####
import gc ####

warnings.filterwarnings('ignore')

# Seed

In [None]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = True

seed = 42
seed_everything(seed)

# GPU Memory Releasing Function

In [None]:
def print_gpu_cache(detailed=False):
    print("GPU Usage")
    showUtilization()
    if detailed:
        print("GPU memory summary")
        print(torch.cuda.memory_summary(device=None, abbreviated=False))

# Data Loader

In [None]:
class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, df, max_len=96):
        self.df = df
        self.max_len = max_len
        self.labeled = 'selected_text' in df
        self.tokenizer = tokenizers.ByteLevelBPETokenizer(
            vocab_file='../input/roberta-base/vocab.json', 
            merges_file='../input/roberta-base/merges.txt', 
            lowercase=True,
            add_prefix_space=True)

    def __getitem__(self, index):
        data = {}
        row = self.df.iloc[index]
        
        ids, masks, tweet, offsets = self.get_input_data(row)
        data['ids'] = ids
        data['masks'] = masks
        data['tweet'] = tweet
        data['offsets'] = offsets
        
        if self.labeled:
            start_idx, end_idx = self.get_target_idx(row, tweet, offsets)
            data['start_idx'] = start_idx
            data['end_idx'] = end_idx
        
        return data

    def __len__(self):
        return len(self.df)
    
    def get_input_data(self, row):
        tweet = " " + " ".join(row.text.lower().split())
        encoding = self.tokenizer.encode(tweet)
        sentiment_id = self.tokenizer.encode(row.sentiment).ids
        ids = [0] + sentiment_id + [2, 2] + encoding.ids + [2]
        offsets = [(0, 0)] * 4 + encoding.offsets + [(0, 0)]
                
        pad_len = self.max_len - len(ids)
        if pad_len > 0:
            ids += [1] * pad_len
            offsets += [(0, 0)] * pad_len
        
        ids = torch.tensor(ids)
        masks = torch.where(ids != 1, torch.tensor(1), torch.tensor(0))
        offsets = torch.tensor(offsets)
        
        return ids, masks, tweet, offsets
        
    def get_target_idx(self, row, tweet, offsets):
        selected_text = " " +  " ".join(row.selected_text.lower().split())

        len_st = len(selected_text) - 1
        idx0 = None
        idx1 = None

        for ind in (i for i, e in enumerate(tweet) if e == selected_text[1]):
            if " " + tweet[ind: ind+len_st] == selected_text:
                idx0 = ind
                idx1 = ind + len_st - 1
                break

        char_targets = [0] * len(tweet)
        if idx0 != None and idx1 != None:
            for ct in range(idx0, idx1 + 1):
                char_targets[ct] = 1

        target_idx = []
        for j, (offset1, offset2) in enumerate(offsets):
            if sum(char_targets[offset1: offset2]) > 0:
                target_idx.append(j)

        start_idx = target_idx[0]
        end_idx = target_idx[-1]
        
        return start_idx, end_idx
        
def get_train_val_loaders(df, train_idx, val_idx, batch_size=8):
    train_df = df.iloc[train_idx]
    val_df = df.iloc[val_idx]

    train_loader = torch.utils.data.DataLoader(
        TweetDataset(train_df), 
        batch_size=batch_size, 
        shuffle=True, 
        num_workers=2,
        drop_last=True)

    val_loader = torch.utils.data.DataLoader(
        TweetDataset(val_df), 
        batch_size=batch_size, 
        shuffle=False, 
        num_workers=2)

    dataloaders_dict = {"train": train_loader, "val": val_loader}

    return dataloaders_dict

# Model

In [None]:
class TweetModel(nn.Module):
    def __init__(self):
        super(TweetModel, self).__init__()
        
        config = RobertaConfig.from_pretrained(
            '../input/roberta-base/config.json', output_hidden_states=True)    
        self.roberta = RobertaModel.from_pretrained(
            '../input/roberta-base/pytorch_model.bin', config=config)
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(config.hidden_size, 2)
        nn.init.normal_(self.fc.weight, std=0.02)
        nn.init.normal_(self.fc.bias, 0)

    def forward(self, input_ids, attention_mask):
        _, _, hs = self.roberta(input_ids, attention_mask)
         
        x = torch.stack([hs[-1], hs[-2], hs[-3], hs[-4]])
        x = torch.mean(x, 0)
        x = self.dropout(x)
        x = self.fc(x)
        start_logits, end_logits = x.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
                
        return start_logits, end_logits

# Loss Function

In [None]:
def loss_fn(start_logits, end_logits, start_positions, end_positions):
    ce_loss = nn.CrossEntropyLoss()
    start_loss = ce_loss(start_logits, start_positions)
    end_loss = ce_loss(end_logits, end_positions)    
    total_loss = start_loss + end_loss
    return total_loss

# Evaluation Function

In [None]:
def get_selected_text(text, start_idx, end_idx, offsets):
    selected_text = ""
    for ix in range(start_idx, end_idx + 1):
        selected_text += text[offsets[ix][0]: offsets[ix][1]]
        if (ix + 1) < len(offsets) and offsets[ix][1] < offsets[ix + 1][0]:
            selected_text += " "
    return selected_text

def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

def compute_jaccard_score(text, start_idx, end_idx, start_logits, end_logits, offsets):
    start_pred = np.argmax(start_logits)
    end_pred = np.argmax(end_logits)
    if start_pred > end_pred:
        pred = text
    else:
        pred = get_selected_text(text, start_pred, end_pred, offsets)
        
    true = get_selected_text(text, start_idx, end_idx, offsets)
    
    return jaccard(true, pred)

# Training Function

In [None]:
#def train_model(model, dataloaders_dict, criterion, optimizer, num_epochs, filename): ####
def train_model(model, dataloaders_dict, criterion, optimizer, scheduler, num_epochs, filename): ####
    model.cuda()

    es = EarlyStopping(patience=2, mode="max") ####
    
    for epoch in range(num_epochs):
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            epoch_loss = 0.0
            epoch_jaccard = 0.0
            
            #for data in (dataloaders_dict[phase]): ####
            for batch_idx, data in enumerate(dataloaders_dict[phase]): ####
                ids = data['ids'].cuda()
                masks = data['masks'].cuda()
                tweet = data['tweet']
                offsets = data['offsets'].numpy()
                start_idx = data['start_idx'].cuda()
                end_idx = data['end_idx'].cuda()
                
                #optimizer.zero_grad() ####
                model.zero_grad() ####

                with torch.set_grad_enabled(phase == 'train'):

                    start_logits, end_logits = model(ids, masks)

                    loss = criterion(start_logits, end_logits, start_idx, end_idx)
                    
                    if phase == 'train':
                        loss.backward()
                        if batch_idx % 50 == 0: ####
                            print( ####
                                "batch_idx " + str(batch_idx), ####
                                ", opt lr " + str(round(optimizer.param_groups[0]['lr'], 6)), ####
                                ", scheduler lr " + str(round(scheduler.get_last_lr()[0], 6)) ####
                            ) ####
                        optimizer.step()
                        scheduler.step() ####

                    epoch_loss += loss.item() * len(ids)
                    
                    start_idx = start_idx.cpu().detach().numpy()
                    end_idx = end_idx.cpu().detach().numpy()
                    start_logits = torch.softmax(start_logits, dim=1).cpu().detach().numpy()
                    end_logits = torch.softmax(end_logits, dim=1).cpu().detach().numpy()
                    
                    for i in range(len(ids)):                        
                        jaccard_score = compute_jaccard_score(
                            tweet[i],
                            start_idx[i],
                            end_idx[i],
                            start_logits[i], 
                            end_logits[i], 
                            offsets[i])
                        epoch_jaccard += jaccard_score
                    
            epoch_loss = epoch_loss / len(dataloaders_dict[phase].dataset)
            epoch_jaccard = epoch_jaccard / len(dataloaders_dict[phase].dataset)
            
            print('Epoch {}/{} | {:^5} | Loss: {:.4f} | Jaccard: {:.4f}'.format(
                epoch + 1, num_epochs, phase, epoch_loss, epoch_jaccard))
            
            if phase == 'val': ####
                es(epoch_jaccard, model, model_path=filename) ####
                
        if es.early_stop: ####
            print("Early stopping") ####
            break ####
    
    #torch.save(model.state_dict(), filename) ####

# Training

In [None]:
#num_epochs = 3 ####
num_epochs = 5 ####
batch_size = 32
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

In [None]:
%%time

train_df = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')
train_df['text'] = train_df['text'].astype(str)
train_df['selected_text'] = train_df['selected_text'].astype(str)

for fold, (train_idx, val_idx) in enumerate(skf.split(train_df, train_df.sentiment), start=1): 
    print(f'Fold: {fold}')

    model = TweetModel()
    optimizer = optim.AdamW(model.parameters(), lr=3e-5, betas=(0.9, 0.999))
    criterion = loss_fn    
    dataloaders_dict = get_train_val_loaders(train_df, train_idx, val_idx, batch_size)
    scheduler = optim.lr_scheduler.OneCycleLR( ####
        optimizer, max_lr=3e-5, steps_per_epoch=len(dataloaders_dict["train"]), epochs=num_epochs ####
    ) ####

    train_model(
        model, 
        dataloaders_dict,
        criterion, 
        optimizer,
        scheduler, ####
        num_epochs,
        f'roberta_fold{fold}.pth')
    
    print_gpu_cache() ####
    torch.cuda.empty_cache() ####
    del model, optimizer, dataloaders_dict, scheduler ####
    gc.collect() ####
    print_gpu_cache() ####