# [Commented] RoBERTa using PyTorch
## This a RoBERTa version of @abhishek's [BERT Base Uncased using PyTorch](https://www.kaggle.com/abhishek/bert-base-uncased-using-pytorch)

*Note: The comments are still fit to the original BERT implementation. I will incrementally work towards updating it with RoBERTa based notes*

I know how it feels like to look at Kaggle kernels and just be completely daunted by the complexity, so I hope these comments & explanations can help you read through, and understand this wonderful solution better! :)

**To understand solutions better, I slowly read through them and comment at each line.**

Later, I realized that this is a good way to:

1. Increase my mastery of the solution, including both the concepts and tools used!(e.g. BERT & PyTorch)
2. Review solutions that I've previously understood and implemented
3. Help other people understand solutions better and faster as well

Without a doubt, I believe that reading and understanding the work of experienced data scientists is one of the best ways to accelerate your growth as a data scientist!

Please do *upvote* this notebook if you find this useful - this will indicate to me that there's value in providing these *commented* versions of solutions. This is my first notebook of this type, and plan to do this for more solutions and Kaggle competitions in the future.

Happy learning and best of luck in your data science journey! üòÑ

# All the important imports

In [1]:
import os
import torch
import pandas as pd
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.optim import lr_scheduler

from sklearn import model_selection
from sklearn import metrics
import transformers
import tokenizers
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm.autonotebook import tqdm
import utils

In [2]:
ROBERTA_PATH = "../input/roberta-base"
TOKENIZER = tokenizers.ByteLevelBPETokenizer(
    vocab_file=f"{ROBERTA_PATH}/vocab.json", 
    merges_file=f"{ROBERTA_PATH}/merges.txt", 
    lowercase=True,
    add_prefix_space=True
)
TRAINING_FILE = "../input/tweet-train-folds/train_folds.csv"
MAX_LEN = 192
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 8
EPOCHS = 5


# Data Processing

In [3]:
class Spliter:
    """‰ΩôË®à„Å™ÊñáÂ≠ó„ÇíÂâäÈô§„Åô„Çã„ÄÇÂÖàÈ†≠„Å´ÂçäËßí„Çπ„Éö„Éº„Çπ„ÅåËøΩÂä†„Åï„Çå„Çã„ÄÇ
    """
    
    @staticmethod
    def split(text):
        return " " + " ".join(str(text).split()) # „Çπ„Éö„Éº„Çπ„ÄÅÊîπË°å„ÄÅ„Çø„Éñ„ÅßÂàÜÂâ≤„ÄÇÈÄ£Á∂ö„Åó„Å¶„ÅÑ„ÇãÂ†¥Âêà„ÄÅ„Åæ„Å®„ÇÅ„Å¶Âá¶ÁêÜ„Åï„Çå„Çã

class MakeLabel:
    """seelctes_text„ÅÆÁÆáÊâÄ„Å´„Éï„É©„Ç∞„ÇíÁ´ã„Å¶„Çã
    text: thanks Todd.
    selected_text: thanks
    label: [1,1,1,1,1,1,0,0,0,0,0]
    """
    
    @classmethod
    def make(cls, text, selected_text):
        idx0, idx1 = cls.search_index(text, selected_text)
        label = cls.make_label(len(text), idx0, idx1)
        return label
        
    
    @staticmethod
    def search_index(text, selected_text):
        len_st = len(selected_text) - 1 # ÂÖàÈ†≠„ÅÆÂçäËßí„Çπ„Éö„Éº„Çπ1ÂÄãÂàÜ
        idx0 = None
        idx1 = None
        for ind in (i for i, e in enumerate(text) if e == selected_text[1]):
            if text[ind: ind+len_st] == selected_text[1:]:
                idx0 = ind
                idx1 = ind + len_st - 1
                break
        return idx0, idx1
    
    @staticmethod
    def make_label(text_length, idx0, idx1):
        label = [0] * text_length
        if idx0 != None and idx1 != None:
            for idx in range(idx0, idx1 + 1):
                label[idx] = 1
        return label
    
class TextToIds:
    """tokenize„Åô„Çã„ÄÇtoken_id„ÅÆ„É™„Çπ„Éà„Å®offset„ÇíËøî„Åô
    seelctes_text„ÅÆÁÆáÊâÄ„Å´„Éï„É©„Ç∞„ÇíÁ´ã„Å¶„Çã
    text: thanks Todd.
    token_id: [34, 76, 7]
    offset: [(0, 5), (6, 9), (10, 10)]
    """
    
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        
    def convert(self, text):
        enc_text = self.tokenizer.encode(text)
        ids, offsets = enc_text.ids, enc_text.offsets
        return ids, offsets

class SearchLabelIdPos:
    """label„Ååoffsets„ÅÆ‰ΩïÁï™ÁõÆ„Å´Â±û„Åó„Å¶„ÅÑ„Çã„Åã„ÇíÊé¢„Åô(tokenize„Åï„Çå„Åüids„Å®ÂØæÂøú„ÇíÂèñ„Çã„Åü„ÇÅ)„ÄÇ
    label: [1,1,1,1,1,1,0,0,0,0,0]
    offset: [(0, 5), (6, 9), (10, 10)]
    label_pos: [1,0,0]
    """
    
    def __init__(self):
        pass
    
    @staticmethod
    def search(label, offsets):
        lebel_pos = []
        for i, (offset1, offset2) in enumerate(offsets):
            if sum(label[offset1: offset2]) > 0:
                lebel_pos.append(i)
        return lebel_pos


class SentimentInputData:
    
    def __init__(self, tokenizer, max_len):
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._fitted = False
    
    def fit(self, text, selected_text, sentiment):
        self.sentiment = sentiment
        self._sentiment_id = self._tokenizer.encode(self.sentiment).ids # positive: 1313, negative: 2430, neutral: 7974
        self.text = Spliter.split(text)
        self.selected_text = Spliter.split(selected_text)
        label = MakeLabel.make(self.text, self.selected_text)
        
        self._ids, self._offsets = TextToIds(self._tokenizer).convert(self.text)
        self._label_pos = SearchLabelIdPos.search(label, self._offsets)
        self._padding_length = max(0, self._max_len - len(self._ids) -5) # self.ids„ÅßË∂≥„Åï„Çå„ÇãÂàÜ„ÄÅÔºï„ÇíÂºï„Åè
        self._fitted = True
        
        return self
        
    def is_fitted(self):
        if not self._fitted:
            raise AttributeError('must be fit before get attribute')
    
    @property
    def ids(self):
        # https://huggingface.co/transformers/model_doc/roberta.html
        # 0: cls_token, 2: sep_token
        self.is_fitted()
        ids = [0] + self._sentiment_id + [2] + [2] + self._ids + [2]
        return ids + ([1] * self._padding_length)
    
    @property
    def token_type_ids(self):
        self.is_fitted()
        token_type_ids = [0, 0, 0, 0] + [0] * (len(self._ids) + 1)
        return token_type_ids + ([0] * self._padding_length)
    
    @property
    def attention_mask(self):
        self.is_fitted()
        attention_mask = [1] * (len(self._ids) + 5)
        return attention_mask + ([0] * self._padding_length)
    @property
    def offsets(self):
        self.is_fitted()
        offsets = [(0, 0)] * 4 + self._offsets + [(0, 0)]
        return offsets + ([(0, 0)] * self._padding_length)
    
    @property
    def targets_start(self):
        self.is_fitted()
        try: 
            output = self._label_pos[0] + 4
        except IndexError:
            print(self.text)
            print(self.selected_text)
            print(self._label_pos)
            raise IndexError('error')
        return output
#         return self._label_pos[0] + 4
    
    @property
    def targets_end(self):
        self.is_fitted()
        return self._label_pos[-1] + 4

In [4]:
pd.read_csv(TRAINING_FILE).head()

Unnamed: 0,textID,text,selected_text,sentiment,kfold
0,171b4de425,graduates college on saturday,graduates college on saturday,neutral,0
1,ed37eaaf83,"thanks i have to finish schoolwork today, no...",thanks,positive,0
2,3a2c407f74,how come when i straighten my hair it has to s...,how come when i straighten my hair it has to s...,neutral,0
3,cf966eb7b8,thanks Todd. Enjoyed reading your blog too - ...,Enjoyed,positive,0
4,2a2b5d0558,"Oh my Lord, I have no idea if any of this ****...",****,negative,0


In [5]:
sentiment_data.text[1:]

NameError: name 'sentiment_data' is not defined

In [6]:
sentiment_data = SentimentInputData(TOKENIZER, 192)
sentiment_data = sentiment_data.fit(text='thanks i have to finish schoolwork today',
                                    selected_text='ave',
                                    sentiment='neutral')
sentiment_data.text

' thanks i have to finish schoolwork today'

In [7]:

sentiment_data = SentimentInputData(TOKENIZER, 192)
sentiment_data = sentiment_data.fit(text='aw its okay tht happened wid me too..am so glad thts OVER now!am not helpin here am i?!?lol thnx for postin WMIAD loved it',
                                    selected_text='?lol thnx for postin WMIAD loved it',
                                    sentiment='neutral')
sentiment_data.text

' aw its okay tht happened wid me too..am so glad thts OVER now!am not helpin here am i?!?lol thnx for postin WMIAD loved it'

In [8]:
sentiment_data.targets_start

30

# Data loader

In [9]:
class TweetDataset:
    """
    Dataset which stores the tweets and returns them as processed features
    """
    def __init__(self, tweet, sentiment, selected_text):
        self.tweet = tweet
        self.sentiment = sentiment
        self.selected_text = selected_text
        self.input_data = SentimentInputData(TOKENIZER, MAX_LEN)
        self.tokenizer = TOKENIZER
        self.max_len = MAX_LEN
    
    def __len__(self):
        return len(self.tweet)

    def __getitem__(self, item):
        data = self.input_data.fit(
            self.tweet[item], 
            self.selected_text[item], 
            self.sentiment[item]
        )

        # Return the processed data where the lists are converted to `torch.tensor`s
        return {
            'ids': torch.tensor(data.ids, dtype=torch.long),
            'mask': torch.tensor(data.attention_mask, dtype=torch.long),
            'token_type_ids': torch.tensor(data.token_type_ids, dtype=torch.long),
            'targets_start': torch.tensor(data.targets_start, dtype=torch.long),
            'targets_end': torch.tensor(data.targets_end, dtype=torch.long),
            'orig_tweet': data.text,
            'orig_selected': data.selected_text,
            'sentiment': data.sentiment,
            'offsets': torch.tensor(data.offsets, dtype=torch.long)
        }

# The Model

In [10]:
class TweetModel(transformers.BertPreTrainedModel):
    """
    Model class that combines a pretrained bert model with a linear later
    """
    def __init__(self, conf):
        super(TweetModel, self).__init__(conf)
        # Load the pretrained RobBERTa model
        self.roberta = transformers.RobertaModel.from_pretrained(ROBERTA_PATH, config=conf)
        # Set 10% dropout to be applied to the RobBERTa backbone's output
        self.drop_out = nn.Dropout(0.1)
        # 768 is the dimensionality of roberta-base's hidden representations
        # Multiplied by 2 since the forward pass concatenates the last two hidden representation layers
        # The output will have two dimensions ("start_logits", and "end_logits")
        self.l0 = nn.Linear(768 * 2, 2)
        torch.nn.init.normal_(self.l0.weight, std=0.02)
    
    def forward(self, ids, mask, token_type_ids):
        # Return the hidden states from the BERT backbone
        _, _, out = self.roberta(
            ids,
            attention_mask=mask,
            token_type_ids=token_type_ids
        ) # bert_layers x bs x SL x (768 * 2)

        # Concatenate the last two hidden states
        # This is done since experiments have shown that just getting the last layer
        # gives out vectors that may be too taylored to the original RoBERTa training objectives (MLM + NSP)
        # Sample explanation: https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last
        out = torch.cat((out[-1], out[-2]), dim=-1) # bs x SL x (768 * 2)
        # Apply 10% dropout to the last 2 hidden states
        out = self.drop_out(out) # bs x SL x (768 * 2)
        # The "dropped out" hidden vectors are now fed into the linear layer to output two scores
        logits = self.l0(out) # bs x SL x 2

        # Splits the tensor into start_logits and end_logits
        # (bs x SL x 2) -> (bs x SL x 1), (bs x SL x 1)
        start_logits, end_logits = logits.split(1, dim=-1)

        start_logits = start_logits.squeeze(-1) # (bs x SL)
        end_logits = end_logits.squeeze(-1) # (bs x SL)

        return start_logits, end_logits

# Loss Function

In [11]:
def loss_fn(start_logits, end_logits, start_positions, end_positions):
    """
    Return the sum of the cross entropy losses for both the start and end logits
    """
    loss_fct = nn.CrossEntropyLoss()
    start_loss = loss_fct(start_logits, start_positions)
    end_loss = loss_fct(end_logits, end_positions)
    total_loss = (start_loss + end_loss)
    return total_loss

# Training Function

In [12]:
def train_fn(data_loader, model, optimizer, device, scheduler=None):
    """
    Trains the bert model on the twitter data
    """
    # Set model to training mode (dropout + sampled batch norm is activated)
    model.train()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()

    # Set tqdm to add loading screen and set the length
    tk0 = tqdm(data_loader, total=len(data_loader))
    
    # Train the model on each batch
    for bi, d in enumerate(tk0):

        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        offsets = d["offsets"]
        

        # Move ids, masks, and targets to gpu while setting as torch.long
        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)

        # Reset gradients
        model.zero_grad()
        # Use ids, masks, and token types as input to the model
        # Predict logits for each of the input tokens for each batch
        outputs_start, outputs_end = model(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids,
        ) # (bs x SL), (bs x SL)
        # Calculate batch loss based on CrossEntropy
        loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end)
        # Calculate gradients based on loss
        loss.backward()
        # Adjust weights based on calculated gradients
        optimizer.step()
        # Update scheduler
        scheduler.step()
        
        # Apply softmax to the start and end logits
        # This squeezes each of the logits in a sequence to a value between 0 and 1, while ensuring that they sum to 1
        # This is similar to the characteristics of "probabilities"
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
        
        # Calculate the jaccard score based on the predictions for this batch
        jaccard_scores = []
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            jaccard_score, _ = calculate_jaccard_score(
                original_tweet=tweet, # Full text of the px'th tweet in the batch
                target_string=selected_tweet, # Span containing the specified sentiment for the px'th tweet in the batch
                sentiment_val=tweet_sentiment, # Sentiment of the px'th tweet in the batch
                idx_start=np.argmax(outputs_start[px, :]), # Predicted start index for the px'th tweet in the batch
                idx_end=np.argmax(outputs_end[px, :]), # Predicted end index for the px'th tweet in the batch
                offsets=offsets[px] # Offsets for each of the tokens for the px'th tweet in the batch
            )
            jaccard_scores.append(jaccard_score)
        # Update the jaccard score and loss
        # For details, refer to `AverageMeter` in https://www.kaggle.com/abhishek/utils
        jaccards.update(np.mean(jaccard_scores), ids.size(0))
        losses.update(loss.item(), ids.size(0))
        # Print the average loss and jaccard score at the end of each batch
        tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)

# Evaluation Functions

In [13]:
def calculate_jaccard_score(
    original_tweet, 
    target_string, 
    sentiment_val, 
    idx_start, 
    idx_end, 
    offsets,
    verbose=False):
    """
    Calculate the jaccard score from the predicted span and the actual span for a batch of tweets
    """
    
    # A span's start index has to be greater than or equal to the end index
    # If this doesn't hold, the start index is set to equal the end index (the span is a single token)
    if idx_end < idx_start:
        idx_end = idx_start
    
    # Combine into a string the tokens that belong to the predicted span
    filtered_output  = ""
    for ix in range(idx_start, idx_end + 1):
        filtered_output += original_tweet[offsets[ix][0]: offsets[ix][1]]
        # If the token is not the last token in the tweet, and the ending offset of the current token is less
        # than the beginning offset of the following token, add a space.
        # Basically, add a space when the next token (word piece) corresponds to a new word
        if (ix+1) < len(offsets) and offsets[ix][1] < offsets[ix+1][0]:
            filtered_output += " "

    # Set the predicted output as the original tweet when the tweet's sentiment is "neutral", or the tweet only contains one word
    if sentiment_val == "neutral" or len(original_tweet.split()) < 2:
        filtered_output = original_tweet

    # Calculate the jaccard score between the predicted span, and the actual span
    # The IOU (intersection over union) approach is detailed in the utils module's `jaccard` function:
    # https://www.kaggle.com/abhishek/utils
    jac = utils.jaccard(target_string.strip(), filtered_output.strip())
    return jac, filtered_output


def eval_fn(data_loader, model, device):
    """
    Evaluation function to predict on the test set
    """
    # Set model to evaluation mode
    # I.e., turn off dropout and set batchnorm to use overall mean and variance (from training), rather than batch level mean and variance
    # Reference: https://github.com/pytorch/pytorch/issues/5406
    model.eval()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()
    
    # Turns off gradient calculations (https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch)
    with torch.no_grad():
        tk0 = tqdm(data_loader, total=len(data_loader))
        # Make predictions and calculate loss / jaccard score for each batch
        for bi, d in enumerate(tk0):
            ids = d["ids"]
            token_type_ids = d["token_type_ids"]
            mask = d["mask"]
            sentiment = d["sentiment"]
            orig_selected = d["orig_selected"]
            orig_tweet = d["orig_tweet"]
            targets_start = d["targets_start"]
            targets_end = d["targets_end"]
            offsets = d["offsets"].numpy()
            
            # Move ids, masks, and targets to gpu while setting as torch.long
            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets_start = targets_start.to(device, dtype=torch.long)
            targets_end = targets_end.to(device, dtype=torch.long)

            # Move tensors to GPU for faster matrix calculations
            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets_start = targets_start.to(device, dtype=torch.long)
            targets_end = targets_end.to(device, dtype=torch.long)

            # Predict logits for start and end indexes
            outputs_start, outputs_end = model(
                ids=ids,
                mask=mask,
                token_type_ids=token_type_ids
            )
            # Calculate loss for the batch
            loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end)
            # Apply softmax to the predicted logits for the start and end indexes
            # This converts the "logits" to "probability-like" scores
            outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
            outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
            # Calculate jaccard scores for each tweet in the batch
            jaccard_scores = []
            for px, tweet in enumerate(orig_tweet):
                selected_tweet = orig_selected[px]
                tweet_sentiment = sentiment[px]
                jaccard_score, _ = calculate_jaccard_score(
                    original_tweet=tweet,
                    target_string=selected_tweet,
                    sentiment_val=tweet_sentiment,
                    idx_start=np.argmax(outputs_start[px, :]),
                    idx_end=np.argmax(outputs_end[px, :]),
                    offsets=offsets[px]
                )
                jaccard_scores.append(jaccard_score)

            # Update running jaccard score and loss
            jaccards.update(np.mean(jaccard_scores), ids.size(0))
            losses.update(loss.item(), ids.size(0))
            # Print the running average loss and jaccard score
            tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)
    
    print(f"Jaccard = {jaccards.avg}")
    return jaccards.avg

# Training

In [14]:
def run(fold):
    """
    Train model for a speciied fold
    """
    # Read training csv
    dfx = pd.read_csv(TRAINING_FILE)

    # Set train validation set split
    df_train = dfx[dfx.kfold != fold].reset_index(drop=True)
    df_valid = dfx[dfx.kfold == fold].reset_index(drop=True)
    
    # Instantiate TweetDataset with training data
    train_dataset = TweetDataset(
        tweet=df_train.text.values,
        sentiment=df_train.sentiment.values,
        selected_text=df_train.selected_text.values
    )

    # Instantiate DataLoader with `train_dataset`
    # This is a generator that yields the dataset in batches
    train_data_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=TRAIN_BATCH_SIZE,
        num_workers=4
    )

    # Instantiate TweetDataset with validation data
    valid_dataset = TweetDataset(
        tweet=df_valid.text.values,
        sentiment=df_valid.sentiment.values,
        selected_text=df_valid.selected_text.values
    )

    # Instantiate DataLoader with `valid_dataset`
    valid_data_loader = torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=VALID_BATCH_SIZE,
        num_workers=2
    )

    # Set device as `cuda` (GPU)
    device = torch.device("cuda")
    # Load pretrained RoBERTa
    model_config = transformers.RobertaConfig.from_pretrained(ROBERTA_PATH)
    # Output hidden states
    # This is important to set since we want to concatenate the hidden states from the last 2 BERT layers
    model_config.output_hidden_states = True
    # Instantiate our model with `model_config`
    model = TweetModel(conf=model_config)
    # Move the model to the GPU
    model.to(device)

    # Calculate the number of training steps
    num_train_steps = int(len(df_train) / TRAIN_BATCH_SIZE * EPOCHS)
    # Get the list of named parameters
    param_optimizer = list(model.named_parameters())
    # Specify parameters where weight decay shouldn't be applied
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    # Define two sets of parameters: those with weight decay, and those without
    optimizer_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]
    # Instantiate AdamW optimizer with our two sets of parameters, and a learning rate of 3e-5
    optimizer = AdamW(optimizer_parameters, lr=3e-5)
    # Create a scheduler to set the learning rate at each training step
    # "Create a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period." (https://pytorch.org/docs/stable/optim.html)
    # Since num_warmup_steps = 0, the learning rate starts at 3e-5, and then linearly decreases at each training step
    scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=0, 
        num_training_steps=num_train_steps
    )

    # Apply early stopping with patience of 2
    # This means to stop training new epochs when 2 rounds have passed without any improvement
    es = utils.EarlyStopping(patience=2, mode="max")
    print(f"Training is Starting for fold={fold}")
    
    # I'm training only for 3 epochs even though I specified 5!!!
    for epoch in range(3):
        train_fn(train_data_loader, model, optimizer, device, scheduler=scheduler)
        jaccard = eval_fn(valid_data_loader, model, device)
        print(f"Jaccard Score = {jaccard}")
        es(jaccard, model, model_path=f"model_{fold}.bin")
        if es.early_stop:
            print("Early stopping")
            break

In [15]:
fold=0
# Read training csv
dfx = pd.read_csv(TRAINING_FILE)

# Set train validation set split
df_train = dfx[dfx.kfold != fold].reset_index(drop=True)
df_valid = dfx[dfx.kfold == fold].reset_index(drop=True)

# Instantiate TweetDataset with training data
train_dataset = TweetDataset(
    tweet=df_train.text.values,
    sentiment=df_train.sentiment.values,
    selected_text=df_train.selected_text.values
)

# Instantiate DataLoader with `train_dataset`
# This is a generator that yields the dataset in batches
train_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=TRAIN_BATCH_SIZE,
    num_workers=4
)

# Instantiate TweetDataset with validation data
valid_dataset = TweetDataset(
    tweet=df_valid.text.values,
    sentiment=df_valid.sentiment.values,
    selected_text=df_valid.selected_text.values
)

# Instantiate DataLoader with `valid_dataset`
valid_data_loader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=VALID_BATCH_SIZE,
    num_workers=2
)

In [16]:
run(fold=0)

Training is Starting for fold=0


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6909789212442821
Jaccard Score = 0.6909789212442821
Validation score improved (-inf --> 0.6909789212442821). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6953626589997153
Jaccard Score = 0.6953626589997153
Validation score improved (0.6909789212442821 --> 0.6953626589997153). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6992251107549516
Jaccard Score = 0.6992251107549516
Validation score improved (0.6953626589997153 --> 0.6992251107549516). Saving model!


In [17]:
run(fold=1)

Training is Starting for fold=1


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6938759908226657
Jaccard Score = 0.6938759908226657
Validation score improved (-inf --> 0.6938759908226657). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6930737407043585
Jaccard Score = 0.6930737407043585
EarlyStopping counter: 1 out of 2


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6952339712935108
Jaccard Score = 0.6952339712935108
Validation score improved (0.6938759908226657 --> 0.6952339712935108). Saving model!


In [18]:
run(fold=2)

Training is Starting for fold=2


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6995753713833153
Jaccard Score = 0.6995753713833153
Validation score improved (-inf --> 0.6995753713833153). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.700698734080668
Jaccard Score = 0.700698734080668
Validation score improved (0.6995753713833153 --> 0.700698734080668). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.7064809285709213
Jaccard Score = 0.7064809285709213
Validation score improved (0.700698734080668 --> 0.7064809285709213). Saving model!


In [19]:
run(fold=3)

Training is Starting for fold=3


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6961423060880071
Jaccard Score = 0.6961423060880071
Validation score improved (-inf --> 0.6961423060880071). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.7016664085710863
Jaccard Score = 0.7016664085710863
Validation score improved (0.6961423060880071 --> 0.7016664085710863). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.7001043914782975
Jaccard Score = 0.7001043914782975
EarlyStopping counter: 1 out of 2


In [20]:
run(fold=4)

Training is Starting for fold=4


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.6956843404159215
Jaccard Score = 0.6956843404159215
Validation score improved (-inf --> 0.6956843404159215). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.70154201897117
Jaccard Score = 0.70154201897117
Validation score improved (0.6956843404159215 --> 0.70154201897117). Saving model!


HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))


Jaccard = 0.7107033768354419
Jaccard Score = 0.7107033768354419
Validation score improved (0.70154201897117 --> 0.7107033768354419). Saving model!


# Do the evaluation on test data

In [21]:
df_test = pd.read_csv("../input/tweet-sentiment-extraction/test.csv")
df_test.loc[:, "selected_text"] = df_test.text.values

In [22]:
device = torch.device("cuda")
model_config = transformers.RobertaConfig.from_pretrained(ROBERTA_PATH)
model_config.output_hidden_states = True

In [23]:
# Load each of the five trained models and move to GPU
model1 = TweetModel(conf=model_config)
model1.to(device)
model1.load_state_dict(torch.load("model_0.bin"))
model1.eval()

model2 = TweetModel(conf=model_config)
model2.to(device)
model2.load_state_dict(torch.load("model_1.bin"))
model2.eval()

model3 = TweetModel(conf=model_config)
model3.to(device)
model3.load_state_dict(torch.load("model_2.bin"))
model3.eval()

model4 = TweetModel(conf=model_config)
model4.to(device)
model4.load_state_dict(torch.load("model_3.bin"))
model4.eval()

model5 = TweetModel(conf=model_config)
model5.to(device)
model5.load_state_dict(torch.load("model_4.bin"))
model5.eval()

TweetModel(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, el

In [24]:
final_output = []

# Instantiate TweetDataset with the test data
test_dataset = TweetDataset(
        tweet=df_test.text.values,
        sentiment=df_test.sentiment.values,
        selected_text=df_test.selected_text.values
)

# Instantiate DataLoader with `test_dataset`
data_loader = torch.utils.data.DataLoader(
    test_dataset,
    shuffle=False,
    batch_size=VALID_BATCH_SIZE,
    num_workers=1
)

# Turn of gradient calculations
with torch.no_grad():
    tk0 = tqdm(data_loader, total=len(data_loader))
    # Predict the span containing the sentiment for each batch
    for bi, d in enumerate(tk0):
        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        offsets = d["offsets"].numpy()

        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)

        # Predict start and end logits for each of the five models
        outputs_start1, outputs_end1 = model1(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        outputs_start2, outputs_end2 = model2(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        outputs_start3, outputs_end3 = model3(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        outputs_start4, outputs_end4 = model4(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        outputs_start5, outputs_end5 = model5(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        # Get the average start and end logits across the five models and use these as predictions
        # This is a form of "ensembling"
        outputs_start = (
            outputs_start1 
            + outputs_start2 
            + outputs_start3 
            + outputs_start4 
            + outputs_start5
        ) / 5
        outputs_end = (
            outputs_end1 
            + outputs_end2 
            + outputs_end3 
            + outputs_end4 
            + outputs_end5
        ) / 5
        
        # Apply softmax to the predicted start and end logits
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()

        # Convert the start and end scores to actual predicted spans (in string form)
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            _, output_sentence = calculate_jaccard_score(
                original_tweet=tweet,
                target_string=selected_tweet,
                sentiment_val=tweet_sentiment,
                idx_start=np.argmax(outputs_start[px, :]),
                idx_end=np.argmax(outputs_end[px, :]),
                offsets=offsets[px]
            )
            final_output.append(output_sentence)

HBox(children=(FloatProgress(value=0.0, max=442.0), HTML(value='')))




In [25]:
# post-process trick:
# Note: This trick comes from: https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/140942
# When the LB resets, this trick won't help
def post_process(selected):
    return " ".join(set(selected.lower().split()))

In [26]:
sample = pd.read_csv("../input/tweet-sentiment-extraction/sample_submission.csv")
sample.loc[:, 'selected_text'] = final_output
sample.selected_text = sample.selected_text.map(post_process)
sample.to_csv("submission.csv", index=False)

In [27]:
sample.head()

Unnamed: 0,textID,selected_text
0,f87dea47db,session last of day http://twitpic.com/67ezh the
1,96d74cb729,exciting
2,eee518ae67,such a shame!
3,01082688c6,bday! happy
4,33987a8ee5,like it!! i
