# About this Notebook

To begin with, I would like to say that **this notebook is not aimed at high scores in the competition**, it was created in order to personally understand the **work of transformer models with small texts** and assess how such giant models can (or can not!) bypass the simpler ones.

Having experience with transformers in Russian language cases, for which the choice of models is extremely limited: there is only a pre-trained [RuBERT from DeepPavlov](http://docs.deeppavlov.ai/en/master/features/models/bert.html), I wanted to try other models that are luckly available for English. So, in this notebook, I'll use RoBERTa to analyze tweets.

In this kernel, I will briefly explain the structure of dataset, generate and analyze metafeatures. Then I will explore tokenizer for transformer models and train RoBERTa on given corpus.

This kernel includes codes and ideas from kernels below:
1. [NLP with Disaster Tweets - EDA, Cleaning and BERT](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert) by @Gunes Evitan ----> MetaFeatures extraction patterns and test answers.
2. [Basic EDA,Cleaning and GloVe](https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove#Data-Cleaning) by Shahules -----------> MetaFeatures extraction patterns.


**This kernel is a work in Progress,and I will keep on updating it as the competition progresses and I learn more and more things about the data**

<font color='red'>**If you find this kernel useful, Please Upvote it , it motivates me to write more Quality content** </font>



# 0. Setting up our environment

In [None]:
!pip install transformers

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
# General
import numpy as np
import pandas as pd

import os
from os import listdir
from os.path import isfile, join

import re
from typing import Dict, Any, List, NoReturn, Optional
from pathlib import Path
import subprocess as sp
import nvidia_smi
import random
import traceback
import string
from pprint import pprint
from collections import defaultdict, Counter

# Time and loading
from tqdm import tqdm
import time
# from datetime import datetime
import datetime

# Sklearns
from sklearn.model_selection import train_test_split
from sklearn.metrics import (balanced_accuracy_score, accuracy_score, 
                             classification_report, confusion_matrix)
from sklearn.metrics import confusion_matrix

# Torch and NN libs
import torch

# Save and load models
import joblib
import pickle

import warnings
warnings.filterwarnings('ignore')

In [None]:
# System adjustments - for all colums to fit into output (default width is 80)
pd.options.display.width = 2500
pd.options.display.max_rows = 999
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 120

## 0.1 Mounting paths

In [None]:
ROOT_DIRECTORY = "/kaggle"
INPUT_DATA_DIRECTORY = Path(ROOT_DIRECTORY) / 'input' / "nlp-getting-started"
OUTPUT_DATA_DIRECTORY = Path(ROOT_DIRECTORY) / 'output' / "kaggle" 
MODEL_DIRECTORY = Path(ROOT_DIRECTORY) / 'input' / "roberta-base"
SAVE_MODEL_DIRECTORY = (OUTPUT_DATA_DIRECTORY / 'roberta-tweet')
SAVE_MODEL_DIRECTORY.mkdir(parents=True, exist_ok=True)

## 0.2 Checking for the hardware backend

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print(device, n_gpu)
torch.cuda.get_device_name(0) 

In [None]:
!nvidia-smi

In [None]:
def seed_everything(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)  # for using CUDA backend
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  # get rid of nondeterminism
        torch.backends.cudnn.benchmark = False
        
seed_everything(11)

# 1. Loading datasets

In [None]:
# Read datasets
df_train = pd.read_csv(str(INPUT_DATA_DIRECTORY / 'train.csv'), dtype={'id': np.int16, 'target': np.int8})
df_test = pd.read_csv(str(INPUT_DATA_DIRECTORY / 'test.csv'), dtype={'id': np.int16})

print('Training Set Shape = {}'.format(df_train.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(df_train.memory_usage().sum() / 1024**2))
print('Test Set Shape = {}'.format(df_test.shape))
print('Test Set Memory Usage = {:.2f} MB'.format(df_test.memory_usage().sum() / 1024**2))

df_train.head()

In [None]:
# Check the proportion pf targets in train dataset
targets_counts = df_train.target.value_counts()
print("Train target distribution:\n", targets_counts)

sns.barplot(targets_counts.index, targets_counts)
_ = plt.gca().set_ylabel('Number of tweets')

## 1.1 Keyword & Location

Both training and test set have Location and Keyword fields missing in many cases. 
At the same time, they have same ratio of missing values in keyword and location.

* 66.7% of location is filled in both training and test set;
* 99.2% of keyword is filled in both training and test set;

Since missing value ratios between training and test set are too close, they are most probably taken from the same sample. 

### Location
Let's take a closer look at the location field. <br>
It is filled in 66% of cases and contains about **3341 unique values**. Thats much! 

That's happen because Locations in Tweeter are not automatically generated, they are user inputs. 
Howewer, if user was using Twitter for Android or Twitter for iOS, the Tweet may also include your precise location (i.e., the GPS coordinates from which you Tweeted), which can be found through the Twitter API, in addition to the location label user select. 

Although this is useful for users, it is completely unsuitable for making this a feature for the model.

As it can be seen from correlation matrix below.

In [None]:
print(f"In train dataset Location is filled in {sum(df_train.location.notnull())} of cases from {df_train.shape[0]}, so in {100*sum(df_train.location.notnull())/df_train.shape[0]}%.") 
print(f"In test dataset Location is filled in {sum(df_test.location.notnull())} of cases from {df_test.shape[0]}, so in {100*sum(df_test.location.notnull())/df_test.shape[0]}%.")

print(f"Unique locations found: {df_train.location.nunique()}.")

loc_data = df_train.groupby(by='location').agg({'id': "count", 'target': 'sum'}).reset_index().sort_values(by='id', ascending=False)
fig = px.bar(loc_data.head(100), x='id', y="location", color='target',
             title="Top Locations", width=800, height=1000)
fig.show()

In [None]:
# Any simple correlation between target and label encoded location (not one-hot)?

df_train['location_index'] = df_train.location.astype('category').cat.codes


ax = sns.heatmap(
    df_train.loc[df_train.location.notnull()][['location_index', 'target']].corr(), 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

### Keyword
But fortunately, we have exactly the opposite picture with the Keyword field. It is filled in 99.2%!

Keywords have much less different values and keyword can be used as a feature by itself or as a word added to the text. Every single keyword in training set exists in test set.

In [None]:
print(f"In train dataset Keyword is filled in {sum(df_train.keyword.notnull())} of cases from {df_train.shape[0]}, so in {100*sum(df_train.keyword.notnull())/df_train.shape[0]}%.") 
print(f"In test dataset Keyword is filled in {sum(df_test.keyword.notnull())} of cases from {df_test.shape[0]}, so in {100*sum(df_test.keyword.notnull())/df_test.shape[0]}%.")

print(f"Unique keywords found: {df_train.keyword.nunique()}.")

kw_data = df_train.groupby(by='keyword').agg({'id': "count", 'target': 'sum'}).reset_index().sort_values(by='id', ascending=False)
fig = px.bar(kw_data.head(200), x='id', y="keyword", color='target',
             title="Top Keywords", width=800, height=1000)
fig.show()

## Statistic of Length of Texts

Distributions of meta features in classes and datasets can be helpful to identify disaster tweets.

However, attributes related to the length of messages: the length of text in words, in characters, and the number of unique, non-repeating words are limited to the maximum length of a tweet - 280 characters.

but in this case, you can try to identify some patterns, such as: whether disaster tweets are written in a more formal way with longer words compared to non-disaster tweets?

* **word_count** number of words in text
* **unique_word_count** number of unique words in text
* **char_count** number of characters in text

In [None]:
for df in (df_train, df_test):
    df['word_count'] = df.text.apply(lambda x: len([tok for tok in re.split("[\s\W]", x) if tok != '']))
    df['unique_word_count'] = df.text.apply(lambda x: len(set([tok for tok in re.split("[\s\W]", x) if tok != ''])))
    df['char_count'] = df.text.apply(lambda x: len(re.sub(r"[\s]+", "", x)))

In [None]:
fig = make_subplots(rows=3, cols=1, subplot_titles=("Number of words in text", "Number of unique words in text", 
                                                    "Number of characters in text"))

trace0 = go.Histogram(x=df_train['word_count'], name='train data', nbinsx=30)
trace1 = go.Histogram(x=df_test['word_count'], name='test data', nbinsx = 30)

trace2 = go.Histogram(x=df_train['unique_word_count'], name='train data', nbinsx = 30)
trace3 = go.Histogram(x=df_test['unique_word_count'], name='test data', nbinsx = 30)

trace4 = go.Histogram(x=df_train['char_count'], name='train data', nbinsx = 30)
trace5 = go.Histogram(x=df_test['char_count'], name='test data', nbinsx = 30)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 3, 1)
fig.append_trace(trace5, 3, 1)

fig.update_layout(barmode='overlay')
fig.update_layout(
    autosize=False,
    width=800,
    height=1200,
)

fig.show()

In [None]:
fig = make_subplots(rows=3, cols=1, subplot_titles=("Number of words in text", "Number of unique words in text", 
                                                    "Number of characters in text"))

trace0 = go.Histogram(x=df_train.loc[df_train.target == 0]['word_count'], name='normal', nbinsx=30)
trace1 = go.Histogram(x=df_train.loc[df_train.target == 1]['word_count'], name='disaster', nbinsx = 30)

trace2 = go.Histogram(x=df_train.loc[df_train.target == 0]['unique_word_count'], name='normal', nbinsx = 30)
trace3 = go.Histogram(x=df_train.loc[df_train.target == 1]['unique_word_count'], name='disaster', nbinsx = 30)

trace4 = go.Histogram(x=df_train.loc[df_train.target == 0]['char_count'], name='normal', nbinsx = 30)
trace5 = go.Histogram(x=df_train.loc[df_train.target == 1]['char_count'], name='disaster', nbinsx = 30)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 3, 1)
fig.append_trace(trace5, 3, 1)

fig.update_layout(barmode='overlay')
fig.update_layout(
    autosize=False,
    width=800,
    height=1000,
)

fig.show()

Sadly, but those features have very similar distributions in disaster and non-disaster tweets.

## Staticstical Meta Features from Texts

We can get a lot additional featured from the way the tweets are written: from the number of hashtags, links to resources or users, even from the number of punctuation marks in the texts.

* **url_count** number of urls in text
* **mean_word_length** average character count in words
* **punctuation_count** number of punctuations in text
* **hashtag_count** number of hashtags (#) in text
* **mention_count** number of mentions (@) in text

In [None]:
for df in (df_train, df_test):
    df['url_count'] = df.text.apply(lambda x: len(re.findall(r"http|https", x)))
    df['mean_word_length'] = df.text.apply(lambda x: np.mean([len(tok) for tok in re.split("[\s\W]", x) if tok != '']))
    df['punctuation_count'] = df.text.apply(lambda x: len(re.findall(r"[" + string.punctuation + "]+", x)))
    df['hashtag_count'] = df.text.apply(lambda x: len(re.findall(r"#[\w]+", x)))
    df['mention_count'] = df.text.apply(lambda x: len(re.findall(r"@[\w_]+", x)))

In [None]:
fig = make_subplots(rows=3, cols=2, subplot_titles=("Number of urls", "Average character count in words", 
                                                    "Number of punctuations", "Number of hashtags",
                                                   "Number of mentions"))

trace0 = go.Histogram(x=df_train['url_count'], name='train data', nbinsx=6)
trace1 = go.Histogram(x=df_test['url_count'], name='test data', nbinsx = 6)

trace2 = go.Histogram(x=df_train['mean_word_length'], name='train data', nbinsx = 30)
trace3 = go.Histogram(x=df_test['mean_word_length'], name='test data', nbinsx = 30)

trace4 = go.Histogram(x=df_train['punctuation_count'], name='train data', nbinsx = 30)
trace5 = go.Histogram(x=df_test['punctuation_count'], name='test data', nbinsx = 30)

trace6 = go.Histogram(x=df_train['hashtag_count'], name='train data', nbinsx = 10)
trace7 = go.Histogram(x=df_test['hashtag_count'], name='test data', nbinsx = 10)

trace8 = go.Histogram(x=df_train['mention_count'], name='train data', nbinsx = 10)
trace9 = go.Histogram(x=df_test['mention_count'], name='test data', nbinsx = 10)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 1)
fig.append_trace(trace6, 2, 2)
fig.append_trace(trace7, 2, 2)
fig.append_trace(trace8, 3, 1)
fig.append_trace(trace9, 3, 1)

fig.update_layout(barmode='overlay')
fig.update_layout(
    autosize=False,
    width=800,
    height=1000,
)

fig.show()

And here's again, features have very similar distributions in disaster and non-disaster tweets.

# 3. Text Preprocessing

As we know, tweets require lots of cleaning, because people usually use emoticons and punctuation marks to express their emotions, often typos in a hurry, which can greatly affect the dictionary and sometimes Tweeter parcer skip html tags in the text.

The training dataset is too large to view with your eyes for the symbols and signs you are looking for, 
so let's use regular expression magic  in this task ðŸ˜‰.

## 3.1 Exploration of common patterns in tweets

### Special symbols

In [None]:
# Search for types of special characters
def spec_symbols_searcher(text: str) -> bool:
    """
    Based on idea of: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
    """
    spec_symb_pattern = re.compile(u"\\x89[\w]+")
    return True if len(spec_symb_pattern.findall(text)) > 0 else False
    
def spec_chars_searcher(text: str) -> bool:
    spec_chars_pattern = re.compile(r"[Ã¥|Ã›|Ã’|Âª|Â¢|ÃŠ|ÃŒ|Â¨|Â©]+")
    return True if len(spec_chars_pattern.findall(text)) > 0 else False
    
def currency_searcher(text: str) -> bool:
    currency_pattern = re.compile(r"[Ã‡|Â£|$|â‚¬|Â¥|ï¿¥|â‚´|â‚½|Â¢|Â¤]+")
    return True if len(currency_pattern.findall(text)) > 0 else False

def xml_chars_searcher(text: str) -> bool:
    xml_pattern = re.compile(r"&quot;|&gt;|&lt;|&amp;|&apos;")
    return True if len(xml_pattern.findall(text)) > 0 else False
    

print("Example contains special symbols: ", spec_symbols_searcher("\x89Ã›Ã’ Two cars set ablaze: SANTA CRUZ \x89Ã›Ã“"))
print("Example contains special characters: ", spec_chars_searcher("don\x89Ã›Âªt"))
print("Example contains currency characters: ", currency_searcher("Â£3 million"))
print("Example contains entities in XML: ", currency_searcher("&quot;New York Times&quot;"))
print("\n")

print("Number of tweets contains special symbols: ", df_train.loc[df_train['text'].apply(lambda x: spec_symbols_searcher(x))].shape[0])
print("Number of tweets contains special characters: ", df_train.loc[df_train['text'].apply(lambda x: spec_chars_searcher(x))].shape[0])
print("Number of tweets contains currency characters: ", df_train.loc[df_train['text'].apply(lambda x: currency_searcher(x))].shape[0])
print("Number of tweets contains entities in XML: ", df_train.loc[df_train['text'].apply(lambda x: xml_chars_searcher(x))].shape[0])

### Emojies - good, but not in NLP ðŸ˜”

In [None]:
# Search for types of emojies

def emoji_searcher(text: str) -> bool:
    """
    Based on: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
    """
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    
    return True if len(emoji_pattern.findall(text)) > 0 else False

print("Example contains emojies: ", emoji_searcher("Omg another Earthquake ðŸ˜”ðŸ˜”"))

print("Number of tweets contains emojies: ", df_train.loc[df_train['text'].apply(lambda x: emoji_searcher(x))].shape[0])

Good news! Tweets are clean of emoticons and similar symbols. There is no need to clean them.

### Date&Time - can it help?

Here we can see, that in whole dataset there are nearly 50 date or time stamps. Thats not enougth for taling them as feature, in addition, we do not know the date or time of the disasters to determine the time basis.

In [None]:
def datetime_searcher(text: str) -> bool:
    date_pattern = re.compile(r"[\d]+[\\./][\d]+[\\./][\d]+")
    time_pattern_partly = re.compile(r"([\d]+[\.:][\d]+)( pm| am)")
    time_pattern_full = re.compile(r"[\d]+[\.:][\d]+[\.:][\d]+")
    return True if ((len(date_pattern.findall(text)) > 0) or 
                    (len(time_pattern_partly.findall(text)) > 0) or 
                    (len(time_pattern_full.findall(text)) > 0)) else False

print("Example contains datetime: ", datetime_searcher('Traffic accident N CABRILLO HWY/MAGELLAN AV MIR (08/06/15 11:03:58)'))

print("Number of tweets contains datetime: ", df_train.loc[df_train['text'].apply(lambda x: datetime_searcher(x))].shape[0])

### Numbers - are they helpfull?

In [None]:
def numbers_searcher(text: str) -> List[str]:
    numbers_pattern = re.compile(r"[\d]+[,\.:;]?[\d]+")
    return numbers_pattern.findall(text)

print("Example contains any numbers: ", numbers_searcher("13,000 people receive #wildfires evacuation orders in california"))

print("Number of tweets contains any numbers: ", df_train.loc[df_train['text'].apply(lambda x: True if len(numbers_searcher(x)) 
                                                              else False)].shape[0])

# Look at some of them
df_train.loc[df_train['text'].apply(lambda x: True if len(numbers_searcher(x)) else False)].text.head(10)                                                      

### Typos - real evil for language models

For this use python spell checking from **TextBlob**.

It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.

In [None]:
!pip install textblob

In [None]:
from textblob import TextBlob

def spell_checker(text: str) -> str:
    textb = TextBlob(text) 
    return False if textb.correct() == text else True

print("Example contains mispellings: ", spell_checker("I havv goood speling!"))

# Too long!
# print("Number of tweets contains mispellings: ", df_train.loc[df_train['text'].apply(lambda x: spell_checker(x))].shape[0])

## 3.2 Clean tweets

lets finally apply created patterns and functions to datasets.

In [None]:
class Tweeter_cleaner:
    """
    Class for cleaning Tweeter dataset. 
    Apply chain of regexp filters to remove punctuation, special symbols and etc.
    """
    
    def __init__(self, make_lower: bool):
        self._filters_chain = self.compile_patterns()
        self._make_lower = make_lower
    
    def compile_patterns(self):
        return [re.compile(r'<.*?>'),  # html tags
                re.compile(r"\.\.\."), # dots at the end of tweet
                re.compile(r'https?://\S+|www\.\S+'), # urls
                re.compile(u"\\x89[\w]+"),  # special symbols
                re.compile(r"[Ã‡|Â£|$|â‚¬|Â¥|ï¿¥|â‚´|â‚½|Â¢|Â¤]+"), # currency
                re.compile(r"[Ã¥|Ã›|Ã’|Âª|Â¢|ÃŠ|ÃŒ|Â¨|Â©]+"),  # special chars
                re.compile(r"&quot;|&gt;|&lt;|&amp;|&apos;"), # xml tags
                re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE),
                re.compile(r"[\d]+[\\./][\d]+[\\./][\d]+"), # date
                re.compile(r"([\d]+[\.:][\d]+)( pm| am)"), # time short vatriant
                re.compile(r"[\d]+[\.:][\d]+[\.:][\d]+"),  # time long vatriant
                re.compile(r"[!?+*\[\]\-\.&%/()$={}^'`:;<>+\.]+"),  # punctuations, but without cleaning: @, # and _
               ]
        
        
    def clean_tweet(self, text: str) -> str:
        assert(type(text) == 'str',  "Text to clean should be string!")
        
        # Make lower case
        if self._make_lower:
            text = text.lower()
        
        # Apply filters
        for f in self._filters_chain:
            text = f.sub(" ", text)
        
        # Link separeted by comma numbers together
        text = re.sub(r",", "", text)
            
        # Clean double whitespaces
        text = re.sub(r"\s+", ' ', text)
            
        return text

In [None]:
cleaner = Tweeter_cleaner(make_lower=True)
df_train['clean_text'] = df_train['text'].apply(lambda x: cleaner.clean_tweet(x))
df_test['clean_text'] = df_test['text'].apply(lambda x: cleaner.clean_tweet(x))

df_train[['text', 'clean_text']].head(10)

### Add Keyword to text

In [None]:
df_train['clean_text'] = df_train['keyword'].fillna("no_keyword") + ' ' + df_train['clean_text']
df_test['clean_text'] = df_test['keyword'].fillna("no_keyword") + ' ' + df_test['clean_text']

# 4. Implementing RoBERTa with HuggingFace ðŸ¤—Transformers

## RoBERTa

Based on [RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu et al.](https://arxiv.org/pdf/1907.11692.pdf)
It is based on Googleâ€™s BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

The abstract from the paper is the following:

> Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

**Tips from pre-training RoBERTa:**
* dynamic masking: tokens are masked differently at each epoch whereas BERT does it once and for all
* no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of contiguous texts together to reach 512 tokens (so sentences in in an order than may span other several documents)
* train with larger batches
* use BPE with bytes as a subunit and not characters (because of unicode characters)

In [None]:
# torch utils
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

# transformers
import transformers
from transformers import RobertaForSequenceClassification, RobertaTokenizer, RobertaConfig
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import AdamW, get_linear_schedule_with_warmup

import tokenizers
from tokenizers.processors import RobertaProcessing

When we use a pre-trained models, we need our data to be pre-processed and presented in the same way as the data that the model was trained on. In transformers, each model architecture is associated with 3 main types of classes:

* A **model** class to load/store a particular pre-train model.
* A **tokenizer** class to pre-process the data and make it compatible with a particular model.
* A **configuration** class to load/store the configuration of a particular model.

For the RoBERTa architecture, we use **RobertaForSequenceClassification** for the model class, **RobertaTokenizer** for the tokenizer class, and **RobertaConfig** for the configuration class. 

## 4.1 Pretrained RoBERTa models

Hugging Face has big library of pre-trained models for different languages, tasks and architectures. Here is the full list of the currently provided pretrained models of Roberta together with a short presentation of each model:

* **roberta-base** - 12-layer, 768-hidden, 12-heads, 125M parameters RoBERTa using the BERT-base architecture
* **roberta-large** - 24-layer, 1024-hidden, 16-heads, 355M parameters RoBERTa using the BERT-large architecture
* **roberta-large-mnli** - 24-layer, 1024-hidden, 16-heads, 355M parameters roberta-large fine-tuned on MNLI.
* **distilroberta-base** - 6-layer, 768-hidden, 12-heads, 82M parameters The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.
* **roberta-base-openai-detector** - 12-layer, 768-hidden, 12-heads, 125M parameters roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.
* **roberta-large-openai-detector** - 24-layer, 1024-hidden, 16-heads, 355M parameters roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

To classify tweets the simplest and most obvious option is to use the basic version of pre trained Roberta: **roberta-base**.

### Load pre-trained model

In [None]:
tokenizer = tokenizers.ByteLevelBPETokenizer(
            vocab_file=str(MODEL_DIRECTORY / 'vocab.json'), 
            merges_file=str(MODEL_DIRECTORY / 'merges.txt'), 
            lowercase=True, add_prefix_space=True)

config = RobertaConfig.from_pretrained(str(MODEL_DIRECTORY / 'config.json'), output_hidden_states=True) 
               
roberta = RobertaForSequenceClassification.from_pretrained(str(MODEL_DIRECTORY / 'pytorch_model.bin'), config=config)    

print(roberta)

## 4.2 Input data
RoBERTa has imposing vocabulary - the size of 50,000. Thus, the simple language of tweets should not contain unknown words, otherwise they will be sorted out by special tokenization technic - **byte version of Byte-Pair Encoding (BPE)**.

As input RoBERTa can deal with the texts. The inputs of the model take pieces of 512 contiguous tokens, that were received using a BPE, and may span over documents.

In this case, because the length of tweets is very small (if you look at the above graph, no more than 15-20 words) we can try to **change the length of the input sequence to a smaller side** and at the same time **increase the butch size** even more, which is one of the things that distinguishes RoBERTa from BERT.

### Research byte version of Byte-Pair Encoder (BPE)

This implementation of a BPE tokenizer consists of the following pipeline of processes, each applying different transformations to the textual information:
<img src=https://miro.medium.com/max/875/1*7uy9X3eE1rVmqV08yKrDgg.png width="500">

The **Normalizer** first normalizes the text, the result of which is fed into the **PreTokenizer** which is in charge of applying simple tokenization by splitting the text into its different word elements using whitespaces. 

The **Model** corresponds to the actual algorithm, such as BPE, WordPiece or SentencePiece, that performs the tokenization itself. 

The **PostProcessing** then takes care of incorporating any additional useful information that needs to be added to the final **output Encoding**, which is then ready to be used and fed into, say, a language model for training.

*Read more in small but informative article [Hugging Face Introduces Tokenizers](https://medium.com/dair-ai/hugging-face-introduces-tokenizers-d792482db360) by Elvis on Medium.*

## 4.3 Output of byte version of Byte-Pair Encoder (BPE)

As it was said before BPE Encoder output Encoding object, which ....


Also, if we need to make truncation or padding the encoding methods should be called after calling:  *enable_padding* and *enable_truncation* set to True. 

The output of the tokenizer should have the following pattern:

> roberta: [s] + prefix_space + tokens + [/s] + padding [pad]

**[s], [/s], [pad] and [unk]** - are **special tokens**, that are marked in the tokenizer dictionary in the following positions:

* [s] - 0
* [/s] - 2
* [pad] - 1
* [unk] - 3

Then we need to adjust the offsets to match the added characters at the beginning and end of the sentence by adding zero offsets: (0,0).

In [None]:
# Change Post-Processor to RoBERTa type
tokenizer._tokenizer.post_processor = RobertaProcessing(
            sep=("</s>", tokenizer.token_to_id("</s>")),
            cls=("<s>", tokenizer.token_to_id("<s>")),
            add_prefix_space=True,
            trim_offsets=True,
        )

example_seq_length = 25
tokenizer.enable_truncation(max_length=example_seq_length)
# Argumens: direction: str, pad_id: int, pad_type_id: int, pad_token: str, length: int
tokenizer.enable_padding("right",1, 0, "<pad>", example_seq_length)

# Look at example of first text in dataset
seq_tokenized = tokenizer.encode(df_train['clean_text'].values[0], add_special_tokens=True)

print("Example of ids: ", seq_tokenized.ids, '\n')
print("Example of type ids: ", seq_tokenized.type_ids, '\n')
print("Example of tokens: ", seq_tokenized.tokens, '\n')
print("Example of offsets: ", seq_tokenized.offsets, '\n')
print("Example of their attention masks: ", seq_tokenized.attention_mask, '\n')
print("Example of special tokens mask: ", seq_tokenized.special_tokens_mask, '\n')
print("Example of overflowing: ", seq_tokenized.overflowing, '\n')

### Analyse BPE encoding text length

To form input, first we need to find out what the average length of the text in the BPE encoding is, since it can differ greatly from the word-by-word tokenization, which was preformed before.

In [None]:
# Disable truncation and padding to get real BPE length of each text
tokenizer.no_truncation()
tokenizer.no_padding()

tokenized_train = df_train['clean_text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
tokenized_test = df_train['clean_text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
tokenized = pd.concat([tokenized_train, tokenized_test], ignore_index=True)
len_tokenized = tokenized.apply(lambda x: len(x))

print(f"Tokenized texts: {len(tokenized)}")

fig = px.histogram([len(t.ids) for t in tokenized], nbins=30)
fig.show()

print("Maximum length: ", np.max([len(t.ids) for t in tokenized]))
print("Minimum length: ", np.min([len(t.ids) for t in tokenized]))

It was highly expected that tokenizer splitted words into subtokens, thus increased the average length of texts by almost one and a half times.

But still, this length of texts is extremely small compared to the 512 characters on which the model was pre-trained. This can lead to a loss of quality in classification task, since RoBERTa will not be able to recognize the linguistic structure of such small sequences.

## 4.4 Create DataLoader

To use the RoBERTa tokenizer, we will create a function **tweeter_dataloader_pipeline** that will contain a Tokenizer, a DataSet and return a DataLoader. 
It will implement the functions of tokenizing and encoding tweets, organizing them in a dataset and submitting them to the model by Butch.



In [None]:
def tweeter_dataloader_pipeline(df: pd.DataFrame, tokenizer, labeled: bool,
                                features_col_name: str, label_col_name: Optional[str],
                                make_padding: bool, make_truncation: bool,
                                max_text_length: int, batch_size: int):
    """
    Create dataloader object from given dataset labled or not.
    param: df - input dataframe (labeled or not);
    param: tokenizer - tokenizer of model;
    param: labeled - True if data is labeled, if False - unlabeled inference case,
        DataLoader object then does not contain labels either;
    param: features_col_name - column in df, where texts are stored;
    param: label_col_name - column in df, where labels are stored; 
    param: make_padding - True if need to pad short sequencies to max_text_length, 
        False otherwise;
    param: make_truncation - True if need to truncate short sequencies to max_text_length, 
        False otherwise;
    param: max_text_length - maximum length of tokens sequence,
    param: batch_size - number of samples that will be stored on GPU simultaneously;
    return: DataLoader object.
    """
    texts = df[features_col_name].tolist()
    
    if labeled:
        labels = df[label_col_name].tolist()
        labels = torch.tensor(labels)
    
    if make_truncation:
        tokenizer.enable_truncation(max_length=max_text_length)
    if make_padding:
        tokenizer.enable_padding("right",1, 0, "<pad>", max_text_length)
    
    texts_encoded = [tokenizer.encode(text, add_special_tokens=True) for text in tqdm(texts)]
    ids = torch.LongTensor([enc.ids for enc in texts_encoded])
    type_ids = torch.LongTensor([enc.type_ids for enc in texts_encoded])
    attention_masks = torch.LongTensor([enc.attention_mask for enc in texts_encoded])
    offsets = torch.LongTensor([enc.attention_mask for enc in texts_encoded])
    
    # Create the DataSets instances
    if labeled:
        ds = TensorDataset(ids, type_ids, attention_masks, offsets, labels)
    else:
        ds = TensorDataset(ids, type_ids, attention_masks, offsets)
    
    sampler = SequentialSampler(ds)

    return DataLoader(ds, sampler=sampler, batch_size=batch_size)

Before creating dataloaders, you need to divide the tagged with targets dataset into two: training and validation subsets. The second one will be much smaller, but it is necessary to evaluate the value of the loss function during training and perform early stopping.

In [None]:
data_train, data_val = train_test_split(df_train, train_size=0.8, random_state=11, 
                                        shuffle=True, stratify=df_train.target.values)

print('Length train data:', len(data_train))
print('Length validation data:', len(data_val))

In [None]:
train_dataloader = tweeter_dataloader_pipeline(data_train, tokenizer, labeled=True, 
                                               features_col_name='clean_text', label_col_name='target',
                                               make_padding=True, make_truncation=True,
                                               max_text_length=55, batch_size=128)

val_dataloader = tweeter_dataloader_pipeline(data_val, tokenizer, labeled=True, 
                                             features_col_name='clean_text', label_col_name='target',
                                             make_padding=True, make_truncation=True,
                                             max_text_length=55, batch_size=128)

test_dataloader = tweeter_dataloader_pipeline(df_test, tokenizer, labeled=False, 
                                             features_col_name='clean_text', label_col_name='',
                                             make_padding=True, make_truncation=True,
                                             max_text_length=55, batch_size=128)

# 5. Training RoBERTa Model

## 5.1 Define Training helper functions

In [None]:
nvidia_smi.nvmlInit()

def get_gpu_memory() -> NoReturn:
    
    # card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate
    handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
    info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

    print(f"\nTotal GPU memory: {info.total}")
    print(f"Free GPU memory: {info.free}")
    print(f"Used GPU memory: {info.used}")
    # nvidia_smi.nvmlShutdown()

def get_gpu_free_memory() -> NoReturn:
    
    # card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate
    handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
    info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

    print(f"Free GPU memory: {info.free / 8388608} MB.")

In [None]:
def clean_GPU_memory() -> NoReturn:
    torch.cuda.empty_cache()
    get_gpu_memory()
    
get_gpu_memory()
get_gpu_free_memory()
# Initial Free Memory size: 2033.859375 MB = 17061249024 

In [None]:
def copy_data_to_device(data, device):
    if torch.is_tensor(data):
        return data.to(device)
    elif isinstance(data, (list, tuple)):
        return [copy_data_to_device(elem, device) for elem in data]
    raise ValueError('Invalid data type {}'.format(type(data)))

In [None]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
# Service function
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
def save_model(model, dir: str):
    """
    Trained model, configuration and tokenizer, 
    they can then be reloaded using `from_pretrained()` if using default names.
    """
    print("Saving model to {0}...".format(dir))
    
    # Take care of distributed/parallel training
    model_to_save = model.module if hasattr(model, 'module') else model 
    model_to_save.save_pretrained(dir)
    print("Model successfully saved.")
    
    
def load(filename: str, has_info: bool):
    model = torch.load(filename)
    if has_info:
        info = torch.load('model.info')
        pprint(info)
    return model

In [None]:
def train_eval_loop(model, dataloaders_dict, device=None,
                    epoch_n=10, lr=1e-5, optim_eps=1e-8, 
                    num_warmup_steps=0, criterion=None,
                    optimizer=None, lr_scheduler=None,
                    model_dir_to_save=".",  model_filename="model.pt"
                    ):
    """
    Loop for training the model. After each epoch, 
    the quality of the model is evaluated by a validation set.
    :param model: torch.nn.Module - model to learn;
    :param dataloaders_dict: dictionary of torch.utils.data.DataLoaders - train, val, test;
    :param device: cuda/cpu - device to perform calculations on;
    :param lr: learning rate;
    :param epoch_n: maximium number of epochs;
    :param optim_eps: coefficient for Adam-regularization;
    :param num_warmup_steps: the number of steps for the warmup phase;
    :return: tuple of two parts:
        - mean loss value on validation set on the best epoch of training;
        - best model;
    """
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    device = torch.device(device)
    model.to(device)
    print("Moved model to device.")
    get_gpu_memory()
    
    # Unpack dataloders
    train_dataloader = dataloaders_dict.get('train', None)
    val_dataloader = dataloaders_dict.get('val', None)
    
    # Set hyperparameters
    if optimizer is None:
        optimizer = AdamW(model.parameters(), lr=lr, eps=optim_eps)
                                                                            
    else:
        optimizer = optimizer(model.parameters(), lr=lr, eps=optim_eps)

    # Total number of training steps is [number of batches] x [number of epochs]. 
    num_training_steps = len(train_dataloader) * epoch_n
    if lr_scheduler is not None:
        lr_scheduler = lr_scheduler(optimizer, 
                                         num_warmup_steps=num_warmup_steps,
                                         num_training_steps=num_training_steps)
    else:
        lr_scheduler = None
        
    # Storing training and validation loss, validation accuracy, and timings.
    training_stats = []

    # Measure the total training time for the whole run.
    total_train_time = time.time()

    for epoch_i in range(epoch_n):
        try:
            # Measure how long the training epoch takes.
            epoch_start = time.time() 
            print("")
            print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epoch_n))
            
            # ========================================
            #               Training
            # ========================================

            model.train()
            mean_train_loss = 0
            train_batches_n = 0

            # Unpacking this training batch from our dataloader. 
            for batch_i, (b_ids, b_type_ids, b_attn_mask, b_offsets, b_labels) in enumerate(train_dataloader):
                
                # Printing progress update every 40 batches.
                if (batch_i % 50 == 0) and not (batch_i == 0):
                    elapsed = format_time(time.time() - epoch_start)
                    # Monitor memory usage
                    get_gpu_free_memory()
                    print("\tBatch {:>5,}  of  {:>5,}.    Elapsed: {:}.".format(batch_i, len(train_dataloader), elapsed))

                b_ids = copy_data_to_device(b_ids, device)
                b_attn_mask = copy_data_to_device(b_attn_mask, device)
                b_labels = copy_data_to_device(b_labels, device)

                # Clearing any previously calculated gradients before performing a backward pass. 
                model.zero_grad()

                # Perform a forward pass - evaluate the model on this training batch.
                # "logits" are the hidden state of the last layer of the RoBERTa model
                outputs = model(b_ids, token_type_ids=None, attention_mask=b_attn_mask, labels=b_labels)

                (loss, logits) = outputs[:2]
                # Perform a backward pass to calculate the gradients.
                loss.backward()

                # Clip the norm of the gradients to 1.0. to help prevent the "exploding gradients" problem.
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                # Update parameters with optimizer's learning rate and take a step using the computed gradient.
                optimizer.step()
                
                # Update the learning rate.
                lr_scheduler.step()

                # Accumulating the training loss over all of the batches so that we can calculate the average loss at the end. 
                mean_train_loss += loss.item()
                train_batches_n += 1

            mean_train_loss /= train_batches_n
            epoch_train_time = (time.time() - epoch_start)
            print('\nEpoch: {} iters, {:0.2f} sec'.format(train_batches_n,
                                                        epoch_train_time))
            print('Mean value of loss function during training: ', mean_train_loss)

            # ========================================
            #               Validation
            # ========================================
            valid_start = time.time() 

            # Put the model in evaluation mode
            model.eval()

            mean_val_loss = 0
            val_accuracy = 0
            val_batches_n = 0
            with torch.no_grad():
                for batch_i, (b_ids, b_type_ids, b_attn_mask, b_offsets, b_labels) in enumerate(val_dataloader):

                    b_ids = copy_data_to_device(b_ids, device)
                    b_attn_mask = copy_data_to_device(b_attn_mask, device)
                    b_labels = copy_data_to_device(b_labels, device)

                    # Forward pass, calculate logit predictions.
                    # Get the "logits" output by the model.
                    outputs  = model(b_ids, token_type_ids=None,
                                            attention_mask=b_attn_mask,
                                            labels=b_labels)
                    (loss, logits) = outputs[:2]

                    mean_val_loss += loss.item()
                    val_batches_n += 1
                    
                    # Move logits and labels to CPU
                    logits = copy_data_to_device(logits.detach(), 'cpu').numpy()
                    label_ids = copy_data_to_device(b_labels, 'cpu').numpy()

                    # Calculate the accuracy for this validation batch and accumulate it over all batches.
                    val_accuracy += flat_accuracy(logits, label_ids)

            # Measure how long the validation run took.
            validation_time = format_time(time.time() - valid_start)

            mean_val_loss /= val_batches_n
            print("Mean value of loss function during validation: {0:.2f}".format(mean_val_loss))

            mean_val_accuracy = val_accuracy / val_batches_n
            print("Mean validation accuracy: {0:.2f}".format(mean_val_accuracy))

            # Record all statistics from this epoch.
            training_stats.append(
                {
                    'epoch': epoch_i + 1,
                    'Training Loss': mean_train_loss,
                    'Valid. Loss': mean_val_loss,
                    'Valid. Accur.': mean_val_accuracy,
                    'Training Time': epoch_train_time,
                    'Validation Time': validation_time
                }
            )
        except KeyboardInterrupt:
            print('Early stopped by the user.')
            break
        except Exception as ex:
            print('Error while training: {}\n{}'.format(ex, traceback.format_exc()))
            break
            
    print('\n', "Training complete")
    print("Total training took {:} (h:mm:ss)".format(format_time(time.time() - total_train_time)))

    # Monitor memory usage
    get_gpu_free_memory()

    # Saving model
    save_model(model, model_dir_to_save) 
    clean_GPU_memory()
    return training_stats, model

## 5.3 Train model

In [None]:
training_stats, model = train_eval_loop(roberta, {"train": train_dataloader, "val": val_dataloader}, 
                                        device=device, epoch_n=8,
                                        lr=1e-5, optim_eps=1e-5,
                                        optimizer=AdamW, 
                                        lr_scheduler=get_linear_schedule_with_warmup,
                                        model_dir_to_save=str(OUTPUT_DATA_DIRECTORY))

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 11), dpi=60)
plt.tight_layout()

# Create a DataFrame from our training statistics and use the 'epoch' as the row index.
df_stats = pd.DataFrame(data=training_stats).set_index('epoch')

axes[0].plot(df_stats['Training Loss'], 'b-o', label="Training")
axes[0].plot(df_stats['Valid. Loss'], 'g-o', label="Validation")

axes[1].plot(df_stats['Valid. Accur.'], 'g-o', label="Validation")

In [None]:
clean_GPU_memory()

# 6. Evaluation on test data

As everyone knows, there are answers to this competition and test set labels can be found on this [website](https://appen.com/resources/datasets/), so the goal of the laptop was a personal attempt to explore and learn the core concepts of the transformers models and BPE tokenizer. 

Getting the highest score is not the goal itself, but it's still interesting to see how Roberta will perform on the test data!

**I specially downloaded the perfect solution and want to evaluate the quality of the model's solution.**

In [None]:
def compute_metrics(true_labels: List[int], 
                    pred_labels: List[int]) -> NoReturn:
    assert(len(true_labels)==len(pred_labels), 
           "Labels lists must have the same length.")
    
    print("***** Eval results {} *****")
    
    ac = accuracy_score(true_labels, pred_labels)
    bac = balanced_accuracy_score(true_labels, pred_labels)

    print('Accuracy score:', ac)
    print('Balanced_accuracy_score:', bac)
    print(classification_report(true_labels, pred_labels))

In [None]:
def plot_confusion_matrix(true_labels: List[int], 
                        pred_labels: List[int]) -> NoReturn:
    CM = confusion_matrix(true_labels, pred_labels)
    df_cm = pd.DataFrame(CM, range(CM.shape[0]), range(CM.shape[1]))
    plt.figure(figsize=(5,5))
    sns.set(font_scale=1.4) 
    sns.heatmap(df_cm, annot=True, annot_kws={"size": 10}, fmt = ".0f") 
    plt.show()

In [None]:
# Inference function to get logits from model for each text in prediction_dataloader
def evaluate(model, dataloader, labeled: bool, device=None, return_logits=False, 
                  print_metrics=True, verbose=True) -> Dict[str, np.array]:
    """
    :param model: torch.nn.Module - trained model;
    :param dataloader: torch.utils.data.DataLoader - data (and labels) for evaluation;
    :param labeled: if True, then make comparison with true labels of classes;
    :param device: cuda/cpu - device to perform calculations on;
    :param return_logits: if True, then return models last layer outputs;
    :param print_metrics: if True, then compute and print classification report;
    :return: dict with results.
    """
    print('\n'*2, 'Predicting labels for {:,} test sentences...'.format(len(dataloader.dataset)))

    predictions, pred_labels = [], []
    if labeled:
        true_labels = []

    # Inference loop for each batch
    for batch_i, batch in tqdm(enumerate(dataloader)):

        if labeled:
            (b_ids, b_type_ids, b_attn_mask, b_offsets, b_labels) = batch
        else:
            (b_ids, b_type_ids, b_attn_mask, b_offsets) = batch

        # Add batch to GPU
        b_ids = copy_data_to_device(b_ids, device)
        b_attn_mask = copy_data_to_device(b_attn_mask, device)
        
        # Telling the model not to compute or store gradients, saving memory and speeding up prediction
        with torch.no_grad():
            # tensor[8x24]
            logits = model(b_ids, token_type_ids=None, attention_mask=b_attn_mask)[0]
        
        # Move logits and labels to CPU
        logits = copy_data_to_device(logits, 'cpu').numpy()
        
        if labeled:
            b_labels = copy_data_to_device(b_labels, 'cpu').numpy()
            true_labels.extend(b_labels)  

        # Store predictions 
        for x in logits:
            predictions.append(x)
            pred_labels.append(np.argmax(x))
        
    # Printing classification results and prediction examples for correctness check
    if verbose:
        print('\n'*2, 'Classification done.', '\n')
        print('Predictions:', len(predictions))
        print('Pred_labels:', len(pred_labels))
        if labeled:
            print('True_labels:', len(true_labels))

    if print_metrics and labeled:
        compute_metrics(true_labels, pred_labels)
        plot_confusion_matrix(true_labels, pred_labels)

    returning = defaultdict(list)
    returning["predicted_labels"] = pred_labels
    if labeled:
        returning["true_labels"] = true_labels
    if return_logits:
        returning["predicted_logits"] = predictions
    
    return returning

In [None]:
# Load dataset with test labels - from website

df_leak = pd.read_csv(Path(ROOT_DIRECTORY) / 'input' / 'disasters-on-social-media-perfect-submission'/ 'perfect_submission.csv', 
                      encoding ='ISO-8859-1')
print('Leaked Data Set Shape = {}'.format(df_leak.shape))
print('Leaked Data Set Memory Usage = {:.2f} MB'.format(df_leak.memory_usage().sum() / 1024**2))

# Append answers to test dataset
df_test = df_test.merge(df_leak, on='id')

In [None]:
# Re-create dataloder with answers
test_dataloader = tweeter_dataloader_pipeline(df_test, tokenizer, labeled=True, 
                                             features_col_name='clean_text', label_col_name='target',
                                             make_padding=True, make_truncation=True,
                                             max_text_length=55, batch_size=128)

In [None]:
# Run evaluation
models_predictions = evaluate(model, test_dataloader, labeled=True,  
                             return_logits=True, device=device, 
                            print_metrics=True, verbose=True)
                                   

In [None]:
df_test['target'] = models_predictions.get('predicted_labels', 0)
df_test['target'].value_counts()

In [None]:
df_test[['id', 'target']].to_csv("models_submission.csv", sep=',', encoding='utf-8')