In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import time

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Bert was the fundament for growing new better models, for crosslingual tasks as well. Researchers from Facebook proposed XLM-Roberta where XLM stands for crosslingual model, Roberta - Robustly Optimized training approach for Bert

Here is the abstract:
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

The main idea is that this model is trained on much larger amount of data, espicially for low-resouce languages, for longer time, higher learning rate and batch size,
what makes it's power to generalize more. What else, training process also differs - you can find details in original papers, but what is really important for us : while training one sentence was appeared not in one language: for example it might look like concatenation of english sentence and its translation in spanish, which helps the model to understand how the same words and sentences but in different languages are connected.

Original paper claims that this new trained model outperforms ordinal bert, that is why we will use it later in our research.

But before we train the model we need to encode data in the specific format manually. This format is simillar to bert at most,
but as far as data source is much wider the backstanding vocabulary differs.

Encoding algorithm is the following: first, texts are split into the tokens - token might look a word or its part. Then every token in sentence is converted to its unique index in backstanding vocabulary for future connection with its embedding in the model.

This notebook will be quite short but still very important. Kaggle gives an opportunity to use TPU while training but it is limited by 30 hours per week. 
Data clearing and encoding is a quite time-consuming step in running the notebook. That is why it'll be more efficient to make tokenization in the single notebook and then use its outputs for testing model in the different one.

Our goal is tokenization is to
1. Firstly clear the texts out of not important data
2. Encoded clear text with custom tokenizer and SEQUENCE_LENGTH

In [None]:
DATA_PATH = "../input/jigsaw-multilingual-toxic-comment-classification"
small_path = "jigsaw-toxic-comment-train.csv"
large_path = "jigsaw-unintended-bias-train.csv"
val_path = "validation.csv"
test_path = "test.csv"
SEQUENCE_LENGTH = 192

Also I added augmented toxic comments to the data as the output of my previous notebook

https://www.kaggle.com/vgodie/class-balancing

In [None]:
small_ds = pd.read_csv(os.path.join(DATA_PATH, small_path), usecols=["comment_text", "toxic"])
large_ds = pd.read_csv(os.path.join(DATA_PATH, large_path), usecols=["comment_text", "toxic"])
aug_ds = pd.read_csv("../input/class-balancing/aug.csv")
val_ds = pd.read_csv(os.path.join(DATA_PATH, val_path), usecols=["comment_text", "toxic"])
test_ds = pd.read_csv(os.path.join(DATA_PATH, test_path))

In [None]:
aug_ds = aug_ds.sample(300000)

Large train dataset has about 2M samples, however we have limited RAM on Kaggle, that is why we will subsample
only part of non-toxic examples

In [None]:
large_toxic = large_ds[large_ds["toxic"] > 0.5].round()
large_nontoxic = large_ds[large_ds["toxic"] == 0].sample(600000)

ds = pd.concat((small_ds,
              large_toxic,
              large_nontoxic,
              aug_ds))

In [None]:
len(ds)

Before encoding into indeces text must be cleaned. Firstly, the punctuation will be removed. Also numbers, emails,
links and usernames in order to save only words

In [None]:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '\xa0', '\t',
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '\u3000', '\u202f',
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '«',
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

In [None]:
def clean_text(text, lang='en'):
    text = str(text)
    text = re.sub(r'[0-9"]', '', text)
    text = re.sub(r'#[\S]+\b', '', text)
    text = re.sub(r'@[\S]+\b', '', text)
    text = re.sub(r'https?\S+', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub("\[\[User.*",'',text)
    for punct in puncts:
        text = text.replace(punct, "")
    return text

def clean_data(df, text_label="comment_text", train=True):
    pos = 0
    while pos < len(df):
        temp = df[pos:pos + 10000].copy()
        df[pos:pos+10000][text_label] = temp[text_label].apply(clean_text).values
        pos += 10000
        print("Processed", pos, "texts" )
    df["lens"] = df[text_label].str.split().apply(len)
    if train:
        df = df[df["lens"] > 0]
    df.drop("lens", axis=1, inplace=True)
    return df

In [None]:
cleaned_ds = clean_data(ds)
cleaned_val = clean_data(val_ds)
cleaned_test = clean_data(test_ds,text_label="content", train=False)

In [None]:
len(cleaned_ds) == len(test_ds)

In [None]:
from transformers import AutoTokenizer

MODEL = "xlm-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

Then dowloading tokenizer with config for xlm-roberta-large split texts to tokens,
then convert them to indeces and pad those sentences that a shorter than max length

In [None]:
def encode_comments(dataframe, tokenizer=tokenizer, max_len=SEQUENCE_LENGTH):
        
        pos = 0
        start = time.time()
        
        while pos < len(dataframe):
            temp = dataframe[pos:pos+10000].copy()
            res = tokenizer.batch_encode_plus(temp.comment_text.values,
                                              pad_to_max_length=True,
                                              max_length = SEQUENCE_LENGTH,
                                              return_attention_masks = False
                                             )
            if pos == 0:
                ids = np.array(res["input_ids"])
                labels = temp.toxic.values
            else:
                ids = np.concatenate((ids, np.array(res["input_ids"])))
                labels = np.concatenate((labels, temp.toxic.values))
            pos+=10000
            print("Processed", pos, "elements")
        return ids, labels

In [None]:
ids,labels = encode_comments(cleaned_ds)
val_ids,val_labels = encode_comments(cleaned_val)

In [None]:
test_ids = tokenizer.batch_encode_plus(cleaned_test.content.values,
                                      pad_to_max_length=True,
                                      max_length=SEQUENCE_LENGTH,
                                      return_attention_masks=False)["input_ids"]

In [None]:
#save the results of tokenization for future use in the next notebooks

np.save("ids.npy", ids)
np.save("labels.npy", labels)
np.save("val_ids.npy", val_ids)
np.save("val_labels.npy", val_labels)
np.save("test_ids.npy", test_ids)

The next step is to build XLM-Roberta Model - wanna see results - see my next notebook

