# About this notebook

This notebook is part of my experiments with a cross-lingual text classifier based on metric learning with deep [Transformer](https://arxiv.org/abs/1706.03762)-based networks for feature transformation and on final classification in new feature space with another approach (for example, Bayesian neural network). The [Jigsaw Multilingual Toxic Comment Classification challenge](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification) is selected as a data source for labeled cross-lingual texts. This task has two nuances:

1. I want to build a semantic space that is independent of a concrete language.

2. I have a large labeled text corpus (hundreds of thousands of labeled texts) for a single language only, but training datasets for other languages are very small or they are empty at all. And I have to classify texts just in such languages.

A transfer learning helps us to account for the first of these nuances. I use a pre-trained [XLM-RoBERTa](https://arxiv.org/abs/1911.02116) model as an initial state for a [Siamese neural network](https://link.springer.com/protocol/10.1007/978-1-0716-0826-5_3) that trains to transform a common semantic space into its special analog, into which a distance between tonally opposed texts is larger and distance between texts with the same sentiments is smaller. For the Siamese NN learning, I apply a special loss function which is known as [Distance Based Logistic Loss (a DBL loss)](https://arxiv.org/abs/1608.00161). Such loss is better than usual cross-entropy loss, because it is contrastive-based, and any contrastive-based loss guarantees that the Siamese neural network after its training will calculate a compact space with required semantic properties. In comparison with a "classical" contrastive loss, which is popular for Siamese neural networks, the DBL loss is more effective owing to quicker convergence. It is known, that a Triplet Loss is also used for the Siamese neural network, but in my experiments with textual data, this loss didn't allow well separable semantic space, and so I didn't include it in this notebook.

In [None]:
import codecs
import copy
import csv
import gc
import os
import pickle
import random
import time
from typing import Dict, List, Tuple, Union

In [None]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [None]:
import matplotlib.pyplot as plt
from nltk import wordpunct_tokenize
import numpy as np
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow.python.framework import ops, tensor_util
from tensorflow.python.keras.utils import losses_utils, tf_utils
from tensorflow.python.ops import math_ops
from tensorflow.python.ops.losses import util as tf_losses_util
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, XLMRobertaTokenizer
from transformers import TFXLMRobertaModel, XLMRobertaConfig

In [None]:
class LossFunctionWrapper(tf.keras.losses.Loss):
    def __init__(self,
                 fn,
                 reduction=losses_utils.ReductionV2.AUTO,
                 name=None,
                 **kwargs):
        super(LossFunctionWrapper, self).__init__(reduction=reduction,
                                                  name=name)
        self.fn = fn
        self._fn_kwargs = kwargs

    def call(self, y_true, y_pred):
        if tensor_util.is_tensor(y_pred) and tensor_util.is_tensor(y_true):
            y_pred, y_true = tf_losses_util.squeeze_or_expand_dimensions(
                y_pred, y_true
            )
        return self.fn(y_true, y_pred, **self._fn_kwargs)

    def get_config(self):
        config = {}
        for k, v in six.iteritems(self._fn_kwargs):
            config[k] = tf.keras.backend.eval(v) \
                if tf_utils.is_tensor_or_variable(v) \
                else v
        base_config = super(LossFunctionWrapper, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

In [None]:
def distance_based_log_loss(y_true, y_pred):
    y_pred = ops.convert_to_tensor(y_pred)
    y_true = math_ops.cast(y_true, y_pred.dtype)
    margin = 1.0
    p = (1.0 + tf.math.exp(-margin)) / (1.0 + tf.math.exp(y_pred - margin))
    return tf.keras.backend.binary_crossentropy(target=y_true, output=p)

In [None]:
class DBLLogLoss(LossFunctionWrapper):
    def __init__(self, reduction=losses_utils.ReductionV2.AUTO,
                 name='distance_based_log_loss'):
        super(DBLLogLoss, self).__init__(distance_based_log_loss, name=name,
                                         reduction=reduction)

In [None]:
class AttentionMaskLayer(tf.keras.layers.Layer):
    def __init__(self, pad_token_id: int, **kwargs):
        self.pad_token_id = pad_token_id
        super(AttentionMaskLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        super(AttentionMaskLayer, self).build(input_shape)

    def call(self, inputs, **kwargs):
        return tf.keras.backend.cast(
            x=tf.math.not_equal(
                x=inputs,
                y=self.pad_token_id
            ),
            dtype='int32'
        )

    def compute_output_shape(self, input_shape):
        return input_shape

    def get_config(self):
        return {"pad_token_id": self.pad_token_id}

**def tokenize_all(...)**

This function transforms input texts to token IDs for XLM-RoBERTa. It returns a 2-d numpy array with integer values.

In [None]:
def tokenize_all(texts: List[str], tokenizer: XLMRobertaTokenizer,
                 maxlen: int) -> List[List[int]]:
    if not isinstance(texts, list):
        err_msg = '"{0}" is wrong type for the text list!'.format(type(texts))
        raise ValueError(err_msg)
    n_texts = len(texts)
    all_tokenized_texts = []
    for cur_text in tqdm(texts):
        full_words = wordpunct_tokenize(cur_text)
        sub_words = []
        for cur_word in filter(lambda it2: len(it2) > 0,
                               map(lambda it1: it1.strip(), full_words)):
            bpe = tokenizer.tokenize(cur_word)
            if tokenizer.unk_token in bpe:
                sub_words.append(tokenizer.unk_token)
            else:
                sub_words += bpe
        sub_words = [tokenizer.bos_token] + sub_words + \
                    [tokenizer.eos_token]
        if len(sub_words) > maxlen:
            sub_words = sub_words[:maxlen]
        elif len(sub_words) < maxlen:
            ndiff = maxlen - len(sub_words)
            for _ in range(ndiff):
                sub_words.append(tokenizer.pad_token)
        all_tokenized_texts.append(
            tokenizer.convert_tokens_to_ids(sub_words)
        )
        del sub_words, full_words
    return all_tokenized_texts

**def load_train_set(...)**

This function loads multilingual labeled text corpus for training from the CSV file. If there is no information about language in this CSV file, then I think that all texts language is English.
    
It returns a special dictionary with information about texts and their binary toxicity labels (integer values) by languages (language is a key, and a list of texts and labels is a value).

In [None]:
def load_train_set(file_name: str, text_field: str, sentiment_fields: List[str],
                   lang_field: str) -> Dict[str, List[Tuple[str, int]]]:
    assert len(sentiment_fields) > 0, 'List of sentiment fields is empty!'
    header = []
    line_idx = 1
    data_by_lang = dict()
    with codecs.open(file_name, mode='r', encoding='utf-8', errors='ignore') as fp:
        data_reader = csv.reader(fp, quotechar='"', delimiter=',')
        for row in data_reader:
            if len(row) > 0:
                err_msg = 'File "{0}": line {1} is wrong!'.format(file_name, line_idx)
                if len(header) == 0:
                    header = copy.copy(row)
                    err_msg2 = err_msg + ' Field "{0}" is not found!'.format(text_field)
                    assert text_field in header, err_msg2
                    for cur_field in sentiment_fields:
                        err_msg2 = err_msg + ' Field "{0}" is not found!'.format(
                            cur_field)
                        assert cur_field in header, err_msg2
                    text_field_index = header.index(text_field)
                    try:
                        lang_field_index = header.index(lang_field)
                    except:
                        lang_field_index = -1
                    indices_of_sentiment_fields = []
                    for cur_field in sentiment_fields:
                        indices_of_sentiment_fields.append(header.index(cur_field))
                else:
                    if len(row) == len(header):
                        text = row[text_field_index].strip()
                        assert len(text) > 0, err_msg + ' Text is empty!'
                        if lang_field_index >= 0:
                            cur_lang = row[lang_field_index].strip()
                            assert len(cur_lang) > 0, err_msg + ' Language is empty!'
                        else:
                            cur_lang = 'en'
                        max_proba = 0.0
                        for cur_field_idx in indices_of_sentiment_fields:
                            try:
                                cur_proba = float(row[cur_field_idx])
                            except:
                                cur_proba = -1.0
                            err_msg2 = err_msg + ' Value {0} is wrong!'.format(
                                row[cur_field_idx]
                            )
                            assert (cur_proba >= 0.0) and (cur_proba <= 1.0), err_msg2
                            if cur_proba > max_proba:
                                max_proba = cur_proba
                        new_label = 1 if max_proba >= 0.5 else 0
                        if cur_lang not in data_by_lang:
                            data_by_lang[cur_lang] = []
                        data_by_lang[cur_lang].append((text, new_label))
            if line_idx % 10000 == 0:
                print('{0} lines of the "{1}" have been processed...'.format(
                    line_idx, file_name
                ))
            line_idx += 1
    if line_idx > 0:
        if (line_idx - 1) % 10000 != 0:
            print('{0} lines of the "{1}" have been processed...'.format(
                line_idx - 1, file_name
            ))
    return data_by_lang

**def load_test_set(...)**

This function loads multilingual unlabeled text corpus for submission from the CSV file. If there is no information about language in this CSV file, then I think that all texts language is English.
    
It returns a special dictionary with information about texts and their integer identifiers by languages (language is a key, and a list of texts and identifiers is a value).

The function differs from the previous load_train_set only in the expected structure (field set) of the parsed CSV file, and these functions are the same by the structure of returned objects (but special identifiers for submission are used instead of toxicity labels).

In [None]:
def load_test_set(file_name: str, id_field: str, text_field: str,
                  lang_field: str) -> Dict[str, List[Tuple[str, int]]]:
    header = []
    line_idx = 1
    data_by_lang = dict()
    with codecs.open(file_name, mode='r', encoding='utf-8', errors='ignore') as fp:
        data_reader = csv.reader(fp, quotechar='"', delimiter=',')
        for row in data_reader:
            if len(row) > 0:
                err_msg = 'File "{0}": line {1} is wrong!'.format(file_name, line_idx)
                if len(header) == 0:
                    header = copy.copy(row)
                    err_msg2 = err_msg + ' Field "{0}" is not found!'.format(text_field)
                    assert text_field in header, err_msg2
                    err_msg2 = err_msg + ' Field "{0}" is not found!'.format(id_field)
                    assert id_field in header, err_msg2
                    err_msg2 = err_msg + ' Field "{0}" is not found!'.format(lang_field)
                    assert lang_field in header, err_msg2
                    id_field_index = header.index(id_field)
                    text_field_index = header.index(text_field)
                    lang_field_index = header.index(lang_field)
                else:
                    if len(row) == len(header):
                        try:
                            id_value = int(row[id_field_index])
                        except:
                            id_value = -1
                        err_msg2 = err_msg + ' {0} is wrong ID!'.format(
                            row[id_field_index])
                        assert id_value >= 0, err_msg2
                        text = row[text_field_index].strip()
                        assert len(text) > 0, err_msg + ' Text is empty!'
                        if lang_field_index >= 0:
                            cur_lang = row[lang_field_index].strip()
                            assert len(cur_lang) > 0, err_msg + ' Language is empty!'
                        else:
                            cur_lang = 'en'
                        if cur_lang not in data_by_lang:
                            data_by_lang[cur_lang] = []
                        data_by_lang[cur_lang].append((text, id_value))
            if line_idx % 10000 == 0:
                print('{0} lines of the "{1}" have been processed...'.format(
                    line_idx, file_name
                ))
            line_idx += 1
    if line_idx > 0:
        if (line_idx - 1) % 10000 != 0:
            print('{0} lines of the "{1}" have been processed...'.format(
                line_idx - 1, file_name
            ))
    return data_by_lang

**def generate_text_probabilities(...)**

This function generates probabilities of text This function generates probabilities of each text random selection from list of all texts. Selection probability is the greater, the shorter the corresponded text.

In [None]:
def generate_text_probabilities(source_texts: List[Tuple[str, int]],
                                interesting_indices: List[int]) -> np.ndarray:
    lengths_of_texts = []
    max_chars_number = 0
    for idx in interesting_indices:
        cur_chars_number = len(source_texts[idx][0])
        lengths_of_texts.append(cur_chars_number)
        if cur_chars_number > max_chars_number:
            max_chars_number = cur_chars_number
    assert max_chars_number > 100
    probabilities = np.zeros((len(lengths_of_texts),), dtype=np.float64)
    counter = 0
    for idx, val in enumerate(lengths_of_texts):
        if val > 10:
            probabilities[idx] = max_chars_number * 10 - val
        else:
            counter += 1
    assert counter == 0
    probabilities /= np.sum(probabilities)
    min_proba = 0.5 / float(probabilities.shape[0])
    for idx in range(probabilities.shape[0]):
        if probabilities[idx] > 0.0:
            if probabilities[idx] < min_proba:
                probabilities[idx] = min_proba
    return probabilities / np.sum(probabilities)

**def build_siamese_dataset(...)**

This function transforms a labeled text corpus in format, which is like to result of the *load_train_set* function, into a special Tensorflow dataset with considering of specified mini-batch size. Also, it returns the total number of mini-batches in this dataset.

In [None]:
def build_siamese_dataset(texts: Dict[str, List[Tuple[str, int]]],
                          dataset_size: int, tokenizer: XLMRobertaTokenizer,
                          maxlen: int, batch_size: int,
                          shuffle: bool) -> Tuple[tf.data.Dataset, int]:
    language_pairs = set()
    for language in texts.keys():
        for other_language in texts:
            if other_language == language:
                language_pairs.add((language, other_language))
            else:
                pair_1 = (language, other_language)
                pair_2 = (other_language, language)
                if (pair_1 not in language_pairs) and (pair_2 not in language_pairs):
                    language_pairs.add(pair_1)
    language_pairs = sorted(list(language_pairs))
    print('Possible language pairs are: {0}.'.format(language_pairs))
    err_msg = '{0} is too small size of the data set!'.format(dataset_size)
    assert dataset_size >= (len(language_pairs) * 10), err_msg
    n_samples_for_lang_pair = int(np.ceil(dataset_size / float(len(language_pairs))))
    text_pairs_and_labels = []
    for left_lang, right_lang in language_pairs:
        print('{0}-{1}:'.format(left_lang, right_lang))
        left_positive_indices = list(filter(
            lambda idx: ((texts[left_lang][idx][1] > 0) and \
                         (len(texts[left_lang][idx][0]) > 10)),
            range(len(texts[left_lang]))
        ))
        left_positive_probas = generate_text_probabilities(
            source_texts=texts[left_lang],
            interesting_indices=left_positive_indices
        )
        left_negative_indices = list(filter(
            lambda idx: ((texts[left_lang][idx][1] == 0) and \
                         (len(texts[left_lang][idx][0]) > 10)),
            range(len(texts[left_lang]))
        ))
        left_negative_probas = generate_text_probabilities(
            source_texts=texts[left_lang],
            interesting_indices=left_negative_indices
        )
        right_positive_indices = list(filter(
            lambda idx: ((texts[right_lang][idx][1] > 0) and \
                         (len(texts[right_lang][idx][0]) > 10)),
            range(len(texts[right_lang]))
        ))
        right_positive_probas = generate_text_probabilities(
            source_texts=texts[right_lang],
            interesting_indices=right_positive_indices
        )
        right_negative_indices = list(filter(
            lambda idx: ((texts[right_lang][idx][1] == 0) and \
                         (len(texts[right_lang][idx][0]) > 10)),
            range(len(texts[right_lang]))
        ))
        right_negative_probas = generate_text_probabilities(
            source_texts=texts[right_lang],
            interesting_indices=right_negative_indices
        )
        used_pairs = set()
        number_of_samples = 0
        iterations = n_samples_for_lang_pair // 4
        if len(left_positive_indices) > iterations:
            left_indices = np.random.choice(
                left_positive_indices,
                min(iterations * 2, len(left_positive_indices)),
                p=left_positive_probas, replace=False
            ).tolist()
        else:
            left_indices = left_positive_indices
        if len(right_positive_indices) > iterations:
            right_indices = np.random.choice(
                right_positive_indices,
                min(iterations * 2, len(right_positive_indices)),
                p=right_positive_probas, replace=False
            ).tolist()
        else:
            right_indices = right_positive_indices
        if len(left_indices) < len(right_indices):
            right_indices = right_indices[:len(left_indices)]
        elif len(left_indices) > len(right_indices):
            left_indices = left_indices[:len(right_indices)]
        random.shuffle(left_indices)
        random.shuffle(right_indices)
        for left_idx, right_idx in zip(left_indices, right_indices):
            if (right_idx == left_idx) and (left_lang == right_lang):
                continue
            if (left_idx, right_idx) in used_pairs:
                continue
            used_pairs.add((left_idx, right_idx))
            used_pairs.add((right_idx, left_idx))
            text_pairs_and_labels.append(
                (
                    texts[left_lang][left_idx][0],
                    texts[right_lang][right_idx][0],
                    1
                )
            )
            number_of_samples += 1
            if number_of_samples >= iterations:
                break
        del left_indices, right_indices
        print('  number of "1-1" pairs is {0};'.format(number_of_samples))
        number_of_samples = 0
        iterations = (2 * n_samples_for_lang_pair) // 4
        iterations -= n_samples_for_lang_pair // 4
        if len(left_negative_indices) > iterations:
            left_indices = np.random.choice(
                left_negative_indices,
                min(iterations * 2, len(left_negative_indices)),
                p=left_negative_probas, replace=False
            ).tolist()
        else:
            left_indices = left_negative_indices
        if len(right_negative_indices) > iterations:
            right_indices = np.random.choice(
                right_negative_indices,
                min(iterations * 2, len(right_negative_indices)),
                p=right_negative_probas, replace=False
            ).tolist()
        else:
            right_indices = right_negative_indices
        if len(left_indices) < len(right_indices):
            right_indices = right_indices[:len(left_indices)]
        elif len(left_indices) > len(right_indices):
            left_indices = left_indices[:len(right_indices)]
        random.shuffle(left_indices)
        random.shuffle(right_indices)
        for left_idx, right_idx in zip(left_indices, right_indices):
            if (right_idx == left_idx) and (left_lang == right_lang):
                continue
            if (left_idx, right_idx) in used_pairs:
                continue
            used_pairs.add((left_idx, right_idx))
            used_pairs.add((right_idx, left_idx))
            text_pairs_and_labels.append(
                (
                    texts[left_lang][left_idx][0],
                    texts[right_lang][right_idx][0],
                    1
                )
            )
            number_of_samples += 1
            if number_of_samples >= iterations:
                break
        del left_indices, right_indices
        print('  number of "0-0" pairs is {0};'.format(number_of_samples))
        number_of_samples = 0
        iterations = n_samples_for_lang_pair
        iterations -= (2 * n_samples_for_lang_pair) // 4
        if len(left_negative_indices) > iterations:
            left_indices = np.random.choice(
                left_negative_indices,
                min(iterations * 2, len(left_negative_indices)),
                p=left_negative_probas, replace=False
            ).tolist()
        else:
            left_indices = left_negative_indices
        if len(right_positive_indices) > iterations:
            right_indices = np.random.choice(
                right_positive_indices,
                min(iterations * 2, len(right_positive_indices)),
                p=right_positive_probas, replace=False
            ).tolist()
        else:
            right_indices = right_positive_indices
        if len(left_indices) < len(right_indices):
            right_indices = right_indices[:len(left_indices)]
        elif len(left_indices) > len(right_indices):
            left_indices = left_indices[:len(right_indices)]
        random.shuffle(left_indices)
        random.shuffle(right_indices)
        for left_idx, right_idx in zip(left_indices, right_indices):
            if (right_idx == left_idx) and (left_lang == right_lang):
                continue
            if (left_idx, right_idx) in used_pairs:
                continue
            used_pairs.add((left_idx, right_idx))
            used_pairs.add((right_idx, left_idx))
            if random.random() >= 0.5:
                text_pairs_and_labels.append(
                    (
                        texts[left_lang][left_idx][0],
                        texts[right_lang][right_idx][0],
                        0
                    )
                )
            else:
                text_pairs_and_labels.append(
                    (
                        texts[right_lang][right_idx][0],
                        texts[left_lang][left_idx][0],
                        0
                    )
                )
            number_of_samples += 1
            if number_of_samples >= iterations:
                break
        del left_indices, right_indices
        print('  number of "0-1" or "1-0" pairs is {0}.'.format(
            number_of_samples
        ))
    random.shuffle(text_pairs_and_labels)
    n_steps = len(text_pairs_and_labels) // batch_size
    print('Samples number of the data set is {0}.'.format(
        len(text_pairs_and_labels)
    ))
    print('Samples number per each language pair is {0}.'.format(
        n_samples_for_lang_pair
    ))
    tokens_of_left_texts = tokenize_all(
        texts=[cur[0] for cur in text_pairs_and_labels],
        tokenizer=tokenizer, maxlen=maxlen
    )
    tokens_of_left_texts = np.array(tokens_of_left_texts, dtype=np.int32)
    print('')
    print('3 examples of left texts after tokenization:')
    for _ in range(3):
        idx = random.randint(0, len(text_pairs_and_labels) - 1)
        print('  {0}'.format(text_pairs_and_labels[idx][0]))
        print('  {0}'.format(tokens_of_left_texts[idx].tolist()))
        print('')
    tokens_of_right_texts = tokenize_all(
        texts=[cur[1] for cur in text_pairs_and_labels],
        tokenizer=tokenizer, maxlen=maxlen
    )
    tokens_of_right_texts = np.array(tokens_of_right_texts, dtype=np.int32)
    print('3 examples of right texts after tokenization:')
    for _ in range(3):
        idx = random.randint(0, len(text_pairs_and_labels) - 1)
        print('  {0}'.format(text_pairs_and_labels[idx][1]))
        print('  {0}'.format(tokens_of_right_texts[idx].tolist()))
        print('')
    siamese_labels = np.array([cur[2] for cur in text_pairs_and_labels],
                              dtype=np.int32)
    print('Number of positive siamese samples is {0} from {1}.'.format(
        int(sum(siamese_labels)), siamese_labels.shape[0]))
    err_msg = '{0} != 2'.format(len(tokens_of_left_texts.shape))
    assert len(tokens_of_left_texts.shape) == 2, err_msg
    err_msg = '{0} != 1'.format(len(siamese_labels.shape))
    assert len(siamese_labels.shape) == 1, err_msg
    err_msg = '{0} != {1}'.format(tokens_of_left_texts.shape, tokens_of_right_texts.shape)
    assert tokens_of_left_texts.shape == tokens_of_right_texts.shape, err_msg
    err_msg = '{0} != {1}'.format(tokens_of_left_texts.shape[0], siamese_labels.shape[0])
    assert tokens_of_left_texts.shape[0] == siamese_labels.shape[0], err_msg
    if shuffle:
        err_msg = '{0} is too small number of samples for the data set!'.format(
            len(text_pairs_and_labels))
        assert n_steps >= 50, err_msg
        dataset = tf.data.Dataset.from_tensor_slices(
            (
                (
                    tokens_of_left_texts,
                    tokens_of_right_texts
                ),
                siamese_labels
            )
        ).repeat().batch(batch_size)
    else:
        dataset = tf.data.Dataset.from_tensor_slices(
            (
                (
                    tokens_of_left_texts,
                    tokens_of_right_texts
                ),
                siamese_labels
            )
        ).batch(batch_size)
    del text_pairs_and_labels
    return dataset, n_steps

**def build_feature_extractor(...)**

This function builds a sentence embedder, based on XLM-RoBERTa, as a Keras model.

In [None]:
def build_feature_extractor(transformer_name: str, padding: int,
                            max_len: int) -> tf.keras.Model:
    xlmroberta_config = XLMRobertaConfig.from_pretrained(transformer_name)
    max_position_embeddings = xlmroberta_config.max_position_embeddings
    if max_len > (max_position_embeddings - 2):
        err_msg = 'max_text_len = {0} is too large! It must be less ' \
                  'then {1}.'.format(max_text_len, max_position_embeddings - 1)
        raise ValueError(err_msg)
    output_embedding_size = xlmroberta_config.hidden_size
    word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,
                                     name="base_word_ids_FE")
    attention_mask = AttentionMaskLayer(
        pad_token_id=padding, name='base_attention_mask_FE',
        trainable=False
    )(word_ids)
    del xlmroberta_config
    transformer_layer = TFXLMRobertaModel.from_pretrained(
        pretrained_model_name_or_path=transformer_name,
        name='Transformer'
    )
    sequence_output = transformer_layer(
        [word_ids, attention_mask]
    )[0]
    output_mask = tf.cast(attention_mask, dtype=tf.bool)
    pooled_output = tf.keras.layers.GlobalAvgPool1D(
        name='AvePool_FE'
    )(sequence_output, mask=output_mask)
    text_embedding = tf.keras.layers.LayerNormalization(
        name='Emdedding_FE'
    )(pooled_output)
    fe_model = tf.keras.Model(
        inputs=word_ids,
        outputs=text_embedding,
        name='FeatureExtractor'
    )
    fe_model.build(input_shape=(None, max_len))
    return fe_model

In [None]:
def euclidean_distance(vects):
    x, y = vects
    sum_square = tf.keras.backend.sum(tf.keras.backend.square(x - y),
                                      axis=1, keepdims=True)
    return tf.keras.backend.sqrt(
        tf.keras.backend.maximum(sum_square, tf.keras.backend.epsilon())
    )

In [None]:
def eucl_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)

**def build_siamese_nn(...)**

This function builds a Siamese neural network, which consists of two XLM-RoBERTa sentence embedders with shared weights (the XLM-RoBERTa sentence embedder is created using the *build_feature_extractor* function), a Euclidean distance layer, and a distance-based logistic loss.

When I compile my *tf.keras.Model* object, then I set an [Adam](https://arxiv.org/abs/1711.05101) algorithm as optimizer with .

In [None]:
def build_siamese_nn(transformer_name: str, max_len: int, padding: int,
                     stepsize: int) -> Tuple[tf.keras.Model, tf.keras.Model]:
    left_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,
                                          name="left_word_ids")
    right_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32,
                                           name="right_word_ids")
    fe_ = build_feature_extractor(
        transformer_name=transformer_name,
        padding=padding,
        max_len=max_len
    )
    left_text_embedding = fe_(left_word_ids)
    right_text_embedding = fe_(right_word_ids)
    distance_layer = tf.keras.layers.Lambda(
        function=euclidean_distance,
        output_shape=eucl_dist_output_shape,
        name='L2DistLayer'
    )([left_text_embedding, right_text_embedding])
    nn = tf.keras.Model(
        inputs=[left_word_ids, right_word_ids],
        outputs=distance_layer,
        name='SiameseXLMR'
    )
    lr_schedule = tfa.optimizers.Triangular2CyclicalLearningRate(
        initial_learning_rate=1e-6,
        maximal_learning_rate=5e-5,
        step_size=3 * stepsize
    )
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
    nn.compile(
        optimizer=optimizer,
        loss=DBLLogLoss()
    )
    fe_.summary()
    print('')
    nn.summary()
    return nn, fe_

**def show_training_process(...)**

This function shows a training metric curve and a validation one using a Tensorflow log (i.e. the *tf.keras.callbacks.History* object). A kind of metric (loss, accuracy, or any other measure) is specified by the additional parameter *metric_name*.

In [None]:
def show_training_process(history: tf.keras.callbacks.History, metric_name: str,
                          figure_id: int=1):
    val_metric_name = 'val_' + metric_name
    err_msg = 'The metric "{0}" is not found! Available metrics are: {1}'.format(
        metric_name, list(history.history.keys()))
    assert metric_name in history.history, err_msg
    plt.figure(figure_id, figsize=(5, 5))
    plt.plot(list(range(len(history.history[metric_name]))),
             history.history[metric_name], label='Training {0}'.format(metric_name))
    if val_metric_name in history.history:
        assert len(history.history[metric_name]) == len(history.history['val_' + metric_name])
        plt.plot(list(range(len(history.history['val_' + metric_name]))),
                 history.history['val_' + metric_name], label='Validation {0}'.format(metric_name))
    plt.xlabel('Epochs')
    plt.ylabel(metric_name)
    plt.title('Training process')
    plt.legend(loc='best')
    plt.show()

**def train_siamese_nn(...)**

This function applies a training procedure to a specified Siamese neural network. Two stopping criteria are used at the same time: 1) "classical" early stopping criterion; 2) stopping by exceeding of maximal training duration (in seconds). The best weights of the neural network, which correspond to training moment with a minimal value of validation loss, are saved in the special binary file *model_weights_path*.

In [None]:
def train_siamese_nn(nn: tf.keras.Model, trainset: tf.data.Dataset, steps_per_trainset: int,
                     steps_per_epoch: int, validset: tf.data.Dataset, max_duration: int,
                     model_weights_path: str):
    assert steps_per_trainset >= steps_per_epoch
    n_epochs = max(30, int(round(10.0 * steps_per_trainset / float(steps_per_epoch))))
    print('Maximal duration of the Siamese XLM-R training is {0} '\
          'seconds.'.format(max_duration))
    callbacks = [
        tf.keras.callbacks.EarlyStopping(patience=9, monitor='val_loss', mode='min',
                                         restore_best_weights=False, verbose=True),
        tf.keras.callbacks.ModelCheckpoint(model_weights_path, monitor='val_loss',
                                           mode='min', save_best_only=True,
                                           save_weights_only=True, verbose=True),
        tfa.callbacks.TimeStopping(seconds=max_duration, verbose=True)
    ]
    history = nn.fit(
        trainset,
        steps_per_epoch=steps_per_epoch,
        validation_data=validset,
        epochs=n_epochs,
        callbacks=callbacks
    )
    show_training_process(history, 'loss')

**def calculate_features_of_texts(...)**

This function calculates sentence embeddings (i.e. fixed-size semantic vectors for sentences) using a trained XLM-RoBERTa-based feature extractor. Input texts with their toxicity labels (or integer identifiers instead of labels) are specified in a format which is like as a result format of the *load_train_set* and the *load_test_set* functions.

The returned object is a dictionary, where a key is a language and a value is a tuple consists of two NumPy arrays (the first of them is a matrix of sentence vectors, and the second of them is a 1d-array of integer labels or identifiers).

In [None]:
def calculate_features_of_texts(texts: Dict[str, List[Tuple[str, int]]],
                                tokenizer: XLMRobertaTokenizer, maxlen: int,
                                fe: tf.keras.Model, batch_size: int,
                                max_dataset_size: int = 0) -> \
        Tuple[Dict[str, Tuple[np.ndarray, np.ndarray]]]:
    languages = sorted(list(texts.keys()))
    datasets_by_languages = dict()
    if max_dataset_size > 0:
        max_size_per_lang = max_dataset_size // len(languages)
        err_msg = '{0} is too small number of dataset samples!'.format(max_dataset_size)
        assert max_size_per_lang > 0, err_msg
    else:
        max_size_per_lang = 0
    print('Number of languages is {0}.'.format(len(texts)))
    for cur_lang in languages:
        print('')
        print('Language "{0}": featurizing is started.'.format(cur_lang))
        selected_indices = list(range(len(texts[cur_lang])))
        print('Number of texts is {0}.'.format(len(selected_indices)))
        if max_size_per_lang > 0:
            if len(selected_indices) > max_size_per_lang:
                selected_indices = random.sample(
                    population=selected_indices,
                    k=max_size_per_lang
                )
        tokens_of_texts = tokenize_all(
            texts=[texts[cur_lang][idx][0] for idx in selected_indices],
            tokenizer=tokenizer, maxlen=maxlen
        )
        tokens_of_texts = np.array(tokens_of_texts, dtype=np.int32)
        print('')
        print('3 examples of texts after tokenization:')
        for _ in range(3):
            idx = random.randint(0, len(selected_indices) - 1)
            print('  {0}'.format(texts[cur_lang][selected_indices[idx]][0]))
            print('  {0}'.format(tokens_of_texts[idx].tolist()))
            print('')
        X = []
        n_batches = int(np.ceil(len(selected_indices) / float(batch_size)))
        if n_batches >= 10:
            n_data_parts = 10
            data_part_size = int(np.ceil(n_batches / float(n_data_parts)))
        else:
            n_data_parts = 0
        data_part_counter = 0
        for batch_idx in range(n_batches):
            batch_start = batch_idx * batch_size
            batch_end = min(len(selected_indices), batch_start + batch_size)
            res = fe.predict_on_batch(tokens_of_texts[batch_start:batch_end])
            if not isinstance(res, np.ndarray):
                res = res.numpy()
            X.append(res)
            del res
            if n_data_parts > 0:
                if (batch_idx + 1) % data_part_size == 0:
                    data_part_counter += 1
                    print('  {0}% of texts are featured.'.format(data_part_counter * 10))
        if (n_data_parts > 0) and (data_part_counter < n_data_parts):
            print('  100% of texts are featured.')
        X = np.vstack(X)
        y = np.array([texts[cur_lang][idx][1] for idx in selected_indices], dtype=np.int32)
        datasets_by_languages[cur_lang] = (X, y)
        del X, y, selected_indices
        print('Language "{0}": featurizing is finished.'.format(cur_lang))
    return datasets_by_languages

**def generate_featured_data(...)**

This function prepares all generated feature vectors for training and submission in a form that is more comfortable for future experiments with various classifiers in the new feature space. The result of the function is a 3-element tuple:

1. training data as two NumPy arrays: matrix of feature vectors for the training of a final classifier and vector of corresponded toxicity labels;
2. splitting of training data by languages as a dictionary of train/test indices by language name (so, one cross-validation fold corresponds to one language);
3. data for submitting as two NumPy arrays: matrix of feature vectors as inputs for the final classifier and vector of corresponded submission samples identifiers.

In [None]:
def generate_featured_data(
    features_by_lang: Dict[str, Tuple[np.ndarray, np.ndarray]],
    features_for_submission: Union[Dict[str, Tuple[np.ndarray, np.ndarray]], None] = None,
) -> Tuple[Tuple[np.ndarray, np.ndarray], Dict[str, Tuple[np.ndarray, np.ndarray]], \
           Union[Tuple[np.ndarray, np.ndarray]], None]:
    X_embedded = []
    y_embedded = []
    split_by_languages = dict()
    start_pos = 0
    for cur_lang in features_by_lang:
        X_embedded.append(features_by_lang[cur_lang][0])
        y_embedded.append(features_by_lang[cur_lang][1])
        split_by_languages[cur_lang] = (
            set(),
            set(range(start_pos, start_pos + features_by_lang[cur_lang][1].shape[0]))
        )
        start_pos = start_pos + features_by_lang[cur_lang][1].shape[0]
    featured_data_for_training = (
        np.vstack(X_embedded),
        np.concatenate(y_embedded)
    )
    del X_embedded, y_embedded
    err_msg = '{0} != {1}'.format(featured_data_for_training[0].shape[0],
                                  featured_data_for_training[1].shape[0])
    assert featured_data_for_training[0].shape[0] == featured_data_for_training[1].shape[0], err_msg
    for cur_lang in features_by_lang:
        indices_for_testing = split_by_languages[cur_lang][1]
        indices_for_training = set(range(featured_data_for_training[0].shape[0]))
        indices_for_training -= indices_for_testing
        split_by_languages[cur_lang] = (
            np.array(sorted(list(indices_for_training)), dtype=np.int32),
            np.array(sorted(list(indices_for_testing)), dtype=np.int32)
        )
        del indices_for_training, indices_for_testing
    all_languages = sorted(list(split_by_languages.keys()))
    prev_lang = all_languages[0]
    assert len(set(split_by_languages[prev_lang][1].tolist()) & \
               set(split_by_languages[prev_lang][0].tolist())) == 0
    for cur_lang in all_languages[1:]:
        assert len(set(split_by_languages[cur_lang][1].tolist()) & \
                   set(split_by_languages[cur_lang][0].tolist())) == 0
        assert len(set(split_by_languages[cur_lang][1].tolist()) & \
                   set(split_by_languages[prev_lang][1].tolist())) == 0
        prev_lang = cur_lang
    if features_for_submission is None:
        return featured_data_for_training, split_by_languages
    featured_inputs_for_submission = []
    identifies_for_submission = []
    for cur_lang in features_for_submission:
        featured_inputs_for_submission.append(features_for_submission[cur_lang][0])
        identifies_for_submission.append(features_for_submission[cur_lang][1])
    featured_data_for_submission = (
        np.vstack(featured_inputs_for_submission),
        np.concatenate(identifies_for_submission)
    )
    del featured_inputs_for_submission, identifies_for_submission
    n_samples_for_submission = featured_data_for_submission[0].shape[0]
    n_IDs_for_submission = featured_data_for_submission[1].shape[0]
    err_msg = '{0} != {1}'.format(n_samples_for_submission, n_IDs_for_submission)
    assert n_samples_for_submission == n_IDs_for_submission, err_msg
    return (featured_data_for_training, split_by_languages, featured_data_for_submission)

**def calculate_projections(...)**

This function is needed to calculate T-SNE projections of labeled data and their visualization on the 2d space.

In [None]:
def calculate_projections(labeled_data: Tuple[np.ndarray, np.ndarray],
                          additional_title: str):
    X_prj = labeled_data[0]
    y_prj = labeled_data[1]
    assert len(X_prj.shape) == 2
    assert len(y_prj.shape) == 1
    n_samples = X_prj.shape[0]
    err_msg = '{0} != {1}'.format(n_samples, y_prj.shape[0])
    assert n_samples == y_prj.shape[0], err_msg
    if n_samples > 3000:
        test_size = 1500.0 / float(n_samples)
        _, X_prj, _, y_prj = train_test_split(X_prj, y_prj, test_size=test_size,
                                              random_state=42, stratify=y_prj)
    X_prj = TSNE(n_components=2, n_jobs=-1).fit_transform(X_prj)
    plt.figure(figsize=(10, 10))
    indices_of_negative_classes = list(filter(
        lambda sample_idx: y_prj[sample_idx] == 0,
        range(y_prj.shape[0])
    ))
    xy = X_prj[indices_of_negative_classes]
    plt.plot(xy[:, 0], xy[:, 1], 'o', color='g', markersize=4,
             label='Normal texts')
    indices_of_positive_classes = list(filter(
        lambda sample_idx: y_prj[sample_idx] > 0,
        range(y_prj.shape[0])
    ))
    xy = X_prj[indices_of_positive_classes]
    plt.plot(xy[:, 0], xy[:, 1], 'o', color='r', markersize=6,
             label='Toxic texts')
    if len(additional_title) > 0:
        if additional_title[0].isalnum():
            plt.title('Toxic and normal texts {0}'.format(additional_title))
        else:
            plt.title('Toxic and normal texts{0}'.format(additional_title))
    else:
        plt.title('Toxic and normal texts')
    plt.legend(loc='best')
    plt.show()

# Declaration of all functions is finished, and now I start to write the main code

I fix start time moment of experiments

In [None]:
experiment_start_time = time.time()

I detect a hardware for my experiments (GPU or TPU) and create a corresponded [distribution strategy](https://www.tensorflow.org/guide/distributed_training) as a special Tensorflow object.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()
    physical_devices = tf.config.list_physical_devices('GPU')
    for device_idx in range(strategy.num_replicas_in_sync):
        tf.config.experimental.set_memory_growth(physical_devices[device_idx], True)
model_name = 'jplu/tf-xlm-roberta-base'
max_seq_len = 128
batch_size_for_siamese = 32 * strategy.num_replicas_in_sync
print("REPLICAS: ", strategy.num_replicas_in_sync)
print('Model name: {0}'.format(model_name))
print('Maximal length of sequence is {0}'.format(max_seq_len))
print('Batch size for the Siamese XLM-RoBERTa is {0}'.format(
    batch_size_for_siamese))

I initialize a seed for all pseudo-random generators. This thing is very important for an experiment reproducibility!

In [None]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

I set all file paths to the input data and to all generated results, i.e. trained Siamese XLM-R and featured data for competition, calculated using this model.

In [None]:
dataset_dir = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification'
tmp_roberta_name = '/kaggle/working/siamese_xlmr.h5'
feature_extractor_name = '/kaggle/working/xlmr_fe.h5'
tmp_features_name = '/kaggle/working/features_by_siamese_xlmr.pkl'

I download meta-information about selected XLM-R from the Hugginface Transformers, and I prepare the configuration and tokenizer accordingly to this meta-information.

In [None]:
xlmroberta_tokenizer = AutoTokenizer.from_pretrained(model_name)
xlmroberta_config = XLMRobertaConfig.from_pretrained(model_name)
print(xlmroberta_config)
print('xlmroberta_tokenizer.pad_token_id',
      xlmroberta_tokenizer.pad_token_id)

I detect a sentence embedding size from the XLM-R configuration, and I check a maximal length of token sequence accordingly to this configuration.

In [None]:
sentence_embedding_size = xlmroberta_config.hidden_size
print('Sentence embedding size is {0}'.format(sentence_embedding_size))
assert max_seq_len <= xlmroberta_config.max_position_embeddings

I load data for training. These data must contain labeled English texts.

In [None]:
corpus_for_training = load_train_set(
    os.path.join(dataset_dir, "jigsaw-unintended-bias-train.csv"),
    text_field="comment_text", lang_field="lang",
    sentiment_fields=["toxic", "severe_toxicity", "obscene", "identity_attack",
                      "insult", "threat"]
)
assert 'en' in corpus_for_training

I load multilingual data for training in addition to the abovementioned English data. These data must represent texts in three languages (non-English only)

In [None]:
multilingual_corpus = load_train_set(
    os.path.join(dataset_dir, "validation.csv"),
    text_field="comment_text", lang_field="lang", sentiment_fields=["toxic", ]
)
assert 'en' not in multilingual_corpus
max_size = 0
print('Multilingual data:')
for language in sorted(list(multilingual_corpus.keys())):
    print('  {0}\t\t{1} samples'.format(language, len(multilingual_corpus[language])))
    assert set(map(lambda cur: cur[1], multilingual_corpus[language])) == {0, 1}
    if len(multilingual_corpus[language]) > max_size:
        max_size = len(multilingual_corpus[language])

I split the text corpus in English into three parts. The first part will rich the multilingual corpus for training (as the fourth language). The second part will be used as a data source for a Siamese network validation to implement early stopping. Finally, the third part, which is the largest, will be used as a data source for the training set of a Siamese network.

In [None]:
err_msg = 'Size of English corpus = {0} is too small!'.format(
    len(corpus_for_training['en'])
)
assert len(corpus_for_training['en']) >= (max_size * 10), err_msg
random.shuffle(corpus_for_training['en'])
multilingual_corpus['en'] = corpus_for_training['en'][:max_size]
n_validation = int(round(0.1 * (len(corpus_for_training['en']) - max_size)))
corpus_for_validation = {'en': corpus_for_training['en'][max_size:(max_size + n_validation)]}
corpus_for_training = {'en': corpus_for_training['en'][(n_validation + max_size):]}

I load multilingual data for final toxicity classification and submitting to the competition.

In [None]:
texts_for_submission = load_test_set(
    os.path.join(dataset_dir, "test.csv"),
    text_field="content", lang_field="lang", id_field="id"
)
print('Data for submission:')
for language in sorted(list(texts_for_submission.keys())):
    print('  {0}\t\t{1} samples'.format(language, len(texts_for_submission[language])))

I prepare a dataset for the Siamese XLM-R training (the number of labeled pairs in this dataset is equal to 500000).

In [None]:
dataset_for_training, n_batches_per_data = build_siamese_dataset(
    texts=corpus_for_training, dataset_size=500000,
    tokenizer=xlmroberta_tokenizer, maxlen=max_seq_len,
    batch_size=batch_size_for_siamese, shuffle=True
)

I prepare a dataset for the Siamese XLM-R validation during training (the number of labeled pairs in this dataset is equal to 1000).

In [None]:
dataset_for_validation, n_batches_per_epoch = build_siamese_dataset(
    texts=corpus_for_validation, dataset_size=1000,
    tokenizer=xlmroberta_tokenizer, maxlen=max_seq_len,
    batch_size=batch_size_for_siamese, shuffle=False
)

I specify optimal number of steps (mini-batches) per single training epoch as value which is less than full number of mini-batches in the training data, because waiting for full training set processing per epoch is not rational.

In [None]:
steps_per_single_epoch = min(20 * n_batches_per_epoch, n_batches_per_data)
print('Number of steps (mini-batches) per single epoch is {0}.'.format(
    steps_per_single_epoch
))

I delete all data which becomes unnecessary and call the garbage collector.

In [None]:
del corpus_for_training, corpus_for_validation
gc.collect()

I fix the duration of all preparing procedures, implemented in the abovementioned code.

In [None]:
preparing_end_time = time.time()
preparing_duration = int(round(preparing_end_time - experiment_start_time))
print("Duration of data loading and preparing to the Siamese NN training is "
      "{0} seconds.".format(preparing_duration))

I build the Siamese XLM-R.

In [None]:
with strategy.scope():
    siamese_network, feature_extractor = build_siamese_nn(
        transformer_name=model_name,
        max_len=max_seq_len,
        padding=xlmroberta_tokenizer.pad_token_id,
        stepsize=steps_per_single_epoch
    )

I calculate the featured training set for the final classifier using an untrained XLM-R-based feature extractor. In reality, these data need not for any classifier training, but they need for their 2d projections representation.

In [None]:
dataset_for_training_ = calculate_features_of_texts(
    texts=multilingual_corpus,
    tokenizer=xlmroberta_tokenizer, maxlen=max_seq_len,
    fe=feature_extractor,
    batch_size=batch_size_for_siamese
)
assert len(dataset_for_training_) == 4

I simplify the abovementioned data by means of their transforming from a dictionary by languages to a "normal" NumPy arrays pair (*X* and *y*).

In [None]:
data_for_cls_training, _ = generate_featured_data(dataset_for_training_)

I delete all data which becomes unnecessary and call the garbage collector.

In [None]:
del dataset_for_training_
gc.collect()

I show projections of data, which are featured by an untrained XLM-R.

In [None]:
calculate_projections(data_for_cls_training, 'before XLM-R training')

I delete all data which becomes unnecessary and call the garbage collector.

In [None]:
del data_for_cls_training
gc.collect()

I fix the duration of all procedures for data projecting.

In [None]:
projecting_duration = int(round(time.time() - preparing_end_time))
print("Duration of data projection is {0} seconds.".format(projecting_duration))

Lastly, I train my Siamese XLM-R.

In [None]:
train_siamese_nn(nn=siamese_network, trainset=dataset_for_training,
                 steps_per_trainset=n_batches_per_data,
                 steps_per_epoch=steps_per_single_epoch,
                 validset=dataset_for_validation,
                 max_duration=int(round(
                     3600 * 2.5 - preparing_duration - 3.0 * projecting_duration
                 )),
                 model_weights_path=tmp_roberta_name)

I delete all data which becomes unnecessary and call the garbage collector.

In [None]:
del dataset_for_training
del dataset_for_validation
gc.collect()

I load the best weights of XLM-R, found during the training process, from the HDF5 binary data file.

In [None]:
siamese_network.load_weights(tmp_roberta_name)
os.remove(tmp_roberta_name)
feature_extractor.save_weights(feature_extractor_name)

I delete the Siamese XLM-R (but not the XLM-R-based feature extractor!), and after that, I call the garbage collector.

In [None]:
del siamese_network
gc.collect()

I calculate the featured training set for the final classifier using a trained XLM-R-based feature extractor.

In [None]:
dataset_for_training_ = calculate_features_of_texts(
    texts=multilingual_corpus,
    tokenizer=xlmroberta_tokenizer, maxlen=max_seq_len,
    fe=feature_extractor,
    batch_size=batch_size_for_siamese
)
assert len(dataset_for_training_) == 4

Also, I calculate the featured data set, which will be used to submit predictions to the competition using some final classifier, trained on the abovementioned featured training set.

In [None]:
dataset_for_submission_ = calculate_features_of_texts(
    texts=texts_for_submission,
    tokenizer=xlmroberta_tokenizer, maxlen=max_seq_len,
    fe=feature_extractor,
    batch_size=batch_size_for_siamese
)

I delete the XLM-R-based feature extractor and corresponded tokenizer. They are no longer needed.

In [None]:
del feature_extractor, xlmroberta_tokenizer

I simplify the abovementioned data for the final classifier training and submitting using this classifier. Simplification is realized by means of these data transforming from a dictionary by languages to a "normal" NumPy arrays pairs (X and y). Also, in this way, I get a CV splitting of training data.

In [None]:
data_for_cls_training, cv_splitting, submission_data_for_cls = generate_featured_data(
    dataset_for_training_,
    dataset_for_submission_
)

I delete all data which becomes unnecessary and call the garbage collector.

In [None]:
del dataset_for_training_, dataset_for_submission_
gc.collect()

I show projections of data, which are featured by a trained XLM-R.

In [None]:
calculate_projections(data_for_cls_training, 'after training of XLM-R as a Siamese network')

At last, I save all featured data in the special binary file. After that, I can re-use these data in any experiment with the final classifier. The Siamese XLM-R is no longer needed because it had done its work!

In [None]:
with open(tmp_features_name, 'wb') as fp:
    pickle.dump(
        obj=(data_for_cls_training, cv_splitting, submission_data_for_cls),
        file=fp,
        protocol=pickle.HIGHEST_PROTOCOL
    )