In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import transformers
from transformers import TFBertModel
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Jigsaw Multilingual Comment Classification - is the NLP competition. And in NLP tasks except others there are two fundamental questions: 

1. How to represent words and sentences in numeric view - which is the specific NLP problem
2. What model to use upon sentence representation - which is regular ML task

Talking about word representations, data science community has made a great research in this field - there are some approaches to make sentence embedding upon words counts. 

a) Bag of words - is the method where sentence embedding is representated as vector with dim == vocab_size x 1, each cell of vector contains a number which shows how many times this particular word from vocabulary appears in this particular sentence. (vocabulary is a dictionary which contains all unique words or tokens with its unique indexes)

b) TF-iDF - more sophisticated method still based on word counts, which uses not exactly word_counts, but its frequencies in particular sentence and whole dataset. TF stands for term frequency and for particular word or token W_t in particular sentence it can measured as $\frac{W_{t}}{\Sigma_{k=1}^{K}W_{k}}$, where K is number of unique word in this sentence - it is a word frequency in sentence. iDF stands for inverse Document frequency and measured for particular word as $\log\frac{len(D)}{len(D | W_t in D_i)}$, where len(D) is the number of all sentences id dataset, len(D | W_t in D_i) - the number of sentences in dataset where word W_t is contained. And tf-idf = tf * idf. Finally, word_embedding has shape (number of sentences in dataset , 1), sentence embedding (vocab_size, 1)

c) Word2Vec, GloVe, FastText - these are the methods that build word representations in a such way that words have quite similiar representations if they appear in the same context, what helps the model to take out its semantics. Due to the process of building embeddings their shape is custom - which is memory friendly. It is difficult to describe the inner process of building such representations in a few words that is why here is the nice guide of the word2vec underhood. https://medium.com/analytics-vidhya/maths-behind-word2vec-explained-38d74f32726b

For FastText https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3

d) RNN-based embeddings. RNNs are type of neural nets which are build in such way to handle sequential data as texts and time series. Here is link to understand better its architecture https://towardsdatascience.com/recurrent-neural-networks-rnn-explained-the-eli5-way-3956887e8b75 

Vanilla recurrent neural nets have problems with gradient signal - it can boost or vanish due to its architecture. That is why, some modifications were proposed as LSTM and GRU. This video is a great way to understand them https://www.youtube.com/watch?v=8HyCNIVRbSU

However, LSTM and GRU still have a major drawback - they can not handle connections between words if they are not close to each other in text.

e) That is why, the new generation of language models has born - transformers, especially those that have Bert-based architecture. These models reach state-of-art result on many language tasks, therefore I will use them in my work. Bert is a stack encoder layers, which contain several multi-head attention mechanisms, due to which bert can catch the semantics of words and sentences and also long dependencies between words in sequences. These are the great link that I strongly recommend you to watch to understand Bert and its underlying mechanisms.

This links are needed to be visited first


https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77


https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1


These are great visualizations on attention mechanism which is used in BERT: 
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

This is visualized guide to transformer architecture 
http://jalammar.github.io/illustrated-transformer/

This is a BERT guide
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Before feeding to bert data is prepared in a special way: text is being split into tokens, [cls] and [sep] tokens are append to the beginning and the end of the sequence. Then every token is converted to its unique index in the inner vocabulary and if the length of the sequence if lower than the SEQUENCE_LENGTH parameter, this is used to make all the sequencies of the same length, then the sequence is padded to max length with specific pad token, and if the length of sequence is higher than the SEQUENCE_LENGTH, than the sequence is trancated.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

sentence = "Hello what's up"
tokenized = tokenizer.encode(sentence)
print(tokenized)
print(tokenizer.convert_ids_to_tokens(tokenized))

In [None]:
DATA_PATH = "../input/jigsaw-multilingual-toxic-comment-classification/"
small_ds_processed_path = "jigsaw-toxic-comment-train-processed-seqlen128.csv"
val_path = "validation-processed-seqlen128.csv"

On Kaggle now TPU is available to speed up training. To use it we need to turn on TPU in notebook settings and also write a few strings of code to enable it

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
print(tpu_strategy.num_replicas_in_sync)

In [None]:
SEQUENCE_LENGTH = 128
BATCH_SIZE = 16 * tpu_strategy.num_replicas_in_sync

In [None]:
train = pd.read_csv(os.path.join(DATA_PATH, small_ds_processed_path))
val = pd.read_csv(os.path.join(DATA_PATH, val_path))
test = pd.read_csv(os.path.join(DATA_PATH, "test-processed-seqlen128.csv"))

Train datasets have many columns we don't need, so we'll drop them. Also authors have already preprocessed and encoded comments to the needed format for Bert, so for this notebook we use it. Overall, we'll take input_words_ids - encoded tokens ids of input comments and toxic column - target variable

In [None]:
train.head()

In [None]:
val.head()

The same for the validation dataset

In [None]:
train = train[["input_word_ids", "toxic"]]
val = val[["input_word_ids", "toxic"]]

In [None]:
print("train")
print(train.dtypes)
print("validation")
print(val.dtypes)

Due to process of encoding and saving of encoded comments, input_ids have dtype object, while we need them to be the array of ints.

In [None]:
train_comments = train["input_word_ids"]
val_comments = val["input_word_ids"]
test_comments = test["input_word_ids"]

In [None]:
train_comments = train_comments.str.strip("()").str.split(",",expand=True).astype(int).values
val_comments = val_comments.str.strip("()").str.split(",",expand=True).astype(int).values
test_comments = test_comments.str.strip("()").str.split(",",expand=True).astype(int).values

In [None]:
train_labels = train["toxic"]
val_labels = val["toxic"]

Now it's time to create tf.data.Dataset object - it's an efficicent way to store data and pass it batched to the model while fitting. You can find more about this class and its methods there: https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [None]:
BUFFER_SIZE = len(train_comments)
train_ds = (tf.data.Dataset.from_tensor_slices((train_comments, train_labels))
            .shuffle(BUFFER_SIZE)
            .repeat()
            .batch(BATCH_SIZE)
            .prefetch(AUTO)
           )

val_ds = (tf.data.Dataset.from_tensor_slices((val_comments, val_labels))
          .shuffle(BUFFER_SIZE)
          .batch(BATCH_SIZE)
          .prefetch(AUTO)
         )


So we are ready to define the model. Our data is preprocessed for BERT so we'll it as a fundamental layer. 
Actually there is a great library for transformers including most state-of-art models and also pretrained weights for them
what is very useful when you have limited data and processing power.

here the link https://huggingface.co/transformers/

there this model is needed: https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel

In [None]:
from transformers import TFBertModel

Bert model returns several things: 
first output is the tensor batch_size x sequence_length x embedding_size

For a single sentence it a matrix with size sequence_length x embedding_size , where every row is the embedding vector for a word at i position. It is the output of the last hidden layer of the model. Here comes the question - what to use as the embedding of the whole sentence. Here the link to know more about it https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

The default way is to use the first row of sequence_length x embedding_size matrix for every sequence - as this row contains a [CLS] token which should contain the semantics of the whole sentence. However, there ,many ways to take sentence semantics out of bert - check the link above to learn it.

In [None]:
def make_model(transformer):
    
    
    input_ids = Input(shape=(SEQUENCE_LENGTH,), name='input_token', dtype='int32')

    embed_layer = transformer(input_ids)[0]
    cls_token = embed_layer[:,0,:]
    X = Dropout(0.3)(cls_token)
    X = Dense(1, activation="sigmoid")(X)
    model = tf.keras.Model(inputs=input_ids, outputs = X)
    return model

To make fitting our model on TPU we need to define it like this

https://huggingface.co/transformers/pretrained_models.html

here is the list of the pretrained models that can be downloaded as in the example below

In [None]:
with tpu_strategy.scope():
    bert = TFBertModel.from_pretrained("bert-base-multilingual-cased")
    model = make_model(bert)
    model.compile(optimizer=Adam(3e-5), loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC()])
    model.summary()

In [None]:
N_STEPS = train_comments.shape[0] // BATCH_SIZE
VAL_STEPS = val_comments.shape[0] // BATCH_SIZE

In [None]:
EPOCHS = 2

In [None]:
history = model.fit(train_ds,
                    validation_data= val_ds,
                    epochs=EPOCHS,
                    steps_per_epoch=N_STEPS
                   )

In [None]:
history_plus = model.fit(val_ds,
                         epochs=EPOCHS,
                         steps_per_epoch=VAL_STEPS
                        )

In [None]:
sub = pd.read_csv(os.path.join('../input/jigsaw-multilingual-toxic-comment-classification/','sample_submission.csv'))
sub['toxic'] = model.predict(test_comments, verbose=1)
sub.to_csv('submission.csv', index=False)

The ideas of bert were taken and improved to get better results on multilingual tasks, therefore the next notebook will be about custom text data preparation, handling unbalanced classes problem and building XLM-RoBerta model, which is the state-of-art bert-based crosslingual model

https://www.kaggle.com/vgodie/class-balancing

https://www.kaggle.com/vgodie/data-encoding

https://www.kaggle.com/vgodie/xlm-roberta