## Acknowledgements:
- https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models#Introduction
- https://github.com/dipanjanS/deep_transfer_learning_nlp_dhs2019

Run it on [Kaggle Kernels](https://www.kaggle.com/spsayakpaul/jigsaw-multilingual-toxic-comment-classification). 

In this notebook, I am going to build a baseline model based on [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5) for the Jigsaw Multilingual Toxic Comment Classification (Kaggle challenge [link](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)). 

**What am I predicting?** (comes from the challenge homepage)

You are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a 0.0. In the test set, all comments are classified as either a 1.0 or a 0.0.

In [1]:
import tensorflow as tf
print(tf.__version__)

2.1.0


An amazing EDA on the dataset in available here: https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models. 

## Load and prepare data

In [2]:
!ls /kaggle/input/jigsaw-multilingual-toxic-comment-classification/

jigsaw-toxic-comment-train-processed-seqlen128.csv
jigsaw-toxic-comment-train.csv
jigsaw-unintended-bias-train-processed-seqlen128.csv
jigsaw-unintended-bias-train.csv
sample_submission.csv
test-processed-seqlen128.csv
test.csv
validation-processed-seqlen128.csv
validation.csv


Data description is available [here](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data). 

In [3]:
# Load datasets
import pandas as pd
import os

DATA_PATH = "/kaggle/input/jigsaw-multilingual-toxic-comment-classification/"

TEST_PATH = os.path.join(DATA_PATH, "test.csv")
VAL_PATH = os.path.join(DATA_PATH, "validation.csv")
TRAIN_PATH = os.path.join(DATA_PATH, "jigsaw-toxic-comment-train.csv")

val_data = pd.read_csv(VAL_PATH)
test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)

In [4]:
# Preview train set
train_data.sample(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
202380,aaa1d08c528cb0d8,""" \n :::::Fine, if you want to add an explanat...",0,0,0,0,0,0
84575,e248e1bda4d2998d,"""\n(1) I'm using """"public"""" in the broadest se...",0,0,0,0,0,0
38370,6673b3878fcd6b40,Notice about your edits \n\nPlease do not add ...,0,0,0,0,0,0
105019,31cd6d6bc4df7f2e,Attacks on editors \n\nI strongly suggest you ...,0,0,0,0,0,0
147377,3b76cc9b5ee3f91f,Could you also ask Panonian? You have higher s...,0,0,0,0,0,0


Columns (comes from [here](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data)): 
- id - identifier within each file.
- comment_text - the text of the comment to be classified.
- toxic:identity_hate - whether or not the comment is classified as toxic. 

In [5]:
val_data.sample(5)

Unnamed: 0,id,comment_text,lang,toxic
5738,5738,hz.isa ile ilgili bir takıntısı olduğunu düşün...,tr,1
4489,4489,E vero ma in questo caso si tratta di un blog...,it,0
1908,1908,Şu konuyla ilgili: Kullanıcı gece saatlerinde ...,tr,0
6689,6689,bilgisayarda çalışırken canım sıkıldı ve biraz...,tr,0
6105,6105,Devriye olması gerekmiyor muydu? Engeli bittiğ...,tr,0


In [6]:
test_data.sample(5)

Unnamed: 0,id,content,lang
21587,21587,Comme il n y a aucune mention du nom Ayumi Ha...,fr
4374,4374,surtout obligé les preuves que tu racontes n i...,fr
21699,21699,", io sono ignorante come un pigna su queste co...",it
2822,2822,"Да, всё, руки не доходят, создать Проект. В те...",ru
57145,57145,¿Acaso no se dan cuenta que cualquier mención ...,es


It's a multilingual dataset as you can see. 

I am going to borrow the helper functions as shown here: https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models. 

In [7]:
# Remove usernames and links
import re

val = val_data
train = train_data

def clean(text):
    # fill the missing entries and convert them to lower case
    text = text.fillna("fillna").str.lower()
    # replace the newline characters with space 
    text = text.map(lambda x: re.sub('\\n',' ',str(x)))
    text = text.map(lambda x: re.sub("\[\[User.*",'',str(x)))
    # remove usernames and links
    text = text.map(lambda x: re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",'',str(x)))
    text = text.map(lambda x: re.sub("\(http://.*?\s\(http://.*\)",'',str(x)))
    return text

val["comment_text"] = clean(val["comment_text"])
test_data["content"] = clean(test_data["content"])
train["comment_text"] = clean(train["comment_text"])

In [8]:
# Load DistilBERT tokenizer
import transformers

tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




The following function comes from [here](https://github.com/dipanjanS/deep_transfer_learning_nlp_dhs2019/blob/master/notebooks/6%20-%20Transformers%20-%20DistilBERT.ipynb).

In [9]:
import numpy as np
import tqdm

def create_bert_input_features(tokenizer, docs, max_seq_length):
    
    all_ids, all_masks = [], []
    for doc in tqdm.tqdm(docs, desc="Converting docs to features"):
        tokens = tokenizer.tokenize(doc)
        if len(tokens) > max_seq_length-2:
            tokens = tokens[0 : (max_seq_length-2)]
        tokens = ['[CLS]'] + tokens + ['[SEP]']
        ids = tokenizer.convert_tokens_to_ids(tokens)
        masks = [1] * len(ids)
        # Zero-pad up to the sequence length.
        while len(ids) < max_seq_length:
            ids.append(0)
            masks.append(0)
        all_ids.append(ids)
        all_masks.append(masks)
    encoded = np.array([all_ids, all_masks])
    return encoded

In [10]:
# Segregate the comments and their labels (not applicable for test set)
train_comments = train.comment_text.astype(str).values
val_comments = val_data.comment_text.astype(str).values
test_comments = test_data.content.astype(str).values

y_valid = val.toxic.values
y_train = train.toxic.values

In [11]:
import gc
gc.collect()

0

In [12]:
# Encode the comments
MAX_SEQ_LENGTH = 500

train_features_ids, train_features_masks = create_bert_input_features(tokenizer, train_comments, 
                                                                      max_seq_length=MAX_SEQ_LENGTH)
val_features_ids, val_features_masks = create_bert_input_features(tokenizer, val_comments, 
                                                                  max_seq_length=MAX_SEQ_LENGTH)
# test_features = create_bert_input_features(tokenizer, test_comments, 
#                                            max_seq_length=MAX_SEQ_LENGTH)

Converting docs to features: 100%|██████████| 223549/223549 [11:01<00:00, 337.90it/s]
Converting docs to features: 100%|██████████| 8000/8000 [00:23<00:00, 334.54it/s]


In [13]:
# Verify the shapes
print(train_features_ids.shape, train_features_masks.shape, y_train.shape)
print(val_features_ids.shape, val_features_masks.shape, y_valid.shape)

(223549, 500) (223549, 500) (223549,)
(8000, 500) (8000, 500) (8000,)


In [15]:
# Configure TPU
from kaggle_datasets import KaggleDatasets

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

GCS_DS_PATH = KaggleDatasets().get_gcs_path('jigsaw-multilingual-toxic-comment-classification')

EPOCHS = 2
BATCH_SIZE = 32 * strategy.num_replicas_in_sync

In [18]:
# Create TensorFlow datasets for better performance
train_ds = (
    tf.data.Dataset
    .from_tensor_slices(((train_features_ids, train_features_masks), y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
    
valid_ds = (
    tf.data.Dataset
    .from_tensor_slices(((val_features_ids, val_features_masks), y_valid))
    .repeat()
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

## Model building and training

In [19]:
# Create utility function to get a training ready model on demand
def get_training_model():
    inp_id = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_ids")
    inp_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_masks")
    inputs = [inp_id, inp_mask]

    hidden_state = transformers.TFDistilBertModel.from_pretrained('distilbert-base-multilingual-cased')(inputs)[0]
    pooled_output = hidden_state[:, 0]    
    dense1 = tf.keras.layers.Dense(128, activation='relu')(pooled_output)
    output = tf.keras.layers.Dense(1, activation='sigmoid')(dense1)

    model = tf.keras.Model(inputs=inputs, outputs=output)
    model.compile(optimizer=tf.optimizers.Adam(learning_rate=2e-5, 
                                            epsilon=1e-08), 
                loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [20]:
# Authorize wandb
import wandb

wandb.login()
from wandb.keras import WandbCallback

[34m[1mwandb[0m: [32m[41mERROR[0m Not authenticated.  Copy a key from https://app.wandb.ai/authorize


API Key: ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [21]:
# Initialize wandb
wandb.init(project="jigsaw-toxic", id="distilbert-tpu-kaggle-weighted")

W&B Run: https://app.wandb.ai/sayakpaul/jigsaw-toxic/runs/distilbert-tpu-kaggle-weighted

In [22]:
# Create 32 random indices from the English only test comments
RANDOM_INDICES = np.random.choice(test_comments.shape[0], 32)
RANDOM_INDICES

array([45335, 59335, 26736, 16088, 17969, 40881, 11877, 37071, 63772,
       22713, 22262, 50775,  2368, 34291, 47547, 41171, 28948,  4920,
       45737,  3029,  1213, 18222, 56638, 41219, 37336, 63313, 36946,
       26508, 48649,  2890,  7353,  2832])

We will be logging some sample predictions on the test dataset to see how our model is doing as it is getting trained. Now, as this is a mulitlingual dataset, we may need to convert a given comment to a language of our choice to make sense of the model's prediction. We will be using the `googletrans` library. 

In [23]:
!pip install -q googletrans

In [25]:
# Demo examples of translations
from googletrans import Translator

sample_comment = test_comments[48649]
print("Original comment:", sample_comment)
translated_comment = Translator().translate(sample_comment)
print("\n")
print("Translated comment:", translated_comment.text)

Original comment:  ¡ah! sí, ya lo sé... pero como que no puedo sacarme ciertos argentinismos de encima a la hora de escribir. —   kved    (discusión)    pd: aunque no sé si lo correcto no es escribir  bloqueé  en lugar de  bloquee . para solucionar ese tema, es más fácil decir   bloquié   y que la rae se vaya a tomar por culo.   ;)


Translated comment: Ah! Yes, I know ... but I can not get me out certain argentinismos off when writing. - kved (discussion) pd: I do not know if right not write blocked instead of blocking. to solve this issue, it is easier to say rae bloquié and that is to take the ass. ;)


In [26]:
# Create a sample prediction logger
# A custom callback to view predictions on the above samples in real-time
class TextLogger(tf.keras.callbacks.Callback):
    def __init__(self):
        super(TextLogger, self).__init__()

    def on_epoch_end(self, logs, epoch):
        samples = []
        for index in RANDOM_INDICES:
            # Grab the comment and translate it
            comment = test_comments[index]
            translated_comment = Translator().translate(comment).text
            # Create BERT features
            comment_feature_ids, comment_features_masks = create_bert_input_features(tokenizer,  
                                    comment, max_seq_length=MAX_SEQ_LENGTH)
            # Employ the model to get the prediction and parse it
            predicted_label = self.model.predict([comment_feature_ids, comment_features_masks])
            predicted_label = np.argmax(predicted_label[0])
            if predicted_label==0: predicted_label="Non-Toxic"
            else: predicted_label="Toxic"
            
            sample = [comment, translated_comment, predicted_label]
            
            samples.append(sample)
        wandb.log({"text": wandb.Table(data=samples, 
                                       columns=["Comment", "Translated Comment", "Predicted Label"])})

In [27]:
# Garbage collection
gc.collect()

1767

In [34]:
# Account for the class imbalance
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)
class_weights

array([0.55288749, 5.22701553])

In [35]:
# Train the model
import time

start = time.time()

# Compile the model with TPU Strategy
with strategy.scope():
    model = get_training_model()
    
model.fit(train_ds, 
          steps_per_epoch=train_data.shape[0] // BATCH_SIZE,
          validation_data=valid_ds,
          validation_steps=val_data.shape[0] // BATCH_SIZE,
          epochs=EPOCHS,
          class_weight=class_weights,
          callbacks=[WandbCallback(), TextLogger()],
          verbose=1)
end = time.time() - start
print("Time taken ",end)
wandb.log({"training_time":end})

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=618.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=910749124.0, style=ProgressStyle(descri…


Train for 873 steps, validate for 31 steps
Epoch 1/2

[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model, h5py returned error: 
Converting docs to features: 100%|██████████| 99/99 [00:00<00:00, 3474.63it/s]
Converting docs to features: 100%|██████████| 345/345 [00:00<00:00, 3355.53it/s]
Converting docs to features: 100%|██████████| 169/169 [00:00<00:00, 3387.00it/s]
Converting docs to features: 100%|██████████| 506/506 [00:00<00:00, 3463.48it/s]
Converting docs to features: 100%|██████████| 139/139 [00:00<00:00, 3366.29it/s]
Converting docs to features: 100%|██████████| 312/312 [00:00<00:00, 3323.52it/s]
Converting docs to features: 100%|██████████| 247/247 [00:00<00:00, 3511.72it/s]
Converting docs to features: 100%|██████████| 174/174 [00:00<00:00, 3402.15it/s]
Converting docs to features: 100%|██████████| 612/612 [00:00<00:00, 3367.64it/s]
Converting docs to features: 100%|██████████| 616/616 [00:00<00:00, 3381.52it/s]
Converting docs to features: 100%|██████████| 111/111 [00:00<00:00, 3160.89it/s]
Converting docs to features: 10

Epoch 2/2

Converting docs to features: 100%|██████████| 99/99 [00:00<00:00, 3105.99it/s]
Converting docs to features: 100%|██████████| 345/345 [00:00<00:00, 3337.37it/s]
Converting docs to features: 100%|██████████| 169/169 [00:00<00:00, 3222.91it/s]
Converting docs to features: 100%|██████████| 506/506 [00:00<00:00, 3274.03it/s]
Converting docs to features: 100%|██████████| 139/139 [00:00<00:00, 3311.02it/s]
Converting docs to features: 100%|██████████| 312/312 [00:00<00:00, 3270.40it/s]
Converting docs to features: 100%|██████████| 247/247 [00:00<00:00, 3448.82it/s]
Converting docs to features: 100%|██████████| 174/174 [00:00<00:00, 3021.22it/s]
Converting docs to features: 100%|██████████| 612/612 [00:00<00:00, 3367.01it/s]
Converting docs to features: 100%|██████████| 616/616 [00:00<00:00, 3368.86it/s]
Converting docs to features: 100%|██████████| 111/111 [00:00<00:00, 3196.04it/s]
Converting docs to features: 100%|██████████| 122/122 [00:00<00:00, 3186.03it/s]
Converting docs to features: 1

Time taken  1160.0268676280975


**As I am logging some demo predictions in between this training time should not be used for any benchmarks. **

Let's try a CNN (with 1D convolutions) now. 

In [41]:
# Create utility function to get a training ready model on demand
def get_training_model_cnn():
    inp_id = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_ids")
    inp_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_masks")
    inputs = [inp_id, inp_mask]

    hidden_state = transformers.TFDistilBertModel.from_pretrained('distilbert-base-multilingual-cased')(inputs)[0]
    pooled_output = hidden_state[:, 0]    
    reshaped_pooled = tf.keras.layers.Reshape((768,1), input_shape=(768,))(pooled_output)
    conv_1 = tf.keras.layers.Conv1D(64, 2, activation='relu')(reshaped_pooled)
    pooled_2 = tf.keras.layers.GlobalAveragePooling1D()(conv_1)
    dense_1 = tf.keras.layers.Dense(128, activation='relu')(pooled_2)
    output = tf.keras.layers.Dense(1, activation='sigmoid')(dense_1)

    model = tf.keras.Model(inputs=inputs, outputs=output)
    model.compile(optimizer=tf.optimizers.Adam(learning_rate=2e-5, 
                                            epsilon=1e-08), 
                loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [37]:
# Garbage collection
gc.collect()

# Reinitialize wandb
wandb.init(project="jigsaw-toxic", id="distilbert-tpu-kaggle-weighted-cnn")

W&B Run: https://app.wandb.ai/sayakpaul/jigsaw-toxic/runs/distilbert-tpu-kaggle-weighted-cnn

In [42]:
# Train the CNN-based model
start = time.time()

# Compile the model with TPU Strategy
with strategy.scope():
    model = get_training_model_cnn()
    
model.fit(train_ds, 
          steps_per_epoch=train_data.shape[0] // BATCH_SIZE,
          validation_data=valid_ds,
          validation_steps=val_data.shape[0] // BATCH_SIZE,
          epochs=EPOCHS,
          class_weight=class_weights,
          callbacks=[WandbCallback(), TextLogger()],
          verbose=1)
end = time.time() - start
print("Time taken ",end)
wandb.log({"training_time":end})

Train for 873 steps, validate for 31 steps
Epoch 1/2

[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model, h5py returned error: 
Converting docs to features: 100%|██████████| 99/99 [00:00<00:00, 3036.11it/s]
Converting docs to features: 100%|██████████| 345/345 [00:00<00:00, 3261.09it/s]
Converting docs to features: 100%|██████████| 169/169 [00:00<00:00, 3239.05it/s]
Converting docs to features: 100%|██████████| 506/506 [00:00<00:00, 3346.99it/s]
Converting docs to features: 100%|██████████| 139/139 [00:00<00:00, 3184.11it/s]
Converting docs to features: 100%|██████████| 312/312 [00:00<00:00, 3242.44it/s]
Converting docs to features: 100%|██████████| 247/247 [00:00<00:00, 3304.00it/s]
Converting docs to features: 100%|██████████| 174/174 [00:00<00:00, 3211.55it/s]
Converting docs to features: 100%|██████████| 612/612 [00:00<00:00, 3484.40it/s]
Converting docs to features: 100%|██████████| 616/616 [00:00<00:00, 3404.12it/s]
Converting docs to features: 100%|██████████| 111/111 [00:00<00:00, 3290.72it/s]
Converting docs to features: 10

Epoch 2/2

Converting docs to features: 100%|██████████| 99/99 [00:00<00:00, 3549.48it/s]
Converting docs to features: 100%|██████████| 345/345 [00:00<00:00, 3274.91it/s]
Converting docs to features: 100%|██████████| 169/169 [00:00<00:00, 3362.30it/s]
Converting docs to features: 100%|██████████| 506/506 [00:00<00:00, 3263.48it/s]
Converting docs to features: 100%|██████████| 139/139 [00:00<00:00, 3437.29it/s]
Converting docs to features: 100%|██████████| 312/312 [00:00<00:00, 3502.22it/s]
Converting docs to features: 100%|██████████| 247/247 [00:00<00:00, 3570.63it/s]
Converting docs to features: 100%|██████████| 174/174 [00:00<00:00, 3517.39it/s]
Converting docs to features: 100%|██████████| 612/612 [00:00<00:00, 3291.56it/s]
Converting docs to features: 100%|██████████| 616/616 [00:00<00:00, 3402.21it/s]
Converting docs to features: 100%|██████████| 111/111 [00:00<00:00, 3236.14it/s]
Converting docs to features: 100%|██████████| 122/122 [00:00<00:00, 3500.25it/s]
Converting docs to features: 1

Time taken  1096.077669620514


The model generalizes better. 