# Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages

[*Link on the Competition*](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)


### Let's go!!!

![head image](https://i.imgur.com/P9G09Ck.jpg)

# Content

1. [Competition task description](#Competition-task-description)
2. [What is Tensor Processing Units?](#What-is-Tensor-Processing-Units?)
3. [Import](#Import)
4. [Functions](#Functions)
5. [TPU Configs](#TPU-Configs)
6. [Create fast tokenizer](#Create-fast-tokenizer)
7. [Load text data](#Load-text-data)
8. [Build datasets objects](#Build-datasets-objects)
9. [Load model into the TPU](#Load-model-into-the-TPU)
10. [Train Model](#Train-Model)
11. [Submission](#Submission)
12. [Model Tuning](#Model-Tuning)
13. [Bland different submissions](#Bland-different-submissions)

# Competition task description

As our computing resources and modeling capabilities grow, so does our potential to support healthy conversations across the globe. **Develop strategies to build effective multilingual models** and you'll help Conversation AI and the entire industry realize that potential. [More...](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)

![task gif](https://preen.ph/files/2018/05/FunnyIGAccounts.gif)

# What is Tensor Processing Units?

A **tensor processing unit (TPU)** is an AI accelerator application-specific integrated circuit (ASIC) developed by Google specifically for neural network machine learning, particularly using Google's own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.

- **Wikipedia** - [Tensor processing unit (TPU)](https://en.wikipedia.org/wiki/Tensor_processing_unit)
- **Kaggle** - [Tensor Processing Units (TPUs)](https://www.kaggle.com/docs/tpu)

![tpu image](https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Tensor_Processing_Unit_3.0.jpg/1280px-Tensor_Processing_Unit_3.0.jpg)

### References
* Original Author: [@xhlulu](https://www.kaggle.com/xhlulu/)
* Original notebook: [Link](https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras)

# Model Tuning

| Version | Changes   |  Score |
|---------|-----------|--------|
|    1    | default   | 0.8697 |
|    2    | (EPOCHS = 3+1; BATCH_SIZE = 16x2)   | 0.8587 |
|    3    | Blanding | 0.9406  |
|.   4.   | lowercase=True; add Dense layer | ? |

# Import

In [None]:
import os

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

In [None]:
import transformers

from tokenizers import BertWordPieceTokenizer

from tqdm.notebook import tqdm

from kaggle_datasets import KaggleDatasets

# Functions

In [None]:
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    """
    Tokenize text
    Source: https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

In [None]:
def build_model(transformer, max_len=512):
    """
    Model initalization
    Source: https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    dense_layer = Dense(224, activation='relu')(cls_token)
    out = Dense(1, activation='sigmoid')(dense_layer)
    
    model = Model(inputs=input_word_ids, outputs=out)
    # model = InceptionV3(input_tensor=input_word_ids, weights='imagenet', include_top=True)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model


# TPU Configs

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

# Data access
# GCS_DS_PATH = KaggleDatasets().get_gcs_path()

In [None]:
# Configuration
EPOCHS = 3
BATCH_SIZE = 32 * strategy.num_replicas_in_sync
MAX_LEN = 192

# Create fast tokenizer

In [None]:
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=True)
fast_tokenizer

# Load text data

In [None]:
DATA_PATH = "/kaggle/input/jigsaw-multilingual-toxic-comment-classification/"

In [None]:
train1 = pd.read_csv(os.path.join(DATA_PATH, "jigsaw-toxic-comment-train.csv"))
train2 = pd.read_csv(os.path.join(DATA_PATH, "jigsaw-unintended-bias-train.csv"))
train2.toxic = train2.toxic.round().astype(int)

valid = pd.read_csv(os.path.join(DATA_PATH, 'validation.csv'))
test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
sub = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'))

In [None]:
# Combine train1 with a subset of train2
train = pd.concat([
    train1[['comment_text', 'toxic']],
    train2[['comment_text', 'toxic']].query('toxic==1'),
    train2[['comment_text', 'toxic']].query('toxic==0').sample(n=150000, random_state=3982)
])

# Note: changed random_state from 0 to 3982

In [None]:
x_train = fast_encode(train.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)

y_train = train.toxic.values
y_valid = valid.toxic.values

# Build datasets objects

In [None]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

# Load model into the TPU

In [None]:
%%time
with strategy.scope():
    transformer_layer = (
        transformers.TFDistilBertModel
        .from_pretrained('distilbert-base-multilingual-cased')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

# Train Model

First, we train on the subset of the training set, which is completely in English.

In [None]:
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

In [None]:
# Plot training & validation accuracy values
plt.plot(train_history.history['accuracy'])
plt.plot(train_history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Now that we have pretty much saturated the learning potential of the model on english only data, we train it for one more epoch on the `validation` set, which is significantly smaller but contains a mixture of different languages.

In [None]:
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
    valid_dataset.repeat(),
    steps_per_epoch=n_steps,
    epochs=EPOCHS*2
)

In [None]:
# Plot training & validation accuracy values
plt.plot(train_history_2.history['accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()

# Submission

In [None]:
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)

# Bland different submissions

- [Ensemble](https://www.kaggle.com/hamditarek/ensemble)
- [Super Fast XLMRoberta](https://www.kaggle.com/shonenkov/tpu-inference-super-fast-xlmroberta)
- [Jigsaw TPU: BERT with Huggingface and Keras](https://www.kaggle.com/miklgr500/jigsaw-tpu-bert-with-huggingface-and-keras)
- [inference of bert tpu model ml w/ validation](https://www.kaggle.com/abhishek/inference-of-bert-tpu-model-ml-w-validation)

P.S. I want to test, how can I bland different submissions. After that I will build ensemble of the different models.

In [None]:
EXTERNAL_DATA_PATH = '../input/external/'
os.listdir(EXTERNAL_DATA_PATH)

### Read data

In [None]:
submission1 = pd.read_csv(os.path.join(EXTERNAL_DATA_PATH, 'submission-1.csv'))
submission2 = pd.read_csv(os.path.join(EXTERNAL_DATA_PATH, 'submission-2.csv'))
submission3 = pd.read_csv(os.path.join(EXTERNAL_DATA_PATH, 'submission-3.csv'))

### Bland 1

In [None]:
submission1['toxic'] = submission1['toxic']*0.05 + submission2['toxic']*0.15 + submission3['toxic']*0.8
submission1.to_csv('submission-1.csv', index=False)