
<font size='5' color='red'>About this notebook</font>

![](https://media.giphy.com/media/Olms1m3vZPURO/giphy.gif)

Jigsaw Multilingual Toxic Comment Classification is the 3rd annual competition organized by the Jigsaw team. It follows Toxic Comment Classification Challenge, the original 2018 competition, and Jigsaw Unintended Bias in Toxicity Classification, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use english only training data to run toxicity predictions on many different languages, which can be done using multilingual models, and speed up using TPUs.



**The focus of this notebook is to:**

- Expirement with different model architectures for finetuning bert.
- Add translated data to validation set.
- Do k fold cross validation on the data.I didn't see any public kernel doing k-fold cross validation with the data.But I'm pretty sure that it's done by LB toppers.
- How to improve CV vs LB corrrelation


### <font size='3' color='red' >If you like this work,Please leave an upvote ⬆️</font>

## <font size='4' color='blue'>Loading Required Libraries</font>

In [None]:
!pip install tensorflow_addons

In [None]:
from tensorflow.keras.layers import Input, Dense, Embedding, SpatialDropout1D, Dropout, add, concatenate
from tensorflow.keras.layers import  Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D,Average
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors
from tensorflow.keras.callbacks import ModelCheckpoint
from transformers import TFAutoModel, AutoTokenizer
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from kaggle_datasets import KaggleDatasets
from tensorflow.keras.models import Model
import plotly.graph_objects as go
from sklearn.utils import shuffle
import tensorflow_addons as tfa
from tqdm.notebook import tqdm
from textblob import TextBlob
import plotly.offline as py
import tensorflow as tf
import pandas as pd
import transformers
import numpy as np
import warnings
import os


warnings.filterwarnings("ignore")
py.init_notebook_mode(connected=True)


## <font size='4' color='blue'> Helper function</font>


In [None]:

def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])



## <font size='4' color='blue'>Multi Sample dropout Network</font>
![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F448347%2Fe7da45582176dd2a4dbdc7e7bd37ca8b%2F2019-07-22%2019.11.42.png?generation=1563790324431533&alt=media)


Multi-Sample Dropout introduced in the paper [Multi-Sample Dropout for Accelearted Training and Better Generalization](https://arxiv.org/abs/1905.09788) is a new way to expand the traditional Dropout by using multiple dropout masks for the same mini-batch.

In [None]:
def build_model(transformer, max_len=220):
   
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    seq_out = transformer(input_word_ids)[0]
    pool= GlobalAveragePooling1D()(seq_out)
    

    dense=[]
    FC = Dense(32,activation='relu')
    for p in np.linspace(0.2,0.5,3):
        x=Dropout(p)(pool)
        x=FC(x)
        x=Dense(1,activation='sigmoid')(x)
        dense.append(x)
    
    out = Average()(dense)
    
    model = Model(inputs=input_word_ids, outputs=out)
    optimizer=tfa.optimizers.RectifiedAdam(learning_rate=2e-5,min_lr=1e-6,total_steps=6000)
    model.compile(optimizer, loss=focal_loss(), metrics=[tf.keras.metrics.AUC()])
    
    return model


## <font size='4' color='red'>Using [CLS] token </font>
![](https://gluon-nlp.mxnet.io/_images/bert-sentence-pair.png)

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token,this is the easiest way to get ouput from bert for classification

In [None]:
def build_model(transformer, max_len=220):
   
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    seq_out = transformer(input_word_ids)[0]
    cls_out = seq_out[:,0,:]
    out=Dense(1,activation='sigmoid')(cls_out)
    
    model = Model(inputs=input_word_ids, outputs=out)
    optimizer=tfa.optimizers.RectifiedAdam(learning_rate=2e-5,min_lr=1e-6,total_steps=6000)
    model.compile(optimizer, loss=focal_loss(), metrics=[tf.keras.metrics.AUC()])
    
    return model
    
    
    

## <font size='4' color='red'> TPU Settings</font>

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

GCS_DS_PATH = KaggleDatasets().get_gcs_path("jigsaw-multilingual-toxic-comment-classification")
MY_GCS_DS_PATH = KaggleDatasets().get_gcs_path("jigsaw-train-multilingual-coments-google-api")

# Configuration
EPOCHS = 3
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192
MODEL = '../input/jplu-tf-xlm-roberta-large'


In [None]:
# First load the real tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)



## <font size='4' color='red'>Load Train,Validation and Test data</font>

In [None]:
train1 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
train2 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv")
train2.toxic = train2.toxic.round().astype(int)

# Combine train1 with a subset of train2
train = pd.concat([
    train1[['comment_text', 'toxic']],
    train2[['comment_text', 'toxic']].query('toxic==1'),
    train2[['comment_text', 'toxic']].query('toxic==0').sample(n=90000, random_state=0)
])


valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')


- Prepare `train` data

In [None]:
labels = valid.toxic.unique()
values = valid.toxic.value_counts()

# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title="Target Class distribution")
fig.show()

The dataset clearly has a class imbalance problem.

- Prepare `validation` data by adding some translated examples.

In [None]:

toxic=train1[train1.toxic==1].id.unique()
non_toxic=train1[train1.toxic==0].id.unique()

In [None]:
def get_valid(valid):
    
    cols=['comment_text','lang','toxic','id']
    i=0
    trans=pd.DataFrame()
    path="../input/jigsaw-train-multilingual-coments-google-api/"
    for file in os.listdir(path):
        if file.endswith("cleaned.csv"):

            df=pd.read_csv(path+file,usecols=['comment_text','toxic','id'])
            df['lang']=file.split("-")[-2]

            trans=trans.append([df[df.id.isin(toxic[i*(len(toxic)//6):(i+1)*(len(toxic)//6)])][cols],
                              df[df.id.isin(non_toxic[i*len(non_toxic)//6:(i+1)*len(non_toxic)//6])][cols]])
            i+=1

    valid= pd.concat([valid[['comment_text','lang','toxic']],trans])
    valid = shuffle(valid).reset_index(drop=True)
    
    
valid = get_valid()



Let's check the language distribution in validation set

In [None]:

labels = valid.lang.unique()
values = valid.lang.value_counts()
# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title="Updated validation set language distribution")
fig.show()

In [None]:
import gc
del trans,df,train1,train2
gc.collect()

## <font size='4' color='red'>Data preparation</font>

In [None]:
%%time 

#x_train = regular_encode(train.comment_text.values, tokenizer, maxlen=MAX_LEN)
x_valid = regular_encode(valid.comment_text.values, tokenizer, maxlen=MAX_LEN)
x_test = regular_encode(test.content.values, tokenizer, maxlen=MAX_LEN)

#y_train = train.toxic.values
y_valid=valid.toxic.values



- Prepare the `test_dataset` as usual.

In [None]:


test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

- Below function prepare and return the `tf.data.Dataset` object during each fold of cross validation.

In [None]:
def get_fold_data(train_ind,valid_ind):
    
    print("Getting fold data...")
    
    train_x=np.vstack([x_train,x_valid[train_ind]])
    train_y=np.hstack([y_train,y_valid[train_ind]])
    
    valid_x= x_valid[valid_ind]
    valid_y =y_valid[valid_ind]
    
    
    
    train_data  = (
    tf.data.Dataset
    .from_tensor_slices((train_x, train_y))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO))
    
    valid_data= (
    tf.data.Dataset
    .from_tensor_slices((valid_x, valid_y))
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)
    return train_data,valid_data

### <font size='3' color='blue'>Focal loss</font>

- We have a very unbalanced dataset,so I decided to use focal loss.


In [None]:
from tensorflow.keras import backend as K

def focal_loss(gamma=1.5, alpha=.25):
    def focal_loss_fixed(y_true, y_pred):
        pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
        pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
        return -K.mean(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1)) - K.mean((1 - alpha) * K.pow(pt_0, gamma) * K.log(1. - pt_0))
    return focal_loss_fixed


## <font size='4' color='red'>K-fold training and cross validation</font>

- Now we do 3 fold cross validation.
- We train on full english train set + 2/3 folds of of validation set.
- We then evaluate model performace on the 1/3 fold of validation set.

Recently I found out that you can clear the TPU memory while doing the cross-validation to prevent `OOM Resource exhausted error`.

In [None]:


pred_test=np.zeros((test.shape[0],1))
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
val_score=[]


for fold,(train_ind,valid_ind) in enumerate(skf.split(x_valid,valid.lang.values)):
    
    if fold < 0:
    
        print("fold",fold+1)
        
        K.clear_session()
        tf.tpu.experimental.initialize_tpu_system(tpu)
        
        train_data,valid_data=get_fold_data(train_ind,valid_ind)
    
        Checkpoint=tf.keras.callbacks.ModelCheckpoint(f"roberta_base_{fold+1}.h5", monitor='val_loss', verbose=1, save_best_only=True,
        save_weights_only=True, mode='min')
        
        with strategy.scope():
            transformer_layer = TFAutoModel.from_pretrained(MODEL)
            model = build_model(transformer_layer, max_len=MAX_LEN)
            
        
        


        n_steps=(x_train.shape[0]+len(train_ind))//BATCH_SIZE
        print(n_steps)

        print("training model {} ".format(fold+1))

        train_history = model.fit(
        train_data,
        steps_per_epoch=n_steps,
        validation_data=valid_data,
        epochs=EPOCHS,callbacks=[Checkpoint],verbose=1)
        
        print("Loading model...")
        model.load_weights(f"roberta_base_{fold+1}.h5")
        
        

        print("fold {} validation auc {}".format(fold+1,np.mean(train_history.history['val_auc'])))
        print("fold {} validation auc {}".format(fold+1,np.mean(train_history.history['val_loss'])))

        val_score.append(np.mean(train_history.history['val_auc']))
        
        print('predict on test....')
        preds=model.predict(test_dataset,verbose=1)

        pred_test+=preds/3
        
#print("Mean cross-validation AUC",np.mean(val_score))



## <font size='4' color='blue'>Making our Submission</font>

In [None]:
sub['toxic'] = pred_test
sub.to_csv('submission.csv', index=False)
sub.head()

###  <font size='3' color='red' >If you like this work,Please leave an upvote ⬆️</font>
## Work in progress...
- Improve CV vs LB Correlation
