# Jigsaw Toxicity: Word2Vec+TFIDF Training
## Table of Contents
* [1. Overview](#1.)
* [2. Configuration](#2.)
* [3. Setup](#3.)
* [4. Import datasets](#4.)
* [5. EDA & Preprocessing](#5.)
    * [5.1 Select trainng data](#5.1)
    * [5.2 Text Preprocessing Function](#5.2)
    * [5.3 TF-IDF Vectorization](#5.3)
    * [5.4 Word2Vec Vectorization](#5.4)
    * [5.5 Train Validation Split](#5.5)
    * [5.6 Create TensorFlow Dataset](#5.6)
    * [5.7 Calculate Class weight](#5.7)
* [6. Model Development](#6.)
    * [6.1 FNet Encoder](#6.1)
    * [6.2 Positional Embedding](#6.2)
    * [6.3 Word2Vec FNet Model](#6.3)
    * [6.4 TFIDF DNN Model](#6.4)
    * [6.5 The Whole Model](#6.5)
    * [6.6 Model Training](#6.6)
    * [6.7 Evaluation](#6.7)
* [7. Submission](#7.)
* [8. References](#8.)

<font color="red" size="3">If you found it useful and would like to back me up, just upvote.</font>

<a id="1."></a>
## 1. Overview
In my previous notebooks, I build Jigsaw Toxicity Model with [FNet](https://www.kaggle.com/lonnieqin/jigsaw-toxicity-training-with-fnet) using Word2Vec Vectorizatoin and [DNN](https://www.kaggle.com/lonnieqin/tf-idf-vectorization-with-keras) using TFIDF Vectorzation. How about combining this two Models together? I will try it in this notebook.

This Model is a binary classfication Model, ranking of toxicity can be calcualated via probability of binary classficiation.

I use dataset from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and combine with more toxic data in this dataset for training.

I will keep track of three Model: Model with Best Accuracy, Model with Best AUC, and Latest Model. So I can use them for inference and try different ensemble method to get a better score.

<a id="2."></a>
## 2. Configuration

In [None]:
class Config:
    vocab_size = 15000 # Vocabulary Size
    sequence_length = 100 # Length of sequence
    batch_size = 1024
    validation_split = 0.15
    embed_dim = 256
    latent_dim = 256
    epochs = 50 # Number of Epochs to train
    best_auc_model_path = "model_best_auc.tf"
    best_acc_model_path = "model_best_acc.tf"
    lastest_model_path = "model_latest.tf"
config = Config()

<a id="3."></a>
## 3. Setup

In [None]:
import pandas as pd
import tensorflow as tf
import pathlib
import random
import string
import re
import sys
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import os
import sklearn
import seaborn as sns
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scipy.stats import rankdata
import json

<a id="4."></a>
## 4. Import datasets

In [None]:
validation_data = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
validation_data.head()

In [None]:
train = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
train.head()

<a id="5."></a>
## 5. EDA & Preprocessing

<a id="5.1"></a>
### 5.1 Select Traning Data

One of the way is to label `less_toxic` as 0 and `more_toxic` as 1, and FNet can get 0.749 score. I tried grouping the duplicated comment together and replace the label with average value, but got a worse 0.49 score instead. I also tried to convert the average value to a class value, but still can't learn any important information from it. So I am going to keep every variable we may use in the future to a data table.


Another way is to use external dataset from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). Since there is a class imbalance problem, I also add more_toxic data from this dataset and label it as 1.

In [None]:
use_external_dataset = True
if use_external_dataset:
    train = train[["comment_text", "toxic"]]
    train.columns = ["text", "label"]
    # Add More toxic data to mitigate class imbalance problem
    train = train.append(pd.DataFrame({"text": validation_data["more_toxic"], "label": [1] * len(validation_data)}))
else:
    data = pd.DataFrame({"text": validation_data["less_toxic"], "label": [0] * len(validation_data)})
    data = data.append(pd.DataFrame({"text": validation_data["more_toxic"], "label": [1] * len(validation_data)}))
    text = data["text"].unique()
    grouped = data.groupby("text")
    label = list(grouped.mean()["label"])
    text_label_dict = dict({key: value for key, value in zip(text, label)})
    index_label = sorted(grouped.mean()["label"].unique())
    data["average_value"] = data["text"].apply(lambda text: text_label_dict[text])
    data["class"] = data["average_value"].apply(lambda value: index_label.index(value))
    classes = sorted(data["class"].unique())
    print("Classes:", classes)
    train = data[["text", "label"]]

<a id="5.2"></a>
### 5.2 Text Preprocessing Function 

In [None]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    text = tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )
    text = tf.strings.regex_replace(text, f"[0-9]+", " ")
    text = tf.strings.regex_replace(text, f"[ ]+", " ")
    text = tf.strings.strip(text)
    return text

<a id="5.3"></a>
### 5.3 TF-IDF Vectorization

In [None]:
tfidf_vectozier = layers.TextVectorization(
    standardize=custom_standardization, 
    max_tokens=config.vocab_size, 
    output_mode="tf-idf", 
    ngrams=2
)
with tf.device("CPU"):
    # A bug that prevents this from running on GPU for now.
    tfidf_vectozier.adapt(list(train["text"]))

<a id="5.4"></a>
### 5.4 Word2Vec Vectorization

In [None]:
word2vec_vectozier = layers.TextVectorization(
    standardize=custom_standardization, 
    max_tokens=config.vocab_size, 
    output_sequence_length=config.sequence_length
)
with tf.device("CPU"):
    word2vec_vectozier.adapt(train["text"])

<a id="5.5"></a>
### 5.5 Train Validation Split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train["text"], train["label"], test_size=config.validation_split)

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

<a id="5.6"></a>
### 5.6 Create TensorFlow Dataset

In [None]:
def make_dataset(X, y, batch_size, mode):
    dataset = tf.data.Dataset.from_tensor_slices((X, y))
    if mode == "train":
       dataset = dataset.shuffle(256) 
    dataset = dataset.batch(batch_size)
    dataset = dataset.cache().prefetch(16).repeat(1)
    return dataset

In [None]:
train_ds = make_dataset(X_train, y_train, batch_size=config.batch_size, mode="train")
valid_ds = make_dataset(X_val, y_val, batch_size=config.batch_size, mode="valid")

Let's see what this data look like.

In [None]:
for batch in train_ds.take(1):
    print(batch)

<a id="5.7"></a>
### 5.7  Calculate Class weight

In [None]:
class_weight =  1 / train["label"].value_counts(normalize=True)
class_weight = dict(class_weight / class_weight.sum())
class_weight

<a id="6."></a>
## 6. Model Development

<a id="6.1"></a>
### 6.1 FNet Encoder

In [None]:
class FNetEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, dropout_rate=0.1, **kwargs):
        super(FNetEncoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs):
        # Casting the inputs to complex64
        inp_complex = tf.cast(inputs, tf.complex64)
        # Projecting the inputs to the frequency domain using FFT2D and
        # extracting the real part of the output
        fft = tf.math.real(tf.signal.fft2d(inp_complex))
        proj_input = self.layernorm_1(inputs + fft)
        proj_output = self.dense_proj(proj_input)
       
        layer_norm = self.layernorm_2(proj_input + proj_output)
        return layer_norm

<a id="6.2"></a>
### 6.2 Positional Embedding

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)


<a id="6.3"></a>
### 6.3 Word2Vec FNet Model

In [None]:
def get_word2vec_model(config, inputs):
    x = word2vec_vectozier(inputs)
    x = PositionalEmbedding(config.sequence_length, config.vocab_size, config.embed_dim)(x)
    x = FNetEncoder(config.embed_dim, config.latent_dim)(x)
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(0.5)(x)
    for i in range(3):
        x = layers.Dense(100, activation="relu")(x)
        x = layers.Dropout(0.3)(x)
    return x

<a id="6.4"></a>
### 6.4 TFIDF DNN Model

In [None]:
def get_tfidf_model(config, inputs):
    x = tfidf_vectozier(inputs)
    x = layers.Dense(256, activation="relu", kernel_regularizer="l2")(x)
    x = layers.Dense(100, activation="relu", kernel_regularizer="l2")(x)
    return x

<a id="6.5"></a>
### 6.5 The Whole Model

In [None]:
def get_model(config):
    inputs = keras.Input(shape=(None, ), dtype="string", name="inputs")
    word2vec_x = get_word2vec_model(config, inputs)
    tfidf_x = get_tfidf_model(config, inputs)
    x = layers.Concatenate()([word2vec_x, tfidf_x])
    output = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, output, name="model")
    return model

In [None]:
model = get_model(config)

In [None]:
model.summary()

Let's visualize the Model.

In [None]:
keras.utils.plot_model(model, show_shapes=True)

<a id="6.6"></a>
### 6.6 Model Training

In [None]:
model.compile(
    "adam", loss="binary_crossentropy", metrics=["accuracy", tf.keras.metrics.AUC()]
)

In [None]:
acc_checkpoint = keras.callbacks.ModelCheckpoint(config.best_acc_model_path, monitor="val_accuracy",save_weights_only=True, save_best_only=True)
auc_checkpoint = keras.callbacks.ModelCheckpoint(config.best_auc_model_path, monitor="val_auc",save_weights_only=True, save_best_only=True)
early_stopping = keras.callbacks.EarlyStopping(patience=10)
reduce_lr = keras.callbacks.ReduceLROnPlateau(patience=5, min_delta=1e-4, min_lr=1e-6)
model.fit(train_ds, epochs=config.epochs, validation_data=valid_ds, callbacks=[acc_checkpoint, auc_checkpoint, reduce_lr], class_weight=class_weight)
model.save_weights(config.lastest_model_path)

<a id="6.7"></a>
### 6.7 Evaluation

#### Classification Report

In [None]:
from sklearn.metrics import classification_report
y_pred = np.array(model.predict(valid_ds) > 0.5, dtype=int)
cls_report = classification_report(y_val, y_pred)
print(cls_report)

<a id="7."></a>
## 7. Submission

In [None]:
test = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sample_submission = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv")
test_ds = tf.data.Dataset.from_tensor_slices((test["text"])).batch(config.batch_size).cache().prefetch(1)
scores = []
for path in [config.best_acc_model_path, config.best_auc_model_path, config.lastest_model_path]:
    model.load_weights(path)
    score = model.predict(test_ds).reshape(-1)
    scores.append(score)
score = np.mean(scores, axis=0)
print(score.shape)
sample_submission["score"] = rankdata(score, method='ordinal')
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()


<a id="8."></a>
## 8. References
- [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824v3)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762v5)
- [Text Generation using FNet](https://keras.io/examples/nlp/text_generation_fnet/)
- [English-Spanish Translation: FNet](https://www.kaggle.com/lonnieqin/english-spanish-translation-fnet)