# Jigsaw Toxicity Training with FNet
## Table of Contents
* [1. Overview](#1.)
* [2. Configuration](#2.)
* [3. Setup](#3.)
* [4. Tools](#4.)
* [5. Import datasets](#5.)
* [6. EDA & Preprocessing](#6.)
    * [6.1 Learn about Stemming](#6.1)
    * [6.2 Learn about Lemmatisation](#6.2)
    * [6.3 Select trainng data](#6.3)
    * [6.4 Statistic info of Token length](#6.4)
    * [6.5 Build a Tokenizer](#6.5)
    * [6.6 Train Validation Split](#6.6)
    * [6.7 Create TensorFlow Dataset](#6.7)
    * [6.8 Calculate Class weight](#6.8)
* [7. Model Development](#7.)
    * [7.1 FNet Encoder](#7.1)
    * [7.2 Positional Embedding](#7.2)
    * [7.3 FNet Classification Model](#7.3)
    * [7.4 Model Training](#7.4)
* [8. Submission](#8.)
* [9. References](#9.)

<font color="red" size="3">If you found it useful and would like to back me up, just upvote.</font>

<a id="1."></a>
## 1. Overview
In this Notebook, I will develop a Jigsaw Toxicity Prediction Model using FNet from scratch.
The FNet Model was able to achieve 92-97% of BERT's accuracy while training 80% faster on GPUs and almost 70% faster on TPUs. So that we use use it to do quick experiment.

I build this sample referring to [Text Generation using FNet](https://keras.io/examples/nlp/text_generation_fnet/), ranking of toxicity can be calcualated via probability of binary classficiation.

I use dataset from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and combine with more toxic data in this dataset for training.

Apart from building Model using FNet, I also try to build a custom Tokenizer to vectorize texts.

Currently this notebook can get a 0.758 LB, not a very good score. Using a pretrained Model and better text preprocessing method could improve the LB score.

<a id="2."></a>
## 2. Configuration

In [None]:
class Config:
    vocab_size = 15000 # Vocabulary Size
    sequence_length = 100 # Length of sequence
    batch_size = 1024
    validation_split = 0.15
    embed_dim = 256
    latent_dim = 256
    oov_token = "<OOV>" # Out of Word token
    bos_token = "<BOS>" # Begin of sequence token
    eos_token = "<EOS>" # End of Sequence token
    epochs = 50 # Number of Epochs to train
    model_path = "model.h5"
config = Config()

<a id="3."></a>
## 3. Setup

In [None]:
import pandas as pd
import tensorflow as tf
import pathlib
import random
import string
import re
import sys
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import os
import sklearn
import seaborn as sns
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scipy.stats import rankdata
import json

<a id="4."></a>
## 4. Tools

### Tokenizer class
This class can help you build a vocabulary by fitting a sequence of text. It's similar to Tokenizer in TensorFlow, it can also support padding sequences and adding Begin-of-Sentence token and End-of-Sentence token at the same time. I build this class to have fun and it's more flexible to custimize in the future. It accepts 5 parameters: vocaulbary size, out of word token, Begin-of-Sentence token (can be null), End-of-Sentence token (can be null), max sequence length.

`fit_transform` can build a vocuabury from a list of tokens like:
```python
[
    ["1", "2", "3", "4", "5"],
    ["1", "2", "3", "4", "5"]
]
```
and return vectors like
```python
[
    [1, 2, 3, 4, 5],
    [1, 2, 3, 4, 5]
]
```

`transform` method is similar to `fit_transform` without building Vocabulary.


In [None]:
class Tokenizer:
    
    def __init__(self, vocab_size = None, oov_token = None, bos_token = None, eos_token = None, max_length = 10000):
        self.vocab_size = vocab_size
        self.oov_token = oov_token
        self.max_length = max_length
        self.bos_token = bos_token
        self.eos_token = eos_token
        
    stopwords = set(["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ])
    
    tweet_tokenizer = TweetTokenizer() 
    
    stemmer = PorterStemmer()
    
    lemmatizer = WordNetLemmatizer()
    
    @staticmethod
    def preprocess_string(text):
        # Convert sentences to lowercase.
        text = text.lower()
        # Remove puntuations, but ? and ! are usually enmotional so I won't remove it.
        text = re.sub(r'[\n| |.|\"|,|:|\(|\)|#|\{|\}|\*|\/|\$|\—|~|;|=|\[｜\]|\-]+', " ", text)
        # Remove Digits
        text = re.sub("[0-9]+", " ", text)
        text = re.sub("[ ]+", " ", text)
        text = text.strip(" ")
        # Convert sentences to tokens
        items = Tokenizer.tweet_tokenizer.tokenize(text)
        # Remove stop words
        new_items = []
        for item in items:
            if item not in Tokenizer.stopwords:
                new_item = Tokenizer.lemmatizer.lemmatize(item)
                new_item = Tokenizer.stemmer.stem(new_item)
                new_items.append(new_item)
        return new_items
        
    def fit_transform(self, texts):
        current_index = 1
        word_index = {self.oov_token: current_index}
        if self.bos_token != None:
            current_index += 1
            word_index[self.bos_token] = current_index
        if self.eos_token != None:
            current_index += 1
            word_index[self.eos_token] = current_index

        word_count = {}
        for i in range(len(texts)):
            text = texts[i]
            for item in text:
                if item in word_count:
                    word_count[item] += 1
                else:
                    word_count[item] = 1
        word_count_df = pd.DataFrame({"key": word_count.keys(), "count": word_count.values()})
        word_count_df.sort_values(by="count", ascending=False, inplace=True)
        self.word_count_df = word_count_df
        vocab = list(word_index.keys())
        vocab += list(word_count_df["key"][0: self.vocab_size - len(word_index)])
        vocab = set(vocab)
        self.vocab = vocab
        
        sentences = []
        offset = 1 if self.eos_token != None else 0
        for i in range(len(texts)):
            text = texts[i]
            sentence = []
            if self.bos_token != None:
                sentence.append(word_index[self.bos_token])
            for item in text:
                if item in self.vocab:
                    if item in word_index:
                        sentence.append(word_index[item])
                    else:
                        current_index += 1
                        word_index[item] = current_index
                        sentence.append(word_index[item])
                else:
                    sentence.append(word_index[self.oov_token])
            if len(sentence) <= self.max_length - offset:
                if self.eos_token != None:
                    sentence.append(word_index[self.eos_token])
                sentence += [0] * (self.max_length - len(sentence))
            elif len(sentence) > self.max_length - offset:
                sentence = sentence[:self.max_length - offset]
                if self.eos_token != None:
                    sentence.append(word_index[self.eos_token])
            sentences.append(sentence)
        self.word_index = word_index
        self.index_word = dict({word_index[key]: key for key in word_index.keys()})
        return sentences
    
    def save(self, path):
        dic = {
            "vocab_size": self.vocab_size,
            "oov_token": self.oov_token,
            "max_length":  self.max_length,
            "vocab": list(self.vocab),
            "index_word": self.index_word,
            "word_index": self.word_index
        }
        if self.bos_token is not None:
            dic["bos_token"] = self.bos_token
        if self.eos_token is not None:
            dic["eos_token"] = self.eos_token
        res = json.dumps(dic)
        with open(path, "w+") as f:
            f.write(res)
            
    def load(self, path):
        with open(path, "r") as f:
            dic = json.load(f)
        self.vocab_size = dic["vocab_size"]
        self.oov_token = dic["oov_token"]
        self.max_length = dic["max_length"]
        self.vocab = set(dic["vocab"])
        self.index_word = dic["index_word"]
        self.word_index = dic["word_index"]
        if "bos_token" in dic:
            self.bos_token = dic["bos_token"]
        if "eos_token" in dic:
            self.eos_token = dic["eos_token"]
            
    def transform(self, texts):
        sentences = []
        offset = 1 if self.eos_token != None else 0
        for i in range(len(texts)):
            text = texts[i]
            sentence = []
            if self.bos_token != None:
                sentence.append(self.word_index[self.bos_token])
            for item in text:
                if item in self.vocab:
                    sentence.append(self.word_index[item])
                else:
                    sentence.append(self.word_index[self.oov_token])
            if len(sentence) == self.max_length - offset:
                if self.eos_token != None:
                    sentence.append(self.word_index[self.eos_token])
            elif len(sentence) < self.max_length - offset:
                if self.eos_token != None:
                    sentence.append(self.word_index[self.eos_token])
                sentence += [0] * (self.max_length - len(sentence))
            elif len(sentence) > self.max_length - offset:
                sentence = sentence[:self.max_length - offset]
                if self.eos_token != None:
                    sentence.append(self.word_index[self.eos_token])
            sentences.append(sentence)
        return sentences
            

<a id="5."></a>
## 5. Import datasets

In [None]:
validation_data = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
validation_data.head()

In [None]:
train = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
train.head()

<a id="6."></a>
### 6. EDA & Preprocessing

<a id="6.1"></a>
#### 6.1 Learn about Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

In [None]:
stemmer = PorterStemmer()
print(stemmer.stem("going"))
print(stemmer.stem("dogs"))
print(stemmer.stem("leaves"))

<a id="6.2"></a>
### 6.2 Learn about Lemmatisation

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [None]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("going"))
print(lemmatizer.lemmatize("dogs"))
print(lemmatizer.lemmatize("leaves"))
print(stemmer.stem("leaf"))

<a id="6.3"></a>
### 6.3 Select Traning Data

One of the way is to label `less_toxic` as 0 and `more_toxic` as 1, and FNet can get 0.749 score. I tried grouping the duplicated comment together and replace the label with average value, but got a worse 0.49 score instead. I also tried to convert the average value to a class value, but still can't learn any important information from it. So I am going to keep every variable we may use in the future to a data table.


Another way is to use external dataset from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). Since there is a class imbalance problem, I also add more_toxic data from this dataset and label it as 1.

In [None]:
sys.setrecursionlimit(100000)
import time
begin = time.time()
use_external_dataset = True
if use_external_dataset:
    train = train[["comment_text", "toxic"]]
    train.columns = ["text", "label"]
    # Add More toxic data to mitigate class imbalance problem
    train = train.append(pd.DataFrame({"text": validation_data["more_toxic"], "label": [1] * len(validation_data)}))
else:
    data = pd.DataFrame({"text": validation_data["less_toxic"], "label": [0] * len(validation_data)})
    data = data.append(pd.DataFrame({"text": validation_data["more_toxic"], "label": [1] * len(validation_data)}))
    text = data["text"].unique()
    grouped = data.groupby("text")
    label = list(grouped.mean()["label"])
    text_label_dict = dict({key: value for key, value in zip(text, label)})
    index_label = sorted(grouped.mean()["label"].unique())
    data["average_value"] = data["text"].apply(lambda text: text_label_dict[text])
    data["class"] = data["average_value"].apply(lambda value: index_label.index(value))
    classes = sorted(data["class"].unique())
    print("Classes:", classes)
    train = data[["text", "label"]]
tokens = []
last_index = len(train) - 1
for i in range(len(train)):
    tokens.append(Tokenizer.preprocess_string(train.iloc[i]["text"]))
    if (i + 1) % 10000 == 0 or i == last_index:
        current = time.time() - begin
        print("%.2fs-%.2fs: %.2f%%" % (current, current * len(train) / i, i / len(train) * 100))
train["token"] = tokens
train["token_length"] = train["token"].apply(len)
train = sklearn.utils.shuffle(train)

<a id="6.4"></a>
### 6.4 Statistic info of Token length
Average Token length is 39. Most are under 100, so choosing 100 as sequence length is enough.

In [None]:
train[["token_length"]].describe()

In [None]:
train["token_length"][train["token_length"] <= 100].hist()

In [None]:
train["label"].hist()

<a id="6.5"></a>
### 6.5 Build a Tokenizer

In [None]:
tokenizer = Tokenizer(
    vocab_size=config.vocab_size, 
    oov_token=config.oov_token, 
    bos_token=config.bos_token,
    eos_token=config.eos_token,
    max_length=config.sequence_length
)
sequences = tokenizer.fit_transform(list(train["token"]))
train["sequence"] = sequences
train.head()

Number of words:

In [None]:
len(tokenizer.index_word)

Save the Tokenzier:

In [None]:
tokenizer.save("tokenizer.json")

Load the Tokenizer:

In [None]:
new_tokenizer = Tokenizer()
new_tokenizer.load("tokenizer.json")

Number of Words that seldom appear:

In [None]:
word_count_seldom_appear = {"word_count": [], "num_words": []}
for i in range(1, 10):
    word_count_seldom_appear["word_count"].append(i)
    word_count_seldom_appear["num_words"].append(len(tokenizer.word_count_df[tokenizer.word_count_df["count"] <= i]))
sns.barplot(x="word_count", y="num_words", data=pd.DataFrame(word_count_seldom_appear))

<a id="6.6"></a>
### 6.6 Train Validation Split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train["sequence"], train["label"], test_size=config.validation_split)

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

<a id="6.7"></a>
### 6.7 Create TensorFlow Dataset

In [None]:
def make_dataset(X, y, batch_size, mode):
    dataset = tf.data.Dataset.from_tensor_slices((X, y))
    if mode == "train":
       dataset = dataset.shuffle(256) 
    dataset = dataset.batch(batch_size)
    dataset = dataset.cache().prefetch(16).repeat(1)
    return dataset

In [None]:
train_ds = make_dataset(list(X_train), list(y_train), batch_size=config.batch_size, mode="train")
valid_ds = make_dataset(list(X_val), list(y_val), batch_size=config.batch_size, mode="valid")

Let's see what this data look like.

In [None]:
for batch in train_ds.take(1):
    print(batch)

<a id="6.8"></a>
### 6.8  Calculate Class weight

In [None]:
class_weight =  dict(len(train) / train["label"].value_counts())
class_weight

<a id="7."></a>
## 7. Model Development

<a id="7.1"></a>
### 7.1 FNet Encoder

In [None]:
class FNetEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, dropout_rate=0.1, **kwargs):
        super(FNetEncoder, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs):
        # Casting the inputs to complex64
        inp_complex = tf.cast(inputs, tf.complex64)
        # Projecting the inputs to the frequency domain using FFT2D and
        # extracting the real part of the output
        fft = tf.math.real(tf.signal.fft2d(inp_complex))
        proj_input = self.layernorm_1(inputs + fft)
        proj_output = self.dense_proj(proj_input)
       
        layer_norm = self.layernorm_2(proj_input + proj_output)
        output = self.dropout(layer_norm)
        return output

<a id="7.2"></a>
### 7.2 Positional Embedding

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)


<a id="7.3"></a>
### 7.3 FNet Classification Model

In [None]:
def get_fnet_classifier(config):
    inputs = keras.Input(shape=(config.sequence_length), dtype="int64", name="encoder_inputs")
    x = PositionalEmbedding(config.sequence_length, config.vocab_size, config.embed_dim)(inputs)
    x = FNetEncoder(config.embed_dim, config.latent_dim)(x)
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(0.3)(x)
    for i in range(3):
        x = layers.Dense(100, activation="relu")(x)
        x = layers.Dropout(0.3)(x)
    output = layers.Dense(1, activation="sigmoid")(x)
    fnet = keras.Model(inputs, output, name="fnet")
    return fnet

In [None]:
fnet = get_fnet_classifier(config)

In [None]:
fnet.summary()

Let's visualize the Model.

In [None]:
keras.utils.plot_model(fnet, show_shapes=True)


<a id="7.4"></a>
### 7.4 Model Training

In [None]:
fnet.compile(
    "adam", loss="binary_crossentropy", metrics=["accuracy", tf.keras.metrics.AUC()]
)

In [None]:
checkpoint = keras.callbacks.ModelCheckpoint(config.model_path, monitor="val_accuracy",save_weights_only=True, save_best_only=True)
early_stopping = keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=10)
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor="val_accuracy", patience=5, min_delta=1e-4, min_lr=1e-6)
fnet.fit(train_ds, epochs=config.epochs, validation_data=valid_ds, callbacks=[checkpoint, reduce_lr], class_weight=class_weight)
fnet.save_weights("model_latest.h5")

In [None]:
fnet.load_weights(config.model_path)

<a id="7.5"></a>
### 7.5 Evaluation

### Classification Report

In [None]:
from sklearn.metrics import classification_report
y_pred = np.array(fnet.predict(valid_ds) > 0.5, dtype=int)
cls_report = classification_report(y_val, y_pred)
print(cls_report)

<a id="8."></a>
## 8. Submission

In [None]:
test = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sample_submission = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/sample_submission.csv")
test["text_preprocessed"] = test["text"].apply(Tokenizer.preprocess_string)
test_sequences = tokenizer.transform(list(test["text_preprocessed"]))
print(test_sequences[0])
test_ds = tf.data.Dataset.from_tensor_slices((test_sequences)).batch(config.batch_size).prefetch(1)
score = fnet.predict(test_ds).reshape(-1)
sample_submission["score"] = rankdata(score, method='ordinal')
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()


<a id="9."></a>
## 9. References
- [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824v3)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762v5)
- [Text Generation using FNet](https://keras.io/examples/nlp/text_generation_fnet/)
- [English-Spanish Translation: FNet](https://www.kaggle.com/lonnieqin/english-spanish-translation-fnet)