# Data Mining Final Project.
# Toxic Comment Classification.
## Authors : Team - 7 
## Akhil Thakur
## Spriha Awasthi
## Ajay Sadananda
<br>
<br>

# Project Title : Toxic Comment Classification.

In this notebook, we will be using machine learning to classify toxic comments. First, we will be conducting EDA on the dataset, which is obtained from kaggle competition ([Link](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data)). Then we use the following models to classify the toxic comments.
* Simple RNN's
* LSTM's
* BI-Directional LSTM's
* GRU's
* BERT

# Loading the libraries
Lets first load all the necessary packages.

In [None]:
## Required packages to run the code
#!pip install tensorflow
#!pip install pyicu
#!pip install pycld2
# !pip install polyglot
# !pip install textstat
# !pip install googletrans
# !pip install plotly==5.4.0

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import LSTM, GRU,SimpleRNN
#from tensorflow.keras.layers import Dense, Activation, Dropout
#from tensorflow.keras.layers import Embedding
#from tensorflow.keras.layers import BatchNormalization
#from tensorflow.keras.utils import np_utils
from tensorflow.keras.utils import to_categorical

from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
#from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.callbacks import Callback
from sklearn.metrics import accuracy_score, roc_auc_score
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger

from tensorflow.keras.models import Model
from kaggle_datasets import KaggleDatasets
from tensorflow.keras.optimizers import Adam
from tokenizers import BertWordPieceTokenizer
from tensorflow.keras.layers import Dense, Input, Dropout, Embedding
from tensorflow.keras.layers import LSTM, GRU, Conv1D, SpatialDropout1D

from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import activations
from tensorflow.keras import constraints
from tensorflow.keras import initializers
from tensorflow.keras import regularizers

import tensorflow.keras.backend as K
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.activations import *
from tensorflow.keras.constraints import *
from tensorflow.keras.initializers import *
from tensorflow.keras.regularizers import *

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
import warnings
warnings.filterwarnings("ignore")

import gc
import re
import folium
import textstat
from scipy import stats
from colorama import Fore, Back, Style, init

import math
import scipy as sp

import random
import networkx as nx
from pandas import Timestamp

from PIL import Image
from IPython.display import SVG
from tensorflow.keras.utils import  model_to_dot

import requests
from IPython.display import HTML

import seaborn as sns
from tqdm import tqdm
import matplotlib.cm as cm
import matplotlib.pyplot as plt

tqdm.pandas()

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import transformers


from sklearn import metrics
from sklearn.utils import shuffle
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer,\
                                            CountVectorizer,\
                                            HashingVectorizer

from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer  

import nltk
from textblob import TextBlob

from nltk.corpus import wordnet
from nltk.corpus import stopwords
from googletrans import Translator
from nltk import WordNetLemmatizer
from polyglot.detect import Detector
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS
from nltk.sentiment.vader import SentimentIntensityAnalyzer

stopword=set(STOPWORDS)

lem = WordNetLemmatizer()
tokenizer=TweetTokenizer()

np.random.seed(0)

# Configuring TPU's

We will be using TPU's to accelerate the training. For this project, TPU's from kaggle are used.

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    #tpu = None
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
train_data = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
val_data = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test_data = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')

Now, Lets have a look at the data.

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
val_data.head()

## Sentiment and polarity <a id="1.4"></a>

Sentiment and polarity are quantities that reflect the emotion and intention behind a sentence. Now, Let's look at the sentiment of the comments using the NLTK (natural language toolkit) library.

In [None]:
def polarity(x):
    if type(x) == str:
        return SIA.polarity_scores(x)
    else:
        return 1000

SIA = SentimentIntensityAnalyzer()
train_data["polarity"] = train_data["comment_text"].progress_apply(polarity)

### Negative sentiment

Negative sentiment refers to negative or pessimistic emotions. It is a score between 0 and 1; the greater the score, the more negative the subject is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neg"] for pols in train_data["polarity"] if pols["neg"] != 0], marker=dict(
            color='seagreen')
    ))

fig.update_layout(xaxis_title="Negativity sentiment", title_text="Negativity sentiment", template="simple_white")
fig.show()

From the above plot, we can see that negative sentiment has a positive skew, indicating that negativity is usually on the lower side. This suggests that most comments are not toxic or negative.

### Negativity vs. Toxicity

In [None]:
nums_1 = train_data.sample(frac=0.1).query("toxic == 1")
nums_1 = [pols["neg"] for pols in nums_1["polarity"]]
nums_2 = train_data.sample(frac=0.1).query("toxic == 0")
nums_2 = [pols["neg"] for pols in nums_2["polarity"]]


fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["darkorange", "dodgerblue"], show_hist=False)

fig.update_layout(title_text="Negativity vs. Toxicity", xaxis_title="Negativity", template="simple_white")
fig.show()

We can clearly see that toxic comments have a significantly greater negative sentiment than toxic comments (on average). The probability density of negativity peaks at around 0 for non-toxic comments, while the negativity for toxic comments are minimum at this point. This suggests that a comment is very likely to be non-toxic if it has a negativity of 0.

### Positive sentiment

Positive sentiment refers to positive or optimistic emotions. It is a score between 0 and 1; the greater the score, the more positive the subject is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["pos"] for pols in train_data["polarity"] if pols["pos"] != 0], marker=dict(
            color='indianred')))

fig.update_layout(xaxis_title="Positivity sentiment", title_text="Positivity sentiment", template="simple_white")
fig.show()

From the above plot, we can see that positive sentiment has a positive skew, indicating that positivity is usually on the lower side. This suggests that most comments do not express positivity explicitly

### Positivity vs. Toxicity

In [None]:
nums_1 = train_data.sample(frac=0.1).query("toxic == 1")
nums_1 = [pols["pos"] for pols in nums_1["polarity"]]
nums_2 = train_data.sample(frac=0.1).query("toxic == 0")
nums_2 = [pols["pos"] for pols in nums_2["polarity"]]


fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["darkorange", "dodgerblue"], show_hist=False)

fig.update_layout(title_text="Positivity vs. Toxicity", xaxis_title="Positivity", template="simple_white")
fig.show()

Here we have plotted the distribution of positivity for toxic and non-toxic comments above. We can see that both the distributions are very similar, indicating that positivity is not an accurate indicator of toxicity in comments.

### Neutrality sentiment

Neutrality sentiment refers to the level of bias or opinion in the text. It is a score between 0 and 1; the greater the score, the more neutral/unbiased the subject is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neu"] for pols in train_data["polarity"] if pols["neu"] != 1], marker=dict(
            color='dodgerblue')
    ))

fig.update_layout(xaxis_title="Neutrality sentiment", title_text="Neutrality sentiment", template="simple_white")
fig.show()

From the above plot, we can see that the neutrality sentiment distribution has a negative skew, which is in constrast to the negativity and positivity sentiment distributions. This indicates that the comments tend to be very neutral and unbiased in general. 

### Neutrality vs. Toxicity

In [None]:
nums_1 = train_data.sample(frac=0.1).query("toxic == 1")
nums_1 = [pols["neu"] for pols in nums_1["polarity"]]
nums_2 = train_data.sample(frac=0.1).query("toxic == 0")
nums_2 = [pols["neu"] for pols in nums_2["polarity"]]

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["darkorange", "dodgerblue"], show_hist=False)

fig.update_layout(title_text="Neutrality vs. Toxicity", xaxis_title="Neutrality", template="simple_white")
fig.show()

We can see that non-toxic comments tend to have a higher neutrality value than toxic comments on average. The probability density of the non-toxic distribution experiences a sudden jump at 1, and the probability density of the toxic distribution is significantly lower at the same point. This suggests that a comment with neutrality close to 1 is more likely to be non-toxic than toxic.

### Compound sentiment

Compoundness sentiment refers to the total level of sentiment in the sentence. It is a score between -1 and 1; the greater the score, the more emotional the subject is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["compound"] for pols in train_data["polarity"] if pols["compound"] != 0], marker=dict(
            color='orchid')
    ))

fig.update_layout(xaxis_title="Compound sentiment", title_text="Compound sentiment", template="simple_white")
fig.show()

From the distribution above, we can see that compound sentiment is evenly distributed across the specturm (from -1 to 1) with very high variance and random peaks throughout the range.

### Compound sentiment vs. Toxicity

In [None]:
nums_1 = train_data.sample(frac=0.1).query("toxic == 1")
nums_1 = [pols["compound"] for pols in nums_1["polarity"]]
nums_2 = train_data.sample(frac=0.1).query("toxic == 0")
nums_2 = [pols["compound"] for pols in nums_2["polarity"]]


fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["darkorange", "dodgerblue"], show_hist=False)

fig.update_layout(title_text="Compound vs. Toxicity", xaxis_title="Compound", template="simple_white")
fig.show()

We can see that compound sentiment tends to be higher for non-toxic comments as compared to toxic comments. The non-toxic distribution has a negative skew, while the toxic distribution has a positive skew. This indicates that non-toxic comments tend to have a higher compound sentiment than toxic comments on average.

## Targets <a id="1.6"></a>

Targets are the outputs of our classification. Here we have various different toxicity classes such as severe_toxic, obscene, threat, insult, identity_hate. Now, Let's visualize the targets in the dataset.

In [None]:
fig = go.Figure(data=[
    go.Pie(labels=train_data.columns[2:7],
           values=train_data.iloc[:, 2:7].sum().values, marker=dict(colors=px.colors.qualitative.Plotly))
])
fig.update_traces(textposition='outside', textfont=dict(color="black"))
fig.update_layout(title_text="Pie chart of labels")
fig.show()

From the pie chart above, we can see that the most common target is toxic, and the other targets, such as insult and threat are relatively uncommon.

# Toxic comment classification as a binary classification problem.

* We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models.

In [None]:
train_data.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)

In [None]:
train_data = train_data.loc[:12000,:]
train_data.shape

We will check the maximum number of words that can be present in a comment , this will help us in padding later

In [None]:
train_data['comment_text'].apply(lambda x:len(str(x).split())).max()

Writing a function for getting auc score for validation

In [None]:
def roc_auc(predictions,target):
    '''
    This methods returns the AUC Score when given the Predictions
    and Labels
    '''
    
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc

### Data Preparation

In [None]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train_data.comment_text.values, train_data.toxic.values, 
                                                  stratify=train_data.toxic.values, 
                                                  random_state=42, 
                                                  test_size=0.2, shuffle=True)

# Simple RNN

Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.

In [None]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [None]:
%%time
with strategy.scope():
    # A simpleRNN without any pretrained embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     input_length=max_len))
    model.add(SimpleRNN(100))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'], steps_per_execution=32)
    
model.summary()

In [None]:
#early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',patience=2, restore_best_weights=True)
model.fit(xtrain_pad, ytrain,epochs=5, batch_size=16*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's

In [None]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

In [None]:
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})

## Summary
* Tokenization<br><br>
 In an RNN we input a sentence word by word. We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1. <br>
  The keras Tokenizer takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurrences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occurred the most in the corpus then it will assign index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zeros.<br>

* Comments on the model<br><br>
We can see our model achieves an accuracy of almost 1 which is just insane , we are clearly overfitting I know , but this was the simplest model of all ,we can tune a lot of hyperparameters like RNN units, we can do batch normalization , dropouts etc to get better result. The point is we got an AUC score of 0.82 without much efforts.

# Word Embeddings
<br>
One of the approach to getting word Embeddings is using pretained GLoVe. In this Notebook, we'll be using the GloVe vectors. You can download the GloVe vectors from here http://www-nlp.stanford.edu/data/glove.840B.300d.zip or you can search for GloVe in datasets on Kaggle and add the file

In [None]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# LSTM's

Simple RNN's were certainly better than classical ML algorithms and gave state of the art results, but it failed to capture long term dependencies that is present in sentences . So in 1998-99 LSTM's were introduced to counter to these drawbacks. We have already tokenized and paded our text for input to LSTM's

In [None]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
%%time
with strategy.scope():
    
    # A simple LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))

    model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
model.summary()

In [None]:
model.fit(xtrain_pad, ytrain, epochs=5, batch_size=16*strategy.num_replicas_in_sync)

In [None]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

In [None]:
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})

## Summary

As a first step we calculate embedding matrix for our vocabulary from the pretrained GLoVe vectors . Then while building the embedding layer we pass Embedding Matrix as weights to the layer instead of training it over Vocabulary and thus we pass trainable = False.
Rest of the model is same as before except we have replaced the SimpleRNN By LSTM Units

* Comments on the Model

We now see that the model is not overfitting and achieves an auc score of 0.96 which is quite fair, also we close in on the gap between accuracy and auc .
We see that in this case we used dropout and prevented overfitting the data

# GRU's

Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU's are a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results . GRU's were designed to be simpler and faster than LSTM's and in most cases produce equally good results and thus there is no clear winner.


In [None]:
%%time
with strategy.scope():
    # GRU with glove embeddings and two dense layers
     model = Sequential()
     model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
     model.add(SpatialDropout1D(0.3))
     model.add(GRU(300))
     model.add(Dense(1, activation='sigmoid'))

     model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])   
    
model.summary()

In [None]:
model.fit(xtrain_pad, ytrain, epochs=5, batch_size=16*strategy.num_replicas_in_sync)

In [None]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

In [None]:
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})

In [None]:
scores_model

## Summary

* Comments on the Model

We now see that the model achieves an higher auc score of 0.978 compared to the previous two models. Using GRU, we can see that with almost same accuracy we achieved a higher auc when compared with LSTM.

# Bi-Directional LSTM
<br>
A Bidirectional LSTM, or biLSTM, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm 

In [None]:
%%time
with strategy.scope():
    # A simple bidirectional LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
    model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

    model.add(Dense(1,activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
    
model.summary()

In [None]:
model.fit(xtrain_pad, ytrain, epochs=5, batch_size=16*strategy.num_replicas_in_sync)

In [None]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

In [None]:
scores_model.append({'Model': 'Bi-directional LSTM','AUC_Score': roc_auc(scores,yvalid)})

We can see that GRU has achived a higher AUC scores among the models we have used. It is important to note that this results might not translate to multi-class classification models.

## Summary

Here, we have only added bidirectional nature to the LSTM cells we used before and is self explanatory. We have achieve similar accuracy and auc score as before.

# BERT

In [None]:
train1 = pd.read_csv("/jigsaw-data/jigsaw-toxic-comment-train.csv")
valid = pd.read_csv('/jigsaw-data/validation.csv')
test = pd.read_csv('/jigsaw-data/test.csv')
sub = pd.read_csv('/jigsaw-data/sample_submission.csv')

In [None]:
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    """
    Encoder for encoding the text into sequence of integers for BERT Input
    """
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

In [None]:
#IMP DATA FOR CONFIG

AUTO = tf.data.experimental.AUTOTUNE


# Configuration
EPOCHS = 5
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192

In [None]:
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# tokenizer = AutoTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer

In [None]:
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)

y_train = train1.toxic.values
y_valid = valid.toxic.values

In [None]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

In [None]:
def build_model(transformer, max_len=512):
    """
    function for training the BERT model
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
%%time
with strategy.scope():
    transformer_layer = (
        transformers.TFDistilBertModel
        .from_pretrained('distilbert-base-multilingual-cased')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

In [None]:
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

In [None]:
scores = model.predict(x_valid, verbose=1)
# scores
# print("Auc: %.2f%%" % (roc_auc(scores, y_valid)))

In [None]:
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
    valid_dataset.repeat(),
    steps_per_epoch=n_steps,
    epochs=EPOCHS*2
)
scores = model.predict(x_valid, verbose=1)


In [None]:
scores_model = [{'Model': 'SimpleRNN', 'AUC_Score': 0.8262},
 {'Model': 'LSTM', 'AUC_Score': 0.9616},
 {'Model': 'GRU', 'AUC_Score': 0.9736},
 {'Model': 'Bi-directional LSTM', 'AUC_Score': 0.9662},
 {'Model': 'BERT', 'AUC_Score': 0.9779}]
scores_model

In [None]:
# Visualization of Results obtained from various Deep learning models
results = pd.DataFrame(scores_model).sort_values(by='AUC_Score',ascending=False)
results.style.background_gradient(cmap='Blues')
fig = go.Figure(go.Funnelarea(
    text =results.Model,
    values = results.AUC_Score,
    title = {"position": "top center", "text": "Funnel-Chart of AOC Score Distribution"}
    ))
fig.show()

## Code Explanation

Here, we can see that using BERT architecture yielded better results. We have achieved an AUC score of 0.9779, which is the best amongst all the previous models.

# Summary of the project.
<br>
<br>

We used the jigsaw multilinugual toxic comments data set from Kaggle and performed various EDA's on the data set. After exploring the data, we calculated the sentiment scores for the comments and check how the sentiment scores correlate with the toxicity of the comments. To make the process of training easier, we have reduced our multi-class classification model to a binary classification problem, where a comment could be either toxic or non-toxic. We then used few of the most common ML models namely, RNN, LSTM, GRU, Bidirectional LSTM, BERT to classify the comments. We have observed that the RNN has an overfitting problem and the other three models perform fairly similar.
