# IMPORTING LIBRARIES

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers

# Importing Dataset

In [None]:
train = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip")
test = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip")

In [None]:
train.head()

# Check for null values

Check for null values, if found, we need to find a solution to deal with them.

In [None]:
print(train.isnull().sum())
print(test.isnull().sum())

No null values were found. We are good to go!

# Separate x and y component from training set

Now we separate the training set into:
* Independent component - The comments
* Dependent component - The columns containing categories of hate speech.

In [None]:
x_train = train["comment_text"]

y_train = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]

x_test = test["comment_text"]

# Tokenize the input comments

Here, we first tokenize each word in the comment.

Ex. ["Hello World", "I am here", "How are you doing"]

After tokenization: ["Hello", "World", "I", "am", "here", "How", "are", "you", "doing"]

After this we provide each word here with an index number. It converts to a dictionary.
{"Hello":1, "World":2, "I":3, "am":4, "here":5, "How":6, "are":7, "you":8, "doing":9}

The text in the first example now converts to:[[1, 2], [3, 4, 5], [6, 7, 8, 9]]

In [None]:
max_feature = 20000

tokenizer = Tokenizer(num_words = max_feature)
tokenizer.fit_on_texts(list(x_train))
tokenized_train = tokenizer.texts_to_sequences(x_train)
tokenized_test = tokenizer.texts_to_sequences(x_test)

In [None]:
tokenized_train[:1]

# Apply Padding

There are a total of 159571 training example. All of the comments(strings) might not be of the same size i.e. some strings might be too long eg. 300 words and some might be too short eg. 30 words. We cannot feed model with inputs of varying length.

To deal with this issue, we need to equalize the length of comments(strings) to some extent in order to provide them as an input to the model. To do this we use the method of padding.

Padding adds zeros at the end of the comments whose length is less than the provided length(Here, maxlen = 200) to match it to the mentioned length(Here maxlen=200)

Ex. Let's consider the previous example, after tokenization we were left with: [[1, 2], [3, 4, 5], [6, 7, 8, 9]]
After padding(with maxlen = 4): [[1, 2, 0, 0], [3, 4, 5, 0], [6, ,7 ,8 ,9]]  #To make training examples of equal length

In [None]:
maxlen = 50
x_train = pad_sequences(tokenized_train, maxlen = maxlen)
x_test = pad_sequences(tokenized_test, maxlen = maxlen)

We are done with all the preprocessing of the text, now we move on to buiding the model.

# Building the Model

Here we have used LSTM for the classification of comments.

Step-by-step layers: 

(Input)-->(Embedding)-->(LSTM layer)-->(Max Pooling layer)-->(First Dense layer)-->(Second Dense layer)

1. INPUT

We take input in batches of (maxlen)-dimensional vectors.

In [None]:
input = Input(shape = (maxlen,))

2. EMBEDDING

In this layer we represent each word in the training example with a 128-dimensional vector.

In [None]:
embed_size = 128
x = Embedding(max_feature, embed_size)(input)

3. LSTM LAYER

This is the LSTM layer which can be considered the main processing layer.
The power of LSTMs is applied to the embedded vectors, and vectors are downsized to 60-dimensional vector keeping the information intact.

In [None]:
x = LSTM(60, return_sequences = True, name = "lstm_layer")(x)

4. MAX POOLING 

This layer is used to flatten the inputs for dense layer.

In [None]:
x = GlobalMaxPool1D()(x)

Now we are using Dropout() to deactivate a few neurons (Here 10% of the total active neurons). This helps prevent the model overfit over the training data.

In [None]:
x = Dropout(0.1)(x)

5. FIRST DENSE LAYER

We have implemented our first Dense layer with 50 neurons and Rectified Linear Unit(ReLU) as it's activation function.

In [None]:
x = Dense(50, activation = "relu")(x)

In [None]:
#Once again we have implemented dropout to prevent overfitting

x = Dropout(0.1)(x)

6. SECOND DENSE LAYER

Here, we have implemented our second Dense layer and final layer of the model with 6 neurons and Sigmoid as it's activation function.

Note: Here we have used Sigmoid, because we need an output between 0 and 1 to decide in which of the six categories does the training example fall into.

In [None]:
x = Dense(6, activation = "sigmoid")(x)

7. COMPILING LAYERS

We have created all the layers, now it's time to compile all the layers and specify loss function, optimizer and metric, to make the model fully functional.

In [None]:
model = Model(inputs = input, outputs = x)
model.compile(loss = "binary_crossentropy",
             optimizer = "adam",
             metrics = ["accuracy"])

# Fit Model to the Data

We provide out model with the training data to train.

In [None]:
batch_size = 32
epochs = 2
model.fit(x_train, y_train, batch_size = batch_size, epochs = epochs, validation_split = 0.1)

# Predicting output of test set

In [None]:
y_pred = model.predict(x_test,batch_size=32)

In [None]:
submission = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip')
submission[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = y_pred
submission.to_csv('submission.csv', index=False)