
# ***Toxic Comment Classification :***

In [None]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from statistics import mean
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, GRU, SimpleRNN, TimeDistributed, ConvLSTM2D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.utils import plot_model

# ***Team Members :***
* Dusan MAKSIMOVIC
* Dejan LUTOVAC
* Loïc CHAUDY

# *Introduction :*


Why is this project usefull for the society ?
>  There are some concrete application of this project in our society:

> This could initially limit cases of cyber harassment. Indeed, we are in a world that is more and more virtualized. The virtual world plays an important role in our lives. 
> For example, access to social networks is easier, so many young people have the opportunity to create an account. Some children are victims of cyber harassment. In France, 22% of teenagers admit to having been a victim of cyber harassment on social networks in 2019.

> An application of this project could make a selection of the messages a person can send. And thus limit any form of harm such as insults, for example.


Why is team members interested in this project ?
> Loïc : I did tutoring at a high school a few years ago. I realized that a lot of people feel bad about cyber stalking, and they don't talk about it. This project can provide a solution so that there are less cases by being able to detect and delete messages of this type 

> Dusan : I chose this project because I think that a lot of racism begins at the Internet and there are a lot of toxic and racist comments.

> Dejan : I‘m interested in this project becaus I hate the toxic comments like for example the racist comments and there are also a lot of fatal cases that I heard about.




# *Dataset Information :*

*Loading the data*

In [None]:
train = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')
test = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')

print(train.shape)
print(test.shape)

**dataset explanation :**


> We retrieved some comments from the internet. We are going to use these comments in order to know their nature.
> For this we will classify them in different categories: toxic, severe toxic, obscene, threat, insult, indentity hate.

In [None]:
train.head()

In [None]:
test.head()

Here are some examples of comments :

In [None]:
train['comment_text'][0]

In [None]:
train['comment_text'][1689]


**Let's have a look on some examples**


*Some of them are very short, you can't misinterpret them, there's nothing toxic. *

In [None]:
test['comment_text'][4]

*But some of them are also not so easy. *


*At first, we can think that this comment is easy to classify. Indeed, we find insults, which puts it in the class toxic or even insult. But one wonders if he can't get into identity_hate because of the last sentence. This comment is not easy to classify.*

In [None]:
test['comment_text'][0]


# *Preprocessing Tasks:*

The first preprocessing step is to check if there are any NULL value. If there are, we need to fill this value with something.


In [None]:
train.isnull().any().sum()


In [None]:
test.isnull().any().sum()

We are lucky to work with a 'clean' database. We don't find null values. We won't check all the cases by hand, it would take too much time. So we prefer to use already available functions. 

Then, we separate our training list, on one side we have the comments(list_sentences_train) and on the other side we have the categories (y). We also retrieve the comments from the test list (list_sentences_test). 

In [None]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]
list_sentences_test = test["comment_text"]

Next, we need to convert our text into number. A way to do this is to use the Tokenizer function in keras.
The *fit_on_texts* method will update the internal vocabulary based on the sentences, it will also create the index based on the frequency.
Then, the *texts_to_sequences* method will transform each text into the corresponding numbers in the dictionary.

In [None]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

We can see here wich number is linked with which text :

In [None]:
index = tokenizer.word_index
print(index['the'])
print(index['cat'])
print(index['cars'])


But it's not finised, we still got one problem. Neural Network need to have the same type of input to work but the comments may not have the same amount of word. 
This two comments didn't have the same length

In [None]:
len(list_tokenized_train[1])

In [None]:
len(list_tokenized_train[250])

Therefore, we need to add padding. We decided to have a maximum length of 200. So we can see from the graph that we won't lose a lot of data. 

In [None]:
totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]
plt.hist(totalNumWords,bins = np.arange(0,410,10))
plt.axvline(x=200, color='gray')
plt.show()

In [None]:
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [None]:
print(X_t[0])

In [None]:
print(len(X_t[0]))

We now have our two list ready for the model : **X_t = train** & **X_te = test**

# *Deep Learning Models :*

*** Baseline model performance (simple model) 0.91761 **

For our first model, we decided to put 2 layers : 
The first with 20 neurons and the second with 6. 
Our last layer need to have 6 neurons because we've got 6 class to predict.

In [None]:
# For the Input 
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
embed_size = 128
x = Embedding(max_features, embed_size)(inp)

# != layers 

x = Dense(6, activation="tanh")(x)
x = Dense(6, activation="tanh")(x)


# For the Output
x = GlobalMaxPool1D()(x)

We use an embedding layer. Indeed, it allows to transform a word into a vector, which makes it more easily processed by the network. It also allows azussi to reduce the representation of words (compared to a vector model for example). 


Here is a way to illustrate it (only the Enbedding part, the rest does not concern us) 

![](https://www.researchgate.net/profile/Xingsheng_Yuan/publication/332810604/figure/fig2/AS:754128875683841@1556809743129/Simple-word-embedding-based-model-with-modified-hierarchical-pooling-strategy.png)

It allows to facilitate the learning by the network. 

In [None]:
model = Model(inputs=inp, outputs=x)
model.compile(loss='mean_squared_error',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True, dpi=96)

In [None]:
batch_size = 64
epochs = 5
history = model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
history_dict = history.history
history_dict.keys()

acc_basic = history.history['accuracy']
val_loss_basic = history.history['val_loss']
loss_basic = history.history['loss']
val_acc_basic = history.history['val_accuracy']

epochs = range(1, len(acc_basic) + 1)

plt.subplot(221)
plt.plot(epochs, loss_basic, 'b', label='Training loss')
plt.plot(epochs, val_loss_basic, 'g', label='Validation loss')

plt.title('Training vs validation (loss)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()


plt.subplot(224)

plt.plot(epochs, acc_basic, 'b', label='Training accuracy')
plt.plot(epochs, val_acc_basic, 'red', label='Validation accuracy')

plt.title('Training vs Validation (accuracy)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In [None]:
score, test_acc_basic = model.evaluate(X_t,y,verbose =1)

In [None]:
print(test_acc_basic)

To create Kaggle submission :

We need to use the predic function proposed by keras. We make the prediction with the test value.

In [None]:
y_pred = model.predict(X_te, verbose = 1)

When we've got the prediction, we need to put this value into a csv file for the submission in Kaggle. We use Pandas to creathe such a file 


In [None]:
submission = pd.DataFrame(columns=['id'] + list_classes)
submission['id'] = test['id'].values 
submission[list_classes] = y_pred
submission.to_csv("./submission_basicmodel.csv", index=False)

*** Complex Models**

First, we made a model with base layers. We will now use a more powerful type of neuron to deal with languages: Recurent Neural Networks 

*Thanks to the keras API, there are already layers that are implemented in it. So we are going to use the 3 following layers: LSTM, GRU and SimpleRNN*

![](https://miro.medium.com/max/1400/1*xTKE0g6XNMLM8IQ4aFdP0w.png)

X(t) is input, h(t) is output and A is the neural network which gains information from the previous step in a loop. The output of one neurons goes into the next one and forward the information.

First we will use the simplest RNN of the Keras API: The simple RNN.
This is a fully connected RNN where the output of the previous time step must be sent to the next time step.


One simpleRNN cell : 

![](https://miro.medium.com/max/415/1*28XR1ajfW1WuTOkjpOc9xA.png)

It's a multiplication of the current input (Xt) and the previous output (Ht-1). The get the current output, the activation function is applied to the result of the multiplication of the other two values. 

You can find more information on how to use this network layer at the following link: [SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN)

In [None]:
# For the Input 
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
embed_size = 128
x = Embedding(max_features, embed_size)(inp)

# != layers 


x = SimpleRNN(6,return_sequences=True)(x)
x = Dense(6, activation="sigmoid")(x)


# For the Output
x = GlobalMaxPool1D()(x)

In [None]:
model_rnn = Model(inputs=inp, outputs=x)
model_rnn.compile(loss='mean_squared_error',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
plot_model(model_rnn, to_file='model_gru.png', show_shapes=True, show_layer_names=True, dpi=96)

In [None]:
batch_size = 64
epochs = 5
history = model_rnn.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
history_dict = history.history
history_dict.keys()

acc_rnn = history.history['accuracy']
loss_rnn = history.history['loss']
val_acc_rnn = history.history['val_accuracy']
val_loss_rnn = history.history['val_loss']

epochs = range(1, len(acc_rnn) + 1)

plt.subplot(221)
plt.plot(epochs, loss_rnn, 'b', label='Training loss')
plt.plot(epochs, val_loss_rnn, 'g', label='Validation loss')

plt.title('Training vs validation (loss)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()


plt.subplot(224)

plt.plot(epochs, acc_rnn, 'b', label='Training accuracy')
plt.plot(epochs, val_acc_rnn, 'red', label='Validation accuracy')

plt.title('Training vs Validation (accuracy)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

For the prediction :

In [None]:
score, test_acc_rnn = model_rnn.evaluate(X_t,y,verbose =1)

In [None]:
y_pred = model_rnn.predict(X_te, verbose = 1)

In [None]:
submission = pd.DataFrame(columns=['id'] + list_classes)
submission['id'] = test['id'].values 
submission[list_classes] = y_pred
submission.to_csv("/kaggle/working/submission_model_RNN.csv", index=False)

The second layer that we will use is the Long Short Time Memory (LSTM).

One LSTM cell :

![](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

Here we use 3 states for each cell: The memory of the previous LSTM layer (on the top line), the result of the previous block (Ht-1) and finally the input layer (Xt). To have the result of all the states, mathematical calculations are made according to the diagram below.  

You can find more information on how to use this network layer at the following link: [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)

[Here](https://en.wikipedia.org/wiki/Recurrent_neural_network#Long_short-term_memory) to have more explaination about the LSTM cell

In [None]:
# For the Input 
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
embed_size = 128
x = Embedding(max_features, embed_size)(inp)

# != layers 

x = LSTM(6, return_sequences=True)(x)
x = Dense(6, activation="sigmoid")(x)


# For the Output
x = GlobalMaxPool1D()(x)

In [None]:
model_lstm = Model(inputs=inp, outputs=x)
model_lstm.compile(loss='mean_squared_error',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
plot_model(model_lstm, to_file='model_lstm.png', show_shapes=True, show_layer_names=True, dpi=96)

In [None]:
batch_size = 64
epochs = 5
history = model_lstm.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
history_dict = history.history
history_dict.keys()

acc_lstm = history.history['accuracy']
val_acc_lstm = history.history['val_accuracy']
val_loss_lstm = history.history['val_loss']
loss_lstm = history.history['loss']

epochs = range(1, len(acc_lstm) + 1)

plt.subplot(221)
plt.plot(epochs, loss_lstm, 'b', label='Training loss')
plt.plot(epochs, val_loss_lstm, 'g', label='Validation loss')

plt.title('Training vs validation (loss)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()


plt.subplot(224)

plt.plot(epochs, acc_lstm, 'b', label='Training accuracy')
plt.plot(epochs, val_acc_lstm, 'red', label='Validation accuracy')

plt.title('Training vs Validation (accuracy)')
plt.xlabel('Accuracy')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
score, test_acc_lstm = model_lstm.evaluate(X_t,y,verbose =1)

In [None]:
test_acc_lstm

In [None]:
y_pred = model_lstm.predict(X_te, verbose = 1)

In [None]:
submission = pd.DataFrame(columns=['id'] + list_classes)
submission['id'] = test['id'].values 
submission[list_classes] = y_pred
submission.to_csv("./submission_model_LSTM.csv", index=False)

The last is called the Gated Recurrent Unit (GRU).This is a variant of the LSTM.

One GRU cell : 

![](https://miro.medium.com/max/552/1*GSZ0ZQZPvcWmTVatAeOiIw.png)

A GRU cell depends on only two parameters: the input value and the value of previous output blocks. 
The interest of GRU compared to LSTM is the execution time which is faster since fewer parameters have to be calculated.

You can find more information on how to use this network layer at the following link: [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU)

[Here](https://en.wikipedia.org/wiki/Recurrent_neural_network#Gated_recurrent_unit) to have more explaination about the GRU cell


In [None]:
# For the Input 
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
embed_size = 128
x = Embedding(max_features, embed_size)(inp)

# != layers 

x = GRU(10,return_sequences=True, activation="sigmoid")(x)
x = Dense(6, activation="sigmoid")(x)


# For the Output
x = GlobalMaxPool1D()(x)

In [None]:
model_gru = Model(inputs=inp, outputs=x)
model_gru.compile(loss='mean_squared_error',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
plot_model(model_gru, to_file='model_gru.png', show_shapes=True, show_layer_names=True, dpi=96)

In [None]:
batch_size = 64
epochs = 5
history = model_gru.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
history_dict = history.history
history_dict.keys()

acc_gru = history.history['accuracy']
loss_gru = history.history['loss']
val_acc_gru = history.history['val_accuracy']
val_loss_gru = history.history['val_loss']

epochs = range(1, len(acc_gru) + 1)

plt.subplot(221)
plt.plot(epochs, loss_gru, 'b', label='Training loss')
plt.plot(epochs, val_loss_gru, 'g', label='Validation loss')

plt.title('Training vs validation (loss)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()


plt.subplot(224)

plt.plot(epochs, acc_gru, 'b', label='Training accuracy')
plt.plot(epochs, val_acc_gru, 'red', label='Validation accuracy')

plt.title('Training vs Validation (accuracy)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In [None]:
score, test_acc_gru = model_gru.evaluate(X_t,y,verbose =1)

In [None]:
test_acc_gru

In [None]:
y_pred = model_gru.predict(X_te, verbose = 1)

In [None]:
submission = pd.DataFrame(columns=['id'] + list_classes)
submission['id'] = test['id'].values 
submission[list_classes] = y_pred
submission.to_csv("./submission_model_GRU.csv", index=False)

# *Table Comparison of Models:*

In [None]:
comp = pd.DataFrame(columns = ['Models'] + ['Training Accuracy (avg)'] + ['Validation Accuracy (avg)'] + ['Test Accuracy'] + ['Time /epochs (avg)']+ ['Kaggle Score'])
comp['Models'] = ['Baseline',"LSTM",'GRU','SimpleRNN']
comp['Training Accuracy (avg)'] = [mean(acc_basic),mean(acc_lstm),mean(acc_gru),mean(acc_rnn)]
comp['Validation Accuracy (avg)'] = [mean(val_acc_basic), mean(val_acc_lstm), mean(val_acc_gru), mean(val_acc_rnn)]
comp['Test Accuracy'] = [test_acc_basic, test_acc_lstm, test_acc_gru, test_acc_rnn]
comp['Time /epochs (avg)'] = ["54s", "208s", "246.2s", "151.4s"]
comp['Kaggle Score'] = ["0.89238","0.92040","0.95050","0.91707"]
comp.to_csv("./Models_comp.csv",index = False)

In [None]:
displaycomp = pd.read_csv('/kaggle/working/Models_comp.csv')
print(displaycomp)

*we check what we said about the difficulties to predict some comments:

In [None]:
predict = pd.read_csv('/kaggle/working/submission_model_GRU.csv')
predict_val = predict[list_classes].values

In [None]:
test['comment_text'][4]

In [None]:
predict_val[4]

In [None]:
test['comment_text'][0]

In [None]:
predict_val[0]

# *Difficulties encountered:*

As on all projects, we had to face several problems to complete this one :

The first problem is the coronavirus. Indeed, the arrival of this virus has already changed the way the project has been carried out because of the introduction of online courses. As we are all erasmus students (two from Luxembourg and one from France), we all went home. So setting up slots to work on this project was a bit more difficult than it would have been without this situation with COVID.

Two people hadn't really practiced before the Python. So it was difficult to start working directly on the project. Luckily, attending Python classes in parallel to this course allowed a better understanding of this scripting language.   

It also took a long time to understand the existing notebook before starting to work on this project. Indeed, we did not start this project from scratch, we started from an existing notebook. All the members of the group had never done neural networks before, so it would have been complicated to start from scratch without any basis. 

# *Conclusion:*

We decided to do the test on 5 epochs. Doing it on more might be more interesting in terms of values. But it would take too much time. Indeed, for the longest model (GRU) which has an average of 246s / epochs. This would make on 50 epochs or 205 min total and for 100 epochs 410 min which is too much waiting time.  

We had already noticed that the average time by epochs and logic. Indeed, the time for the simplest model (baseline) and the one that takes the shortest time, on average 54s.The more complex the model, the longer the time increases: simpleRNN got an average time of 154.4s per epochs, 208s for the LSTM and finally 246.2s for the GRU.

In order to get a better idea of the performance of each model, we mainly look at the test accuracy, but also to the other metrics (Training accuracy, validation accuracy). We notice that the LSTM and the GRU have the two best results for all categories. 

To choose between the two, we look at the Kaggle score. We notice that this one is 0.95050 for the GRU against 0.92040 for the LSTM.The GRU also got better performance to all the metrics, comparing to the LSTM. It tends to converge to the optimal solution.

We would say that, to answer the problem, we prefer to use our GRU model, in this case. 