# Toxic Comments Classification

In this program, we are going to classify a comment in 6 different labels such as *toxic, severe_toxic, obsene*, etc.

## Importing libraries

In [None]:
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, GlobalAvgPool1D, Dropout, Embedding,Bidirectional, Flatten, CuDNNLSTM, Conv1D, MaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
import random
import matplotlib.pyplot as plt

## Getting the dataset

In [None]:
training_set = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")

In [None]:
training_set = training_set.drop(['id'], axis=1)

## Analyzing the dataset

In [None]:
print("Number of training records :",len(training_set))
print("Columns :")
for i in training_set:
    print("\t"+i)

The training set consists of 159571 records and 8 columns. The columns are very much self explanatory.<br>
<br>
>The **id** contains the id of our training records and is quite irrelevant for the training purpose, so we will eventually end up dropping this column.<br>
>Then we have **comment_text**, which consists of the text of comment text.<br>
>Rest other columns have values 0/1 based on whether the comment text qualifies for that label.
<br>


**Now, let's take a look at how many examples of training data do we have satifying our labels.**

In [None]:
#plot 2
columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']  
count_ones = []
for i in columns:
    count_ones.append(training_set[training_set[i]==1][i].count())
y_pos = np.arange(len(columns))
plt.bar(y_pos, count_ones, align="center", alpha=0.5)
plt.xticks(y_pos, columns)
plt.ylabel("Number of Ones")
plt.title("Number of Ones")
plt.show()

#plot 1
count_zeros = []
for i in columns:
    count_zeros.append(training_set[training_set[i]==0][i].count())
y_pos = np.arange(len(columns))
plt.bar(y_pos, count_zeros, align="center", alpha=0.5)
plt.xticks(y_pos, columns)
plt.ylabel("Number of Zeros")
plt.title("Number of Zeros")
plt.show()

From the above plots, we can see that our training set has more records which are negative(or have '0' value). We have around 15000 records which have are positively classified as toxic and around 140000 which are classified as negative. The worst case is with the threat class, here we have aroung 500-700 positive records only, while having 160000 negative records. So, our data is could be highly biased towards predicting a comment as negative toxicity for most of the classes.

**Let's have a look at some of the data examples.**

In [None]:
for i in range(1):
    j = random.randint(0, 10000)
    print(training_set.values[j])
    

**Okay, so enough of analyzing the data. Now, let's preprocess our data for training.**

Since, we have text data and the semantics of text are very important to correctly classify them as being toxic, severe_toxic, and so on, we will be using pre-trained word embeddings as inputs.

## Getting Word Embeddings

In [None]:
f = open("../input/glove-embeddings/glove.6B.300d.txt")

In [None]:
embedding_matrix = {}
for line in tqdm(f):
    temp = line.split(" ")
    word = temp[0]
    embeds = np.array(temp[1:], dtype='float32')
    embedding_matrix[word] = embeds

For the words which may not be present in glove word embeddings, we will be using zero vectos.

**Let's now create x and y datasets where 'x' will be the values we will use for making predictions and 'y', the values to predict.**

In [None]:
x = training_set['comment_text']
y = training_set[columns]

Now, we will tokenize our texts and convert them to sequences.

In [None]:
token = Tokenizer(num_words=20000)
token.fit_on_texts(x)
seq = token.texts_to_sequences(x)

We will need to pad our sequences. This is useful for making all the sentences of the same size.

In [None]:
padded_seq = pad_sequences(seq, maxlen=40)

In [None]:
vocab_size = len(token.word_index)+1
print(vocab_size)

We will now create word embeddings for words in our dictionary.

In [None]:
embeddings = np.zeros((vocab_size, 300))
for word, i in tqdm(token.word_index.items(), position=0):
    embeds = embedding_matrix.get(word)
    if embeds is not None:
        embeddings[i] = embeds

## Defining our models

Since we have to make predictions for six classes, let's have a separate classifier for each of them.

**Model for TOXIC **

In [None]:
model1 = Sequential()
model1.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model1.add(Conv1D(128, 5, activation='relu'))
model1.add(MaxPooling1D(5))
model1.add(Conv1D(128, 5, activation='relu'))
model1.add(MaxPooling1D(3))
model1.add(Flatten())
model1.add(Dense(128, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))

model1.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model1.summary()

In [None]:
model1.fit(padded_seq, training_set['toxic'], epochs=3, batch_size=32, validation_split=0.2)

**Model for SEVERE_TOXIC**

In [None]:
model2 = Sequential()
model2.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model2.add(Conv1D(128, 5, activation='relu'))
model2.add(MaxPooling1D(5))
model2.add(Conv1D(128, 5, activation='relu'))
model2.add(MaxPooling1D(3))
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

model2.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model2.summary()

In [None]:
model2.fit(padded_seq, training_set['severe_toxic'], epochs=2, batch_size=32, validation_split=0.2)

**Model for OBSCENE**

In [None]:
model3 = Sequential()
model3.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model3.add(Conv1D(128, 5, activation='relu'))
model3.add(MaxPooling1D(5))
model3.add(Conv1D(128, 5, activation='relu'))
model3.add(MaxPooling1D(3))
model3.add(Flatten())
model3.add(Dense(128, activation='relu'))
model3.add(Dense(1, activation='sigmoid'))

model3.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model3.summary()

In [None]:
model3.fit(padded_seq, training_set['obscene'], epochs=2, batch_size=32, validation_split=0.2)

**Model for THREAT**

In [None]:
model4 = Sequential()
model4.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model4.add(Conv1D(128, 5, activation='relu'))
model4.add(MaxPooling1D(5))
model4.add(Conv1D(128, 5, activation='relu'))
model4.add(MaxPooling1D(3))
model4.add(Flatten())
model4.add(Dense(128, activation='relu'))
model4.add(Dense(1, activation='sigmoid'))

model4.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model4.summary()

In [None]:
model4.fit(padded_seq, training_set['threat'], epochs=1, batch_size=32, validation_split=0.2)

**Model for INSULT**

In [None]:
model5 = Sequential()
model5.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model5.add(Conv1D(128, 5, activation='relu'))
model5.add(MaxPooling1D(5))
model5.add(Conv1D(128, 5, activation='relu'))
model5.add(MaxPooling1D(3))
model5.add(Flatten())
model5.add(Dense(128, activation='relu'))
model5.add(Dense(1, activation='sigmoid'))

model5.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model5.summary()

In [None]:
model5.fit(padded_seq, training_set['insult'], epochs=2, batch_size=32, validation_split=0.2)

**Model fot IDENTITY_HATE**

In [None]:
model6 = Sequential()
model6.add(Embedding(vocab_size, 300, weights = [embeddings],
                     input_length=40, trainable=False))
model6.add(Conv1D(128, 5, activation='relu'))
model6.add(MaxPooling1D(5))
model6.add(Conv1D(128, 5, activation='relu'))
model6.add(MaxPooling1D(3))
model6.add(Flatten())
model6.add(Dense(128, activation='relu'))
model6.add(Dense(1, activation='sigmoid'))

model6.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

model6.summary()

In [None]:
model6.fit(padded_seq, training_set['identity_hate'], epochs=1, batch_size=32, validation_split=0.2)

## Let's make some predictions now

In [None]:
test_set = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv')

In [None]:
x_test = test_set['comment_text']
token = Tokenizer(num_words=20000)
token.fit_on_texts(x_test)
seq = token.texts_to_sequences(x_test)

In [None]:
test_padded_seq = pad_sequences(seq, maxlen=40)

In [None]:
toxic = model1.predict(test_padded_seq)
severe_toxic = model2.predict(test_padded_seq)
obscene = model3.predict(test_padded_seq)
threat = model4.predict(test_padded_seq)
insult = model5.predict(test_padded_seq)
identity_hate = model6.predict(test_padded_seq)

In [None]:
toxic = [1 if i>=0.5 else 0 for i in toxic]
severe_toxic = [1 if i>=0.5 else 0 for i in severe_toxic]
obscene = [1 if i>=0.5 else 0 for i in obscene]
threat = [1 if i>=0.5 else 0 for i in threat]
insult = [1 if i>=0.5 else 0 for i in insult]
identity_hate = [1 if i>=0.5 else 0 for i in identity_hate]

In [None]:
id = test_set['id']

In [None]:
df = pd.DataFrame({'id':id,
                   'toxic':toxic,
                   'severe_toxic':severe_toxic,
                   'obscene':obscene,
                   'threat':threat,
                   'insult':insult,
                   'identity_hate':identity_hate})

In [None]:
df.to_csv("submission.csv", index=False)