# Toxic Comment Classification

This notebook aims at classifying comments into 6 categories using **Keras**. We will be using Sequential model to classify comments by **feature extraction** using keras preprocessing functionality. We will be using **Tokenizer** to convert the text (comments) into word vectors using term frequency inverse document frequency **(TFIDF)**.

In [None]:
# Importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

We will first begin with importing essential elements from keras.
Sequential model can stack up different layers in it. We will be using Dense layer followed by drop out, which is used to randomly droping out features while training them. 

In [None]:
import keras
from keras.models import Sequential
from keras.preprocessing import text
from keras.layers import Dense,Dropout

We can use pandas read-csv module to read training and test data from the csv file. These files are formatted using Dataframes and function differently from numpy matrices.You can read more about Data frames here https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html

In [None]:
# Reading datasets from csv file
train_set = pd.read_csv('../input/train.csv')

In [None]:
train_set.head()

We will now separate out **comments** and their **labels** from the dataframes (train_set). Dataframes are sort of dictionary so we can use column heading to select columns.

In [None]:
# Separating Comments and Labels
comments,labels= train_set['comment_text'],train_set[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

In [None]:
comments.head()

### Preprocessing Comments

We will be taking help of keras in built text processing  function Tokenizer. 
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf. In short and simple, it will convert text into a word sequence which can be used to extract features from sequence. We will be using TFIDF, you can use word count as well.

##### A brief about TFIDF

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

In [None]:
# Use of tokenizer to convert texts into word array
num_words=3000
tokenizer = text.Tokenizer(lower=True,num_words=num_words)

In [None]:
tokenizer.fit_on_texts(comments)

In [None]:
encoded_text = tokenizer.texts_to_matrix(comments,mode='tfidf')

In [None]:
encoded_text.shape

## Defining model

Our model will consist of three fully connected hidden layers followed by output layer. 

In [None]:
def Model():
    model = Sequential()
    model.add(Dense(1024,input_shape=(num_words,),activation='relu'))
    model.add(Dropout(0.4))
    
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.2))
    
    model.add(Dense(64,activation='relu'))
    model.add(Dense(6,activation='sigmoid'))
    
    return model

In [None]:
model = Model()
model.compile(loss=keras.losses.binary_crossentropy,optimizer='adam',metrics=['accuracy'])

##### Training on dataset

Split dataset, to use a part of it for validation data to ensure the correctness of our model. We will use 2 epochs for training our model.

In [None]:
history = model.fit(encoded_text,labels,verbose=1,epochs=2,validation_split=0.3)

Evaluate the model using test set. First convert the test data using same tokensizer and same mode tfidf

### Visualising the results

In [None]:
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()