# Introduction

This notebook guide through the simple pipeline to solve the [Toxic comment classification problem](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) hosted on kaggle in year 2018.

In this competition we are given the dataset of 160k  comments wikipedia comments which are manually labeled as `toxic`, `severe_toxic`,  `obscene`, `threat`, `insult`, `identity_hate`. 

This is multi label classification problem where we have to classify comments among all the labels.


# Methodology used

I will use two bi-directional GRU (type of recurrent neural network) as base of my model in order to solve this problem. I will be using [keras API](https://www.tensorflow.org/api_docs/python/tf/keras) supported by Tensorflow for training the model because of its easy implementation.


In order to understand basic pipeline of a NLP project you can read [this](https://t.co/dVO5ky1pGi?amp=1) blog post.

## Download data 

We will download the data using kaggle API keys. 
In order to get yours go to your account page which is kaggle.com/`your_user_id`/account, and click "Create new API token" button. save the file on gdrive under 'kaggle/kaggle.json'. After doing that execute the following code.

In [1]:
!pip3 install kaggle



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
! mkdir ~/.kaggle/

In [0]:
!cp 'drive/My Drive/kaggle/kaggle.json' ~/.kaggle/kaggle.json

In [5]:
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge

Downloading sample_submission.csv.zip to /content
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 46.3MB/s]
Downloading test.csv.zip to /content
 38% 9.00M/23.4M [00:00<00:00, 20.9MB/s]
100% 23.4M/23.4M [00:00<00:00, 43.4MB/s]
Downloading train.csv.zip to /content
 34% 9.00M/26.3M [00:00<00:00, 23.7MB/s]
100% 26.3M/26.3M [00:00<00:00, 53.7MB/s]
Downloading test_labels.csv.zip to /content
  0% 0.00/1.46M [00:00<?, ?B/s]
100% 1.46M/1.46M [00:00<00:00, 97.6MB/s]


In [6]:
!unzip test.csv.zip
!unzip train.csv.zip
!unzip test_labels.csv.zip
!unzip sample_submission.csv.zip

Archive:  test.csv.zip
  inflating: test.csv                
Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test_labels.csv.zip
  inflating: test_labels.csv         
Archive:  sample_submission.csv.zip
  inflating: sample_submission.csv   


In [7]:
!ls

drive			   test.csv		train.csv
sample_data		   test.csv.zip		train.csv.zip
sample_submission.csv	   test_labels.csv
sample_submission.csv.zip  test_labels.csv.zip


# Imports and initialization


In [0]:
import os
import numpy as np
import pandas as pd
import tqdm
import nltk
from nltk.corpus import stopwords
import re
import string
from gensim.models import KeyedVectors

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


In [0]:
from sklearn import metrics
from tensorflow.keras import backend as K
import tensorflow as tf

In [0]:
TEST_FILE = './test.csv'
TRAIN_FILE = './train.csv'
SAMPLE_SUB = './sample_submission.csv'
TEST_LABEL = './test_labels.csv'

In [0]:
embed_size = 300
maxlen = 300
max_features =  2000000
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

In [0]:
UNKNOWN_WORD = "_UNK_"
END_WORD = "_END_"
NAN_WORD = "_NAN_"

In [0]:
MODEL_DIR = "models/"
os.makedirs(MODEL_DIR, exist_ok=True)

In [16]:
%%time

train = pd.read_csv(TRAIN_FILE)
test = pd.read_csv(TEST_FILE)
sample = pd.read_csv(SAMPLE_SUB)

CPU times: user 1.56 s, sys: 185 ms, total: 1.75 s
Wall time: 1.75 s


In [17]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


# Preprocessing text


For our current problem we are just removing the newline characters `\n` and numbers from the text. We will separate the word and special characters. Doing so will help us down the line (to be exact it will be helpful in tokenization).

In [0]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')

def clean_text(s):
    return re_tok.sub(r' \1 ', s).lower()

def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

def remove_nl(x):
  x = re.sub('\n', ' ', x)
  return x


In [19]:
%%time

print("Cleaning train comments")
train["comment_text"] = train["comment_text"].apply(clean_text)
print("Cleaning test comments")
test["comment_text"] = test["comment_text"].apply(clean_text)
print("Removing numbers from train comments")
train["comment_text"] = train["comment_text"].apply(clean_numbers)
print("Removing numbers from test comments")
test["comment_text"] = test["comment_text"].apply(clean_numbers)
print("Removing next line from train comments")
train["comment_text"] = train["comment_text"].apply(remove_nl)
print("Removing next line from test comments")
test["comment_text"] = test["comment_text"].apply(remove_nl)

Cleaning train comments
Cleaning test comments
Removing numbers from train comments
Removing numbers from test comments
Removing next line from train comments
Removing next line from test comments
CPU times: user 21.9 s, sys: 83.7 ms, total: 22 s
Wall time: 22.1 s


In [20]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0
1,000103f0d9cfb60f,d ' aww ! he matches this background colour i...,0,0,0,0,0,0
2,000113f07ec002fd,"hey man , i ' m really not trying to edit war...",0,0,0,0,0,0
3,0001b41b1c6bb37e,""" more i can ' t make any real suggestions o...",0,0,0,0,0,0
4,0001d958c54c6e35,"you , sir , are my hero . any chance you re...",0,0,0,0,0,0


In [21]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,yo bitch ja rule is more succesful then you ' ...
1,0000247867823ef7,= = from rfc = = the title is fine as...
2,00013b17ad220c46,""" = = sources = = * zawe ashto..."
3,00017563c3f7919a,": if you have a look back at the source , th..."
4,00017695ad8997eb,i don ' t anonymously edit articles at all .


# Downloading and preparing word embeddings

For this experiment we are using [fasttext](https://fasttext.cc/docs/en/english-vectors.html) embedding vectors of size 300, trained with subword information on Common Crawl (600B tokens). After downloading we will extract and read the `crawl-300d-2M-subword.vec` file

In [23]:
!wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip'

--2019-09-14 10:18:36--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.22.166, 104.20.6.166, 2606:4700:10::6814:6a6, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.22.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5828358084 (5.4G) [application/zip]
Saving to: ‘crawl-300d-2M-subword.zip’


2019-09-14 10:26:31 (11.7 MB/s) - ‘crawl-300d-2M-subword.zip’ saved [5828358084/5828358084]



In [24]:
%%time 
!unzip crawl-300d-2M-subword.zip

Archive:  crawl-300d-2M-subword.zip
  inflating: crawl-300d-2M-subword.vec  
  inflating: crawl-300d-2M-subword.bin  
CPU times: user 596 ms, sys: 80.4 ms, total: 676 ms
Wall time: 2min 5s


In [0]:
!rm crawl-300d-2M-subword.bin

In [0]:
EMBEDDING_FILE = "/content/crawl-300d-2M-subword.vec"

In [27]:
%%time

def load_embed(file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    words = embeddings_index.keys()
    embedding_list = [embeddings_index[k] for k in words]
    embedding_word_dict = {item : index for index, item in enumerate(words)}
    return embedding_list, embedding_word_dict

embedding_list, embedding_word_dict = load_embed(EMBEDDING_FILE)

CPU times: user 2min 29s, sys: 4.48 s, total: 2min 34s
Wall time: 2min 34s


In order to handle for the words that are not present in fasttext embedding we will calculate mean and standard deviation of all the embedding vectors to generate random embedding for unknown words.

In [0]:
all_embs = np.stack(embedding_list)
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

In [0]:
embedding_word_dict[UNKNOWN_WORD] = len(embedding_list)
embedding_list.append([0.] * embed_size)

# Tokenize comments and rebuild embedding matrix



In [0]:
all_comments = list(train["comment_text"]) +  list(test["comment_text"])

In [32]:
%%time
tokenizer = Tokenizer(num_words=max_features, oov_token=UNKNOWN_WORD)
tokenizer.fit_on_texts(list(all_comments))

CPU times: user 21.4 s, sys: 82.9 ms, total: 21.5 s
Wall time: 21.5 s


In [0]:
nb_words = min(max_features, len(tokenizer.word_index))
not_found = []
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in tokenizer.word_index.items():
    if i >= max_features:
        continue
    try:
      index = embedding_word_dict[word]
      embedding_vector = embedding_list[index]
      embedding_matrix[i] = embedding_vector
    except:
      not_found.append(word)


Deleting the unrequired variables from memory as the memory available to us is always limited. More available memory means bigger batch size of our training model.

In [0]:
del all_comments 
del embedding_list
del embedding_word_dict


# Prepare inputs and output

In [37]:
%%time

train_X = tokenizer.texts_to_sequences(train["comment_text"])
test_X = tokenizer.texts_to_sequences(test["comment_text"])

CPU times: user 17 s, sys: 60 ms, total: 17.1 s
Wall time: 17.1 s


In [0]:
train_X = pad_sequences(train_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

In [0]:
train_y = train[list_classes].values

# Define model and prepare for training

This compition uses ROC

In [0]:
def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred, curve='ROC')[1]
    K.get_session().run(tf.local_variables_initializer())
    return auc

In [0]:
X_train, X_eval, y_train, y_eval = train_test_split(train_X, train_y, test_size=0.1, random_state=101)


In [0]:

def get_model(embedding_matrix, sequence_length, dropout_rate, recurrent_units, dense_size, embed_train= False):
  input_layer = tf.keras.layers.Input(shape=(sequence_length,))
  embedding_layer = tf.keras.layers.Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=embed_train)(input_layer)
  x = tf.keras.layers.Dropout(dropout_rate)(embedding_layer)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(recurrent_units, return_sequences=True))(x)
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(recurrent_units, return_sequences=True))(x)
  avg_pool = tf.keras.layers.GlobalAveragePooling1D()(x)
  max_pool = tf.keras.layers.GlobalMaxPooling1D()(x)
  x =  tf.compat.v1.keras.layers.concatenate([avg_pool, max_pool])
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  x = tf.keras.layers.Dense(dense_size, activation="relu")(x)
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  output_layer = tf.keras.layers.Dense(6, activation="sigmoid")(x)
  model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
  adam = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
  model.compile(optimizer=adam, loss=tf.keras.losses.binary_crossentropy, metrics=['accuracy', auc])
  return model
   

In [86]:
get_model_func = lambda: get_model(embedding_matrix, maxlen, 0.3, 120, 512, embed_train=False )
model = get_model_func()
model.summary()

Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_9 (InputLayer)            [(None, 300)]        0                                            
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 300, 300)     102254100   input_9[0][0]                    
__________________________________________________________________________________________________
dropout_29 (Dropout)            (None, 300, 300)     0           embedding_8[0][0]                
__________________________________________________________________________________________________
bidirectional_16 (Bidirectional (None, 300, 240)     303840      dropout_29[0][0]                 
____________________________________________________________________________________________

In [0]:
model.load_weights('/content/saved-dmodel-0.991368.h5')

In [0]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('models/model-{acc:03f}.h5', verbose=1, monitor='val_auc',save_best_only=True, mode='max')

# Training the model

In [0]:
model.fit(X_train, y_train, batch_size=512, validation_split=0.1, callbacks=[checkpoint], epochs=40)

Train on 129251 samples, validate on 14362 samples
Epoch 1/40

# Predict using model

In [80]:
preds = model.predict([test_X], batch_size=1024, verbose=1)



In [81]:
sample['id'] = test['id']
sample[list_classes] = preds
sample.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.9999627,0.531956,0.999861,0.007352,0.999299,0.940022
1,0000247867823ef7,1.233816e-05,0.0,0.0,0.0,0.0,0.0
2,00013b17ad220c46,0.0,0.0,0.0,0.0,0.0,0.0
3,00017563c3f7919a,1.788139e-07,0.0,0.0,0.0,0.0,0.0
4,00017695ad8997eb,9.834766e-06,0.0,0.0,0.0,0.0,0.0


In [0]:
sample.to_csv("submission.csv", index=False)

In [83]:
!kaggle competitions submit -c jigsaw-toxic-comment-classification-challenge -f submission.csv -m "Submission through colab"

100% 13.8M/13.8M [00:03<00:00, 3.95MB/s]
Successfully submitted to Toxic Comment Classification Challenge