# Introduction

This notebook guide through the simple pipeline to solve the [Toxic comment classification problem](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) hosted on kaggle in year 2018.

In this competition we are given the dataset of 160k  comments wikipedia comments which are manually labeled as `toxic`, `severe_toxic`,  `obscene`, `threat`, `insult`, `identity_hate`. 

This is multi label classification problem where we have to classify comments among all the labels.


# Methodology used

I will use two bi-directional GRU (type of recurrent neural network) as base of my model in order to solve this problem. I will be using [keras API](https://www.tensorflow.org/api_docs/python/tf/keras) supported by Tensorflow for training the model because of its easy implementation.


In order to understand basic pipeline of a NLP project you can read [this](https://) blog post.

## Download data 

We will download the data using kaggle API keys. 
In order to get yours go to your account page which is kaggle.com/`your_user_id`/account, and click "Create new API token" button. save the file on gdrive under 'kaggle/kaggle.json'. After doing that execute the following code.

In [2]:
!pip3 -q install kaggle
!pip3 -q install fasttext
!pip3 -q install sentencepiece

[?25l[K     |█████▊                          | 10kB 20.5MB/s eta 0:00:01[K     |███████████▍                    | 20kB 1.8MB/s eta 0:00:01[K     |█████████████████               | 30kB 2.7MB/s eta 0:00:01[K     |██████████████████████▊         | 40kB 1.8MB/s eta 0:00:01[K     |████████████████████████████▍   | 51kB 2.2MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.5MB/s 
[?25h  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.0MB 2.8MB/s 
[?25h

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
! mkdir ~/.kaggle/

In [0]:
!cp 'drive/My Drive/kaggle/kaggle.json' ~/.kaggle/kaggle.json

In [6]:
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge

Downloading sample_submission.csv.zip to /content
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 46.3MB/s]
Downloading test.csv.zip to /content
 43% 10.0M/23.4M [00:00<00:00, 53.3MB/s]
100% 23.4M/23.4M [00:00<00:00, 78.2MB/s]
Downloading train.csv.zip to /content
 99% 26.0M/26.3M [00:00<00:00, 54.5MB/s]
100% 26.3M/26.3M [00:00<00:00, 104MB/s] 
Downloading test_labels.csv.zip to /content
  0% 0.00/1.46M [00:00<?, ?B/s]
100% 1.46M/1.46M [00:00<00:00, 96.5MB/s]


In [0]:
!unzip -q test.csv.zip
!unzip -q train.csv.zip
!unzip -q test_labels.csv.zip
!unzip -q sample_submission.csv.zip

In [0]:
!rm *zip

In [9]:
!ls 

drive  sample_data  sample_submission.csv  test.csv  test_labels.csv  train.csv


# Imports and initialization


In [0]:
import os
import numpy as np
import pandas as pd
import tqdm
import nltk
from nltk.corpus import stopwords
import re
import string
from gensim.models import KeyedVectors

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


In [0]:
from sklearn import metrics
from tensorflow.keras import backend as K
import tensorflow as tf

In [0]:
import fasttext
import sentencepiece as spm

In [0]:
TEST_FILE = './test.csv'
TRAIN_FILE = './train.csv'
SAMPLE_SUB = './sample_submission.csv'
TEST_LABEL = './test_labels.csv'

In [0]:
maxlen = 300
max_features =  2000000
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

In [0]:
UNKNOWN_WORD = "<unk>"

In [0]:
MODEL_DIR = "models/"
os.makedirs(MODEL_DIR, exist_ok=True)

In [18]:
%%time

train = pd.read_csv(TRAIN_FILE)
test = pd.read_csv(TEST_FILE)
sample = pd.read_csv(SAMPLE_SUB)

CPU times: user 1.56 s, sys: 180 ms, total: 1.74 s
Wall time: 1.75 s


In [19]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [20]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


# Preprocessing text


For our current problem we are just removing the newline characters `\n` and numbers from the text. We will separate the word and special characters. Doing so will help us down the line (to be exact it will be helpful in tokenization).

In [0]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')

def clean_text(s):
    return re_tok.sub(r' \1 ', s).lower()

def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

def remove_nl(x):
  x = re.sub('\n', ' ', x)
  return x


def add_nl(x):
  re_nl = re.compile(f'([?.!;:])')
  x = re_nl.sub(r'\1 \n', x).lower()
  return x


In [22]:
%%time

print("Adding new line for sentencepiece in train comments")
train["comment_spm"] = train["comment_text"].apply(add_nl)
print("Adding new line for sentencepiece in test comments")
test["comment_spm"] = test["comment_text"].apply(add_nl)
print("Removing numbers from train comments")

Adding new line for sentencepiece in train comments
Adding new line for sentencepiece in test comments
Removing numbers from train comments
CPU times: user 5.06 s, sys: 74.9 ms, total: 5.14 s
Wall time: 5.15 s


In [23]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_spm
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,explanation\nwhy the edits made under my usern...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,d'aww! \n he matches this background colour i'...
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"hey man, i'm really not trying to edit war. \n..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"""\nmore\ni can't make any real suggestions on ..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"you, sir, are my hero. \n any chance you remem..."


In [24]:
test.head()

Unnamed: 0,id,comment_text,comment_spm
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,yo bitch ja rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,== from rfc == \n\n the title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",""" \n\n == sources == \n\n * zawe ashton on lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in...",": \nif you have a look back at the source, the..."
4,00017695ad8997eb,I don't anonymously edit articles at all.,i don't anonymously edit articles at all. \n


# Training BPE model from scratch using sentencepiece


Sentencepiece is a package developed by google in order to create optimal tokenization technique and make vocabulary of fixed size and reduce the number of unknown tokens.

You can use [this](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb) code example to check out the library.

In [0]:
SENT_PIECE = 'pre_sentpc.txt'

def createFile(name, data):
  with open(name, 'w') as f:
    for item in data:
        f.write("%s\n" % item)

all_comments = train['comment_spm'] +  test['comment_spm'] 

createFile(SENT_PIECE,all_comments)

In [0]:
!ls

drive	pre_sentpc.txt	sample_submission.csv  test_labels.csv
models	sample_data	test.csv	       train.csv


In [26]:
%%time

spm.SentencePieceTrainer.Train("--input={} --model_prefix=sentence_pc --model_type=bpe --vocab_size=90000".format(SENT_PIECE))

CPU times: user 5min 47s, sys: 1.92 s, total: 5min 49s
Wall time: 5min 40s


True

In [27]:
sp_tokenizer = spm.SentencePieceProcessor()
sp_tokenizer.load('sentence_pc.model')

True

In [28]:
%%time

def tokenize_comment(comment):
  return " ".join(sp_tokenizer.EncodeAsPieces(comment)).lower()

print("tokenize train comments")
train["comment_token"] = train["comment_text"].apply(tokenize_comment)
print("tokenize test comments")
test["comment_token"] = test["comment_text"].apply(tokenize_comment)


tokenize train comments
tokenize test comments
CPU times: user 1min 25s, sys: 247 ms, total: 1min 25s
Wall time: 1min 25s


In [29]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_spm,comment_token
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,explanation\nwhy the edits made under my usern...,▁ e x plan ation ▁ w hy ▁the ▁edits ▁made ▁und...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,d'aww! \n he matches this background colour i'...,▁ d ' aww ! ▁ h e ▁matches ▁this ▁background ▁...
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"hey man, i'm really not trying to edit war. \n...","▁ h ey ▁man , ▁ i ' m ▁really ▁not ▁trying ▁to..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"""\nmore\ni can't make any real suggestions on ...","▁"" ▁m ore ▁ i ▁can ' t ▁make ▁any ▁real ▁sugge..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"you, sir, are my hero. \n any chance you remem...","▁ y ou , ▁sir , ▁are ▁my ▁hero . ▁ a ny ▁chanc..."


In [30]:
test.head()

Unnamed: 0,id,comment_text,comment_spm,comment_token
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,yo bitch ja rule is more succesful then you'll...,▁ y o ▁bitch ▁ j a ▁ r ule ▁is ▁more ▁succesfu...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,== from rfc == \n\n the title is fine as it is...,▁== ▁ f rom ▁ r f c ▁== ▁ t he ▁title ▁is ▁fin...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",""" \n\n == sources == \n\n * zawe ashton on lap...","▁"" ▁== ▁ s ources ▁== ▁* ▁ z a we ▁ a s hton ▁..."
3,00017563c3f7919a,":If you have a look back at the source, the in...",": \nif you have a look back at the source, the...",▁: i f ▁you ▁have ▁a ▁look ▁back ▁at ▁the ▁sou...
4,00017695ad8997eb,I don't anonymously edit articles at all.,i don't anonymously edit articles at all. \n,▁ i ▁don ' t ▁anonymously ▁edit ▁articles ▁at ...


In [0]:
createFile('pre_fasttext.txt',train['comment_token'] +  test['comment_token'] )

# Training a FastText embedding model from scrach

Creating custom embedding for our models using FastText library.



In [0]:
import fasttext

### Installing Fasttext from source

In [33]:
! wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
! unzip -q v0.9.1.zip

--2019-09-21 07:47:01--  https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.9.1 [following]
--2019-09-21 07:47:01--  https://codeload.github.com/facebookresearch/fastText/zip/v0.9.1
Resolving codeload.github.com (codeload.github.com)... 140.82.114.10
Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.9.1.zip’

v0.9.1.zip              [<=>                 ]       0  --.-KB/s               v0.9.1.zip              [ <=>                ]   1.52M  6.87MB/s               v0.9.1.zip              [  <=>               ]   4.13M  14.9MB/s    in 0.3s    

2019-09-21 07:47:01 (14.9 MB/s) - ‘v0.9.1.zip’ saved [

In [34]:
! cd fastText-0.9.1 &&  make 

c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/vector.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/model.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/utils.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c src/meter.cc
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -DNDEBUG -c s

### Training the fasttext model

In [35]:
%%time

! cd fastText-0.9.1 && ./fasttext skipgram -input ../pre_fasttext.txt -output ../embed_model -epoch 6 -lr 0.05 -dim 300
# 8206, 1.166972

Read 31M words
Number of words:  74667
Number of labels: 0
tcmalloc: large alloc 2489606144 bytes == 0x55ba4d50a000 @  0x7fcae86ca887 0x55ba42979b7d 0x55ba42984028 0x55ba4298b2e4 0x55ba42991092 0x55ba4295bcc7 0x7fcae7767b97 0x55ba4295bf8a
Progress: 100.0% words/sec/thread:   10520 lr:  0.000000 loss:  1.575268 ETA:   0h 0m
CPU times: user 23.5 s, sys: 4.26 s, total: 27.8 s
Wall time: 25min 45s


In [0]:
EMBEDDING_FILE = "embed_model.bin"

In [0]:
vocabs = [sp_tokenizer.id_to_piece(id) for id in range(sp_tokenizer.get_piece_size())]

In [39]:
%%time
embed_model = fasttext.load_model(EMBEDDING_FILE)

CPU times: user 502 ms, sys: 1.78 s, total: 2.28 s
Wall time: 2.28 s





### Creating Embedding matrix

In [0]:
not_present = []

In [0]:
%%time 

nb_words = len(vocabs)
not_found = []
embedding_matrix = np.zeros((nb_words, 300), dtype=np.float64)

for word in vocabs:
    try:
      index = sp_tokenizer.encode_as_ids(word)[0]
      embedding_matrix[index] = embed_model[word]
    except:
      not_found.append(word)

In [56]:
len(not_found)

1

Deleting the unrequired variables from memory as the memory available to us is always limited. More available memory means bigger batch size of our training model.

In [0]:
del all_comments 


# Prepare inputs and output

In [0]:
def text_to_ids(comment):
  return sp_tokenizer.encode_as_ids(comment)

In [67]:
%%time
train_X =  train["comment_spm"].apply(text_to_ids)
test_X = test["comment_spm"].apply(text_to_ids)

CPU times: user 1min 24s, sys: 591 ms, total: 1min 24s
Wall time: 1min 24s


In [0]:
train_X = pad_sequences(train_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

In [0]:
train_y = train[list_classes].values

# Define model and prepare for training

This compition evaluates a model based on the mean column-wise ROC AUC. 


In [0]:
def auc(y_true, y_pred):
    auc = tf.metrics.mean(tf.metrics.auc(y_true, y_pred, curve='ROC'))
    K.get_session().run(tf.local_variables_initializer())
    return auc

In [0]:
X_train, X_eval, y_train, y_eval = train_test_split(train_X, train_y, test_size=0.1, random_state=101)


In [0]:

def get_model(embedding_matrix, sequence_length, dropout_rate, recurrent_units, dense_size, embed_train= False):
  input_layer = tf.keras.layers.Input(shape=(sequence_length,))
  embedding_layer = tf.keras.layers.Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=embed_train)(input_layer)
  x = tf.keras.layers.Dropout(dropout_rate)(embedding_layer)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(recurrent_units, return_sequences=True))(x)
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  x = tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNGRU(recurrent_units, return_sequences=True))(x)
  avg_pool = tf.keras.layers.GlobalAveragePooling1D()(x)
  max_pool = tf.keras.layers.GlobalMaxPooling1D()(x)
  x =  tf.compat.v1.keras.layers.concatenate([avg_pool, max_pool])
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  x = tf.keras.layers.Dense(dense_size, activation="relu")(x)
  x = tf.keras.layers.Dropout(dropout_rate)(x)
  output_layer = tf.keras.layers.Dense(6, activation="sigmoid")(x)
  model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
  adam = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
  model.compile(optimizer=adam, loss=tf.keras.losses.binary_crossentropy, metrics=['accuracy', auc])
  return model
   

In [74]:
get_model_func = lambda: get_model(embedding_matrix, maxlen, 0.4, 120, 512, embed_train=False )
model = get_model_func()
model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)     

In [0]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('models/model-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='min')

In [0]:
stop_loss = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=30)

# Training the model

In [77]:
model.fit(X_train, y_train, batch_size=512, validation_split=0.1, callbacks=[checkpoint, stop_loss ], epochs=200)



KeyboardInterrupt: ignored

# Evaluate Model

In [0]:
from sklearn.metrics import roc_auc_score

In [92]:
!ls -al models

total 115096
drwxr-xr-x 2 root root      4096 Sep 21 08:45 .
drwxr-xr-x 1 root root      4096 Sep 21 10:48 ..
-rw-r--r-- 1 root root 117846296 Sep 21 09:38 model-best.h5


In [0]:
model.load_weights('/content/models/model-best.h5')

In [110]:
eval_p = model.predict([X_eval], batch_size=1024, verbose=1)

print(roc_auc_score(y_eval, eval_p))

0.987919586565639


# Predict using model

In [105]:
preds = model.predict([test_X], batch_size=1024, verbose=1)



In [106]:
sample['id'] = test['id']
sample[list_classes] = preds
sample.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999649,0.2806593,0.980301,0.1414118,0.925292,0.374861
1,0000247867823ef7,0.000734,1.192093e-07,0.000169,2.086163e-07,9.5e-05,1.907349e-06
2,00013b17ad220c46,0.0003,0.0,9e-05,5.960464e-08,2.5e-05,5.364418e-07
3,00017563c3f7919a,9e-05,0.0,1.3e-05,0.0,8e-06,8.940697e-08
4,00017695ad8997eb,0.004475,1.877546e-06,0.000628,1.302361e-05,0.000508,1.531839e-05


In [0]:
sample.to_csv("submission.csv", index=False)

In [108]:
!kaggle competitions submit -c jigsaw-toxic-comment-classification-challenge -f submission.csv -m "Submission through colab: BPE test 3"

100% 19.8M/19.8M [00:00<00:00, 24.0MB/s]
Successfully submitted to Toxic Comment Classification Challenge