# MultiLabel Text Classifier using BERT embeddings as input features

Start BERT server first

```
./start-bert.sh

```

In [1]:
! python3 warmup.py

Example of how BERT encoding works with input text:
Type and shape of returned embedding <class 'numpy.ndarray'> (1, 768)
[['[CLS]', 'welcome', 'to', 'the', 'dev', '##con', '##f', '##20', '##19', 'conference', '!', 'bangalore', 'has', 'awesome', 'weather', 'today', '.', '[SEP]']]
Running it 100 times more to warmup the server...
100%|████████████████████████████████████████| 100/100 [00:00<00:00, 153.22it/s]


In [2]:
import pandas as pd
from bert_serving.client import BertClient
import matplotlib.pyplot as plt
import datetime
import os
import matplotlib.gridspec as gridspec
from train import CLASS_LABELS
%matplotlib inline

Using TensorFlow backend.


In [3]:
bc = BertClient(check_length=False)

## How does the data look like?

Data source is: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

In [4]:
data_path = "data"
train_csv_file = os.path.join(data_path, "jigsaw-toxic-comment-classification-challenge/train.csv")

In [5]:
df_train = pd.read_csv(train_csv_file)

In [6]:
df_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [7]:
# non-identity_hate
for i, r in df_train.head(1).iterrows():
    print(r["comment_text"])

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27


In [8]:
# identiy_hate
df_train[df_train.identity_hate == 1].head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
42,001810bf8c45bf5f,You are gay or antisemmitian? \n\nArchangel WH...,1,0,1,0,1,1
105,00472b8e2d38d1ea,A pair of jew-hating weiner nazi schmucks.,1,0,1,0,1,1
176,006b94add72ed61c,I think that your a Fagget get a oife and burn...,1,0,1,1,1,1
218,008e0818dde894fb,"Kill all niggers. \n\nI have hard, that others...",1,0,1,0,1,1
238,0097dd5c29bf7a15,u r a tw@ fuck off u gay boy.U r smelly.Fuck u...,1,0,1,0,1,1
429,01166f26ee280e56,Gay \n\nThe existence of CDVF is further proof...,1,0,1,0,1,1
521,015d1b0bb4cc744d,Dictionaries\n\nHow dare you call my contribut...,1,0,1,0,1,1
887,026bd33490542b2e,"you gay motherfucker i know where you live,i a...",1,1,1,0,1,1
952,029dceed3519e371,you studid cock sucker u stop callin me ok its...,1,0,1,0,1,1
1017,02c6e41e4b317ac3,WOULDN'T BE THE FIRST TIME BITCH. FUCK YOU I'L...,1,1,1,1,1,1


In [9]:
df_train.iloc[521].comment_text

'Dictionaries\n\nHow dare you call my contribution spam!!! I am a Kurd and I made a lsit of kurdish dictionaries. you bloody turkish nationalist and atoricity commiting bone breaking Nazi. watch out folk this slimy Turk is trying to censor the internet this is not undemocratic Turkey here, no prison cells in wikipedia you stupid Turk! And you buggers want membership to the EEC'

In [10]:
encodings = bc.encode([df_train.iloc[521].comment_text], show_tokens=True)

In [11]:
embedding = encodings[0][0]
embedding.shape, embedding.dtype

((768,), dtype('float16'))

In [12]:
" ".join(encodings[1][0])

'[CLS] di ##ction ##aries how dare you call my contribution spa ##m ! ! ! i am a ku ##rd and i made a l ##sit of kurdish di ##ction ##aries . you bloody turkish nationalist and at ##oric ##ity commit ##ing bone breaking nazi . watch out folk this slim ##y turk is trying to ce ##nsor the internet this is not und ##em ##oc ##ratic turkey here , no prison cells in wikipedia you stupid turk ! and you bug ##gers want membership to the ee ##c [SEP]'

In [13]:
len(encodings[1][0])

90

In [14]:
df_train.shape

(159571, 8)

## Estimate time it takes to fetch the embeddings from BERT server

In [15]:
a = datetime.datetime.now()
N_samples = 100
for i, r in df_train.sample(N_samples).iterrows():
    txt = r.comment_text
    bc.encode([txt])

b = datetime.datetime.now()
c = (b-a)
average = c / N_samples


In [16]:
print("BERT server takes %.3f ms on average on this machine" %(average.microseconds / 1000))

BERT server takes 13.999 ms on average on this machine


In [17]:
# ! python3 train.py data/

In [18]:
# ! python3 down_parse_subtitle.py data/

In [19]:
# ! python3 evaluate.py data/