**Multi-Class classification with BERT using keras and tensorflow**



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Load the Text data 

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

#The first line in input file contains "5485" adding "skiprows" skips 1st line
df = pd.read_csv('/content/gdrive/My Drive/doc_classification.txt', header=None, skiprows=1)
print(df.head(3))

                                                   0
0  1 champion products ch approves stock split ch...
1  2 computer terminal systems cpml completes sal...
2  1 cobanco inc cbco year net shr cts vs dlrs ne...


Extract labels and Sentences and create a DataFrame

In [None]:
sentences=[]
labels=[]
#Iterate through all rows 0th column and extract lables and sentences
#If maxsplit is specified, the list will have the maximum of maxsplit+1 items, here each line is split into 2 items at 0->label at 1->whole sentence.
for line in df.iloc[:,0]:
    line=line.split(" ", maxsplit=1)
    labels.append(line[0])
    sentences.append(line[1])

sentences=pd.DataFrame(sentences)
labels=pd.DataFrame(labels)
df2=pd.concat([sentences,labels],axis=1) # Here column names will be '0', so giving names to columns 
df2.columns=["Sentence","Label"]

#Drop duplicates except for first one
df3=df2.drop_duplicates(keep='first')
print("Before: Class compositions percentage: \n",df3["Label"].value_counts(normalize=True))

Before: Class compositions percentage: 
 1    0.522388
2    0.293164
6    0.045145
3    0.044039
8    0.035931
7    0.032430
4    0.019348
5    0.007555
Name: Label, dtype: float64


Data Pre-processing, cleaning

In [None]:
import re

# Data Pre-processing. Converting to lower case, Remove special characters.
def clean_corpus(sentence):    
     
        #Covert to lower case        
        sentence = sentence.lower()  
        #Remove special characters      
        pattern1 = r'[\,+\:\?\!\"\(\)!\'\.\%\[\]]+'
        # Remove words with 1 and 2 char length
        pattern2 = r'\b\w{1,2}\b'
        #Remove extra spaces
        pattern3 = r' +'   

        replacements=[(pattern1 , " "), (pattern2 , " "), (pattern3 , " ")]
        for pat,repl in replacements:
            sentence = re.sub(pat, repl, sentence)  
        return sentence         
       
    
df3["Sentence"] = df3.Sentence.apply(lambda s: clean_corpus(s))

print("We have %d lines/Sentences in the corpus." %len(df3.Sentence))
df3.head()

We have 5427 lines/Sentences in the corpus.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Sentence,Label
0,champion products approves stock split champio...,1
1,computer terminal systems cpml completes sale ...,2
2,cobanco inc cbco year net shr cts dlrs net ass...,1
3,international inc qtr jan oper shr loss two c...,1
4,brown forman inc bfd qtr net shr one dlr cts n...,1


Bert expects labels/categories to start from 0. 

Also, "lablel" column should be of type int or float. If "lablel" column is of type obj OR string, it has to be converted to int or float.

If above two are not handled Bert will not work as expected OR can raise  error(s).

In [None]:
df3["sentenceLength"] = df3.Sentence.apply(lambda x : len(x))
#Each sentence must be classified into one of 8 categories.
print(df3.Label.unique())
#BERT expects class labels to start from 0, instead of 1, else Bert will not work.
df3['label_encode'] = df3['Label'].map({'1':0,'2':1,'3':2,'4':3,'5':4,'6':5,'7':6,'8':7})
print(df3.label_encode.unique())
# Half of sentences are of length 337
print(df3.sentenceLength.describe())

['1' '2' '3' '4' '5' '6' '7' '8']
[0 1 2 3 4 5 6 7]
count    5427.000000
mean      546.466188
std       646.976398
min        23.000000
25%       147.000000
50%       337.000000
75%       619.000000
max      4772.000000
Name: sentenceLength, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [None]:
df_bert = df3.copy()
df_bert.drop(['sentenceLength'], axis =1, index=None, inplace=True)
#Labels column were onj type, so converting to int, else Bert will not work. obj/str must be converted to int or float.
df_bert['label_encode'] = df_bert.label_encode.astype(int)
#df_bert.reset_index(drop=True) #remove index col
df_bert.head()

Unnamed: 0,Sentence,Label,label_encode
0,champion products approves stock split champio...,1,0
1,computer terminal systems cpml completes sale ...,2,1
2,cobanco inc cbco year net shr cts dlrs net ass...,1,0
3,international inc qtr jan oper shr loss two c...,1,0
4,brown forman inc bfd qtr net shr one dlr cts n...,1,0


In [None]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
!pip install transformers
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, num_labels=8)

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 16.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 55.5MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 50.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=e488

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Because the labels are imbalanced, we split the data set in a stratified fashion into Train and valtest set. We will further split valtest set into validation and Test set.

In [None]:
# can be up to 512 for BERT
max_length = 200
batch_size = 6

X_train, X_valtest, y_train, y_valtest = train_test_split(df_bert.Sentence.values, 
                                                  df_bert.label_encode.values, 
                                                  test_size=0.2, 
                                                  random_state=42, 
                                                  stratify=df3.Label.values)

print('Type of X_train: ', type(X_train))
print(X_train[:3])
print(X_valtest[:3])
print(y_train[:3])
print(y_valtest[:3])

print('length of X_train: ', len(X_train))
print('length of X_test: ', len(X_valtest))
print('length of y_train: ', len(y_train))
print('length of y_test: ', len(y_valtest))

Type of X_train:  <class 'numpy.ndarray'>
['analysis and technology inc aati hikes payout annual div cts cts prior pay april record march reuter '
 'yankee cos ynk unit sell asssets yankee cos inc eskey inc esk subsidiary said reached agreement principle sell its eskey yale key inc subsidiary new concern formed key management and private investor for about mln dlrs part sale eskey said the buyers will assume the mln dlrs publicly held eskey pct debentures due said the debentures will continue converted into yankee preferred the remainder the price will one mln dlr note eskey yankee said the sale will result loss mln dlrs reuter '
 ' urges surplus nations boost growth leading industrial nations will reviewing the paris agreement stabilize exchange rates foster increased worldwide growth and reduce trade imbalances but the thinks the accord has been successful far senior treasury official said the paris accord will reviewed this meeting has been successful and continues succesfull senior

In [None]:
# Split valtest set in to validation and test set
X_val, X_test, y_val, y_test = train_test_split(X_valtest, 
                                                  y_valtest, 
                                                  test_size=0.20, 
                                                  random_state=43)

print('length of X_val: ', len(X_val))
print('length of X_test: ', len(X_test))
print('length of y_val: ', len(y_val))
print('length of y_test: ', len(y_test))


length of X_val:  868
length of X_test:  218
length of y_val:  868
length of y_test:  218


BertTokenizer and Encoding the Data


In [None]:
def convert_example_to_feature(sentence):
  
  # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
  
  return tokenizer.encode_plus(sentence, 
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = max_length, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

In [None]:
# map to the expected input to TFBertForSequenceClassification
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label=None):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(s, l, isTestSet=0 ):

  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []
  
  if(isTestSet): #Do not append lables, because during model prediction we should not pass labels/target values
      for sentence, label in zip(s, l):
         bert_input = convert_example_to_feature(sentence)
  
         input_ids_list.append(bert_input['input_ids'])
         token_type_ids_list.append(bert_input['token_type_ids'])
         attention_mask_list.append(bert_input['attention_mask'])
         #label_list.append([label])
      return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list)).map(map_example_to_dict)
  else: #during model training lables are required.
      for sentence, label in zip(s, l):
         bert_input = convert_example_to_feature(sentence)
  
         input_ids_list.append(bert_input['input_ids'])
         token_type_ids_list.append(bert_input['token_type_ids'])
         attention_mask_list.append(bert_input['attention_mask'])
         label_list.append([label])
      return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)



In [None]:
# train dataset
ds_train_encoded = encode_examples(X_train,y_train).shuffle(4341).batch(batch_size)

#Validation dataset
ds_val_encoded = encode_examples(X_val, y_val).batch(batch_size)

# test dataset
ds_test_encoded = encode_examples(X_test, y_test,1).batch(batch_size)



Below you can see the token, segment and positional embeddings of BERT. Shape of these are (6,200), where 6 is the size of the batch & 200 is the length of each sentence we specified.

If sentence length is smaller than 200 it will be padded with 0s. If length is bigger, sentence will be truncated.

In [None]:
print(type(ds_test_encoded))
for sentence, label in ds_train_encoded.take(1):
    print('sentence', sentence, label)

<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
sentence {'input_ids': <tf.Tensor: shape=(6, 200), dtype=int32, numpy=
array([[  101,  2900, 23439, ...,     0,     0,     0],
       [  101, 22714, 17180, ...,     0,     0,     0],
       [  101,  9587,  3406, ...,     0,     0,     0],
       [  101, 23876, 15726, ...,  4495,  6848,   102],
       [  101, 23060, 24163, ...,     0,     0,     0],
       [  101,  8174,  2015, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(6, 200), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(6, 200), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1

**BERT Model Intialization**

We will use already pretrained TensorFlow models from transformers models. You can just import them from the library and call from_pretrained and you will be able to use them

In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5

# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1

# model initialization. For multi-class classification we must specify number of categories to classify the sentences/documents, 8 in our case.
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=8)

# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

# we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Train Bert Model**

In [None]:
model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded)



<tensorflow.python.keras.callbacks.History at 0x7fc5495f8240>

**Classify/Predict unseen test sentences**


In [None]:
test_pred = model.predict(ds_test_encoded, verbose=True)



In [None]:
test_pred

TFSequenceClassifierOutput([('logits',
                             array([[ 0.10074364,  5.658406  , -1.0375811 , ..., -1.1489806 ,
                                     -0.5750521 , -0.95126194],
                                    [ 6.435535  , -0.03667761, -1.3546342 , ..., -1.2319204 ,
                                     -0.7304846 , -1.1815776 ],
                                    [ 6.4013224 , -0.0300498 , -1.380059  , ..., -1.242933  ,
                                     -0.7095235 , -1.166215  ],
                                    ...,
                                    [ 6.3813424 ,  0.06080941, -1.3787117 , ..., -1.2683374 ,
                                     -0.6816839 , -1.1932096 ],
                                    [-1.2148149 , -1.3147076 , -0.33345002, ..., -0.8928549 ,
                                      0.80356324,  3.9285421 ],
                                    [ 6.319223  ,  0.11768577, -1.4491475 , ..., -1.4584801 ,
                                    

In [None]:
op = pd.DataFrame(test_pred[0])
print(op.shape)
op.head()

(218, 8)


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.100744,5.658406,-1.037581,-0.421502,-1.073307,-1.148981,-0.575052,-0.951262
1,6.435535,-0.036678,-1.354634,-1.188579,-0.985788,-1.23192,-0.730485,-1.181578
2,6.401322,-0.03005,-1.380059,-1.206783,-1.003658,-1.242933,-0.709523,-1.166215
3,6.377688,0.04281,-1.38533,-1.274339,-1.01731,-1.256177,-0.688236,-1.156304
4,6.398369,0.099404,-1.407508,-1.250317,-1.019872,-1.237434,-0.670231,-1.159171


Extract Predicted Label for each sentence in the test data set.

In [None]:
#Get predicted class for each sentence in the test set
predicted_label = op.idxmax(axis=1)
predicted_label

0      1
1      0
2      0
3      0
4      0
      ..
213    0
214    1
215    0
216    7
217    0
Length: 218, dtype: int64

Create a DataFrame that shows sentence, actual label and, predicted label for test data set.

In [None]:
predicted = pd.DataFrame(predicted_label)
actual = pd.DataFrame(y_test)
sentence = pd.DataFrame(X_test)
testset_results = pd.concat([sentence, actual, predicted],axis=1)
testset_results.columns = ['sentence', 'actual', 'predicted']
testset_results.tail()

Unnamed: 0,sentence,actual,predicted
213,quest medical inc qmed qtr loss shr loss six c...,0,0
214,usair has comment twa twa offer usair group in...,1,1
215,peps boys manny moe and jack pby set payout qt...,0,0
216,money market given mln stg late assistance th...,7,7
217,primebank pmbk sets pct stock dividend primeba...,0,0


**Performance Metrics**

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report

ascore = accuracy_score(testset_results.actual, testset_results.predicted)
print("Bert Accuracy Score: ", ascore)
print("Confusion Matrix of BERT Classifier output: ")
confusion_matrix(testset_results.actual, testset_results.predicted)
print("Classification Metrics: ")
print(classification_report(testset_results.actual, testset_results.predicted))

Bert Accuracy Score:  0.9724770642201835
Confusion Matrix of BERT Classifier output: 
Classification Metrics: 
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       116
           1       0.97      0.95      0.96        59
           2       1.00      1.00      1.00        15
           3       1.00      0.75      0.86         4
           5       1.00      0.71      0.83         7
           6       1.00      1.00      1.00         7
           7       0.91      1.00      0.95        10

    accuracy                           0.97       218
   macro avg       0.98      0.92      0.94       218
weighted avg       0.97      0.97      0.97       218

