# AGENDA :

-- To Implement BERT as a tokenizer and embedding algorithm for Named Entity Recogntion

-- Making a baseline model for sentiment analysis

-- To Implement BERT as an Embedding layer in a model for Sentiment Analysis

-- Comparing results for BERT as word embedder and embedding layer in model

# INTRODUCTION

BERT (Bidirectional Encoder Representations from Transformers), is the ultimate Natural Language Processing Algorithm created by Google's AI team. It's best for tasks that require only encoding of words into word vectors, since it does not come with a decoder module. BERT implements the Transformer model and stacks them in an encoder fashion. It comes pretrained for several days on TPUs so we will just harness the power of transfer learning in this note book to execute simple tasks like Named Entity Recognition and Sentiment Analysis on the Entity Annotated and Stock Market Sentiment Datasets.

Named Entity Recognition - Is the process of finding proper nouns in dataset, and for this purpose we will be using the entity-annotated-corpus to train an NER classifier model, using a Random Forest or Gradient Boosted Trees Classifier and BERT's inbuilt tokenizer. Then we will try to implement BERT as an embedding layer and compare the results

Sentiment Classification - Is the process of calculating the overall positive or negative impact of a sentence or paragraph and classifying it as having a negative or positive impact on the price of the stock of the company (in this case). For this task we will implement a baseline XGBoost Model, using the bert tokenizer module, and then a custom Convolutional neural Network based architecture using BERT as an embedding layer, and compare the two performances.



We will be using the bert-for-tf2 module and pretrained weights for the BERT variant : BERT encased L-12 H-768, from tensorflow hub, which is an open source collection of many pretrained models for tensorflow.

# 1. NAMED ENTITY RECOGNITION

Using BERT as a tokenizer and quick word embedding algorithm. To make the classifier we will use the Machine Learning algorithms Random Forest and Support Vector Machine from Scikit Learn, and the ensemble Gradient Boosted Trees Algorithm XGBoost.

# 1.1 Importing Dependencies

In [None]:
# downloading BERT module 
!pip install bert-for-tf2
!pip install sentencepiece

In [None]:
# Visualization libraries

import seaborn as sns
import matplotlib as plt

%matplotlib inline
plt.style.use('ggplot')

# Analysis libraries

import os
import re
import random
import numpy as np 
import pandas as pd 

import nltk
import string
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

# ML Modelling Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
from sklearn.metrics import f1_score, precision_score,recall_score,roc_auc_score
from sklearn.metrics import accuracy_score, plot_precision_recall_curve

import xgboost as xgb
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier


# BERT and Deep Learning Libs
import bert
import tensorflow_hub as hub
import tensorflow as tf
from keras import layers
from keras.callbacks import ModelCheckpoint,EarlyStopping
from keras.preprocessing.sequence import pad_sequences
import tensorflow_datasets as tfds

import warnings
warnings.filterwarnings("ignore")

# 1.2 - Exploratory Data Analysis and Visualizations

In [None]:
ner = pd.read_csv('/kaggle/input/entity-annotated-corpus/ner.csv', 
                  encoding= 'ISO-8859-1',
                  error_bad_lines=False)
ner_dataset = pd.read_csv('/kaggle/input/entity-annotated-corpus/ner_dataset.csv',
                  encoding='ISO-8859-1',
                  error_bad_lines=False)
stock_data = pd.read_csv('/kaggle/input/stockmarket-sentiment-dataset/stock_data.csv',
                  encoding='ISO-8859-1',
                  error_bad_lines=False)

In [None]:
ner.columns

In [None]:
ner = ner[['prev-word','prev-pos','word','pos','next-word','next-pos','tag']]
ner.head(10)

In [None]:
plt.pyplot.figure(figsize=(30,15))
plt.pyplot.title('Parts of Speech vs Frequency')
sns.set(font_scale=1.5)
sns.countplot(x='pos', data = ner, palette = 'magma');

In [None]:
ner.pos.unique()

In [None]:
# Top 50 most commonly named places
data = ner.query('pos == "NNP"').word.value_counts().reset_index().head(50)

# named places plot

plt.pyplot.figure(figsize=(75,35))
plt.pyplot.title('Proprt Nouns vs Frequency')
sns.set(font_scale=1.6)
sns.barplot(x='index',y='word', data =data, palette = 'magma');

del data

In [None]:
ner.fillna("None",inplace=True)
ner.tag.unique()

In [None]:
# Filtering out organisations
df = ner[ner.tag.map(lambda x : x[-3:] == "org")]
plt.pyplot.title("Number and types of organisations")
df.tag.value_counts().plot.bar();

# 1.3 - Data Pre-processing and Splitting

In [None]:
# no cleaning function is required to be passed since we are training model on entire corpus

lb_tag = LabelEncoder().fit(ner.tag)
ner.tag = lb_tag.fit_transform(ner.tag)

lb_pos= LabelEncoder().fit(ner.pos)
ner.pos = lb_pos.fit_transform(ner.pos)

ner['prev-pos'] = lb_pos.fit_transform(ner['prev-pos'])
ner['next-pos'] = lb_pos.fit_transform(ner['next-pos'])

Lets import the pretrained BERT layer from TF HUB to harness its tokenizer

In [None]:
# import BERT layer from Tensorflow hub URL
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

# Using BERT's inbuilt tokenizer

tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
def vectorize(series):
    # Converting words into word vectors to feed into Model using BERT Tokenizer
    series.fillna("None", inplace = True)
    series = series.apply(lambda word : tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word)))
    # Tokenizer returns list hence extracting numbers from it
    return series.map(lambda x: 0 if len(x) == 0 else x[0])

ner.word = vectorize(ner.word)
ner['prev-word'] = vectorize(ner['prev-word'])
ner['next-word'] = vectorize(ner['next-word'])

In [None]:
num_classes = ner.tag.nunique()
num_classes

In [None]:
X = ner.drop(columns='tag').values
y = ner['tag'].values

print(X.shape)
y = y.reshape(y.shape[0])
print(y.shape)

In [None]:
# Splitting into Train/Test sets

X_train, X_test,y_train, y_test = train_test_split(X,y,test_size = 0.20)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# 1.4 - Model using XGBoost

Model Making and Training

In [None]:
# Implementing -- XGBoost, SVM, RandomForest

#clf = RandomForestClassifier() # 96.4, High Precision Very Low Recall
#clf = SVC(1.6)
#Xgb = 93.5 , but better balanced precision and recall

params_grid = {
    'validate_parameters':True,
    'lambda':1.6,
    'num_class': num_classes,
    'objective': 'multi:softmax',
    'eval_metric': 'merror'}

In [None]:
# Training over 3 stratified folds

skf = StratifiedKFold(n_splits=4, random_state=42)
x=1
for train_idx, valid_idx in skf.split(X_train,y_train):
    
    Xtrain, Xvalid = X_train[train_idx], X_train[valid_idx]
    ytrain, yvalid = y_train[train_idx], y_train[valid_idx]
    
   # clf.fit(Xtrain,ytrain)
    
    dtrain = xgb.DMatrix(Xtrain,label=ytrain)
    dtest = xgb.DMatrix(Xvalid, label=yvalid)
    
    evallist = [(dtest, 'eval'), (dtrain, 'train')]
    
    bst = xgb.train(params_grid,dtrain,evals = evallist)
    
    print("fold: ",x)
    x=x+1
    print('f1_score: ', f1_score(yvalid,bst.predict(xgb.DMatrix(Xvalid)),average="micro"))

# 1.5 - Prediction and Evauation

In [None]:
d_test = xgb.DMatrix(X_test)

preds=bst.predict(d_test).astype('int32')

print("f1 score : ", f1_score(y_test,preds, average = "micro"))
print("precision score : ", precision_score(y_test,preds, average = "macro"))
print("recall score : ", recall_score(y_test,preds, average = "macro"))

In [None]:
#fig = plot_precision_recall_curve(clf,X_test,y_test)
#plt.pyplot.title("Precision vs Recall")
xgb.plot_importance(bst);

In [None]:
preds=lb_tag.inverse_transform(preds)
preds

# 2. SENTIMENT ANALYSIS

Using BERT as an embedding layer and a custom convolutional neural network trained for classification tasks such as sentiment analysis. BERT as an embedding layer takes specific types of inputs. Each sentence needs to be padded and we need to add a classification [CLS] token and separator token [SEP]. We have to pad the input data, and create 3 types of inputs for each sentence:

(1) Word Vectors of tokens in sentence
(2) Which indexes are Padding Tokens
(3) Which indexes are Seperator Tokens

For this purpose we will create 3 fuctions

NOTE: we have not used the PAD token anywhere

# 2.1 - EDA and Pre-Processing

In [None]:
# changing labels of stock data to 0 and 1
stock_data['Sentiment'] = stock_data['Sentiment'].apply(lambda x: 0 if x == -1 else 1)
stock_data.head(10)

In [None]:
stock_data.Sentiment.value_counts().plot.bar();
plt.pyplot.title("Sentiment counts");

In [None]:
# creating new column containing sentence lengths to plot
stock_data['Sentence_length'] = [len(stock_data.Text[i]) for i in range(5791)]

plt.pyplot.figure(figsize =(25,10))
sns.lineplot(data=stock_data['Sentence_length'],color ='r');
plt.pyplot.title("Distribution of sentence lengths");
plt.pyplot.savefig('lineplot.png')

In [None]:
plt.pyplot.figure(figsize = (20,10));
plt.pyplot.title("Distribution of sentence lengths");
sns.distplot(stock_data.Sentence_length,kde=True,color ='r',bins=70);

Preprocessing functions

In [None]:
# clean text - lemmitization and removing stop-words, URLs, punctuations and special chars

stopwords = stopwords.words('english') # from nltk module
def preprocess(sentence):
    
    result = []
    
    s = BeautifulSoup(sentence, "lxml").get_text()
    
    # Removing the URL links
    s = re.sub(r"https?://[A-Za-z0-9./]+", ' ', s)
    
    # Keeping only letters
    s = re.sub(r"[^a-zA-Z.!?']", ' ', s)
    
    # Removing additional whitespaces
    s = re.sub(r" +", ' ', s)
    
    token_list = tokenizer.tokenize(s)
    
    for token in token_list:
        if (token not in list(string.punctuation))and(token not in stopwords):
            result.append(token)
        else:
            continue
    
    return result

# Adding Classification and Separator token for each sentence -- BERT input format

def add_std_tokens(token_list):
    return ["[CLS]"] + token_list + ["[SEP]"]

In [None]:
# FUNCTION 1: TO GET WORD VECTOR FROM A LIST OF TOKENS

def get_ids(tokens):
    return tokenizer.convert_tokens_to_ids(tokens)


# FUNTION 2: TO GET WHETHER OUR TOKENS HAVE [PAD] PADDING OR NOT
# NOTE: In this case it is not important but we will use it to maintain general norm

def get_masks(tokens):
    return np.char.not_equal(tokens, '[PAD]').astype(int)


# FUNCTION 3 : TO GET ID's OF SEGMENTATION TOKENS

def get_segs(tokens):
    curr_seg_id=0
    seg_ids =[]
    for tok in tokens:
        seg_ids.append(curr_seg_id)
        if tok=="[SEP]":
            curr_seg_id = 1- curr_seg_id
            
            # 1 becomes 0 and 0 becomes 1
            # 1 denoting [SEP] token and 0 any other token
            
    return seg_ids

Creating Dataset with Appropriate Format for for BERT layer.
Applying the three functions on the shuffled and sorted data in format : - 

( [wordvecs] , [pads] , [seps] , labels )

NOTE : keras.preprocessing.sequence.pad_sequences cannot be used because it doesnt support string and int.

In [None]:
# Applying text cleaning and merging labels and lengths for sorting

labels = stock_data.Sentiment.values
cleaned_data = [add_std_tokens(preprocess(sent)) for sent in stock_data.Text]


data_with_len = [[sent, labels[i], len(sent)]
                 for i, sent in enumerate(cleaned_data)]

# Shuffle and Sort the dataset

random.shuffle(data_with_len)

data_with_len.sort(key=lambda x: x[2])

# Applying the 3 functions to get input in appropriate format

compiled_data = [([ get_ids(sent_idx[0]), list(get_masks(sent_idx[0])), get_segs(sent_idx[0])],
                    sent_idx[1]) for sent_idx in data_with_len]


Using tf.data.Dataset module to make a padded Dataset for BERT Layer

In [None]:
batch_size = 32
num_batches = len(compiled_data) // batch_size #180
num_test_batches = num_batches // 15           #12

In [None]:
# making tf.data.Dataset generator objects
dataset_gen = tf.data.Dataset.from_generator(lambda : compiled_data, 
                                             output_types=(tf.int32,tf.int32))

# using the padded batch function to make a batch generator
batch_gen = dataset_gen.padded_batch(batch_size, 
                                     padded_shapes=((3,None),()), 
                                     padding_values = (0,0))

# using the shuffle attribute
batch_gen.shuffle(num_batches)

# getting batched tensor datasets from generator
train_data = batch_gen.skip(num_test_batches)
test_data = batch_gen.take(num_test_batches)

# 2.2 - Baseline Model - XGBClassifier

Accuracy of 63 % was achieved

In [None]:
y = stock_data.Sentiment.values
X = [get_ids(preprocess(sent)) for sent in stock_data.Text]
X = pad_sequences(X)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1,shuffle=True)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# parameters for XGBClassifier
param_dist = {'objective':'binary:logistic', 'n_estimators':2}

In [None]:
x=1

clf = xgb.XGBClassifier(**param_dist)

kf = StratifiedKFold(n_splits = 50, shuffle= True, random_state=42)

for train_idx, valid_idx in kf.split(X_train,y_train):
    
    Xtrain, Xvalid = X_train[train_idx], X_train[valid_idx]
    ytrain, yvalid = y_train[train_idx], y_train[valid_idx]
    
    clf.fit(Xtrain, ytrain,
        eval_set=[(Xtrain, ytrain), (Xvalid, yvalid)],
        eval_metric='logloss',
        verbose = False)

    evals_result = clf.evals_result()
    
    print("fold: ",x)
    x=x+1
    print(evals_result)

Prediction and Evaluation

In [None]:
preds = clf.predict(X_test)

print("f1_score : ", f1_score(y_test,preds,average="micro"))
print("precision: ", precision_score(y_test,preds, average="macro"))
print("recall: ", recall_score(y_test,preds, average="macro"))

# 2.3 - CNN with BERT Embedding Layer

Making the model using keras subclassing API. The idea behind the architecture is to embed word vectors using the pretrained BERT layer as an embedding layer. The inputs are embedded into the model which are passed to three Convolutional Neural Networks - Bigram, Trigram, Fourgram, having kernel_size = 2,3,4 respectively. The models are then concatenated and dense layers are applied to obtain output

In [None]:
class DCNN(tf.keras.Model):
    
    # making a constructor for default params
    def __init__(self,
                 FC_units =512,
                 num_filters=32,
                 num_classes=2,
                 droupout = 0.2,
                 name = "DCNN"):
        
        # calling superclass constructor
        super(DCNN,self).__init__(name=name)
        
        # adding layers to DCNN model object
        
        self.bert_layer = hub.KerasLayer(
                            "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
        
        self.bigram_layer = layers.Conv1D(
                                filters=num_filters,
                                kernel_size=2,
                                padding='valid',
                                activation ='relu')
        
        self.trigram_layer = layers.Conv1D(
                                filters=num_filters,
                                kernel_size=3,
                                padding='valid',
                                activation ='relu')
        
        self.fourgram_layer = layers.Conv1D(
                                filters=num_filters,
                                kernel_size=4,
                                padding='valid',
                                activation ='relu')
        
        self.batchnorm = layers.BatchNormalization()
        self.layernorm = layers.LayerNormalization()
        self.pool_layer = layers.GlobalMaxPool1D()
        self.dense_layer = layers.Dense(FC_units,activation='relu')
        self.dropout_layer = layers.Dropout(rate=dropout_rate)
        
        if num_classes == 2:
            self.output_layer = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.output_layer = layers.Dense(units=nb_classes,
                                           activation="softmax")
            
    # Embed Tensors into BERT Layer, embs gives output
    def embed_with_bert(self, all_tokens):
            
        _, embs = self.bert_layer([all_tokens[:, 0, :],
                                   all_tokens[:, 1, :],
                                   all_tokens[:, 2, :]])
        return embs

        
    # Implement the Architecture in the call function
    def call(self, inputs, training):
            
        x = self.embed_with_bert(inputs)
            
        bigram = self.bigram_layer(x)
        bigram = self.layernorm(bigram)
        bigram = self.batchnorm(bigram)
        bigram = self.pool_layer(bigram)
            
        trigram = self.trigram_layer(x)
        trigram = self.layernorm(trigram)
        trigram = self.batchnorm(trigram)
        trigram = self.pool_layer(trigram)
        
        fourgram = self.fourgram_layer(x)
        fourgram = self.layernorm(fourgram)
        fourgram = self.batchnorm(fourgram)
        fourgram = self.pool_layer(fourgram)
        
        merged = tf.concat([bigram, trigram, fourgram],axis=-1) 
        # (batch_size, 4 * num_filters)
        merged = self.dense_layer(merged)
        merged = self.dropout_layer(merged)
        output = self.output_layer(merged)
            
        return output
        

Implementing model architecture

In [None]:
from keras.optimizers import Adam
from keras.metrics import BinaryAccuracy
opt = Adam(learning_rate =0.001)

In [None]:
# Callback to prevent overfit

early_stopping_callback =  EarlyStopping(monitor = 'val_accuracy',
                                         min_delta = 0.01,
                                         patience = 6,
                                         restore_best_weights=True)

FC_units = 64
num_filters = 4
num_classes = 2
dropout_rate = 0.2
batch_size = 32
num_epochs = 16

# Making model
model = DCNN(FC_units = FC_units,
             num_filters=num_filters,
             num_classes=num_classes,
             droupout = dropout_rate)

# Compiling
model.compile(loss="binary_crossentropy",
              optimizer=opt,
              metrics=["accuracy"])

Training model 

In [None]:
num_train_batches = num_batches - num_test_batches
num_valid_batches = num_train_batches // 5


x_train=train_data.skip(num_valid_batches) 
x_valid=test_data.take(num_valid_batches) 

In [None]:
# Fitting data using crossvalidation
history = model.fit(x_train, 
                    epochs=num_epochs,
                    validation_data = x_valid,
                    callbacks =[early_stopping_callback])

# 2.4 - Prediction and Evaluation 

In [None]:
print(history.history.keys())

In [None]:
plt.pyplot.plot(history.history['accuracy'])
plt.pyplot.plot(history.history['loss'])
plt.pyplot.gcf().set_size_inches(20,10)
plt.pyplot.legend(['accuracy','loss'])
plt.pyplot.title('Training Performance')
plt.pyplot.ylabel('Accracy and Loss')
plt.pyplot.xlabel('Epoch')
plt.pyplot.savefig('trainperf.png')
plt.pyplot.show()

In [None]:
plt.pyplot.plot(history.history['accuracy'])
plt.pyplot.plot(history.history['loss'])
plt.pyplot.gcf().set_size_inches(20,10)
plt.pyplot.legend(['val_accuracy','val_loss'])
plt.pyplot.title('Testing Performance')
plt.pyplot.ylabel('Accracy and Loss')
plt.pyplot.xlabel('Epoch')
plt.pyplot.show()

In [None]:
results = model.evaluate(test_data)
results

In [None]:
y_pred = model.predict(test_data)
y_pred.shape

Extracting values from nested Tensors in tf.data.Dataset objects

In [None]:
lis=[]
y = tfds.as_numpy(test_data)
for i,j in enumerate(y):
    tensors,labels = j
    lis.extend(labels)
    
y_true = np.array(lis,dtype='float32')
y_true = y_true.reshape(y_pred.shape[0],)
y_pred = y_pred.reshape(y_pred.shape[0],)

In [None]:
print("ROC AUC score : ", roc_auc_score(y_true,y_pred,average="micro"))

# making predictions discrete
y_pred=y_pred>0.5
y_pred=y_pred.astype(int)

print("f1_score: ", f1_score(y_true,y_pred,average="micro"))
print("precision_score: ", precision_score(y_true,y_pred,average="macro"))
print("recall_score: ", recall_score(y_true,y_pred,average="macro"))

We can see BERT as embedding layer improves model performance. There is an overfit in this model, but it still gives better results than Machine Learning models using BERT as tokenizer