# What is BERT?

  - BERT stands for Bidirectional Encoder Representations from Transformers. 
  - A pre-trained BERT model can be fine-tuned to create state-of-the-art (SOTA) models for a wide range of NLP tasks such as:
    - Question Answering
    - Sentiment Analysis 
    - Named Entity Recognition (NER). 
  - BERT Pretrained Models 
    - BASE has 110M parameters (L=12, H=768, A=12) 
    - BERT LARGE has 340M parameters (L=24, H=1024, A=16)(L stands for the number of layers, H for the hidden size and A for the number of self-attention heads) ([Devlin et al., 2019](https://arxiv.org/abs/1810.04805)).
  - Model Architecture
    - BERT model architecture is a multi-layer bidirectional Transformer encoder (see Figure 1). 
      <img src = 'https://miro.medium.com/max/736/1*IN-Z-o-9m9jAB57_xZX47A.png'>
      
      - The authors of BERT paper pre-train the model with 3.3 billion words in the two NLP tasks: 
        - Task #1: Masked LM (MLM) and 
        - Task #2: Next Sentence Prediction (NSP).
      
      - BERT model has an interesting input (see Figure 2) representation. 
        - Its input is the sum of the token embeddings, the segment embeddings and the position embeddings 

          <img src='https://miro.medium.com/max/3600/1*YgAWrY8PnFkncyDZW7ycPg.png'>

### [Loading models from TensorFlow Hub](https://www.tensorflow.org/text/tutorials/classify_text_with_bert#loading_models_from_tensorflow_hub)

Here you can choose which BERT model you will load from TensorFlow Hub and fine-tune. There are multiple BERT models available.

  - BERT-Base, Uncased and seven more models with trained weights released by the original BERT authors.
  - Small BERTs have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
  - ALBERT: four different sizes of "A Lite BERT" that reduces model size (but not computation time) by sharing parameters between layers.
  - BERT Experts: eight models that all have the BERT-base architecture but offer a choice between different pre-training domains, to align more closely with the target task.
  - Electra has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
  - BERT with Talking-Heads Attention and Gated GELU [base, large] has two improvements to the core of the Transformer architecture.

# Coding Practice

In [14]:
# Install the required package
!pip install bert-for-tf2



In [15]:
# Import modules
import pandas as pd
import numpy as np
import bert
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import  Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tqdm import tqdm
import matplotlib.pyplot as plt

print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)
pd.set_option('display.max_colwidth',1000)

TensorFlow Version: 2.6.0
Hub version:  0.12.0


## Data Preparation

In [16]:
# Read the IMDB Dataset.csv into Pandas dataframe
url = 'https://raw.githubusercontent.com/bestvater/misc/master/IMDB%20Dataset.csv'
df=pd.read_csv(url)

In [34]:
# Take a peek at the dataset
df.head(5)
print(df.shape)




(50000, 2)


In [18]:
print("The number of rows and columns in the dataset is: {}".format(df.shape))

The number of rows and columns in the dataset is: (50000, 2)


In [19]:
# Identify missing values
df.apply(lambda x: sum(x.isnull()), axis=0)

review       0
sentiment    0
dtype: int64

In [20]:
# Check the target class balance
df["sentiment"].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

## Modelling

In [21]:
# Functions for constructing BERT Embeddings: input_ids, input_masks, input_segments and Inputs

MAX_SEQ_LEN=500 # max sequence length

def get_masks(tokens):
    """Masks: 1 for real tokens and 0 for paddings"""
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))
 
def get_segments(tokens):
    """Segments: 0 for the first sequence, 1 for the second"""  
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

def get_ids(tokens, tokenizer):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens,)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN - len(token_ids))
    return input_ids

def create_single_input(sentence, tokenizer, max_len):
    """Create an input from a sentence"""
    stokens = tokenizer.tokenize(sentence)
    stokens = stokens[:max_len]
    stokens = ["[CLS]"] + stokens + ["[SEP]"]
 
    ids = get_ids(stokens, tokenizer)
    masks = get_masks(stokens)
    segments = get_segments(stokens)

    return ids, masks, segments
 
def convert_sentences_to_features(sentences, tokenizer):
    """Convert sentences to features: input_ids, input_masks and input_segments"""
    input_ids, input_masks, input_segments = [], [], []
 
    for sentence in tqdm(sentences,position=0, leave=True):
      ids,masks,segments=create_single_input(sentence,tokenizer,MAX_SEQ_LEN-2)
      assert len(ids) == MAX_SEQ_LEN
      assert len(masks) == MAX_SEQ_LEN
      assert len(segments) == MAX_SEQ_LEN
      input_ids.append(ids)
      input_masks.append(masks)
      input_segments.append(segments)

    return [np.asarray(input_ids, dtype=np.int32), 
          np.asarray(input_masks, dtype=np.int32), 
          np.asarray(input_segments, dtype=np.int32)]

def create_tonkenizer(bert_layer):
    """Instantiate Tokenizer with vocab"""
    vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy()
    do_lower_case=bert_layer.resolved_object.do_lower_case.numpy() 
    tokenizer=bert.bert_tokenization.FullTokenizer(vocab_file,do_lower_case)
    return tokenizer

In [23]:
def nlp_model(callable_object):
    # Load the pre-trained BERT base model
    bert_layer = hub.KerasLayer(handle=callable_object, trainable=True)  
   
    # BERT layer three inputs: ids, masks and segments
    input_ids = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_ids")           
    input_masks = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_masks")       
    input_segments = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="segment_ids")
    
    inputs = [input_ids, input_masks, input_segments] # BERT inputs
    pooled_output, sequence_output = bert_layer(inputs) # BERT outputs
    
    # Add a hidden layer
    x = Dense(units=768, activation='relu')(pooled_output)
    x = Dropout(0.1)(x)
 
    # Add output layer
    outputs = Dense(2, activation="softmax")(x)

    # Construct a new model
    model = Model(inputs=inputs, outputs=outputs)
    return model

model = nlp_model("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1")
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
keras_layer_4 (KerasLayer)      [(None, 768), (None, 109482241   input_ids[0][0]                  
                                                                 input_masks[0][0]          

**BERT Pretrained Model has been continuously updated.**

Link: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3

### Model Training

In [24]:
# Create examples for training and testing
df = df.sample(frac=1) # Shuffle the dataset
tokenizer = create_tonkenizer(model.layers[3])
X_train = convert_sentences_to_features(df['review'][:40000], tokenizer)
X_test = convert_sentences_to_features(df['review'][40000:], tokenizer)

df['sentiment'].replace('positive',1.,inplace=True)
df['sentiment'].replace('negative',0.,inplace=True)
one_hot_encoded = to_categorical(df['sentiment'].values)
y_train = one_hot_encoded[:40000]
y_test =  one_hot_encoded[40000:]

100%|██████████| 40000/40000 [03:19<00:00, 200.06it/s]
100%|██████████| 10000/10000 [00:50<00:00, 199.92it/s]


In [26]:
# Train the model
BATCH_SIZE = 8
EPOCHS = 1

# Use Adam optimizer to minimize the categorical_crossentropy loss
opt = Adam(learning_rate=2e-5)
model.compile(optimizer=opt, 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

# Fit the data to the model
history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test),
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose = 1)

# Save the trained model
model.save('nlp_model.h5')



### Analysis of model performance

In [27]:
# Load the pretrained nlp_model
from tensorflow.keras.models import load_model
new_model = load_model('nlp_model.h5',custom_objects={'KerasLayer':hub.KerasLayer})

In [28]:
# Predict on test dataset
from sklearn.metrics import classification_report
pred_test = np.argmax(new_model.predict(X_test), axis=1)

In [29]:
print(classification_report(np.argmax(y_test,axis=1), pred_test))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94      5058
           1       0.94      0.93      0.94      4942

    accuracy                           0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000



In [30]:
pred_test[:10]

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 1])

In [31]:
# Predicted 1 for the first review in the test dataset
df['review'][40000:40001]

38136    The greatest compliments to the other commentator here at IMDb who asked himself why this series didn't "get stuck" in its time to last a lot longer like many other series in the 80s did.<br /><br />It is not true the series would have gotten worse if further continued.<br /><br />I will at the end of this my comment post some thoughts about the other movie realizations, rather: attempts of the Robin Hood legend.<br /><br />First of All, Robert Addie (Gisburne), you are among us all, you live forever.<br /><br />Nothing is as fun as the entire two, if one wants, three seasons of this absolutely unique series. And at the same time absolutely agreeing with the mostly new and revolutionary findings of Terry Jones' history documentations about Egypt, Greece, Rome, Konstantinopel, the Goths and Barbarians, and the middle ages and crusades (...yes, THE Monthy Python-Terry Jones):<br /><br />If you have seen those brilliant and funny Jones-Docs you will better, much better understand

In [32]:
# Predicted 1 for the second review in the test dataset
df['review'][40001:40002]

3203    Some people say the pace of this film is a little slow, but how is this different from any other Hitchcock movie? They all move very deliberately and, as a point, have spurts of suspense and brilliant montages injected through it. This movie gives us just the right amount of comic relief which make the suspense scenes seem all the more suspenseful. The Albert Hall scene is one of the best examples of Pure Cinema that exists in Hitchcock's collection (the best probably being almost all of "Rear Window"). Pure Cinema for Hitchcock meant a series of usually small pieces of film fit together without dialogue, in order to tell the story visually. This is, of course the basic definition of the Albert Hall sequence, as well as the shorter staircase sequence at the end of the picture. <br /><br />Not many slip-ups by Hitchcock here, and the acting is superb especially by Doris Day in a rather surprising serious role.
Name: review, dtype: object

In [33]:
# Predicted 0 for the third review in the test dataset
df['review'][40002:40003]

14277    I'm not surprised that this film did well at the Hamptons Film Festival. It is a shallow film that would appeal well to shallow people. Two actors pretending to be actors in a relationship who fight and look for a lost dog. The film is allegedly exploring the dynamics of the relationship, however, the relationship is far too petty to merit any such exploration. This couple has one dimension: they fight, they tease, then they make love and fight some more. There brief moment of tenderness does not reveal any possible reason that these two would be involved with each other given their venomous and volatile relationship. Beautifully shot, excellent score, but without anything of merit in the script or characters, this short is just that.
Name: review, dtype: object