Purpose of this notebook is to see how to prepare the data to use the tfhub BERT module. I tried to explain the data structures it uses and exactly how you can prepare your data set in a tfhub BERT consumable format. For this I used ```bert-tensorflow``` package hosted [here](https://github.com/google-research/bert).

This tutorial [page](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb) from Google shows how to do this but uses tf.estimator. I am not very comfortable with tf.estimators and like to build models with custom layers and tf.keras.

#### Start by importing some libraries

In [0]:
# example followed: 
# https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb
# https://github.com/SunYanCN/bert-text


from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras
import os
import re
import numpy as np

#### We need google provided bert libraries.
This will make our life easy to *tokenize*, *featurise* our data.

In [37]:
!pip install bert-tensorflow



In [0]:
# import bert related packages
import bert
from bert import modeling
from bert import run_classifier
from bert import tokenization

### Dataset
This is a small dataset to validate our idea.

In [39]:
# prepare dataset
data = ['he is happy because he got a new car'
        ,'He is a lovable person'
        ,'john is a cheerful guy'
        ,'he was in a merry mood'
        ,'the whole crowd was joyful'
        ,'she is a loving person'
        ,'he was delighted to see me'
        ,'he was smiling at me when i got a new mobile'
        ,'he is in a jovial mood'
        ,'he was sad because his friend died'
        ,'he was unhappy with his dog'
        ,'the company has miserable status'
        ,'he was sorrowful as he lost all his money'
        ,'The bird was sorrowful as it had no food'
        ,'the dog was glum as his master was not at home'
        ,'he was in gloomy mood'
        ,'she was depressed as she lost her job'
        ,'do not be downhearted']
label = [1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]

zip_list = list(zip(data,label))
df = pd.DataFrame(zip_list, columns = ['sentence','polarity'])
train_df, test_df = train_test_split(df, test_size=0.1)

train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'polarity'
# label_list is the list of labels, i.e. True, False or 0, 1 or 'dog', 'cat'
label_list = [0, 1]


print('Train data shape:', train_df.shape)
print('Test data shape:', test_df.shape)

Train data shape: (16, 2)
Test data shape: (2, 2)


### Prepare dataset

#### **InputExample**

To use with BERT we need to give our data a special shape.<br></br>
**Details about the ```InputExample```.**<br></br>

See the definition of ```InputExample```. Taken from BERT source code [here](https://github.com/google-research/bert/blob/master/run_classifier.py).
```
class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label
```

Details of the memebers:
- ```text_a``` is the text we want to classify, in our case, it is the ```sentence``` column of the Dataframe.
- ```text_b``` is used if we're training a model to understand the relationship between sentences ```text_a``` and ```text_b``` (i.e. is ```text_b``` a translation of ```text_a```? Can ```text_b``` be the next sentence to ```text_a```?). This doesn't apply to our task, so we can leave ```text_b``` as ```None```.
- ```label``` is the class label according to our example.

So, one can assume ```InputExample``` as a data structure to hold sentence pairs (if a pair is required) and the target label.

In [40]:
# Use the InputExample class from BERT's run_classifier code to create examples from the data
train_InputExamples = train_df.apply(lambda x: bert.run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

test_InputExamples = test_df.apply(lambda x: bert.run_classifier.InputExample(guid=None, 
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

print(train_InputExamples[0].text_a)
print(train_InputExamples[0].label)

he was smiling at me when i got a new mobile
1


#### Tokenize

**Now we must tokenize our data using the same tokenizer used to train the actual BERT model.**

In [41]:
# preprocess our data so that it matches the data BERT was trained on

# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

def create_tokenizer_from_hub_module():
  """Get the vocab file and casing info from the Hub module."""
  with tf.Graph().as_default():
    bert_module = hub.Module(BERT_MODEL_HUB)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    with tf.Session() as sess:
      vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
  return bert.tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

# see what tokenizer does
print(tokenizer.tokenize("This here's an example of using the BERT tokenizer"))
print(tokenizer.tokenize(train_df['sentence'][0]))

['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer']
['he', 'was', 'smiling', 'at', 'me', 'when', 'i', 'got', 'a', 'new', 'mobile']


#### InputFeatures

But its not over yet! We are yet to get our data to a form which can be understood by BERT.
See the below snippet. Taken from the BERT source code [here](https://github.com/google-research/bert/blob/master/run_classifier.py). We have to bring our data into the below form, ```bert-tensorflow``` provides us the necessary method to do so!
But one good thing is, the next step is going to be the last step to transform our data into BERT consumable form.

```
# The convention in BERT is:
# (a) For sequence pairs:
#  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
#  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
# (b) For single sequences:
#  tokens:   [CLS] the dog is hairy . [SEP]
#  type_ids: 0     0   0   0  0     0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
```



In [0]:
# Using our tokenizer, we'll call run_classifier.convert_examples_to_features on our InputExamples to convert them into features BERT understands
MAX_SEQ_LENGTH = 20 # at most these many tokens long
# Convert our train and test features to InputFeatures that BERT understands.
train_features = bert.run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

So, what are ```InputFeatures``` now? <br>
It is again a data structure to hold three pieces of information.
- ```input_ids```: Index of the tokens in the vocabulary.
- ```input_mask```: The mask value is 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
- ```segment_ids```: It is the ```type_ids```. It is 0 for the tokens of ```text_a``` and 1 for the tokens of ```text_b```.

In [43]:
# Lets understand the InputFeatures data structures by printing values for a single data row.
print('Text:', train_InputExamples[0].text_a)
print('Tokens:', tokenizer.tokenize(train_InputExamples[0].text_a))
print('Vocab index (input_ids):', train_features[0].input_ids)
print('Get back the tokens from vocab index:', tokenizer.convert_ids_to_tokens(train_features[0].input_ids))
print('input_mask:', train_features[0].input_mask)
print('segment_ids:', train_features[0].segment_ids)

Text: he was smiling at me when i got a new mobile
Tokens: ['he', 'was', 'smiling', 'at', 'me', 'when', 'i', 'got', 'a', 'new', 'mobile']
Vocab index (input_ids): [101, 2002, 2001, 5629, 2012, 2033, 2043, 1045, 2288, 1037, 2047, 4684, 102, 0, 0, 0, 0, 0, 0, 0]
Get back the tokens from vocab index: ['[CLS]', 'he', 'was', 'smiling', 'at', 'me', 'when', 'i', 'got', 'a', 'new', 'mobile', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
input_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
segment_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### Model building

We are all set to start with the model!
As a first step we will create a custom layer by wrapping the tensorflow hub BERT model.<br>
Once done, we can use the BERT layer as any other layer inside a keras model. This way we will create an abstraction for the actual BERT model from tensorflow hub.

#### Custom layer to wrap tfhub BERT module

In [0]:
from tensorflow.keras import backend as K


class BertLayer(tf.keras.layers.Layer):
    def __init__(
        self,
        n_fine_tune_layers=10,
        pooling="first",
        bert_path=None,
        max_len = None,
        return_sequences=False,
        **kwargs,
    ):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = True
        self.output_size = 768
        self.pooling = pooling
        self.bert_path = bert_path
        self.return_sequences = return_sequences
        self.output_key = 'sequence_output' if return_sequences else 'pooled_output'
        self.max_len = max_len
        if self.pooling not in ["first", "mean"]:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_shape):

        self.trainable = self.n_fine_tune_layers > 0
        self.bert = hub.Module(
            self.bert_path, trainable=self.trainable, name=f"{self.name}_module"
        )
        # Remove unused layers
        trainable_vars = self.bert.variables

        if self.pooling == "first":
            trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]
            trainable_layers = ["pooler/dense"]

        elif self.pooling == "mean":
            trainable_vars = [
                var
                for var in trainable_vars
                if not "/cls/" in var.name and not "/pooler/" in var.name
            ]
            trainable_layers = []
        else:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        # Select how many layers to fine tune
        for i in range(self.n_fine_tune_layers):
            trainable_layers.append(f"encoder/layer_{str(11 - i)}")

        # Update trainable vars to contain only the specified layers
        trainable_vars = [
            var
            for var in trainable_vars
            if any([l in var.name for l in trainable_layers])
        ]

        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)

        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)

        super(BertLayer, self).build(input_shape)

    def call(self, inputs, **kwargs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )

        if self.pooling == "first":
            pooled = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "pooled_output"
            ]
        elif self.pooling == "mean":
            result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "sequence_output"
            ]

            mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
            masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (
                    tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
            input_mask = tf.cast(input_mask, tf.float32)
            pooled = masked_reduce_mean(result, input_mask)
        else:
            raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")

        sequence_output = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)["sequence_output"]
        sequence_output = tf.reshape(sequence_output, (-1, self.max_len, self.output_size))
        return [sequence_output, pooled]

    def compute_output_shape(self, input_shape):
        return [(input_shape[0], self.max_len, self.output_size),(input_shape[0], self.output_size)]

#### Keras Model

Build the model which uses BERT as a internal layer.

In [0]:
def build_model(bert_path, max_seq_length):
    in_id = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids")
    in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_masks")
    in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
    bert_inputs = [in_id, in_mask, in_segment]

    bert_output = BertLayer(bert_path=bert_path, pooling='first', n_fine_tune_layers=0, max_len=MAX_SEQ_LENGTH)(bert_inputs)

    dense = tf.keras.layers.Dense(128, activation="relu")(bert_output[1])
    pred = tf.keras.layers.Dense(1, activation="sigmoid")(dense)

    model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    model.summary()

    return model
  
def initialize_vars(allow_growth=True):
    gpu_options = tf.GPUOptions(allow_growth=allow_growth)
    sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
    sess.run(tf.local_variables_initializer())
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    K.set_session(sess)

#### Unpack InputFeatures to feed the training data into the model

Unpack the ```InputFeatures``` to get the inputs to our model.

In [0]:
# our model needs the following inputs
train_input_ids = []
train_input_mask = []
train_segment_ids = []
train_label_ids = []
for feature in train_features:
    train_input_ids.append(feature.input_ids)
    train_input_mask.append(feature.input_mask)
    train_segment_ids.append(feature.segment_ids)
    train_label_ids.append(feature.label_id)

test_input_ids = []
test_input_mask = []
test_segment_ids = []
test_label_ids = []
for feature in test_features:
    test_input_ids.append(feature.input_ids)
    test_input_mask.append(feature.input_mask)
    test_segment_ids.append(feature.segment_ids)
    test_label_ids.append(feature.label_id)

#### Train the model

Now, train the model.

In [47]:
model = build_model(BERT_MODEL_HUB, MAX_SEQ_LENGTH)

initialize_vars()

train_inputs = [train_input_ids, train_input_mask, train_segment_ids]
train_labels = train_label_ids

model.fit(train_inputs, train_labels,
          validation_data=None,
          epochs=10,batch_size=2,shuffle=True )

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 20)]         0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 20)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 20)]         0                                            
__________________________________________________________________________________________________
bert_layer_7 (BertLayer)        [(None, 20, 768), (N 110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]          

<tensorflow.python.keras.callbacks.History at 0x7f9c7e9735c0>

#### Get prediction with the model

In [0]:
def getPredictionFromSentence(in_sentences):
  labels = ["Negative", "Positive"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)

  test_input_ids = []
  test_input_mask = []
  test_segment_ids = []
  test_label_ids = []
  for feature in input_features:
      test_input_ids.append(feature.input_ids)
      test_input_mask.append(feature.input_mask)
      test_segment_ids.append(feature.segment_ids)
      test_label_ids.append(feature.label_id)
  
  probabilities = getPredictionFromFeatures(test_input_ids, test_input_mask, test_segment_ids)
  # print(probabilities)
  predictions = (probabilities > 0.5).astype(np.int)
  # print(predictions)
  
  return [(sentence, proba, labels[prediction[0]]) for sentence, proba, prediction in zip(in_sentences, probabilities, predictions)]

def getPredictionFromFeatures(test_input_ids, test_input_mask, test_segment_ids):
  return model.predict([test_input_ids, test_input_mask, test_segment_ids])

#### Test data and prediction on test data

In [49]:
# get little more test data
pred_sentences = [
  "That movie was absolutely awful",
  "The acting was a bit lacking",
  "The film was creative and surprising",
  "Absolutely fantastic!"
]

predictions = getPredictionFromSentence(pred_sentences)
print(predictions)

# test with our original test data
predictions = getPredictionFromSentence(test_df['sentence'].tolist())
print(predictions)

[('That movie was absolutely awful', array([0.7265612], dtype=float32), 'Positive'), ('The acting was a bit lacking', array([0.5497797], dtype=float32), 'Positive'), ('The film was creative and surprising', array([0.5154184], dtype=float32), 'Positive'), ('Absolutely fantastic!', array([0.561795], dtype=float32), 'Positive')]
[('he was sorrowful as he lost all his money', array([0.699868], dtype=float32), 'Positive'), ('she was depressed as she lost her job', array([0.5356732], dtype=float32), 'Positive')]


### Feature extraction

Now, suppose you want to pull out the activations of some of the layers and use them as features for your downstream work. But which layers provide the best results? Below image from [Jay Alammar](http://jalammar.github.io/illustrated-bert/)'s blog summarizes the findings of the paper.
![Bert-feature_extraction](http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png)

#### Extract features from a specific layer

Here, we will extract the activation from some random layer, but you can follow that to extract features from any layer.

In [53]:
# collect all the layers of the tfhub bert model
# layers_of_bert = [i.values() for i in tf.get_default_graph().get_operations()]
# check their names
# print(layers_of_bert[-15:]) # there are just too many layersß

# picked some random layer
some_layer_name = 'bert_layer_module/bert/encoder/layer_3/attention/output/LayerNorm/moments/SquaredDifference:0'

# this helped: https://stackoverflow.com/questions/55333558/how-to-access-bert-intermediate-layer-outputs-in-tf-hub-module
some_layer_output = K.get_session().run(tf.get_default_graph().get_tensor_by_name(some_layer_name)
      , feed_dict={'bert_layer_module/input_ids:0': test_input_ids, 'bert_layer_module/input_mask:0': test_input_mask, 'bert_layer_module/segment_ids:0': test_segment_ids})
print('Input data shape:', len(test_input_ids))
print('Feature shape: {}, remember MAX_SEQ_LENGTH value is: {}'.format(some_layer_output.shape, MAX_SEQ_LENGTH))
print('Feature matrix:')
print(some_layer_output)

Input data shape: 2
Feature shape: (40, 768), remember MAX_SEQ_LENGTH value is: 20
Feature matrix:
[[0.02290059 0.03542009 0.08184339 ... 0.03956602 0.0046551  0.09658112]
 [0.00689361 0.00321472 0.6773902  ... 0.00224064 0.00283264 0.03928791]
 [0.00213189 0.32839078 0.1467535  ... 0.00231929 0.0064002  0.08923845]
 ...
 [0.08666013 0.13227108 0.27587062 ... 0.797492   0.18822151 0.42659256]
 [0.15041502 0.18689519 0.23036228 ... 0.8800924  0.1415598  0.34996518]
 [0.27268362 0.23202768 0.06867391 ... 0.9788965  0.1378572  0.24066505]]


## Conclusion

There might be better way to do this. I will update if I find a more elegant way of doing the same.