# BERT

### Installation

TensorFlow 2.0 supports the Keras API (in fact, it is recommended for most engineering-level work unless you are doing research). You can install both TF 2.0 and pre-trained BERT for KERAS using the following:

In [None]:
!pip install tensorflow

In [None]:
!pip install keras-bert

### Initialize a pre-trained model
Transfer learning requires the initialization of a pre-trained model. Let's start by importing the libraries and functions we need

In [1]:
import tensorflow as tf
from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs

Using TensorFlow backend.


Let's initialize a pre-trained model. Before we do that, let's decide on the size of our vocabulary, which is the number of words that we will recognize in our corpus. To keep things simple, let's work with 1000 unique words. Let's call it `NUM_TOKENS` because words are normally tokenized.

In [2]:
NUM_TOKENS = 1000

We now have a vocabulary size, let's go ahead and initialize a model

In [3]:
model = get_model(
    token_num=NUM_TOKENS
)

 we have a model, let's look at what advantage we got from using a pre-trained model.

In [4]:
print(model.summary())

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 512)          0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 768), ( 768000      Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

Looks like there is something wrong. If we make use of this model, we will have to train 87 million parameters. That is a lot of parameters. What is going on?

BERT gives us a pre-trained model, an we can fine-tune it to suit our needs. Perhaps that explains why we are training (or fine-tuning) so many parameters.

But, when we make use of pre-trained models for computer vision, we normally don't need to train so many parameters (well, if we don't go crazy and add many dense layers with lots of neurons and a FLATTEN layer somewhere).

But the, this is a small vocabulary. A lot of the time, in English, we might work with a vocabulary of 20,000 words. Let's see what happens if we instantiate a model with such a dictionary ...

In [7]:
_m = get_model(
    token_num=20000
)
print(_m.summary())

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 512)          0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 768), ( 15360000    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

Okay, a pre-trained model for a 20,000 word vocabulary would require us to train 102 million parameters, which is an extra 5 million.

Let's see if we can get any insights from the model architecture.

Looking at the architecture, we encounter the following layers:
1. InputLayer called Input-Token with an output shape (None, 512) and 0 trainable parameters
2. InputLayer called Input-Segment with an output shape (None, 512) and 0 trainable parameters
3. TokenEmbedding called Embedding-Token with shape [(None, 512, 768), (...)] and 15360000 trainable parameters
4. Embedding called Embedding-Segment with shape (None, 512, 768) and 1536 trainable parameters
5. Add called Embedding-Token-Segment with shape (None, 512, 758) and 0 trainable parameters

The following questions come to mind:
* What is that size of 512?
* If we reduce it, can we reduce the number of trainable parameters?

In [8]:
model = get_model(
    token_num=NUM_TOKENS,
    seq_len=36
)
print(model.summary())

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 36)           0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 36)           0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 36, 768), (1 768000      Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 36, 768)      1536        Input-Segment[0][0]              
____________________________________________________________________________________________

In the code above, we introduced a parameter called seq_len, which is the number of token sequences that we will pass into our input layer. In the default setting, the value is 512. We have set it to 36. This doesn't reduce the total number of trainable parameters, so let's consider something else.

The third and fourth layers (actually, second layers of two parallel networks before the addition) are embeddings. What is the dimension of the embedding? Previously the output shape was (None, 512, 768) but now it is (None, 36, 768) because we set our sequence length to 36. That implies that our default dimension size is 768.

Let's try something. Let's take the fourth root of our dictionary size. Our dictionary is 1000, the fourth root is 5.62. Let's round it up and use it as our embedding size.

In [9]:
model = get_model(
    token_num=NUM_TOKENS,
    seq_len=36,
    embed_dim=6
)
print(model.summary())

IndexError: Invalid head number 12 with the given input dim 6

When we try that, we get an error about a head number of 12 being invalid for a given input dimension of 6. Let's try to fix that.

In [11]:
model = get_model(
    token_num=NUM_TOKENS,
    head_num=1,
    seq_len=36,
    embed_dim=6
)
print(model.summary())

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 36)           0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 36)           0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 36, 6), (100 6000        Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 36, 6)        12          Input-Segment[0][0]              
____________________________________________________________________________________________

By setting our head number to 1, we are able to work with an embedding dimension of 6. Most importantly, the total number of trainable parameters is now down to 491 thousand.

If we are happy with this, we can proceed to train the model.

In [16]:
x = model.output
print(x)

[<tf.Tensor 'MLM_3/Identity:0' shape=(None, 36, 1000) dtype=float32>, <tf.Tensor 'NSP_3/Softmax:0' shape=(None, 2) dtype=float32>]


In [14]:

x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

classifier = tf.keras.models.Model(inputs=model.input, outputs=x)

AttributeError: 'tuple' object has no attribute 'layer'