# BERT

### Installation

TensorFlow 2.0 supports the Keras API (in fact, it is recommended for most engineering-level work unless you are doing research). You can install both TF 2.0 and pre-trained BERT for KERAS using the following:

In [None]:
!pip install tensorflow

In [None]:
!pip install keras-bert

### Initialize a pre-trained model
Transfer learning requires the initialization of a pre-trained model. Let's start by importing the libraries and functions we need

In [1]:
import tensorflow as tf
from keras_bert import get_base_dict, get_model, compile_model, gen_batch_inputs

Using TensorFlow backend.


Let's initialize a pre-trained model. Before we do that, let's decide on the size of our vocabulary, which is the number of words that we will recognize in our corpus. To keep things simple, let's work with 1000 unique words. Let's call it `NUM_TOKENS` because words are normally tokenized.

In [2]:
NUM_TOKENS = 1000

We now have a vocabulary size, let's go ahead and initialize a model

In [3]:
model = get_model(
    token_num=NUM_TOKENS
)

 we have a model, let's look at what advantage we got from using a pre-trained model.

In [4]:
print(model.summary())

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 512)          0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 768), ( 768000      Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

Looks like there is something wrong. If we make use of this model, we will have to train 87 million parameters. That is a lot of parameters. What is going on?

BERT gives us a pre-trained model, an we can fine-tune it to suit our needs. Perhaps that explains why we are training (or fine-tuning) so many parameters.

But, when we make use of pre-trained models for computer vision, we normally don't need to train so many parameters (well, if we don't go crazy and add many dense layers with lots of neurons and a FLATTEN layer somewhere).

But the, this is a small vocabulary. A lot of the time, in English, we might work with a vocabulary of 20,000 words. Let's see what happens if we instantiate a model with such a dictionary ...

In [7]:
_m = get_model(
    token_num=20000
)
print(_m.summary())

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 512)          0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 768), ( 15360000    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

Okay, a pre-trained model for a 20,000 word vocabulary would require us to train 102 million parameters, which is an extra 5 million.

Let's see if we can get any insights from the model architecture.

Looking at the architecture, we encounter the following layers:
1. InputLayer called Input-Token with output shape (None, 512)
2. InputLayer called