## How-to Guide: Using a PIP package for fine-tuning a BERT model

Authors: [Chen Chen](https://github.com/chenGitHuber), [Claire Yao](https://github.com/claireyao-fen)

In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package.

## License

Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Learning objectives

In this Colab notebook, you will learn how to fine-tune a BERT model using the TensorFlow Model Garden PIP package.

## Enable the GPU acceleration
Please enable GPU for better performance.
*   Navigate to Edit.
*   Find Notebook settings.
*   Select GPU from the "Hardware Accelerator" drop-down list, save it.

##Install and import

### Install the TensorFlow Model Garden pip package

*  tf-models-nightly is the nightly Model Garden package created daily automatically.
*  pip will install all models and dependencies automatically.

In [0]:
!pip install tf-models-nightly

### Import Tensorflow and other libraries

In [0]:
import os

import numpy as np
import tensorflow as tf

from official.modeling import tf_utils
from official.nlp import optimization
from official.nlp.bert import configs as bert_configs
from official.nlp.bert import tokenization
from official.nlp.data import classifier_data_lib
from official.nlp.modeling import losses
from official.nlp.modeling import models
from official.nlp.modeling import networks

## Preprocess the raw data and output tf.record files

### Introduction of dataset

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

*   Number of labels: 2.
*   Size of training dataset: 3668.
*   Size of evaluation dataset: 408.
*   Maximum sequence length of training and evaluation dataset: 128.
*   Please refer here for details: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc

### Get dataset from TensorFlow Datasets (TFDS)

For example, we used the GLUE MRPC dataset from TFDS: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc.

### Preprocess the data and write to TensorFlow record file



In [0]:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"

# Set up tokenizer to generate Tensorflow dataset
tokenizer = tokenization.FullTokenizer(
    vocab_file=os.path.join(gs_folder_bert, "vocab.txt"), do_lower_case=True)

# Set up processor to generate Tensorflow dataset
processor = classifier_data_lib.TfdsProcessor(
    tfds_params="dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2",
    process_text_fn=tokenization.convert_to_unicode)

# Set up output of training and evaluation Tensorflow dataset
train_data_output_path="./mrpc_train.tf_record"
eval_data_output_path="./mrpc_eval.tf_record"

# Generate and save training data into a tf record file
input_meta_data = classifier_data_lib.generate_tf_record_from_data_file(
    processor=processor,
    data_dir=None,  # It is `None` because data is from tfds, not local dir.
    tokenizer=tokenizer,
    train_data_output_path=train_data_output_path,
    eval_data_output_path=eval_data_output_path,
    max_seq_length=128)

### Create tf.dataset for training and evaluation


In [0]:
def create_classifier_dataset(file_path, seq_length, batch_size, is_training):
  """Creates input dataset from (tf)records files for train/eval."""
  dataset = tf.data.TFRecordDataset(file_path)
  if is_training:
    dataset = dataset.shuffle(100)
    dataset = dataset.repeat()

  def decode_record(record):
    name_to_features = {
      'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
      'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),
      'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
      'label_ids': tf.io.FixedLenFeature([], tf.int64),
    }
    return tf.io.parse_single_example(record, name_to_features)

  def _select_data_from_record(record):
    x = {
        'input_word_ids': record['input_ids'],
        'input_mask': record['input_mask'],
        'input_type_ids': record['segment_ids']
    }
    y = record['label_ids']
    return (x, y)

  dataset = dataset.map(decode_record,
                        num_parallel_calls=tf.data.experimental.AUTOTUNE)
  dataset = dataset.map(
      _select_data_from_record,
      num_parallel_calls=tf.data.experimental.AUTOTUNE)
  dataset = dataset.batch(batch_size, drop_remainder=is_training)
  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
  return dataset

# Set up batch sizes
batch_size = 32
eval_batch_size = 32

# Return Tensorflow dataset
training_dataset = create_classifier_dataset(
    train_data_output_path,
    input_meta_data['max_seq_length'],
    batch_size,
    is_training=True)

evaluation_dataset = create_classifier_dataset(
    eval_data_output_path,
    input_meta_data['max_seq_length'],
    eval_batch_size,
    is_training=False)


## Create, compile and train the model

### Construct a Bert Model

Here, a Bert Model is constructed from the json file with parameters. The bert_config defines the core Bert Model, which is a Keras model to predict the outputs of *num_classes* from the inputs with maximum sequence length *max_seq_length*. 

In [0]:
bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
bert_config = bert_configs.BertConfig.from_json_file(bert_config_file)

bert_encoder = networks.TransformerEncoder(vocab_size=bert_config.vocab_size,
      hidden_size=bert_config.hidden_size,
      num_layers=bert_config.num_hidden_layers,
      num_attention_heads=bert_config.num_attention_heads,
      intermediate_size=bert_config.intermediate_size,
      activation=tf_utils.get_activation(bert_config.hidden_act),
      dropout_rate=bert_config.hidden_dropout_prob,
      attention_dropout_rate=bert_config.attention_probs_dropout_prob,
      sequence_length=input_meta_data['max_seq_length'],
      max_sequence_length=bert_config.max_position_embeddings,
      type_vocab_size=bert_config.type_vocab_size,
      embedding_width=bert_config.embedding_size,
      initializer=tf.keras.initializers.TruncatedNormal(
          stddev=bert_config.initializer_range))

classifier_model = models.BertClassifier(
        bert_encoder,
        num_classes=input_meta_data['num_labels'],
        dropout_rate=bert_config.hidden_dropout_prob,
        initializer=tf.keras.initializers.TruncatedNormal(
          stddev=bert_config.initializer_range))

### Initialize the encoder from a pretrained model

In [0]:
checkpoint = tf.train.Checkpoint(model=bert_encoder)
checkpoint.restore(
    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()

### Set up an optimizer for the model

BERT model adopts the Adam optimizer with weight decay.
It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0.

In [0]:
# Set up epochs and steps
epochs = 3
train_data_size = input_meta_data['train_data_size']
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)

# Create learning rate schedule that firstly warms up from 0 and they decy to 0.
lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
      initial_learning_rate=2e-5,
      decay_steps=num_train_steps,
      end_learning_rate=0)
lr_schedule = optimization.WarmUp(
        initial_learning_rate=2e-5,
        decay_schedule_fn=lr_schedule,
        warmup_steps=warmup_steps)
optimizer = optimization.AdamWeightDecay(
        learning_rate=lr_schedule,
        weight_decay_rate=0.01,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias'])

### Define metric_fn and loss_fn

The metric is accuracy and we use sparse categorical cross-entropy as loss.

In [0]:
def metric_fn():
  return tf.keras.metrics.SparseCategoricalAccuracy(
      'accuracy', dtype=tf.float32)

def classification_loss_fn(labels, logits):
  return losses.weighted_sparse_categorical_crossentropy_loss(
    labels=labels, predictions=tf.nn.log_softmax(logits, axis=-1))


### Compile and train the model

In [0]:
classifier_model.compile(optimizer=optimizer,
                         loss=classification_loss_fn,
                         metrics=[metric_fn()])
classifier_model.fit(
      x=training_dataset,
      validation_data=evaluation_dataset,
      steps_per_epoch=steps_per_epoch,
      epochs=epochs,
      validation_steps=int(input_meta_data['eval_data_size'] / eval_batch_size))

### Save the model

In [0]:
classifier_model.save('./saved_model', include_optimizer=False, save_format='tf')

## Use the trained model to predict


In [0]:
eval_predictions = classifier_model.predict(evaluation_dataset)
for prediction in eval_predictions:
  print("Predicted label id: %s" % np.argmax(prediction))