# BERT Classifier Fine Tuning using IMDB Movie Reviews

This notebook uses the BERT BFloat16 classifier training scripts from the model zoo to
do fine tuning. The [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
is used to do sentiment analysis on movie reviews. For more information on the large movie
review dataset, please see the [ACL 2011 paper](https://aclanthology.org/P11-1015/).

Steps:
1. [Download the IMDB dataset](#Download-the-IMDB-dataset)
2. [Convert the dataset to .tsv files](#Convert-the-dataset-to-.tsv-files)
3. [Download the BERT pretrained model checkpoints](#Download-the-BERT-pretrained-model-checkpoints)
3. [BERT classifier fine tuning using the IMDB tsv files](#BERT-classifier-fine-tuning-using-the-IMDB-tsv-files)
4. [Export the saved model](#Export-the-saved-model)
5. [Load the saved model and make predictions](#Load-the-saved-model-and-make-predictions)

## Setup and initialization

Before starting there are a few dependencies to install and variables to setup. This notebook assumes that TensorFlow has already been installed. There are environment variables to define the location where the [Model Zoo for Intel® Architecture](https://github.com/intelai/models) has been cloned and directories for checkpoint files, the dataset, and output.

In [None]:
%%bash
if [[ -n "$(which apt-get)" ]]; then
  apt-get -qq update && apt-get -qq install -y unzip wget
elif [[ -n "$(which apt-get)" ]]; then
  yum install -y unzip wget
else
  echo "Please install wget and unzip, or manually download and extract the BERT base checkpoints to ${CHECKPOINT_DIR}"
fi
pip install --upgrade -q pip && pip install -q pandas

In [None]:
import csv
import os
import random
import sys
import tensorflow as tf

The following cell has parameters that you may want to change.

In [None]:
# Define path to the model zoo directory, if the env var is not set
if "MODEL_ZOO_DIR" not in os.environ:
    os.environ["MODEL_ZOO_DIR"] = os.path.join(os.environ["HOME"], "intelai_models")
    
# Define the directory where BERT base checkpoints will be downloaded, if the env var is not set
if "CHECKPOINT_DIR" not in os.environ:
    os.environ["CHECKPOINT_DIR"] = os.path.join(os.environ["HOME"], "bert_base_checkpoints")

# Define the directory where the IMDB dataset will be downloaded/extracted, if the env var is not set
if "DATASET_DIR" not in os.environ:
    os.environ["DATASET_DIR"] = os.path.join(os.environ["HOME"], "imdb_dataset")

# Define the directory where output logs and checkpoints will be written, if the env var is not set
if "OUTPUT_DIR" not in os.environ:
    os.environ["OUTPUT_DIR"] = os.path.join(os.environ["HOME"], "bert_classifier_output")

# The number of reviews to use for training (the IMDB dataset has 25,000 training reviews)
num_train_reviews = 1000

# The number of reviews to use for evaluation (the IMDB dataset has 25,000 training reviews)
num_test_reviews = 1000

# The number of training epochs to run
num_train_epochs = 1

# Training batch size
batch_size = 32

# Learning rate
learning_rate = "3e-5"

# Maximum total input sequence length after WordPiece tokenization (longer sequences will be truncated)
max_seq_length = 128

In [None]:
# Location with the model zoo repo code
model_zoo_dir = os.environ["MODEL_ZOO_DIR"]
print("MODEL_ZOO_DIR:", model_zoo_dir)
if not os.path.isdir(model_zoo_dir):
    raise ValueError("The model zoo directory ({}) does not exist. This directory should "
                     "have a clone of the model zoo repo.".format(model_zoo_dir))

# Location where the bert base uncased checkpoints will be downloaded and extracted
bert_base_checkpoint_dir = os.environ["CHECKPOINT_DIR"]
print("CHECKPOINT_DIR:", bert_base_checkpoint_dir)
if not os.path.isdir(bert_base_checkpoint_dir):
    os.makedirs(bert_base_checkpoint_dir)

# Location where the dataset will be downloaded
dataset_dir = os.environ["DATASET_DIR"]
print("DATASET_DIR:", dataset_dir)
if not os.path.isdir(dataset_dir):
    os.makedirs(dataset_dir)

# Output directory for logs and checkpoints generated during training
output_dir = os.environ["OUTPUT_DIR"]
print("OUTPUT_DIR:", output_dir)
if not os.path.isdir(output_dir):
    os.makedirs(output_dir)

if len(os.listdir(output_dir)) != 0:
    print("\nWARNING: The OUTPUT_DIR is not empty.")
    print("To prevent previously generated checkpoint files from getting picked up, you may want "
          "to provide an empty output directory.")

## Download the IMDB dataset

The next section downloads and extracts the IMDB movie review dataset. This may take a few minutes, depending on your network speed and hardware. If the `aclImdb` folder is already found in the `DATASET_DIR`, the download will be skipped.

In [None]:
%%time

if os.path.isdir(os.path.join(dataset_dir, "aclImdb")):
    imdb_dataset_dir = os.path.join(dataset_dir, "aclImdb")
    print("Skipping download, since the imdb dataset folder was already found")
else:
    imdb_download_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    imdb_file_name = os.path.basename(imdb_download_url)

    dataset_file = tf.keras.utils.get_file(imdb_file_name, imdb_download_url, untar=True,
                                           cache_dir=dataset_dir, cache_subdir="")
    imdb_dataset_dir = os.path.join(os.path.dirname(dataset_file), "aclImdb")

    if not os.path.isdir(imdb_dataset_dir):
        raise RuntimeError("The IMDB dataset {} folder could not be found at {}.")
    
print("IMDB dataset dir:", imdb_dataset_dir)

## Convert the dataset to .tsv files

The BERT training scripts expect the dataset to be in .tsv files. The downloaded dataset has folders for `train` and `test` files, and within those folders there are folders named `neg` and `pos` which have negative and positive movie reviews as `.txt` files.

The code below shuffles the list of text files and randomly grabs a positive or negative review for the number of entries that are being using in this example for training and test.

We use the IMDB data that was just generated to create `train.tsv`, `dev.tsv`, and `test.tsv`. In these `.tsv` files, column 1 has the label (`0` for negative and `1` for positive) and column 3 has the movie review sentence. Column 0 and 2 are unused.

In [None]:
tsv_dir = os.path.join(dataset_dir, "imdb_tsv")
if not os.path.isdir(tsv_dir):
    os.makedirs(tsv_dir)
    
for data_folder in ["train", "test"]:
    counts = [0, 0]
    file_list = [os.listdir(os.path.join(imdb_dataset_dir, data_folder, "neg")),
                 os.listdir(os.path.join(imdb_dataset_dir, data_folder, "pos"))]
    random.shuffle(file_list[0])
    random.shuffle(file_list[1])
    pos_neg = ["neg", "pos"]
    num_reviews = num_train_reviews if data_folder == "train" else num_test_reviews
    
    # Create a dev.tsv and test.tsv from the test data
    file_names = [data_folder]
    if data_folder == "test":
        file_names = ["test", "dev"]
    
    for file in file_names:
        tsv_file = os.path.join(tsv_dir, "{}.tsv".format(file))

        with open(tsv_file, "w") as out_tsv:
            tsv_writer = csv.writer(out_tsv, delimiter='\t')

            for x in range(0, num_reviews):
                rand_int = random.randint(0, 1)
                label = str(rand_int)
                txt_file = os.path.join(imdb_dataset_dir, data_folder, pos_neg[rand_int],
                                        file_list[rand_int][counts[rand_int]])
                counts[rand_int] += 1

                with open(txt_file, "r") as data_file:
                    tsv_writer.writerow(['', str(rand_int), '', data_file.read()])
                
        print("Wrote {} reviews to {}".format(num_reviews, tsv_file))

## Download the BERT pretrained model checkpoints

Download the `uncased_L-12_H-768_A-12` checkpoints to the `CHECKPOINT_DIR` directory and extract the files. The download is skipped if the file already exists.

In [None]:
%%bash

mkdir -p ${CHECKPOINT_DIR}
cd ${CHECKPOINT_DIR}
if [ ! -f "uncased_L-12_H-768_A-12.zip" ]; then
    wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
else
    echo "Skipping download since uncased_L-12_H-768_A-12.zip already exists"
fi

if [ ! -d "uncased_L-12_H-768_A-12" ]; then
    unzip uncased_L-12_H-768_A-12.zip
fi

## BERT classifier fine tuning using the IMDB tsv files

Run the `launch_benchmark.py` script to run BERT BFloat16 training using the BERT base uncased_L-12_H-768_A-12 checkpoints that were just downloaded as the initial checkpoints. The directory with the IMDB tsv files that were created earlier are used as the dataset directory. Checkpoints that are generated during training will be written to the `OUTPUT_DIR`.

In [None]:
# Set env vars to pass as parameters to the model zoo launch_benchmark.py script
os.environ["NUM_TRAIN_EPOCHS"] = str(num_train_epochs)
os.environ["BATCH_SIZE"] = str(batch_size)
os.environ["LEARNING_RATE"] = str(learning_rate)
os.environ["MAX_SEQ_LENGTH"] = str(max_seq_length)

In [None]:
%%time
!python ${MODEL_ZOO_DIR}/benchmarks/launch_benchmark.py \
  --model-name=bert_large \
  --precision=bfloat16 \
  --mode=training \
  --framework=tensorflow \
  --batch-size=${BATCH_SIZE} \
  --output-dir ${OUTPUT_DIR} \
  -- train-option=Classifier \
  task-name=cola \
  do-train=true \
  do-eval=true \
  data-dir=${DATASET_DIR}/imdb_tsv \
  vocab-file=${CHECKPOINT_DIR}/uncased_L-12_H-768_A-12/vocab.txt \
  config-file=${CHECKPOINT_DIR}/uncased_L-12_H-768_A-12/bert_config.json \
  init-checkpoint=${CHECKPOINT_DIR}/uncased_L-12_H-768_A-12/bert_model.ckpt \
  max-seq-length=${MAX_SEQ_LENGTH} \
  learning-rate=${LEARNING_RATE} \
  num-train-epochs=${NUM_TRAIN_EPOCHS} \
  optimized_softmax=True \
  experimental_gelu=False \
  do-lower-case=True

Check what files are in the output directory:

In [None]:
!ls -l $OUTPUT_DIR

## Export the saved model

Use the checkpoint files that were generated during fine tuning to export a `saved_model.pb`.

In [None]:
!rm -rf ${OUTPUT_DIR}/frozen
!python ${MODEL_ZOO_DIR}/models/language_modeling/tensorflow/bert_large/inference/export_classifier.py \
  --task_name=cola \
  --output_dir=${OUTPUT_DIR} \
  --precision=bfloat16 \
  --bert_config_file=${CHECKPOINT_DIR}/uncased_L-12_H-768_A-12/bert_config.json \
  --saved_model

Check to make sure that there's a saved model in the `$OUTPUT_DIR/frozen` directory:

In [None]:
!ls -l $OUTPUT_DIR/frozen

## Load the saved model and make predictions

Load the saved model from the output directory:

In [None]:
tf.compat.v1.enable_resource_variables()
reloaded_model = tf.saved_model.load(os.path.join(output_dir, "frozen"))

We are using classes and functions from the model zoo BERT training model directory to set the tokenizer and a function to create input examples from the movie review sentences.

In [None]:
sys.path.append(os.path.join(os.environ["MODEL_ZOO_DIR"], "models", "language_modeling",
                             "tensorflow", "bert_large", "training", "bfloat16"))

import tokenization
from run_classifier import InputExample
from run_classifier import convert_examples_to_features
from run_classifier import convert_single_example

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.WARN)
vocab_file = os.path.join(bert_base_checkpoint_dir, "uncased_L-12_H-768_A-12", "vocab.txt")
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)
        
# Function to create input examples from movie review sentences
def create_examples(sentences):
    examples = []
    
    for (i, sentence) in enumerate(sentences):
        # Pass the tokenized sentence as the text_a arg. Just always pass label as 0 since we are predicting
        text_a = tokenization.convert_to_unicode(sentence)
        examples.append(InputExample(guid=i, text_a=text_a, text_b=None, label="0"))
        
    return examples

Next, we setup a list of movie review sentences to use for predicting. We convert these sentences to input examples, then convert the input examples to features. We send the features as a batch to the saved model's evaluation function and get back the prediction results.

In [None]:
# List of sample movie review sentences
movie_reviews = [
    "The movie was fantastic",
    "The worst movie ever!",
    "Captivating and creative",
    "Meh",
    "I'd rather have a cat claw out my eyes than watch that again",
    "Full of action and suspense",
    "Overall pretty boring"
]

label_list=["0", "1"]
labels=["Negative", "Positive"]
num_examples = len(movie_reviews)

# Convert the movie review sentences to examples and then convert the examples to features
input_examples = create_examples(movie_reviews)
features = convert_examples_to_features(input_examples, label_list, max_seq_length, tokenizer)

# Create lists for all the feature inputs so that we can do the prediction as a batch
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []

for feature in features:
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_segment_ids.append(feature.segment_ids)
    all_label_ids.append(feature.label_id)

tf_input_ids = tf.constant(all_input_ids, shape=[num_examples, max_seq_length], dtype=tf.int32)
tf_input_mask = tf.constant(all_input_mask, shape=[num_examples, max_seq_length], dtype=tf.int32)
tf_segment_ids = tf.constant(all_segment_ids, shape=[num_examples, max_seq_length], dtype=tf.int32)
tf_label_ids = tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32)

# Use the 'eval' signature from the reloaded model to do predictions on our using the feature lists
results = reloaded_model.signatures["eval"](
    input_ids=tf_input_ids, input_mask=tf_input_mask, segment_ids=tf_segment_ids, label_ids=tf_label_ids)

# Print the results
for (i, sentence) in enumerate(movie_reviews):
    print("Movie review:\t{}".format(sentence))
    print("Prediction:\t{}\n".format(labels[results["probabilities"][i].numpy().argmax()]))