# BERT Fine-Tuned Notebook
## W266 Final Project
### Game of Thrones Text Classification
### T. P. Goter
### Fall 2019

This notebook is used to perform the baseline, finetuned BERT supervised text classification. The original UDA process utilized a Python script wrapped in a bash shell script. This notebook was generated in order to better show and annotate the process.

## Import Data Libraries

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import os
import tensorflow as tf

import uda
from bert import modeling
from utils import proc_data_utils
from utils import raw_data_utils

import yaml
import pprint

from absl import app
from absl import logging

## Define Some Options
This section replaces passing the input parameters as command line arguments. This section is very important. It controls the entire model. See the dictionary below.

### Task Options:
- **do_train:** Boolean of whether we are training
- **do_eval:** Boolean of whether we are just evaluating

### Training Options:
- **sup_train_data_dir:** Input directory for supervised data. This should be set to "./Data/proc_data/train_##" where the ## is one of the subsets of training data generated from the prepro_ALL.csh script.
- **eval_data_dir:**  The input data dir of the evaluation data. This should be the path to the development data with which we will do hyperparameter tuning. We can change this to the test data directory once we are ready for final evaluation. The dev data path is: "./Data/proc_data/dev"
- **unsup_data_dir:** The input data dir of the unsupervised data. Path for the unsupervised, augmented data. This should be equal to "./Data/proc_data/unsup"
- **bert_config_file:** Absolute path to the json file corresponding to the pre-trained BERT model. For us this is: "./bert_pretrained/bert_base/bert_config.json"
- **vocab_file:** The vocabulary file that the BERT model was trained on. This should be equal to "./bert_pretrained/bert_base/vocab.txt"
- **init_checkpoint:** Initial checkpoint from the pre-trained BERT model. This should be equal to: "./bert_pretrained/bert_base/bert_model.ckpt"
- **task_name:** The name of the task to train. This should be equal to "GoT"
- **model_dir:** The output directory where the model checkpoints will be written. This will be set to "models" followed by a case specific identifier.

### Model configuration
- **use_one_hot_embeddings:** Boolean, default: True, If True, tf.one_hot will be used for embedding lookups, otherwise tf.nn.embedding_lookup will be used. On TPUs, this should be True since it is much faster."
- **max_seq_length":** Integer, default = 128, The maximum total sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Note, GoT data was processed to be on-average close to this length to minimize lost data.
- **model_dropout:** Float, default = -1 (i.e., no dropout). Dropout rate for both the attention and the hidden states.

### Training hyper-parameters
- **train_batch_size:** Integer, default = 32. Based on the discussion here https://github.com/google-research/bert#out-of-memory-issues. 32 is probably the largest we can run with 11 GB of RAM while using BERT base with a maximum sequence length of 128.
- **eval_batch_size:** Integer, default = 8, "Base batch size for evaluation."
- **save_checkpoints_num:** Integer, default = 20, Number of checkpoints to save during training.
- **iterations_per_loop:** Integer, default = 200, Number of steps to make in each estimator call.
- **num_train_steps:** Integer, no default, number of training steps

### Optimizer hyperparameters
- **learning_rate:** Float, default = 2e-5, The initial learning rate for Adam Optimizer
- **num_warmup_steps:** Integer, no default, Number of warmup steps
- **clip_norm:** Float, default= 1.0, Gradient clip hyperparameter.

### UDA Options:
- **unsup_ratio:** Integer - ratio between unsupervised batch size and supervised batch size. If zero - dont use
- **aug_ops:** String - what augmentation procedure do you want to run
- **aug_copy:** Integer - how many augmentations per example are to be generated
- **uda_coeff:** Float - default 1 - This is the coefficient on the UDA loss. Basically you can rely more or less on the UDA loss during the supervised training. The UDA paper generally kept this at 1
- **tsa:** String - Annealing schedule to use. Options provided are "" none, linear_schedule, log_schedule, exp_schedule
- **uda_softmax_temp:** Float, default -1, A smaller temperature will accentuate differences in probabilities. Low temps were used in the UDA paper for cases with low numbers of labeled data, after masking out uncertain predictions.
- **uda_confidence_thresh:** Float, default -1, Threshold value above which the consistency loss term from the UDA is used. Basically ensures we are using loss from random guesses.

### TPU and GPU Options:
- **use_tpu:** Boolean - self-explanatory - it affects how the model is run. If we run in colab this could be important. False means use CPU or GPU. We will default to FALSE.
- **tpu_name:** String - address of the tpu
- **gcp_project:** String - project name when using TPU
- **tpu_zone:** String - can be set or detected
- **master:** Address of the TPU master, if applicable



### Defaults

The defaults below should not be changed. Note that a config file will be read in after this in order to update these if desired.

In [19]:
options = {
### Training Options:
'bert_config_file' : "./bert_pretrained/bert_base/bert_config.json",
'vocab_file' : "./bert_pretrained/bert_base/vocab.txt",
'init_checkpoint' : "./bert_pretrained/bert_base/bert_model.ckpt",
'task_name' : "GoT",

### Model configuration
'use_one_hot_embeddings' : True,
'max_seq_length' : 128,
'model_dropout' : -1 ,

### Training hyper-parameters
'train_batch_size' : 32,
'eval_batch_size' : 8,
'save_checkpoints_num' : 20,
'iterations_per_loop' : 200,

### Optimizer hyperparameters
'learning_rate' : 2e-5,
'clip_norm' : 1.0,

### UDA Options - only important if using UDA
'uda_coeff' : 1 ,
'tsa' : "" ,
'uda_softmax_temp' : -1,
'uda_confidence_thresh' : -1,

### TPU and GPU Options:
'use_tpu': False
}

## Set the Case to Run
This will ensure that different configurations are being controlled and saved separately. Just load in the correct yaml file that specifies all of the parameters.

In [27]:
# Set the config file to load - controls what is run
config = 'base_20'
with open('./config/' + config + '.yml', 'r') as config_in:
    options_from_file = yaml.safe_load(config_in)
    print()
    print("="*50 + "\nCase Specific Options: \n" + "="*50)
    pprint.pprint(options_from_file)

# merge dictionaries    
options.update(options_from_file)

#
print()
print("="*50 + "\nFull Listing of Options: \n" + "="*50)
pprint.pprint(options)


Case Specific Options: 
{'bert_config_file': './bert_pretrained/bert_base/bert_config.json',
 'do_eval': True,
 'do_train': True,
 'eval_data_dir': 'Data/proc_data/GoT/dev',
 'init_checkpoint': './bert_pretrained/bert_base/bert_model.ckpt',
 'learning_rate': '3e-05',
 'model_dir': 'model/base_20',
 'num_train_steps': 50,
 'num_warmup_steps': 20,
 'sup_train_data_dir': '/Data/proc_data/GoT/train_20',
 'task_name': 'GoT',
 'use_tpu': False,
 'vocab_file': './bert_pretrained/bert_base/vocab.txt'}

Full Listing of Options: 
{'bert_config_file': './bert_pretrained/bert_base/bert_config.json',
 'clip_norm': 1.0,
 'do_eval': True,
 'do_train': True,
 'eval_batch_size': 8,
 'eval_data_dir': 'Data/proc_data/GoT/dev',
 'init_checkpoint': './bert_pretrained/bert_base/bert_model.ckpt',
 'iterations_per_loop': 200,
 'learning_rate': '3e-05',
 'max_seq_length': 128,
 'model_dir': 'model/base_20',
 'model_dropout': -1,
 'num_train_steps': 50,
 'num_warmup_steps': 20,
 'save_checkpoints_num': 20,
 's

In [31]:
# Record informational logs
logging.set_verbosity(logging.INFO)

# Specify the task as that controls how the data is read and cleaned
processor = raw_data_utils.get_processor(options['task_name'])

# Read in the labels
label_list = processor.get_labels()

# Check the labels  -  they should be 1 through 5
print(label_list)

# Read the BertConfig File
bert_config = modeling.BertConfig.from_json_file(
      options['bert_config_file'],
      options['model_dropout'])

['1', '2', '3', '4', '5']


AttributeError: module 'tensorflow' has no attribute 'gfile'

In [28]:
  tf.io.gfile.makedirs(FLAGS.model_dir)

  flags_dict = app.flags.FLAGS.flag_values_dict()
  with tf.io.read_file(os.path.join(FLAGS.model_dir, "FLAGS.json"), "w") as ouf:
    json.dump(flags_dict, ouf)

  logging.info("warmup steps {}/{}".format(
      FLAGS.num_warmup_steps, FLAGS.num_train_steps))

  save_checkpoints_steps = FLAGS.num_train_steps // FLAGS.save_checkpoints_num
  logging.info("setting save checkpoints steps to {:d}".format(
      save_checkpoints_steps))

  FLAGS.iterations_per_loop = min(save_checkpoints_steps,
                                  FLAGS.iterations_per_loop)
  if FLAGS.use_tpu and FLAGS.tpu_name:
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
        FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
  else:
    tpu_cluster_resolver = None
  # if not FLAGS.use_tpu and FLAGS.num_gpu > 1:
  #   train_distribute = tf.contrib.distribute.MirroredStrategy(
  #       num_gpus=FLAGS.num_gpu)
  # else:
  #   train_distribute = None

  is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  run_config = tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      master=FLAGS.master,
      model_dir=FLAGS.model_dir,
      save_checkpoints_steps=save_checkpoints_steps,
      keep_checkpoint_max=1000,
      # train_distribute=train_distribute,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=FLAGS.iterations_per_loop,
          per_host_input_for_training=is_per_host))

  model_fn = uda.model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      clip_norm=FLAGS.clip_norm,
      num_train_steps=FLAGS.num_train_steps,
      num_warmup_steps=FLAGS.num_warmup_steps,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_one_hot_embeddings,
      num_labels=len(label_list),
      unsup_ratio=FLAGS.unsup_ratio,
      uda_coeff=FLAGS.uda_coeff,
      tsa=FLAGS.tsa,
      print_feature=False,
      print_structure=False,
  )

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      params={"model_dir": FLAGS.model_dir},
      train_batch_size=FLAGS.train_batch_size,
      eval_batch_size=FLAGS.eval_batch_size)

  if FLAGS.do_train:
    logging.info("  >>> sup data dir : {}".format(FLAGS.sup_train_data_dir))
    if FLAGS.unsup_ratio > 0:
      logging.info("  >>> unsup data dir : {}".format(
          FLAGS.unsup_data_dir))

    train_input_fn = proc_data_utils.training_input_fn_builder(
        FLAGS.sup_train_data_dir,
        FLAGS.unsup_data_dir,
        FLAGS.aug_ops,
        FLAGS.aug_copy,
        FLAGS.unsup_ratio)

  if FLAGS.do_eval:
    logging.info("  >>> dev data dir : {}".format(FLAGS.eval_data_dir))
    eval_input_fn = proc_data_utils.evaluation_input_fn_builder(
        FLAGS.eval_data_dir,
        "clas")

    eval_size = processor.get_dev_size()
    eval_steps = int(eval_size / FLAGS.eval_batch_size)

  if FLAGS.do_train and FLAGS.do_eval:
    logging.info("***** Running training & evaluation *****")
    logging.info("  Supervised batch size = %d", FLAGS.train_batch_size)
    logging.info("  Unsupervised batch size = %d",
                    FLAGS.train_batch_size * FLAGS.unsup_ratio)
    logging.info("  Num steps = %d", FLAGS.num_train_steps)
    logging.info("  Base evaluation batch size = %d", FLAGS.eval_batch_size)
    logging.info("  Num steps = %d", eval_steps)
    best_acc = 0
    for _ in range(0, FLAGS.num_train_steps, save_checkpoints_steps):
      logging.info("*** Running training ***")
      estimator.train(
          input_fn=train_input_fn,
          steps=save_checkpoints_steps)
      logging.info("*** Running evaluation ***")
      dev_result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
      logging.info(">> Results:")
      for key in dev_result.keys():
        logging.info("  %s = %s", key, str(dev_result[key]))
        dev_result[key] = dev_result[key].item()
      best_acc = max(best_acc, dev_result["eval_classify_accuracy"])
    logging.info("***** Final evaluation result *****")
    logging.info("Best acc: {:.3f}\n\n".format(best_acc))
  elif FLAGS.do_train:
    logging.info("***** Running training *****")
    logging.info("  Supervised batch size = %d", FLAGS.train_batch_size)
    logging.info("  Unsupervised batch size = %d",
                    FLAGS.train_batch_size * FLAGS.unsup_ratio)
    logging.info("  Num steps = %d", FLAGS.num_train_steps)
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
  elif FLAGS.do_eval:
    logging.info("***** Running evaluation *****")
    logging.info("  Base evaluation batch size = %d", FLAGS.eval_batch_size)
    logging.info("  Num steps = %d", eval_steps)
    checkpoint_state = tf.train.get_checkpoint_state(FLAGS.model_dir)

    best_acc = 0
    for ckpt_path in checkpoint_state.all_model_checkpoint_paths:
      if not tf.io.gfile.exists(ckpt_path + ".data-00000-of-00001"):
        logging.info(
            "Warning: checkpoint {:s} does not exist".format(ckpt_path))
        continue
      logging.info("Evaluating {:s}".format(ckpt_path))
      dev_result = estimator.evaluate(
          input_fn=eval_input_fn,
          steps=eval_steps,
          checkpoint_path=ckpt_path,
      )
      logging.info(">> Results:")
      for key in dev_result.keys():
        logging.info("  %s = %s", key, str(dev_result[key]))
        dev_result[key] = dev_result[key].item()
      best_acc = max(best_acc, dev_result["eval_classify_accuracy"])
    logging.info("***** Final evaluation result *****")
    logging.info("Best acc: {:.3f}\n\n".format(best_acc))

IndentationError: unexpected indent (<ipython-input-28-aa809895a58a>, line 6)