## Introduction

This is a simple demonstration of fine tuning BERT for this competition. We make use of the fact that BERT comes with a binary classification example we can repurpose for this competition (cola). I am confident it is not possible to fine-tune BERT for a complete epoch on the entire data set in 2HR kernel, so here I've had to make some adjustments to stay within the time limits:

* We only use 1/3 of the training data
* We use a maximum sequence length of 72

I think compute might be a big factor in this competition!

Thanks for Jon Mischo (https://www.kaggle.com/supertaz) for uploading BERT Models + Scripts :)

## Libraries

We'll add the BERT repo to path so we can import directly.

In [None]:
import os
import sys
import collections
import csv
import pandas as pd
import numpy as np
import tensorflow as tf
import pandas as pd
import numpy as np
import time

# BERT files

os.listdir("../input/pretrained-bert-including-scripts/master/bert-master")
sys.path.insert(0, '../input/pretrained-bert-including-scripts/master/bert-master')

from run_classifier import *
import modeling
import optimization
import tokenization

## Prepare Data

To keep things simple we adapt the competition data to the format that BERT expects for cola. 

In [None]:
# import data

train=pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test=pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

# remove new lines etc.

train['comment_text'] = train['comment_text'].replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n',  ' ', regex=True)
test['comment_text'] = test['comment_text'].replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n',  ' ', regex=True)

# force train into cola format, test is fine as it is

train['dummy_1'] = 'meh'
train['dummy_2'] = '*'

train = train[['dummy_1','target','dummy_2','comment_text']]
train['target'] = np.where(train['target']>=0.5,1,0)

train = train.sample(frac=0.33)

# export as tab seperated

train.to_csv('train.tsv', sep='\t', index=False, header=False)
test.to_csv('test.tsv', sep='\t', index=False, header=True)

## Parameters

See https://github.com/google-research/bert/blob/master/run_classifier.py

In [None]:
task_name = 'cola'
bert_config_file = '../input/pretrained-bert-including-scripts/uncased_l-12_h-768_a-12/uncased_L-12_H-768_A-12/bert_config.json'
vocab_file = '../input/pretrained-bert-including-scripts/uncased_l-12_h-768_a-12/uncased_L-12_H-768_A-12/vocab.txt'
init_checkpoint = '../input/pretrained-bert-including-scripts/uncased_l-12_h-768_a-12/uncased_L-12_H-768_A-12/bert_model.ckpt'
data_dir = './'
output_dir = './'
do_lower_case = True
max_seq_length = 72
do_train = True
do_eval = False
do_predict = False
train_batch_size = 32
eval_batch_size = 32
predict_batch_size = 32
learning_rate = 2e-5 
num_train_epochs = 1.0
warmup_proportion = 0.1
use_tpu = False
master = None
save_checkpoints_steps = 99999999 # <----- don't want to save any checkpoints
iterations_per_loop = 1000
num_tpu_cores = 8
tpu_cluster_resolver = None

## Fine Tuning

We'll run over the entire data set for a single epoch. Following code is just lifted from run_classifier.py - apologies for the mess :)

In [None]:
start = time.time()
print("--------------------------------------------------------")
print("Starting training ...")
print("--------------------------------------------------------")

In [None]:
bert_config = modeling.BertConfig.from_json_file(bert_config_file)

processor = ColaProcessor()
label_list = processor.get_labels()

tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

tpu_cluster_resolver = None
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2

run_config = tf.contrib.tpu.RunConfig(
  cluster=tpu_cluster_resolver,
  master=master,
  model_dir=output_dir,
  save_checkpoints_steps=save_checkpoints_steps,
  tpu_config=tf.contrib.tpu.TPUConfig(
      iterations_per_loop=iterations_per_loop,
      num_shards=num_tpu_cores,
      per_host_input_for_training=is_per_host))

train_examples = processor.get_train_examples(data_dir)
num_train_steps = int(len(train_examples) / train_batch_size * num_train_epochs)
num_warmup_steps = int(num_train_steps * warmup_proportion)

model_fn = model_fn_builder(
      bert_config=bert_config,
      num_labels=len(label_list),
      init_checkpoint=init_checkpoint,
      learning_rate=learning_rate,
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      use_tpu=use_tpu,
      use_one_hot_embeddings=use_tpu)

estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=train_batch_size)
      
      
train_file = os.path.join(output_dir, "train.tf_record")

file_based_convert_examples_to_features(
    train_examples, label_list, max_seq_length, tokenizer, train_file)

tf.logging.info("***** Running training *****")
tf.logging.info("  Num examples = %d", len(train_examples))
tf.logging.info("  Batch size = %d", train_batch_size)
tf.logging.info("  Num steps = %d", num_train_steps)

train_input_fn = file_based_input_fn_builder(
    input_file=train_file,
    seq_length=max_seq_length,
    is_training=True,
    drop_remainder=True)
    
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)


In [None]:
end = time.time()
print("--------------------------------------------------------")
print("Training complete in ", end - start, " seconds")
print("--------------------------------------------------------")

## Inference

For some reason I've had issues with batch_size - I'm not quite sure where this parameter comes from. For now I just hard code it in the function below, which should work fine. As I spend more time with the code, hopefully it becomes clearer.

Inference should only take about 10 minutes for public test set (~100k rows).


In [None]:
def file_based_input_fn_builder(input_file, seq_length, is_training,
                                drop_remainder):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  name_to_features = {
      "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
      "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
      "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
      "label_ids": tf.FixedLenFeature([], tf.int64),
      "is_real_example": tf.FixedLenFeature([], tf.int64),
  }

  def _decode_record(record, name_to_features):
    """Decodes a record to a TensorFlow example."""
    example = tf.parse_single_example(record, name_to_features)

    # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
    # So cast all int64 to int32.
    for name in list(example.keys()):
      t = example[name]
      if t.dtype == tf.int64:
        t = tf.to_int32(t)
      example[name] = t

    return example

  def input_fn(params):
    """The actual input function."""
    
    #batch_size = params["batch_size"]
    batch_size = 64 # <----- hardcoded batch_size added here 
    
    # For training, we want a lot of parallel reading and shuffling.
    # For eval, we want no shuffling and parallel reading doesn't matter.
    d = tf.data.TFRecordDataset(input_file)
    if is_training:
      d = d.repeat()
      d = d.shuffle(buffer_size=100)

    d = d.apply(
        tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder))

    return d

  return input_fn

In [None]:
start = time.time()
print("--------------------------------------------------------")
print("Starting inference ...")
print("--------------------------------------------------------")

In [None]:
predict_examples = processor.get_test_examples(data_dir)
num_actual_predict_examples = len(predict_examples)

predict_file = os.path.join(output_dir, "predict.tf_record")

file_based_convert_examples_to_features(predict_examples, label_list,
                                        max_seq_length, tokenizer,
                                        predict_file)

tf.logging.info("***** Running prediction*****")
tf.logging.info("  Num examples = %d (%d actual, %d padding)",
                len(predict_examples), num_actual_predict_examples,
                len(predict_examples) - num_actual_predict_examples)
tf.logging.info("  Batch size = %d", predict_batch_size)

predict_drop_remainder = True if use_tpu else False
predict_input_fn = file_based_input_fn_builder(
    input_file=predict_file,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=predict_drop_remainder)

result = estimator.predict(input_fn=predict_input_fn)

output_predict_file = os.path.join(output_dir, "test_results.tsv")

with tf.gfile.GFile(output_predict_file, "w") as writer:
    num_written_lines = 0
    tf.logging.info("***** Predict results *****")
    for (i, prediction) in enumerate(result):
        probabilities = prediction["probabilities"]
        if i >= num_actual_predict_examples:
            break
        output_line = "\t".join(
            str(class_probability)
            for class_probability in probabilities) + "\n"
        writer.write(output_line)
        num_written_lines += 1


In [None]:
end = time.time()
print("--------------------------------------------------------")
print("Inference complete in ", end - start, " seconds")
print("--------------------------------------------------------")

## Submission

In [None]:
sample_submission = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv')
predictions = pd.read_csv('./test_results.tsv', header=None, sep='\t')

submission = pd.concat([sample_submission.iloc[:,0], predictions.iloc[:,1]], axis=1)
submission.columns = ['id','prediction']
submission.to_csv('submission.csv', index=False, header=True)

In [None]:
submission.head()