# BERT finetuning tasks in 5 minutes with Cloud TPU

<table class="tfo-notebook-buttons" align="left" >
 <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

Once you finish the setup, let's start!

**Firstly**, we need to set up Colab TPU running environment, verify a TPU device is succesfully connected and upload credentials to TPU for GCS bucket usage.

In [2]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.84.255.202:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 10396890425686630942),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 16470429612696230446),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 1015792942436212038),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 495137551527194747),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 1039165987876656910),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 7506864652263850389),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 16799792537013114676),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 2091114872633841358),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 1386655203263

**Secondly**, prepare and import BERT modules.

In [3]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

Cloning into 'bert_repo'...
remote: Enumerating objects: 325, done.[K
remote: Total 325 (delta 0), reused 0 (delta 0), pack-reused 325[K
Receiving objects: 100% (325/325), 258.56 KiB | 3.54 MiB/s, done.
Resolving deltas: 100% (184/184), done.


**Thirdly**, prepare for training:

*  Specify task and download training data.
*  Specify BERT pretrained model
*  Specify GS bucket, create output directory for model checkpoints and eval results.



In [0]:
#!curl https://raw.githubusercontent.com/jaisong87/prDetect/master/Content/msr_paraphrase_train.txt -o glue_data/MRPC/train.tsv

In [0]:
#!curl https://raw.githubusercontent.com/jaisong87/prDetect/master/Content/msr_paraphrase_test.txt -o glue_data/MRPC/test.tsv

In [6]:
#TASK = 'MRPC' #@param {type:"string"}
TASK = 'CoLA' #@param {type:"string"}
assert TASK in ('MRPC', 'CoLA'), 'Only (MRPC, CoLA) are demonstrated here.'
# Download glue data.
! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
!python download_glue_repo/download_glue_data.py --data_dir='glue_data' --tasks=$TASK
TASK_DATA_DIR = 'glue_data/' + TASK
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR

# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model
#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
BERT_MODEL = 'uncased_L-12_H-768_A-12' #@param {type:"string"}
BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL
print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
!gsutil ls $BERT_PRETRAINED_DIR

BUCKET = 'bucket-w261' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


Cloning into 'download_glue_repo'...
remote: Enumerating objects: 21, done.[K
remote: Total 21 (delta 0), reused 0 (delta 0), pack-reused 21[K
Unpacking objects:   4% (1/21)   Unpacking objects:   9% (2/21)   Unpacking objects:  14% (3/21)   Unpacking objects:  19% (4/21)   Unpacking objects:  23% (5/21)   Unpacking objects:  28% (6/21)   Unpacking objects:  33% (7/21)   Unpacking objects:  38% (8/21)   Unpacking objects:  42% (9/21)   Unpacking objects:  47% (10/21)   Unpacking objects:  52% (11/21)   Unpacking objects:  57% (12/21)   Unpacking objects:  61% (13/21)   Unpacking objects:  66% (14/21)   Unpacking objects:  71% (15/21)   Unpacking objects:  76% (16/21)   Unpacking objects:  80% (17/21)   Unpacking objects:  85% (18/21)   Unpacking objects:  90% (19/21)   Unpacking objects:  95% (20/21)   Unpacking objects: 100% (21/21)   Unpacking objects: 100% (21/21), done.
Downloading and extracting CoLA...
	Completed!
***** Task data directory: glue_data/CoLA *

In [7]:
!ls -l  $TASK_DATA_DIR/

total 528
-rw-r--r-- 1 root root  53717 Apr  8 02:50 dev.tsv
drwxr-xr-x 4 root root   4096 Apr  8 02:50 original
-rw-r--r-- 1 root root  48788 Apr  8 02:50 test.tsv
-rw-r--r-- 1 root root 428590 Apr  8 02:50 train.tsv


**Now, let's play!**

In [8]:
# Setup task specific model and TPU running config.

import modeling
import optimization
import run_classifier
import tokenization


# Model Hyper Parameters
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 128
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

processors = {
  "cola": run_classifier.ColaProcessor,
  "mnli": run_classifier.MnliProcessor,
  "mrpc": run_classifier.MrpcProcessor,
}
processor = processors[TASK.lower()]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=OUTPUT_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

train_examples = processor.get_train_examples(TASK_DATA_DIR)
num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)

INFO:tensorflow:Using config: {'_model_dir': 'gs://bucket-w261/bert/models/CoLA', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.84.255.202:8470"
    }
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd49cd41898>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.84.255.202:8470', '_evaluation_master': 'grpc://10.84.255.202:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_

In [9]:
# Train the model.
print('MRPC/CoLA on BERT base model normally takes about 2-3 minutes. Please wait...')
train_features = run_classifier.convert_examples_to_features(
    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started training at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(train_examples)))
print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info("  Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))

MRPC/CoLA on BERT base model normally takes about 2-3 minutes. Please wait...
INFO:tensorflow:Writing example 0 of 8551
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-0
INFO:tensorflow:tokens: [CLS] our friends won ' t buy this analysis , let alone the next one we propose . [SEP]
INFO:tensorflow:input_ids: 101 2256 2814 2180 1005 1056 4965 2023 4106 1010 2292 2894 1996 2279 2028 2057 16599 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [10]:
# Eval the model.
eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
eval_features = run_classifier.convert_examples_to_features(
    eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(eval_examples)))
print('  Batch size = {}'.format(EVAL_BATCH_SIZE))
# Eval will be slightly WRONG on the TPU because it will truncate
# the last batch.
eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
eval_input_fn = run_classifier.input_fn_builder(
    features=eval_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=True)
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
  print("***** Eval results *****")
  for key in sorted(result.keys()):
    print('  {} = {}'.format(key, str(result[key])))
    writer.write("%s = %s\n" % (key, str(result[key])))

INFO:tensorflow:Writing example 0 of 1043
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: dev-0
INFO:tensorflow:tokens: [CLS] the sailors rode the breeze clear of the rocks . [SEP]
INFO:tensorflow:input_ids: 101 1996 11279 8469 1996 9478 3154 1997 1996 5749 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [11]:
!ls -l  $TASK_DATA_DIR/

total 528
-rw-r--r-- 1 root root  53717 Apr  8 02:50 dev.tsv
drwxr-xr-x 4 root root   4096 Apr  8 02:50 original
-rw-r--r-- 1 root root  48788 Apr  8 02:50 test.tsv
-rw-r--r-- 1 root root 428590 Apr  8 02:50 train.tsv


In [12]:
type(estimator)

tensorflow.contrib.tpu.python.tpu.tpu_estimator.TPUEstimator

In [13]:
! echo "i quit smoking.  how do i go about removing the tobacco premium?" | tee $TASK_DATA_DIR/input_sentence_01.txt
! echo "I was wondering if you can tell me how I can remove the tobacco user surcharge from my coverage plan since I am not a tobacco user" | tee $TASK_DATA_DIR/input_sentence_02.txt
! echo "how frequently is an annual checkup is covered?" | tee $TASK_DATA_DIR/input_sentence_03.txt
! echo "I want to know if general health checkup is covered in my insurance plan?" | tee $TASK_DATA_DIR/input_sentence_04.txt
! echo "are breast pumps covered under my plan?" | tee $TASK_DATA_DIR/input_sentence_05.txt
! echo "I was wondering where I can find info on breast pumps and how to get one that's covered by my insurance." | tee $TASK_DATA_DIR/input_sentence_06.txt
! ls -l $TASK_DATA_DIR/input_*.txt

i quit smoking.  how do i go about removing the tobacco premium?
I was wondering if you can tell me how I can remove the tobacco user surcharge from my coverage plan since I am not a tobacco user
how frequently is an annual checkup is covered?
I want to know if general health checkup is covered in my insurance plan?
are breast pumps covered under my plan?
I was wondering where I can find info on breast pumps and how to get one that's covered by my insurance.
-rw-r--r-- 1 root root  65 Apr  8 02:51 glue_data/CoLA/input_sentence_01.txt
-rw-r--r-- 1 root root 131 Apr  8 02:51 glue_data/CoLA/input_sentence_02.txt
-rw-r--r-- 1 root root  48 Apr  8 02:51 glue_data/CoLA/input_sentence_03.txt
-rw-r--r-- 1 root root  74 Apr  8 02:51 glue_data/CoLA/input_sentence_04.txt
-rw-r--r-- 1 root root  40 Apr  8 02:51 glue_data/CoLA/input_sentence_05.txt
-rw-r--r-- 1 root root 105 Apr  8 02:51 glue_data/CoLA/input_sentence_06.txt


In [14]:
! gsutil -m cp -r gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12 .


Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_config.json...
/ [0/6 files][    0.0 B/421.1 MiB]   0% Done                                    Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.index...
/ [0/6 files][    0.0 B/421.1 MiB]   0% Done                                    Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.meta...
Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001...
/ [0/6 files][    0.0 B/421.1 MiB]   0% Done                                    / [0/6 files][    0.0 B/421.1 MiB]   0% Done                                    Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/checkpoint...
Copying gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/vocab.txt...
/ [6/6 files][421.1 MiB/421.1 MiB] 100% Done                                    
Operation completed over 6 objects/421.1 MiB.                               

In [25]:
! for i in `seq 1 6`; do echo ${i}; done;









In [29]:
#input_sentence_01.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_01.txt   --output_file=glue_data/CoLA/output_sentence_01.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] i quit smoking . how do i go about removing the tobacco premium ? [SEP]
INFO:tensorflow:input_ids: 101 1045 8046 9422 1012 2129 2079 1045 2175 2055 9268 1996 9098 12882 1029 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [30]:
#input_sentence_02.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_02.txt   --output_file=glue_data/CoLA/output_sentence_02.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] i was wondering if you can tell me how i can remove the tobacco user sur ##cha ##rge from my coverage plan since i am not a tobacco user [SEP]
INFO:tensorflow:input_ids: 101 1045 2001 6603 2065 2017 2064 2425 2033 2129 1045 2064 6366 1996 9098 5310 7505 7507 20800 2013 2026 6325 2933 2144 1045 2572 2025 1037 9098 5310 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [31]:
#input_sentence_03.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_03.txt   --output_file=glue_data/CoLA/output_sentence_03.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] how frequently is an annual check ##up is covered ? [SEP]
INFO:tensorflow:input_ids: 101 2129 4703 2003 2019 3296 4638 6279 2003 3139 1029 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflo

In [32]:
#input_sentence_04.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_04.txt   --output_file=glue_data/CoLA/output_sentence_04.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] i want to know if general health check ##up is covered in my insurance plan ? [SEP]
INFO:tensorflow:input_ids: 101 1045 2215 2000 2113 2065 2236 2740 4638 6279 2003 3139 1999 2026 5427 2933 1029 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [34]:
#input_sentence_05.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_05.txt   --output_file=glue_data/CoLA/output_sentence_05.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] are breast pumps covered under my plan ? [SEP]
INFO:tensorflow:input_ids: 101 2024 7388 15856 3139 2104 2026 2933 1029 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_type_ids

In [35]:
#input_sentence_06.txt

!python bert_repo/extract_features.py --input_file=glue_data/CoLA/input_sentence_06.txt   --output_file=glue_data/CoLA/output_sentence_06.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] i was wondering where i can find info on breast pumps and how to get one that ' s covered by my insurance . [SEP]
INFO:tensorflow:input_ids: 101 1045 2001 6603 2073 1045 2064 2424 18558 2006 7388 15856 1998 2129 2000 2131 2028 2008 1005 1055 3139 2011 2026 5427 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [0]:
embeddings = []
for i in range(1, 7):
  file = os.path.join(TASK_DATA_DIR, 'output_sentence_0' + str(i) + '.jsonl') 
  
  with open(file) as f:
          output = json.load(f)

  embeddings.append([])
  for feature in output['features']:
    for layer in feature['layers']:
      embeddings[i-1].append(layer['values'])


In [39]:
len(embeddings)

6

In [0]:
import numpy as np

sentences = []

for embed in embeddings:
  # print('{}'.format(embed))
  sentences.append(np.sum(embed, axis=0))


In [41]:
def compute_cosine_similarity(x, y):
  return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

cosine_similarities = []
for i in [0, 2, 4]:
  cs = compute_cosine_similarity(sentences[i], sentences[i+1])
  cosine_similarities.append(cs)
  print('Cosine Similarity for sentence {} and {}: {}'.format(i+1, i+2, cs))


Cosine Similarity for sentence 1 and 2: 0.7819853239942068
Cosine Similarity for sentence 3 and 4: 0.8574093574136543
Cosine Similarity for sentence 5 and 6: 0.8145740832188282


In [42]:
from scipy.spatial import distance
from math import sqrt

dst = distance.euclidean(sentences[0], sentences[1])

for i in [0, 2, 4]:
  ed_1 = distance.euclidean(sentences[i], sentences[i+1])
  ed_2 = sqrt(sum((sentences[i] - sentences[i+1])**2))

  print('Euclidean distance for sentence {} and {}: {}'.format(i+1, i+2, ed_1))
  print('Manually computed Euclidean distance for sentence {} and {}: {}'.format(i+1, i+2, ed_2))

Euclidean distance for sentence 1 and 2: 287.2174552478757
Manually computed Euclidean distance for sentence 1 and 2: 287.2174552478758
Euclidean distance for sentence 3 and 4: 130.829968957146
Manually computed Euclidean distance for sentence 3 and 4: 130.82996895714606
Euclidean distance for sentence 5 and 6: 238.27785721564973
Manually computed Euclidean distance for sentence 5 and 6: 238.27785721564973


In [43]:
! gsutil -m cp $TASK_DATA_DIR/input_*.txt gs://bucket-251/input
! gsutil -m cp $TASK_DATA_DIR/output_*.jsonl gs://bucket-251/output


Copying file://glue_data/CoLA/input_sentence_06.txt [Content-Type=text/plain]...
Copying file://glue_data/CoLA/input_sentence_01.txt [Content-Type=text/plain]...
Copying file://glue_data/CoLA/input_sentence_02.txt [Content-Type=text/plain]...
Copying file://glue_data/CoLA/input_sentence_03.txt [Content-Type=text/plain]...
/ [0/6 files][    0.0 B/  463.0 B]   0% Done                                    / [0/6 files][    0.0 B/  463.0 B]   0% Done                                    Copying file://glue_data/CoLA/input_sentence_04.txt [Content-Type=text/plain]...
Copying file://glue_data/CoLA/input_sentence_05.txt [Content-Type=text/plain]...
/ [6/6 files][  463.0 B/  463.0 B] 100% Done                                    
Operation completed over 6 objects/463.0 B.                                      
Copying file://glue_data/CoLA/output_sentence_01.jsonl [Content-Type=application/octet-stream]...
Copying file://glue_data/CoLA/output_sentence_02.jsonl [Content-Type=application/octet-stre