<a href="https://colab.research.google.com/github/takoloco/w251/blob/master/week06/hw/tako_hisada_hw6_BERT_FineTuning_with_Cloud_TPU_Sentence_and_Sentence_Pair_Classification_Tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT finetuning tasks in 5 minutes with Cloud TPU

<table class="tfo-notebook-buttons" align="left" >
 <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

Once you finish the setup, let's start!

**Firstly**, we need to set up Colab TPU running environment, verify a TPU device is succesfully connected and upload credentials to TPU for GCS bucket usage.

In [0]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.33.249.18:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 5023830667552631994),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 10372112135816979123),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 884658457123060681),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 1483279485429802220),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 13291532716602769516),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 13663665642373204420),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 7008609742610233499),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 12351791256026920448),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 9029820107139

**Secondly**, prepare and import BERT modules.

In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

Cloning into 'bert_repo'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 306 (delta 0), reused 2 (delta 0), pack-reused 299[K
Receiving objects: 100% (306/306), 266.58 KiB | 3.60 MiB/s, done.
Resolving deltas: 100% (167/167), done.


**Thirdly**, prepare for training:

*  Specify task and download training data.
*  Specify BERT pretrained model
*  Specify GS bucket, create output directory for model checkpoints and eval results.



In [0]:
TASK = 'MRPC' #@param {type:"string"}

TASK_DATA_DIR = 'glue_data/' + TASK
! mkdir -pv $TASK_DATA_DIR
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR

import zipfile
import urllib.request

! curl https://raw.githubusercontent.com/jaisong87/prDetect/master/Content/msr_paraphrase_train.txt --output glue_data/MRPC/msr_paraphrase_train.txt
! curl https://raw.githubusercontent.com/jaisong87/prDetect/master/Content/msr_paraphrase_test.txt --output glue_data/MRPC/msr_paraphrase_test.txt

MRPCURL='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc'
task='MRPC'
mrpc_dir='glue_data/MRPC'
urllib.request.urlretrieve(MRPCURL, os.path.join(mrpc_dir, "dev_ids.tsv"))

mrpc_train_file = os.path.join(mrpc_dir, "msr_paraphrase_train.txt")
mrpc_test_file = os.path.join(mrpc_dir, "msr_paraphrase_test.txt")

dev_ids = []
with open(os.path.join(mrpc_dir, "dev_ids.tsv"), encoding="utf8") as ids_fh:
    for row in ids_fh:
        dev_ids.append(row.strip().split('\t'))

with open(mrpc_train_file, encoding="utf8") as data_fh, \
     open(os.path.join(mrpc_dir, "train.tsv"), 'w', encoding="utf8") as train_fh, \
     open(os.path.join(mrpc_dir, "dev.tsv"), 'w', encoding="utf8") as dev_fh:
    header = data_fh.readline()
    train_fh.write(header)
    dev_fh.write(header)
    for row in data_fh:
        label, id1, id2, s1, s2 = row.strip().split('\t')
        if [id1, id2] in dev_ids:
            dev_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))
        else:
            train_fh.write("%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2))

with open(mrpc_test_file, encoding="utf8") as data_fh, \
        open(os.path.join(mrpc_dir, "test.tsv"), 'w', encoding="utf8") as test_fh:
    header = data_fh.readline()
    test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
    for idx, row in enumerate(data_fh):
        label, id1, id2, s1, s2 = row.strip().split('\t')
        test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
print("\tCompleted!")

# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model
#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
BERT_MODEL = 'uncased_L-12_H-768_A-12' #@param {type:"string"}
BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL
print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
!gsutil ls $BERT_PRETRAINED_DIR

BUCKET = 'ucb_w251' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Task data directory: glue_data/MRPC *****
dev_ids.tsv  msr_paraphrase_test.txt   test.tsv
dev.tsv      msr_paraphrase_train.txt  train.tsv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1002k  100 1002k    0     0  6774k      0 --:--:-- --:--:-- --:--:-- 6774k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  422k  100  422k    0     0  3950k      0 --:--:-- --:--:-- --:--:-- 3950k
	Completed!
***** BERT pretrained directory: gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12 *****
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_config.json
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.index
gs://cloud-tpu-checkpoints/bert

In [0]:
#TASK = 'MRPC' #@param {type:"string"}
TASK = 'CoLA'
# assert TASK in ('MRPC', 'CoLA'), 'Only (MRPC, CoLA) are demonstrated here.'
# Download glue data.
! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
! python download_glue_repo/download_glue_data.py --data_dir='glue_data' --tasks=$TASK

TASK_DATA_DIR = 'glue_data/' + TASK
print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR

# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model
#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
BERT_MODEL = 'uncased_L-12_H-768_A-12' #@param {type:"string"}
BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL
print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
!gsutil ls $BERT_PRETRAINED_DIR

BUCKET = 'ucb_w251' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


Downloading and extracting CoLA...
	Completed!
***** Task data directory: glue_data/CoLA *****
dev.tsv  original  test.tsv  train.tsv
***** BERT pretrained directory: gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12 *****
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_config.json
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.index
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/bert_model.ckpt.meta
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/checkpoint
gs://cloud-tpu-checkpoints/bert/uncased_L-12_H-768_A-12/vocab.txt
***** Model output directory: gs://ucb_w251/bert/models/CoLA *****


**Now, let's play!**

In [0]:
# Setup task specific model and TPU running config.

import modeling
import optimization
import run_classifier
import tokenization


# Model Hyper Parameters
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 128
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

processors = {
  "cola": run_classifier.ColaProcessor,
  "mnli": run_classifier.MnliProcessor,
  "mrpc": run_classifier.MrpcProcessor,
}
processor = processors[TASK.lower()]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=OUTPUT_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

train_examples = processor.get_train_examples(TASK_DATA_DIR)
num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)

INFO:tensorflow:Using config: {'_model_dir': 'gs://ucb_w251/bert/models/MRPC', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      value: "10.33.249.18:8470"
    }
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fce0252b668>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.33.249.18:8470', '_evaluation_master': 'grpc://10.33.249.18:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, p

In [0]:
# Train the model.
print('MRPC/CoLA on BERT base model normally takes about 2-3 minutes. Please wait...')
train_features = run_classifier.convert_examples_to_features(
    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started training at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(train_examples)))
print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info("  Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))

MRPC/CoLA on BERT base model normally takes about 2-3 minutes. Please wait...
INFO:tensorflow:Writing example 0 of 3668
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-1
INFO:tensorflow:tokens: [CLS] am ##ro ##zi accused his brother , whom he called " the witness " , of deliberately di ##stor ##ting his evidence . [SEP] referring to him as only " the witness " , am ##ro ##zi accused his brother of deliberately di ##stor ##ting his evidence . [SEP]
INFO:tensorflow:input_ids: 101 2572 3217 5831 5496 2010 2567 1010 3183 2002 2170 1000 1996 7409 1000 1010 1997 9969 4487 23809 3436 2010 3350 1012 102 7727 2000 2032 2004 2069 1000 1996 7409 1000 1010 2572 3217 5831 5496 2010 2567 1997 9969 4487 23809 3436 2010 3350 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

In [0]:
# Eval the model.
eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
eval_features = run_classifier.convert_examples_to_features(
    eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(eval_examples)))
print('  Batch size = {}'.format(EVAL_BATCH_SIZE))
# Eval will be slightly WRONG on the TPU because it will truncate
# the last batch.
eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
eval_input_fn = run_classifier.input_fn_builder(
    features=eval_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=True)
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
  print("***** Eval results *****")
  for key in sorted(result.keys()):
    print('  {} = {}'.format(key, str(result[key])))
    writer.write("%s = %s\n" % (key, str(result[key])))

INFO:tensorflow:Writing example 0 of 408
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: dev-1
INFO:tensorflow:tokens: [CLS] he said the foods ##er ##vic ##e pie business doesn ' t fit the company ' s long - term growth strategy . [SEP] " the foods ##er ##vic ##e pie business does not fit our long - term growth strategy . [SEP]
INFO:tensorflow:input_ids: 101 2002 2056 1996 9440 2121 7903 2063 11345 2449 2987 1005 1056 4906 1996 2194 1005 1055 2146 1011 2744 3930 5656 1012 102 1000 1996 9440 2121 7903 2063 11345 2449 2515 2025 4906 2256 2146 1011 2744 3930 5656 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [0]:
! echo "i quit smoking.  how do i go about removing the tobacco premium?" | tee $TASK_DATA_DIR/input_sentence_01.txt
! echo "I was wondering if you can tell me how I can remove the tobacco user surcharge from my coverage plan since I am not a tobacco user" | tee $TASK_DATA_DIR/input_sentence_02.txt
! echo "how frequently is an annual checkup is covered?" | tee $TASK_DATA_DIR/input_sentence_03.txt
! echo "I want to know if general health checkup is covered in my insurance plan?" | tee $TASK_DATA_DIR/input_sentence_04.txt
! echo "are breast pumps covered under my plan?" | tee $TASK_DATA_DIR/input_sentence_05.txt
! echo "I was wondering where I can find info on breast pumps and how to get one that's covered by my insurance." | tee $TASK_DATA_DIR/input_sentence_06.txt
! ls -l $TASK_DATA_DIR/input_*.txt

i quit smoking.  how do i go about removing the tobacco premium?
I was wondering if you can tell me how I can remove the tobacco user surcharge from my coverage plan since I am not a tobacco user
how frequently is an annual checkup is covered?
I want to know if general health checkup is covered in my insurance plan?
are breast pumps covered under my plan?
I was wondering where I can find info on breast pumps and how to get one that's covered by my insurance.
-rw-r--r-- 1 root root  65 Feb 11 03:31 glue_data/MRPC/input_sentence_01.txt
-rw-r--r-- 1 root root 131 Feb 11 03:31 glue_data/MRPC/input_sentence_02.txt
-rw-r--r-- 1 root root  48 Feb 11 03:31 glue_data/MRPC/input_sentence_03.txt
-rw-r--r-- 1 root root  74 Feb 11 03:31 glue_data/MRPC/input_sentence_04.txt
-rw-r--r-- 1 root root  40 Feb 11 03:31 glue_data/MRPC/input_sentence_05.txt
-rw-r--r-- 1 root root 105 Feb 11 03:31 glue_data/MRPC/input_sentence_06.txt


In [0]:
#!gsutil -m cp -r $BERT_PRETRAINED_DIR .
! ls -l
! echo $TASK_DATA_DIR ;
! ls -l $TASK_DATA_DIR ;

total 24
-rw-r--r-- 1 root root 2711 Feb 11 03:05 adc.json
drwxr-xr-x 4 root root 4096 Feb 11 03:14 bert_repo
drwxr-xr-x 3 root root 4096 Feb 11 03:13 download_glue_repo
drwxr-xr-x 4 root root 4096 Feb 11 03:16 glue_data
drwxr-xr-x 1 root root 4096 Feb  6 17:31 sample_data
drwxr-xr-x 2 root root 4096 Feb 11 04:12 uncased_L-12_H-768_A-12
glue_data/MRPC
total 2892
-rw-r--r-- 1 root root    6222 Feb 11 03:17 dev_ids.tsv
-rw-r--r-- 1 root root  103255 Feb 11 03:17 dev.tsv
-rw-r--r-- 1 root root      65 Feb 11 03:31 input_sentence_01.txt
-rw-r--r-- 1 root root     131 Feb 11 03:31 input_sentence_02.txt
-rw-r--r-- 1 root root      48 Feb 11 03:31 input_sentence_03.txt
-rw-r--r-- 1 root root      74 Feb 11 03:31 input_sentence_04.txt
-rw-r--r-- 1 root root      40 Feb 11 03:31 input_sentence_05.txt
-rw-r--r-- 1 root root     105 Feb 11 03:31 input_sentence_06.txt
-rw-r--r-- 1 root root  432841 Feb 11 03:17 msr_paraphrase_test.txt
-rw-r--r-- 1 root root 1026746 Feb 11 03:17 msr_paraphrase_trai

In [0]:
! for i in `seq 1 6`; do python bert_repo/extract_features.py --input_file=glue_data/MRPC/input_sentence_0${i}.txt   --output_file=glue_data/MRPC/output_sentence_0${i}.jsonl   --vocab_file=uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt   --layers=-2   --max_seq_length=128   --batch_size=8; done;


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] i quit smoking . how do i go about removing the tobacco premium ? [SEP]
INFO:tensorflow:input_ids: 101 1045 8046 9422 1012 2129 2079 1045 2175 2055 9268 1996 9098 12882 1029 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [0]:
! ls -l $TASK_DATA_DIR ;
! export EMBEDDING_FILES=(ls ${TASK_DATA_DIR}/output_*.jsonl) ;
#from scipy.spatial import distance
! echo $EMBEDDING_FILES ;

total 6444
-rw-r--r-- 1 root root    6222 Feb 11 03:17 dev_ids.tsv
-rw-r--r-- 1 root root  103255 Feb 11 03:17 dev.tsv
-rw-r--r-- 1 root root      65 Feb 11 03:31 input_sentence_01.txt
-rw-r--r-- 1 root root     131 Feb 11 03:31 input_sentence_02.txt
-rw-r--r-- 1 root root      48 Feb 11 03:31 input_sentence_03.txt
-rw-r--r-- 1 root root      74 Feb 11 03:31 input_sentence_04.txt
-rw-r--r-- 1 root root      40 Feb 11 03:31 input_sentence_05.txt
-rw-r--r-- 1 root root     105 Feb 11 03:31 input_sentence_06.txt
-rw-r--r-- 1 root root  432841 Feb 11 03:17 msr_paraphrase_test.txt
-rw-r--r-- 1 root root 1026746 Feb 11 03:17 msr_paraphrase_train.txt
-rw-r--r-- 1 root root  513627 Feb 11 04:15 output_sentence_01.jsonl
-rw-r--r-- 1 root root  995294 Feb 11 04:16 output_sentence_02.jsonl
-rw-r--r-- 1 root root  385010 Feb 11 04:16 output_sentence_03.jsonl
-rw-r--r-- 1 root root  577563 Feb 11 04:16 output_sentence_04.jsonl
-rw-r--r-- 1 root root  320838 Feb 11 04:16 output_sentence_05.jsonl
-rw

In [0]:
file = os.path.join(TASK_DATA_DIR, 'output_sentence_02.jsonl')

embeddings = []
for i in range(1, 7):
  file = os.path.join(TASK_DATA_DIR, 'output_sentence_0' + str(i) + '.jsonl') 
  
  with open(file) as f:
          output = json.load(f)

  embeddings.append([])
  for feature in output['features']:
    for layer in feature['layers']:
      embeddings[i-1].append(layer['values'])

print('{}'.format(embeddings))


[[[0.228388, -0.452495, 0.074351, -0.444699, -0.454134, -0.451693, 0.339783, 0.322187, 0.596154, -0.730671, 0.306022, -0.02872, -0.118068, 0.066014, 0.422215, 0.038982, -0.42573, 0.073658, -0.208762, 0.282245, -0.27535, 0.218962, -0.005548, -0.424176, 0.050014, 0.2488, 0.250507, -0.327597, -0.472708, 0.665794, -0.219003, -0.49515, -0.13894, -0.342008, 0.185133, 0.119136, 0.386916, -0.406668, -0.376325, -0.216163, -1.162496, 0.084736, -0.018365, -0.064183, -0.094544, -0.240918, -3.471137, -0.104011, -0.204922, -0.414086, -0.079023, -0.311203, -0.230259, 0.392347, 0.429423, 0.041858, 0.094038, -0.618071, -0.413706, 0.074038, 0.271649, 0.281195, -0.41303, 0.005702, -0.131151, -0.300389, 0.113067, -0.000956, -0.226261, 0.368788, -0.200331, -0.170206, 0.208057, 0.120881, 0.029803, -0.48894, -0.288363, 0.079719, -0.575602, -0.443352, -0.666888, 1.16996, -0.063724, -0.259889, 0.188365, 0.674972, 0.072165, 0.143701, 0.209511, 0.245016, -0.610613, 0.09305, -0.046465, 0.413741, 0.360716, 0.01447

In [0]:
import numpy as np

sentences = []

for embed in embeddings:
  print('{}'.format(embed))
  sentences.append(np.sum(embed, axis=0))

[[0.228388, -0.452495, 0.074351, -0.444699, -0.454134, -0.451693, 0.339783, 0.322187, 0.596154, -0.730671, 0.306022, -0.02872, -0.118068, 0.066014, 0.422215, 0.038982, -0.42573, 0.073658, -0.208762, 0.282245, -0.27535, 0.218962, -0.005548, -0.424176, 0.050014, 0.2488, 0.250507, -0.327597, -0.472708, 0.665794, -0.219003, -0.49515, -0.13894, -0.342008, 0.185133, 0.119136, 0.386916, -0.406668, -0.376325, -0.216163, -1.162496, 0.084736, -0.018365, -0.064183, -0.094544, -0.240918, -3.471137, -0.104011, -0.204922, -0.414086, -0.079023, -0.311203, -0.230259, 0.392347, 0.429423, 0.041858, 0.094038, -0.618071, -0.413706, 0.074038, 0.271649, 0.281195, -0.41303, 0.005702, -0.131151, -0.300389, 0.113067, -0.000956, -0.226261, 0.368788, -0.200331, -0.170206, 0.208057, 0.120881, 0.029803, -0.48894, -0.288363, 0.079719, -0.575602, -0.443352, -0.666888, 1.16996, -0.063724, -0.259889, 0.188365, 0.674972, 0.072165, 0.143701, 0.209511, 0.245016, -0.610613, 0.09305, -0.046465, 0.413741, 0.360716, 0.014472

In [0]:
def compute_cosine_similarity(x, y):
  return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

cosine_similarities = []
for i in [0, 2, 4]:
  cs = compute_cosine_similarity(sentences[i], sentences[i+1])
  cosine_similarities.append(cs)
  print('Cosine Similarity for sentence {} and {}: {}'.format(i+1, i+2, cs))


Cosine Similarity for sentence 1 and 2: 0.7819853239942068
Cosine Similarity for sentence 3 and 4: 0.8574093574136543
Cosine Similarity for sentence 5 and 6: 0.8145740832188282


In [0]:
from scipy.spatial import distance
from math import sqrt

dst = distance.euclidean(sentences[0], sentences[1])

for i in [0, 2, 4]:
  ed_1 = distance.euclidean(sentences[i], sentences[i+1])
  ed_2 = sqrt(sum((sentences[i] - sentences[i+1])**2))

  print('Euclidean distance for sentence {} and {}: {}'.format(i+1, i+2, ed_1))
  print('Manually computed Euclidean distance for sentence {} and {}: {}'.format(i+1, i+2, ed_2))

Euclidean distance for sentence 1 and 2: 287.2174552478757
Manually computed Euclidean distance for sentence 1 and 2: 287.2174552478758
Euclidean distance for sentence 3 and 4: 130.829968957146
Manually computed Euclidean distance for sentence 3 and 4: 130.82996895714606
Euclidean distance for sentence 5 and 6: 238.27785721564973
Manually computed Euclidean distance for sentence 5 and 6: 238.27785721564973


In [0]:
! gsutil -m cp $TASK_DATA_DIR/input_*.txt gs://ucb_w251/input
! gsutil -m cp $TASK_DATA_DIR/output_*.jsonl gs://ucb_w251/output

Copying file://glue_data/MRPC/input_sentence_01.txt [Content-Type=text/plain]...
Copying file://glue_data/MRPC/input_sentence_06.txt [Content-Type=text/plain]...
Copying file://glue_data/MRPC/input_sentence_02.txt [Content-Type=text/plain]...
/ [0/6 files][    0.0 B/  463.0 B]   0% Done                                    / [0/6 files][    0.0 B/  463.0 B]   0% Done                                    Copying file://glue_data/MRPC/input_sentence_04.txt [Content-Type=text/plain]...
/ [0/6 files][    0.0 B/  463.0 B]   0% Done                                    / [0/6 files][    0.0 B/  463.0 B]   0% Done                                    Copying file://glue_data/MRPC/input_sentence_03.txt [Content-Type=text/plain]...
Copying file://glue_data/MRPC/input_sentence_05.txt [Content-Type=text/plain]...
/ [6/6 files][  463.0 B/  463.0 B] 100% Done                                    
Operation completed over 6 objects/463.0 B.                                      
Copying file://glue_data/MR