## Augmentation Inspection
This notebook is used to sample the augmented training data and inspect it.

In [1]:
import numpy as np
import os
import tensorflow as tf
import collections
import pandas as pd
from utils import tokenization
tf.enable_eager_execution()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Get Paths to the Different Data Augmentation Directories

In [2]:
data_base_path = './Data/proc_Data/GoT/unsup'
prob_factors = np.arange(0.1,0.2,0.1)
copy_number = '0'
data_record_paths = [os.path.join(data_base_path, 'tf_idf-{:0.1f}'.format(x), copy_number, "tf_examples.tfrecord*") for x in prob_factors]
data_files = [tf.contrib.slim.parallel_reader.get_data_files(
          data_record_path) for data_record_path in data_record_paths]

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Feature Specifications is a mapping of the different "columns" of data stored in the tfrecords files.

In [3]:
max_seq_len = 128
feature_specs = collections.OrderedDict()
feature_specs["ori_input_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["ori_input_mask"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["ori_input_type_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["aug_input_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["aug_input_mask"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["aug_input_type_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)


### Use the cell below to create mappings of words to ids and ids to words

In [4]:
vocab_file = "./bert_pretrained/bert_base/vocab.txt"

vocab = tokenization.load_vocab(vocab_file)
ids_dict = tokenization.load_ids(vocab)

In [5]:
for i,infile in enumerate(data_files):
    for example in tf.python_io.tf_record_iterator(infile[-1]):
        a = tf.train.Example.FromString(example)
        orig_int_list = [a.features.feature['ori_input_ids'].int64_list.value[i] for i in range(0,128)]
        aug_int_list = [a.features.feature['aug_input_ids'].int64_list.value[i] for i in range(0,128)]
        orig_seq = tokenization.convert_ids_to_words(orig_int_list, ids_dict)
        aug_seq = tokenization.convert_ids_to_words(aug_int_list, ids_dict)
        print("Original Sequence:\n {}\n\n".format(" ".join(orig_seq)))
        print("Augmented Sequence with p={}:\n {}\n\n".format(prob_factors[i], " ".join(aug_seq)))
        break

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
Original Sequence:
 [CLS] brien ##ne was moving , slow and wary , sword to hand ; step , turn , and listen . each step made a little splash . a cave lion ? dire ##wo ##lves ? some bear ? tell me , jaime . what lives here ? what lives in the darkness ? doom . no bear , he knew . no lion . only doom . in the cool silvery - blue light of the swords , the big wen ##ch looked pale and fierce . i mis ##like this place . i ’ m not fond of it myself . their blades made a little island of light , but all around them stretched a sea of darkness , une ##nding . [SEP] [PAD] [PAD]


Augmented Sequence with p=0.1:
 [CLS] moving , slow and wary , sword to hand ; step , turn , and listen figured bath ##house step made a wrists splash gods ##way a cave lion ? dire ##wo ##lves ? some bear ? tell tap , jaime . what lives here ? what lives in ( darkness trees doom . no bear , he knew . no moaning . only doom . in the cool

In [6]:
a = list(zip(orig_int_list, aug_int_list))

## Generate Consistent Training Examples for Baseline
For our baseline model, we would like to use the same exact examples as we will be using in BERT finetune and the UDA modeling. For this reason we use the code below to read the examples from the tf records files and save the tokenized sequences back to a pickle file. We will then use these pickle files as input (after reading in with pandas) for our baseline models.

In [33]:
data_base_path = './Data/proc_Data/GoT/test/'
data_record_path = os.path.join(data_base_path, "tf_examples.tfrecord*")
data_files = tf.contrib.slim.parallel_reader.get_data_files(
          data_record_path)

In [34]:
max_seq_len = 128
feature_specs = collections.OrderedDict()
feature_specs["input_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["input_mask"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["input_type_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
feature_specs["label_ids"] = tf.io.FixedLenFeature([1], tf.int64)


In [35]:
labels = []
seqs = []
for i,infile in enumerate(data_files):
    for example in tf.python_io.tf_record_iterator(infile):
        a = tf.train.Example.FromString(example)
        temp_labels = a.features.feature['label_ids'].int64_list.value[0]
        orig_int_list = [a.features.feature['input_ids'].int64_list.value[i] for i in range(0,128)]
        orig_seq = tokenization.convert_ids_to_words(orig_int_list, ids_dict)
        labels.append(temp_labels)
        seqs.append(orig_seq)
        

In [37]:
# Convert each list of tokens to one string
seqs = [" ".join(seq) for seq in seqs]

# Convert the label ids back to book labels
labels = np.array(labels) + 1

In [38]:
df = pd.DataFrame()
df['seq'] = seqs
df['label'] = labels
df.to_pickle(os.path.join(data_base_path,'test.pkl'))

## Explore the XLNet TFRecords

In [13]:
import sentencepiece as spm

In [14]:
xl_spmodel = "./xlnet_pretrained/xlnet_base/spiece.model"

xl_features = collections.OrderedDict()
xl_features["input_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
xl_features["input_mask"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
xl_features["segment_ids"] = tf.io.FixedLenFeature([max_seq_len], tf.int64)
xl_features["label_ids"] = tf.io.FixedLenFeature([1], tf.int64)
xl_features["is_real_example"] = tf.io.FixedLenFeature([1], tf.int64)

In [15]:
sp = spm.SentencePieceProcessor()
sp.Load(xl_spmodel)

True

In [16]:
!ls ./Data/proc_Data/GoT_xlnet/train_20/

old.spiece.model.len-128.train.tf_record
tf_examples.tfrecord.0.0


In [42]:
xl_data_base_path = './Data/proc_Data/GoT_xlnet/test/'
xl_data_record_path = os.path.join(xl_data_base_path, "tf_examples.tfrecord*")
xl_data_files = tf.contrib.slim.parallel_reader.get_data_files(
          xl_data_record_path)

In [67]:
labels = []
seqs = []
for i,infile in enumerate(xl_data_files):
    for example in tf.python_io.tf_record_iterator(infile):
        a = tf.train.Example.FromString(example)
        id_list = [a.features.feature['input_ids'].int64_list.value[i] for i in range(0,128)]
        label = [a.features.feature['label_ids'].int64_list.value[i] for i in range(0,1)]
        labels.append(label[0]+1)
        mask = [a.features.feature['input_mask'].float_list.value[i] for i in range(0,128)]
        piece_list = [sp.IdToPiece(i) for i in id_list]
        seqs.append(piece_list)




In [69]:
len(labels)

1938

In [70]:
# Convert each list of tokens to one string
seqs = [" ".join(piece) for piece in seqs]

xldf = pd.DataFrame()
xldf['seq'] = seqs
xldf['label'] = labels
xldf.to_pickle(os.path.join(xl_data_base_path,'test.pkl'))

## Double check data didn't change
I made some changes to the XLNet preprocessing to be more consistent with BERT preprocessing. The cells below should return the same as the cells above.

In [54]:
xl_data_base_path = './Data/proc_Data/GoT_xlnet/train_20/'
xl_data_record_path = os.path.join(xl_data_base_path, "spiece.model.len-128.train.tf_record")
xl_data_files = tf.contrib.slim.parallel_reader.get_data_files(
          xl_data_record_path)

In [55]:
labels = []
seqs = []
for i,infile in enumerate(xl_data_files):
    for example in tf.python_io.tf_record_iterator(infile):
        a = tf.train.Example.FromString(example)
        id_list = [a.features.feature['input_ids'].int64_list.value[i] for i in range(0,128)]
        label_list = [a.features.feature['label_ids'].int64_list.value[i] for i in range(0,1)]
#         print(id_list)
        piece_list = [sp.IdToPiece(i) for i in id_list]
        print(" ".join(piece_list))
        print(label_list)
        break

<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> ▁Gen dry ’ s ▁mare ▁lost ▁her ▁footing ▁in ▁the ▁mud ▁once , ▁going ▁down ▁hard ▁on ▁her ▁hind quarter s ▁and ▁spilling ▁him ▁from ▁the ▁saddle , ▁but ▁neither ▁horse ▁nor ▁rider ▁was ▁hurt , ▁and ▁Gen dry ▁got ▁that ▁stubborn ▁look ▁on ▁his ▁face ▁and ▁mounted ▁right ▁up ▁again . ▁Not ▁long ▁after , ▁they ▁came ▁upon ▁three ▁wolves ▁devour ing ▁the ▁corpse ▁of ▁a ▁f awn . ▁When ▁Hot ▁Pie ’ s ▁horse ▁caught ▁the ▁scent , ▁he ▁ shi ed ▁and ▁bolt ed . ▁Two ▁of ▁the ▁wolves ▁fled ▁as ▁well , ▁but ▁the ▁third ▁raised ▁his ▁head ▁and ▁bar ed ▁his ▁teeth , ▁prepared ▁to ▁defend ▁his ▁kill . <sep> <cls>
[2]


## Look at augmented XLNet Data


In [26]:
for tfidf in np.arange(0.1,0.7,0.2):
    unsup_data_path = './Data/proc_data/GoT_xlnet/unsup/tf_idf-{:0.1f}/0/'.format(tfidf)
    us_data_record_path = os.path.join(unsup_data_path, "tf_examples.tfrecord*")
    us_data_files = tf.contrib.slim.parallel_reader.get_data_files(
              us_data_record_path)
    for i,infile in enumerate(us_data_files):
        for example in tf.python_io.tf_record_iterator(infile):
            a = tf.train.Example.FromString(example)

            ori_id_list = [a.features.feature['ori_input_ids'].int64_list.value[i] for i in range(0,128)]
            aug_id_list = [a.features.feature['aug_input_ids'].int64_list.value[i] for i in range(0,128)]

            ori_piece_list = [sp.IdToPiece(i) for i in ori_id_list]
            aug_piece_list = [sp.IdToPiece(i) for i in aug_id_list]

            ori_mask_list = [a.features.feature['ori_input_mask'].float_list.value[i] for i in range(0,128)]
            aug_mask_list = [a.features.feature['aug_input_mask'].float_list.value[i] for i in range(0,128)]

            print("Original Sequence:\n {}\n\n".format(" ".join(ori_piece_list)))
            print("Augmented Sequence with p={:0.1f}:\n {}\n\n".format(tfidf, " ".join(aug_piece_list)))
            break
        break


Original Sequence:
 <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> ▁It ▁was ▁strangely ▁comforting ▁to ▁see ▁Ed d ’ s ▁do ur ▁face ▁again . ▁How ▁goes ▁the ▁restoration ▁work ? ▁he ▁asked ▁his ▁old ▁steward . ▁Ten ▁more ▁years ▁should ▁do ▁it , ▁Tol lett ▁replied ▁in ▁his ▁usual ▁gloomy ▁tone . ▁Place ▁was ▁over run ▁with ▁rats ▁when ▁we ▁moved ▁in . ▁The ▁spear wi ve s ▁killed ▁the ▁nasty ▁bug gers . ▁Now ▁the ▁place ▁is ▁over run ▁with ▁spear wi ve s . ▁There ’ s ▁days ▁I ▁want ▁the ▁rats ▁back . ▁How ▁do ▁you ▁find ▁serving ▁under ▁Iron ▁Em met t ? ▁Jon ▁asked . ▁Most ly ▁it ’ s ▁Black ▁Mari s ▁serving ▁under ▁him , ▁ m ’ lord . <sep> <cls>


Augmented Sequence with p=0.1:
 ▁It ▁was ▁strangely ▁comforting ▁to ▁see ▁Ed d ▁ ’ ▁ s ▁do ur ▁face ▁again ▁Owen ▁How ▁goes ▁the ▁restoration ▁work ▁ ? ▁he ▁handled ▁his ▁old ▁steward ▁ . ▁Ten ▁more ▁years ▁should ▁do ▁it ▁ , ▁Tol lett ▁replied ▁Horn foot ▁his ▁usual ▁gloomy ▁voice ▁restrain ▁Place ▁was ▁over run ▁with ▁rats ▁O go ▁sworn ▁trem

In [41]:
unsup_data_path = './Data/proc_data/GoT_xlnet/unsup/tf_idf-0.1/0/'
us_data_record_path = os.path.join(unsup_data_path, "tf_examples.tfrecord*")
us_data_files = tf.contrib.slim.parallel_reader.get_data_files(
          us_data_record_path)
for i,infile in enumerate(us_data_files):
    for example in tf.python_io.tf_record_iterator(infile):
        a = tf.train.Example.FromString(example)
        
        ori_id_list = [a.features.feature['ori_input_ids'].int64_list.value[i] for i in range(0,128)]
        aug_id_list = [a.features.feature['aug_input_ids'].int64_list.value[i] for i in range(0,128)]
        
        ori_piece_list = [sp.IdToPiece(i) for i in ori_id_list]
        aug_piece_list = [sp.IdToPiece(i) for i in aug_id_list]
        
        ori_mask_list = [a.features.feature['ori_input_mask'].float_list.value[i] for i in range(0,128)]
        aug_mask_list = [a.features.feature['aug_input_mask'].float_list.value[i] for i in range(0,128)]
        
        print("Original Sequence:\n {}\n\n".format(" ".join(ori_piece_list)))
        print("Augmented Sequence with p=0.1:\n {}\n\n".format(" ".join(aug_piece_list)))
        break

Original Sequence:
 <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> ▁It ▁was ▁strangely ▁comforting ▁to ▁see ▁Ed d ’ s ▁do ur ▁face ▁again . ▁How ▁goes ▁the ▁restoration ▁work ? ▁he ▁asked ▁his ▁old ▁steward . ▁Ten ▁more ▁years ▁should ▁do ▁it , ▁Tol lett ▁replied ▁in ▁his ▁usual ▁gloomy ▁tone . ▁Place ▁was ▁over run ▁with ▁rats ▁when ▁we ▁moved ▁in . ▁The ▁spear wi ve s ▁killed ▁the ▁nasty ▁bug gers . ▁Now ▁the ▁place ▁is ▁over run ▁with ▁spear wi ve s . ▁There ’ s ▁days ▁I ▁want ▁the ▁rats ▁back . ▁How ▁do ▁you ▁find ▁serving ▁under ▁Iron ▁Em met t ? ▁Jon ▁asked . ▁Most ly ▁it ’ s ▁Black ▁Mari s ▁serving ▁under ▁him , ▁ m ’ lord . <sep> <cls>


Augmented Sequence with p=0.1:
 ▁It ▁was ▁strangely ▁comforting ▁to ▁see ▁Ed d ▁ ’ ▁ s ▁do ur ▁face ▁again ▁Owen ▁How ▁goes ▁the ▁restoration ▁work ▁ ? ▁he ▁handled ▁his ▁old ▁steward ▁ . ▁Ten ▁more ▁years ▁should ▁do ▁it ▁ , ▁Tol lett ▁replied ▁Horn foot ▁his ▁usual ▁gloomy ▁voice ▁restrain ▁Place ▁was ▁over run ▁with ▁rats ▁O go ▁sworn ▁trem