## Import dataset

Data Augmentation: we perform synonym and random
word replacement with NLTK and WordNet on the contexts
of the SQuAD dataset. Questions are left unchanged. We
explored different strategies of synonym replacement
(sampling rate, +random words, +stop words) and
injected different amount of augmented data (x0.33, x1, x2,
x3) on top of the original data in our experiments.
● For each word in a context paragraph
○ 20% of the time:
call replace_synonym
■ if exists synonyms:
replace with a random synonym
■ otherwise:
replace with a random word
○ 80% of the time: remain unchanged

http://web.stanford.edu/class/cs224n/posters/15845024.pdf

In [1]:
import tensorflow as tf
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

tf.logging.set_verbosity(tf.logging.ERROR)  # suppress some deprecation warnings

device_name = tf.test.gpu_device_name()

if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')

  from ._conv import register_converters as _register_converters


In [2]:
import pandas as pd
import os

data_directory = './data'

data_val = pd.read_csv(os.path.join(data_directory, 'cloze_test_val__spring2016 - cloze_test_ALL_val.csv'), header='infer')
data_test = pd.read_csv(os.path.join(data_directory, 'cloze_test_test__spring2016 - cloze_test_ALL_test.csv'), header='infer')

In [3]:
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from datetime import datetime
import numpy as np

W0525 16:20:21.492210 47019843706304 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [4]:
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

In [198]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /cluster/home/sanagnos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /cluster/home/sanagnos/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /cluster/home/sanagnos/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [199]:
def create_dataset(data_val):
    contexts = list()
    last_sentences = list()
    classes = list()
    for pos in range(len(data_val)):
        story_start = data_val.iloc[pos][['InputSentence' + str(i) for i in [1, 2, 3, 4]]].values
        
        contexts.append(" ".join(story_start))
        last_sentences.append(data_val.iloc[pos]['RandomFifthSentenceQuiz1'])
        contexts.append(" ".join(story_start))
        last_sentences.append(data_val.iloc[pos]['RandomFifthSentenceQuiz2'])
        
        if data_val.iloc[pos]['AnswerRightEnding'] == 1:
            classes.append(0)
            classes.append(1)
        else:
            classes.append(1)
            classes.append(0)
            
    return pd.DataFrame({'story': contexts, 'ending': last_sentences, 'class': classes})

val_pd = create_dataset(data_val)
test_pd = create_dataset(data_test)

In [200]:
from sklearn.utils import shuffle

train = shuffle(val_pd)
train_unshuffled = val_pd
test = test_pd

print(len(train))
print(len(test))

3742
3742


In [206]:
CONTEXT_COLUMN = 'story'
ENDING_COLUMN = 'ending'
LABEL_COLUMN = 'class'

# label_list is 0 for a true story and 1 for a false story
label_list = [0, 1]

REPLICATION_FACTOR = 5

train_InputExamples = pd.concat([train]*REPLICATION_FACTOR).apply(lambda x: 
                                                                  bert.run_classifier.InputExample(guid=None,
                                                                  text_a = x[CONTEXT_COLUMN], 
                                                                  text_b = x[ENDING_COLUMN], 
                                                                  label = x[LABEL_COLUMN]), axis = 1)

test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None, 
                                                                   text_a = x[CONTEXT_COLUMN], 
                                                                   text_b = x[ENDING_COLUMN], 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

In [108]:
# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
# BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1"

def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    with tf.Graph().as_default():
        bert_module = hub.Module(BERT_MODEL_HUB)
        tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
        with tf.Session() as sess:
            vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
    return bert.tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

In [207]:
from bert.run_classifier import PaddingInputExample, _truncate_seq_pair, InputFeatures
import nltk
from nltk.corpus import wordnet
import random

def replace_with_synonym(token, tokenizer):
  new_token = token
  synonyms = []
  for syn in wordnet.synsets(token):
    for l in syn.lemmas():
      synonyms.append(l.name())
  if len(synonyms) > 0:
    new_token = tokenizer.tokenize(random.choice(synonyms))[0]
#     print(token, new_token)
  return new_token


def convert_examples_to_features(examples, label_list, max_seq_length,
                                 tokenizer, set_synonyms=False, percentage_synonyms=0.2):
  """Convert a set of `InputExample`s to a list of `InputFeatures`."""
  
  if set_synonyms == False:
    percentage_synonyms = 0

  features = []
  for (ex_index, example) in enumerate(examples):
    if ex_index % 10000 == 0:
      tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

    feature = convert_single_example(ex_index, example, label_list,
                                     max_seq_length, tokenizer, percentage_synonyms)

    features.append(feature)
  return features


def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer, percentage_synonyms):
  """Converts a single `InputExample` into a single `InputFeatures`."""

  if isinstance(example, PaddingInputExample):
    return InputFeatures(
        input_ids=[0] * max_seq_length,
        input_mask=[0] * max_seq_length,
        segment_ids=[0] * max_seq_length,
        label_id=0,
        is_real_example=False)

  # Which tokens to replace with synonyms
  set_synonyms = np.random.choice([True, False], max_seq_length,
                                  p=[percentage_synonyms, 1 - percentage_synonyms])

  label_map = {}
  for (i, label) in enumerate(label_list):
    label_map[label] = i

  tokens_a = tokenizer.tokenize(example.text_a)
  tokens_b = None
  if example.text_b:
    tokens_b = tokenizer.tokenize(example.text_b)

  if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3"
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
      tokens_a = tokens_a[0:(max_seq_length - 2)]

  # The convention in BERT is:
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  #
  # Where "type_ids" are used to indicate whether this is the first
  # sequence or the second sequence. The embedding vectors for `type=0` and
  # `type=1` were learned during pre-training and are added to the wordpiece
  # embedding vector (and position vector). This is not *strictly* necessary
  # since the [SEP] token unambiguously separates the sequences, but it makes
  # it easier for the model to learn the concept of sequences.
  #
  # For classification tasks, the first vector (corresponding to [CLS]) is
  # used as the "sentence vector". Note that this only makes sense because
  # the entire model is fine-tuned.
  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  index = 1
  for token in tokens_a:
    if set_synonyms[index]:
        tokens.append(replace_with_synonym(token, tokenizer))
    else:
        tokens.append(token)
    segment_ids.append(0)
    index += 1
  tokens.append("[SEP]")
  segment_ids.append(0)
  index += 1

  if tokens_b:
    for token in tokens_b:
      if set_synonyms[index]:
        tokens.append(replace_with_synonym(token, tokenizer))
      else:
        tokens.append(token)
      segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

  input_ids = tokenizer.convert_tokens_to_ids(tokens)

  # The mask has 1 for real tokens and 0 for padding tokens. Only real
  # tokens are attended to.
  input_mask = [1] * len(input_ids)

  # Zero-pad up to the sequence length.
  while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  assert len(input_ids) == max_seq_length
  assert len(input_mask) == max_seq_length
  assert len(segment_ids) == max_seq_length

  label_id = label_map[example.label]
  if ex_index < 5:
    tf.logging.info("*** Example ***")
    tf.logging.info("guid: %s" % (example.guid))
    tf.logging.info("tokens: %s" % " ".join(
        [tokenization.printable_text(x) for x in tokens]))
    tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
    tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
    tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
    tf.logging.info("label: %s (id = %d)" % (example.label, label_id))

  feature = InputFeatures(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids,
      label_id=label_id,
      is_real_example=True)
  return feature

In [217]:
# We'll set sequences to be at most this tokens long.
MAX_SEQ_LENGTH = 96
# Convert our train and test features to InputFeatures that BERT understands.
train_features = convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer,
                                              set_synonyms=True, percentage_synonyms=0.2)

test_features = convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

In [218]:
tokens = tokenizer.tokenize("This here's an example of using the BERT tokenizer ||| test")
print(tokenizer.convert_tokens_to_ids(tokens))
tokens.append("[CLS]")
print(tokenizer.convert_tokens_to_ids(tokens))

[2023, 2182, 1005, 1055, 2019, 2742, 1997, 2478, 1996, 14324, 19204, 17629, 1064, 1064, 1064, 3231]
[2023, 2182, 1005, 1055, 2019, 2742, 1997, 2478, 1996, 14324, 19204, 17629, 1064, 1064, 1064, 3231, 101]


In [219]:
tf.reset_default_graph()

In [220]:
class DenseLayer(tf.keras.Model):
    def __init__(self, layers, dropout_keep_proba=0.9, activation=tf.nn.relu):
        super(DenseLayer, self).__init__()
        
        self.dense_layers = []
        self.dropout_keep_proba = dropout_keep_proba
        
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(layer_size, name='DenseLayer_' + str(i), use_bias=True, activation=tf.nn.relu))
    
    def call(self, logits):
        
        for layer in self.dense_layers:
            logits = layer(logits)
            logits = tf.nn.dropout(logits, keep_prob=self.dropout_keep_proba)

        return logits

In [221]:
def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
                 num_labels):
  """Creates a classification model."""

  bert_module = hub.Module(
      BERT_MODEL_HUB,
      trainable=True)
  bert_inputs = dict(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids)
  bert_outputs = bert_module(
      inputs=bert_inputs,
      signature="tokens",
      as_dict=True)

  output_layer = bert_outputs["pooled_output"]

  hidden_size = output_layer.shape[-1].value
  n_ctx = input_ids.shape[-1].value
 
  transformer_outputs = bert_outputs['sequence_output']

  # final output size
  weight_size_last_sentence = hidden_size * 3
  layers_for_last_sentence = [weight_size_last_sentence]
  assert weight_size_last_sentence == layers_for_last_sentence[-1]
  dense_last_sentence = DenseLayer(layers_for_last_sentence)
    
  segment_ids_expandend = tf.tile(tf.expand_dims(segment_ids, 2), [1, 1, weight_size_last_sentence])
  segment_ids_expandend = tf.cast(segment_ids_expandend, tf.float32)
  result = dense_last_sentence(transformer_outputs) * segment_ids_expandend

  output_layer_last_sentence = tf.reduce_sum(result, 1)
        
  output_layer = output_layer_last_sentence

  with tf.variable_scope("loss"):

    # Dropout helps prevent overfitting
    output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
    logits = tf.layers.dense(output_layer, 512, use_bias=True, activation=tf.nn.sigmoid)
    logits = tf.nn.dropout(logits, keep_prob=0.9)
    logits = tf.layers.dense(logits, num_labels, use_bias=True)

    log_probs = tf.nn.log_softmax(logits, axis=-1)
    
    # Convert labels into one-hot encoding
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
    # If we're predicting, we want predicted labels and the probabiltiies.
    if is_predicting:
      return (predicted_labels, log_probs)

    # If we're train/eval, compute loss between predicted and actual label
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, predicted_labels, log_probs)


In [222]:
# model_fn_builder actually creates our model function
# using the passed parameters for num_labels, learning_rate, etc.
def model_fn_builder(num_labels, learning_rate, num_train_steps,
                     num_warmup_steps):
  """Returns `model_fn` closure for TPUEstimator."""
  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    label_ids = features["label_ids"]

    is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)
    
    # TRAIN and EVAL
    if not is_predicting:

      (loss, predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      train_op = bert.optimization.create_optimizer(
          loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)

      # Calculate evaluation metrics. 
      def metric_fn(label_ids, predicted_labels):
        accuracy = tf.metrics.accuracy(label_ids, predicted_labels)
        return {
            "eval_accuracy": accuracy,
        }

      eval_metrics = metric_fn(label_ids, predicted_labels)

      if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
          loss=loss,
          train_op=train_op)
      else:
          return tf.estimator.EstimatorSpec(mode=mode,
            loss=loss,
            eval_metric_ops=eval_metrics)
    else:
      (predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      predictions = {
          'probabilities': log_probs,
          'labels': predicted_labels
      }
      return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  # Return the actual model function in the closure
  return model_fn


In [223]:
# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 50000
SAVE_SUMMARY_STEPS = 100000

OUTPUT_DIR = 'output_dir'
SAVE_RESULTS_DIR = 'results_predictions'

N_ESTIMATORS = 15

# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_features) / REPLICATION_FACTOR / BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

assert REPLICATION_FACTOR == int(NUM_TRAIN_EPOCHS)

print('num_train_steps', num_train_steps)

# Skip this step to avoid disk quota
# Specify outpit directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(
    model_dir=OUTPUT_DIR,
    save_summary_steps=SAVE_SUMMARY_STEPS,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)

model_fn = model_fn_builder(
  num_labels=len(label_list),
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps)

estimator = tf.estimator.Estimator(
  model_fn=model_fn,
  config=run_config,
  params={"batch_size": BATCH_SIZE})


test_input_fn = run_classifier.input_fn_builder(
    features=test_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

num_train_steps 701


In [224]:
def get_final_predictions(in_contexts, in_last_sentences):
    input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = y, label = 0) for x, y in zip(in_contexts, in_last_sentences)] # here, "" is just a dummy label
    input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
    predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
    predictions = estimator.predict(predict_input_fn)
    predictions = [prediction['probabilities'] for prediction in predictions]

    return predictions

def combine_predictions(predictions):
    my_predictions = []

    i = 0
    while i < len(test):
        p_first = np.exp(predictions[i])
        p_second = np.exp(predictions[i + 1])

        p1 = p_first[0] + p_second[1]
        p2 = p_first[1] + p_second[0]

        if p1 > p2:
            my_predictions.append(1)
        else:
            my_predictions.append(2)
        i += 2
        
    return np.array(my_predictions)

In [None]:
!rm -rf {SAVE_RESULTS_DIR} || true
!mkdir {SAVE_RESULTS_DIR}

true_labels_train = train_unshuffled['class'].values[::2] + 1
true_labels_val = test['class'].values[::2] + 1

for i in range(N_ESTIMATORS):
    !rm -rf {OUTPUT_DIR} || true
    
    train_features = shuffle(train_features)
    
    # Create an input function for training. drop_remainder = True for using TPUs.
    train_input_fn = bert.run_classifier.input_fn_builder(
        features=train_features,
        seq_length=MAX_SEQ_LENGTH,
        is_training=True,
        drop_remainder=False)

    
    print(f'========= BEGINNING TRAINING FOR CLASSIFIER {i:2d} =========')
    current_time = datetime.now()
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)

    print("Training took time ", datetime.now() - current_time)
        
    predictions = get_final_predictions(train_unshuffled['story'].values, train_unshuffled['ending'].values)
    print('Score on train is ', accuracy_score(true_labels_train, combine_predictions(predictions)))
    np.savetxt(os.path.join("./" + SAVE_RESULTS_DIR, "predictions_train_" + str(i) + '.csv'), predictions, delimiter=",")
    
    predictions = get_final_predictions(test['story'].values, test['ending'].values)
    val_score = accuracy_score(true_labels_val, combine_predictions(predictions))
    print('Score on val is ', val_score)
    np.savetxt(os.path.join("./" + SAVE_RESULTS_DIR, "predictions_test_" + str(val_score) + '_classifier_' + str(i) + '.csv'), predictions, delimiter=",")



  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Training took time  0:04:39.547921
Score on train is  0.9935863174772849
Score on val is  0.8818813468733298
Training took time  0:04:41.831737
Score on train is  0.997327632282202
Score on val is  0.8760021378941742
Training took time  0:04:39.743914
Score on train is  0.9941207910208445
Score on val is  0.882950293960449


In [None]:
ALL 12 EPOCHS unlesss specified otherwise

## just take pooled output from BERT:
0.8562266167824693
0.8722608230892571
0.8583645109567076
0.8599679315873864


## only last sentece layers = [hidden_size]:
0.8679850347407804
0.8711918760021379
0.877605558524853
0.8765366114377339
0.8674505611972207

## same but for 3 epochs
0.8760021378941742
0.8722608230892571
0.8743987172634955
0.8797434526990914
0.8797434526990914

## only last sentece layers = [hidden_size * 5] 3 epochs:
0.8813468733297701
0.8818813468733298
0.8733297701763763
0.8808123997862106
0.8733297701763763
0.8738642437199359

## only last sentece 3 epochs:[hidden_size * 5, hidden_size * 3, weight_size_last_sentence]
0.8786745056119722
0.8717263495456975

## only last sentece layers = [hidden_size] 3 epochs with random_replacement 0.1:
0.8850881881346874
0.8743987172634955
0.8685195082843399
0.882950293960449

## only last sentece layers = [hidden_size * 3] 3 epochs with random_replacement 0.2:
0.8818813468733298
0.8760021378941742
0.882950293960449

## concat of last sentence and context [hidden_size], [hidden_size] 3 epochs:
0.8711918760021379
0.8711918760021379


In [145]:
# TEST 

mask = tf.constant([[0,0,1,1,1,0],
                    [0,0,0,1,1,0]])

mask1 = tf.constant([[1,1,1,1,1,0],
                     [1,1,1,1,1,0]])


features = tf.ones([2,6,3], dtype=tf.float32)

# weights = tf.get_variable('weights', [3,3])
weights = tf.constant([1,2,3], dtype=tf.float32)

mask_expanded = tf.tile(tf.expand_dims(mask, 2), [1,1,3])
mask_expanded = tf.cast(mask_expanded, tf.float32)
r = (features * weights) * mask_expanded

r = tf.reduce_sum(r, 1)

mask_expanded1 = tf.tile(tf.expand_dims((mask1 * (1 - mask)), 2), [1,1,3])
mask_expanded1 = tf.cast(mask_expanded1, tf.float32)
r1 = (features * weights) * mask_expanded1

r1 = tf.reduce_sum(r1, 1)

with tf.Session() as sess:
    print(sess.run(r))
    print(sess.run(r1))

[[3. 6. 9.]
 [2. 4. 6.]]
[[2. 4. 6.]
 [3. 6. 9.]]


In [19]:
from os import listdir
from os.path import isfile, join
from scipy import stats
import numpy as np

SAVE_RESULTS_DIR = 'results_predictions'


files = [f for f in listdir(SAVE_RESULTS_DIR) if isfile(join(SAVE_RESULTS_DIR, f))]

true_labels_train = train_unshuffled['class'].values[::2] + 1
true_labels_test = test['class'].values[::2] + 1

classifiers = [int(file.split("_")[2].split(".")[0]) for file in files]
num_classifiers = np.max(classifiers)

predictions_train = []
predictions_test = []
for i in range(num_classifiers + 1):
    predictions_file_train = np.genfromtxt(os.path.join("./" + SAVE_RESULTS_DIR, 
                                                        "predictions_train_" + str(i) + '.csv'), delimiter=',')
    predictions_train.append(predictions_file_train)
    
    test_file = [x for x in files if 'classifier_' + str(i) in x][0]
    predictions_file_test = np.genfromtxt(os.path.join("./" + SAVE_RESULTS_DIR, test_file), delimiter=',')
    predictions_test.append(predictions_file_test)
    print(f'For classifier {i:2d} train accuracy {accuracy_score(true_labels_train, combine_predictions(predictions_file_train)):.4f} and test accuracy {accuracy_score(true_labels_test, combine_predictions(predictions_file_test)):.4f}')
    
def print_ensemble_predictions(predictions, true_labels):
    preds_mode = [combine_predictions(p) for p in predictions]
    preds_mode = np.array(preds_mode)
    preds_mode = stats.mode(preds_mode)[0][0]

    print('ENSEMBLE ACCURACY MODE')
    print(accuracy_score(true_labels, preds_mode))

    preds_prob = np.mean(predictions, axis=0)
    preds_prob = combine_predictions(preds_prob)

    print('ENSEMBLE ACCURACY PROB MEAN ON LOGS')
    print(accuracy_score(true_labels, preds_prob))


    preds_prob = np.log(np.mean(np.exp(predictions), axis=0))
    preds_prob = combine_predictions(preds_prob)

    print('ENSEMBLE ACCURACY PROB MEAN ON PROBS')
    print(accuracy_score(true_labels, preds_prob))
    print()
    
    
print_ensemble_predictions(predictions_train, true_labels_train)
print_ensemble_predictions(predictions_test, true_labels_test)

For classifier  0 train accuracy 1.0000 and test accuracy 0.8787
For classifier  1 train accuracy 1.0000 and test accuracy 0.8899
For classifier  2 train accuracy 1.0000 and test accuracy 0.8867
For classifier  3 train accuracy 1.0000 and test accuracy 0.8771
For classifier  4 train accuracy 0.9995 and test accuracy 0.8904
For classifier  5 train accuracy 1.0000 and test accuracy 0.8883
For classifier  6 train accuracy 1.0000 and test accuracy 0.8856
For classifier  7 train accuracy 1.0000 and test accuracy 0.8840
For classifier  8 train accuracy 1.0000 and test accuracy 0.8830
For classifier  9 train accuracy 1.0000 and test accuracy 0.8942
For classifier 10 train accuracy 1.0000 and test accuracy 0.8985
For classifier 11 train accuracy 1.0000 and test accuracy 0.8910
For classifier 12 train accuracy 1.0000 and test accuracy 0.8936
For classifier 13 train accuracy 1.0000 and test accuracy 0.8936
For classifier 14 train accuracy 1.0000 and test accuracy 0.8862
ENSEMBLE ACCURACY MODE
1.

In [67]:
###### OLD


small_bert (whole val and whole test)

output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.layers.dense(output_layer, 128, use_bias=True, activation=tf.nn.sigmoid)
logits = tf.nn.dropout(logits, keep_prob=0.9)
logits = tf.layers.dense(logits, num_labels, use_bias=True)

For classifier  0 train accuracy 1.0000 and test accuracy 0.8391
For classifier  1 train accuracy 1.0000 and test accuracy 0.8343
For classifier  2 train accuracy 1.0000 and test accuracy 0.8386
For classifier  3 train accuracy 1.0000 and test accuracy 0.8311
For classifier  4 train accuracy 1.0000 and test accuracy 0.8434
For classifier  5 train accuracy 1.0000 and test accuracy 0.8397
For classifier  6 train accuracy 1.0000 and test accuracy 0.8434
For classifier  7 train accuracy 1.0000 and test accuracy 0.8418
For classifier  8 train accuracy 1.0000 and test accuracy 0.8359
For classifier  9 train accuracy 1.0000 and test accuracy 0.8487
For classifier 10 train accuracy 1.0000 and test accuracy 0.8332
For classifier 11 train accuracy 1.0000 and test accuracy 0.8429
For classifier 12 train accuracy 1.0000 and test accuracy 0.8397
For classifier 13 train accuracy 1.0000 and test accuracy 0.8407
For classifier 14 train accuracy 1.0000 and test accuracy 0.8365
ENSEMBLE ACCURACY MODE
1.0
ENSEMBLE ACCURACY PROB MEAN ON LOGS
1.0
ENSEMBLE ACCURACY PROB MEAN ON PROBS
1.0

ENSEMBLE ACCURACY MODE
0.8722608230892571
ENSEMBLE ACCURACY PROB MEAN ON LOGS
0.8786745056119722
ENSEMBLE ACCURACY PROB MEAN ON PROBS
0.8786745056119722

big_bert (whole val and whole test)
For classifier  0 train accuracy 1.0000 and test accuracy 0.8691
For classifier  1 train accuracy 1.0000 and test accuracy 0.8664
For classifier  2 train accuracy 1.0000 and test accuracy 0.8680
For classifier  3 train accuracy 1.0000 and test accuracy 0.8562
For classifier  4 train accuracy 0.9984 and test accuracy 0.8397
For classifier  5 train accuracy 1.0000 and test accuracy 0.8696
For classifier  6 train accuracy 1.0000 and test accuracy 0.8381
For classifier  7 train accuracy 1.0000 and test accuracy 0.8589
For classifier  8 train accuracy 1.0000 and test accuracy 0.8632
For classifier  9 train accuracy 1.0000 and test accuracy 0.8600
For classifier 10 train accuracy 1.0000 and test accuracy 0.8701
For classifier 11 train accuracy 1.0000 and test accuracy 0.8696
For classifier 12 train accuracy 0.9947 and test accuracy 0.8268
For classifier 13 train accuracy 1.0000 and test accuracy 0.8653
For classifier 14 train accuracy 0.9995 and test accuracy 0.8044
ENSEMBLE ACCURACY MODE
1.0
ENSEMBLE ACCURACY PROB MEAN ON LOGS
1.0
ENSEMBLE ACCURACY PROB MEAN ON PROBS
1.0

ENSEMBLE ACCURACY MODE
0.8979155531801176
ENSEMBLE ACCURACY PROB MEAN ON LOGS
0.9021913415285944
ENSEMBLE ACCURACY PROB MEAN ON PROBS
0.8984500267236771

big bert [512]

For classifier  0 train accuracy 1.0000 and test accuracy 0.8792
For classifier  1 train accuracy 1.0000 and test accuracy 0.8787
For classifier  2 train accuracy 0.9989 and test accuracy 0.8952
For classifier  3 train accuracy 1.0000 and test accuracy 0.8803
For classifier  4 train accuracy 1.0000 and test accuracy 0.8632
For classifier  5 train accuracy 1.0000 and test accuracy 0.8717
For classifier  6 train accuracy 1.0000 and test accuracy 0.8797
For classifier  7 train accuracy 1.0000 and test accuracy 0.8733
For classifier  8 train accuracy 1.0000 and test accuracy 0.8781
For classifier  9 train accuracy 1.0000 and test accuracy 0.8712
For classifier 10 train accuracy 1.0000 and test accuracy 0.8616
For classifier 11 train accuracy 1.0000 and test accuracy 0.8669
For classifier 12 train accuracy 0.9995 and test accuracy 0.8589
For classifier 13 train accuracy 1.0000 and test accuracy 0.8632
For classifier 14 train accuracy 1.0000 and test accuracy 0.8803
ENSEMBLE ACCURACY MODE
1.0
ENSEMBLE ACCURACY PROB MEAN ON LOGS
1.0
ENSEMBLE ACCURACY PROB MEAN ON PROBS
1.0

ENSEMBLE ACCURACY MODE
0.9048637092463923
ENSEMBLE ACCURACY PROB MEAN ON LOGS
0.9075360769641903
ENSEMBLE ACCURACY PROB MEAN ON PROBS
0.9048637092463923

------------------------------------------------------------------------
one weight [hidden_size, hidden_size] all between first_sep_token and first_pad_token (15 epochs)

For classifier  0 train accuracy 1.0000 and test accuracy 0.8787
For classifier  1 train accuracy 1.0000 and test accuracy 0.8899
For classifier  2 train accuracy 1.0000 and test accuracy 0.8867
For classifier  3 train accuracy 1.0000 and test accuracy 0.8771
For classifier  4 train accuracy 0.9995 and test accuracy 0.8904
For classifier  5 train accuracy 1.0000 and test accuracy 0.8883
For classifier  6 train accuracy 1.0000 and test accuracy 0.8856
For classifier  7 train accuracy 1.0000 and test accuracy 0.8840
For classifier  8 train accuracy 1.0000 and test accuracy 0.8830
For classifier  9 train accuracy 1.0000 and test accuracy 0.8942
For classifier 10 train accuracy 1.0000 and test accuracy 0.8985
For classifier 11 train accuracy 1.0000 and test accuracy 0.8910
For classifier 12 train accuracy 1.0000 and test accuracy 0.8936
For classifier 13 train accuracy 1.0000 and test accuracy 0.8936
For classifier 14 train accuracy 1.0000 and test accuracy 0.8862
ENSEMBLE ACCURACY MODE
1.0
ENSEMBLE ACCURACY PROB MEAN ON LOGS
1.0
ENSEMBLE ACCURACY PROB MEAN ON PROBS
1.0

ENSEMBLE ACCURACY MODE
0.9102084446819882
ENSEMBLE ACCURACY PROB MEAN ON LOGS
0.9064671298770711
ENSEMBLE ACCURACY PROB MEAN ON PROBS
0.9075360769641903

array([[-0.85567284, -0.55338029],
       [-0.87034223, -0.54267442],
       [-0.60520196, -0.78957908],
       ...,
       [-1.29842089, -0.31877721],
       [-1.00223294, -0.4573779 ],
       [-0.17982265, -1.80434856]])