Natural Language Inferencing (NLI) is a classic NLP (Natural Language Processing) problem that involves taking two sentences (the _premise_ and the _hypothesis_ ), and deciding how they are related- if the premise entails the hypothesis, contradicts it, or neither.

In this tutorial we'll look at the _Contradictory, My Dear Watson_ competition dataset, build a preliminary model using Tensorflow 2, Keras, and BERT, and prepare a submission file.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

In [None]:
from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt
import tensorflow as tf

# from tensorflow import keras
# import keras_tuner as kt

# from numpy.random import seed
# seed(1)
# from tensorflow import set_random_seed
# set_random_seed(2)

# ImportError: cannot import name 'set_random_seed' from 'tensorflow' (/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py)

import numpy as np
import tensorflow as tf
import random as rn

# The below is necessary in Python 3.2.3 onwards to
# have reproducible behavior for certain hash-based operations.
# See these references for further details:
# https://docs.python.org/3.4/using/cmdline.html#envvar-PYTHONHASHSEED
# https://github.com/fchollet/keras/issues/2280#issuecomment-306959926

os.environ['PYTHONHASHSEED'] = '0'

# The below is necessary for starting Numpy generated random numbers
# in a well-defined initial state.

np.random.seed(42)

# The below is necessary for starting core Python generated random numbers
# in a well-defined state.

rn.seed(12345)

# Force TensorFlow to use single thread.
# Multiple threads are a potential source of
# non-reproducible results.
# For further details, see: https://stackoverflow.com/questions/42022950/which-seeds-have-to-be-set-where-to-realize-100-reproducibility-of-training-res

# session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)

from keras import backend as K

# The below tf.set_random_seed() will make random number generation
# in the TensorFlow backend have a well-defined initial state.
# For further details, see: https://www.tensorflow.org/api_docs/python/tf/set_random_seed

tf.random.set_seed(1234)

# sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
# K.set_session(sess)

Let's set up our TPU.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

## Downloading Data

The training set contains a premise, a hypothesis, a label (0 = entailment, 1 = neutral, 2 = contradiction), and the language of the text. For more information about what these mean and how the data is structured, check out the data page: https://www.kaggle.com/c/contradictory-my-dear-watson/data

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")

We can use the pandas head() function to take a quick look at the training set.

In [None]:
train.head()

Let's look at one of the pairs of sentences.

In [None]:
train.premise.values[1]

In [None]:
train.hypothesis.values[1]

In [None]:
train.label.values[1]

These statements are contradictory, and the label shows that.

Let's look at the distribution of languages in the training set.

In [None]:
labels, frequencies = np.unique(train.language.values, return_counts = True)

plt.figure(figsize = (10,10))
plt.pie(frequencies,labels = labels, autopct = '%1.1f%%')
plt.show()

## Preparing Data for Input

To start out, we can use a pretrained model. Here, we'll use a multilingual BERT model from huggingface. For more information about BERT, see: https://github.com/google-research/bert/blob/master/multilingual.md

First, we download the tokenizer.

In [None]:
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Tokenizers turn sequences of words into arrays of numbers. Let's look at an example:

In [None]:
def encode_sentence(s):
   tokens = list(tokenizer.tokenize(s))
   tokens.append('[SEP]')
   return tokenizer.convert_tokens_to_ids(tokens)

In [None]:
encode_sentence("I love machine learning")

BERT uses three kind of input data- input word IDs, input masks, and input type IDs.

These allow the model to know that the premise and hypothesis are distinct sentences, and also to ignore any padding from the tokenizer.

We add a [CLS] token to denote the beginning of the inputs, and a [SEP] token to denote the separation between the premise and the hypothesis. We also need to pad all of the inputs to be the same size. For more information about BERT inputs, see: https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel

Now, we're going to encode all of our premise/hypothesis pairs for input into BERT.

In [None]:
max_len = 50

def bert_encode(hypotheses, premises, tokenizer):
    num_examples = len(hypotheses)
    
    sentence1 = tf.ragged.constant([encode_sentence(s) for s in np.array(hypotheses)])
    sentence2 = tf.ragged.constant([encode_sentence(s) for s in np.array(premises)])
    
    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
    input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
    input_mask = tf.ones_like(input_word_ids).to_tensor()
    
    type_cls = tf.zeros_like(cls)
    type_s1 = tf.zeros_like(sentence1)
    type_s2 = tf.ones_like(sentence2)
    
    input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor()
    inputs = {
        'input_word_ids': input_word_ids.to_tensor()[:,:max_len],
        'input_mask': input_mask[:,:max_len],
        'input_type_ids': input_type_ids[:,:max_len]
    }
    return inputs

In [None]:
train_input = bert_encode(train.premise.values, train.hypothesis.values, tokenizer)

## Creating & Training Model

Now, we can incorporate the BERT transformer into a Keras Functional Model. For more information about the Keras Functional API, see: https://www.tensorflow.org/guide/keras/functional.

This model was inspired by the model in this notebook: https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert#BERT-and-Its-Implementation-on-this-Competition, which is a wonderful introduction to NLP!

In [None]:
def build_model():
    bert_encoder = TFBertModel.from_pretrained(model_name)
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_type_ids")

    embedding = bert_encoder([input_word_ids, input_mask, input_type_ids])[0] # sentence embedding instead of word embedding
    # print('Embedding.shape is:', embedding.shape) # (None, 50, 768)
#     x = tf.keras.layers.BatchNormalization(
#         axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True,
#         beta_initializer='zeros', gamma_initializer='ones',
#         moving_mean_initializer='zeros',
#         moving_variance_initializer='ones', beta_regularizer=None,
#         gamma_regularizer=None, beta_constraint=None, gamma_constraint=None
#     )(embedding[:,0,:]) # find the weights
#     x = tf.keras.layers.Dropout(0.2, noise_shape=None, seed=None)(embedding[:,0,:]) # find the weights
#     x = tf.keras.layers.Dense(6, activation='relu')(embedding[:,0,:]) # find the weights
    x = tf.keras.layers.Dense(6, activation='relu')(tf.keras.layers.Dense(8, activation='relu')(embedding[:,0,:])) # find the weights
#     x = (embedding[:,0,:]) # find the weights
    output = tf.keras.layers.Dense(3, activation='softmax')(x)
    # Dense = linear layer
    # 3 because we want the prob of each category (we have 3 categories)
    # softmax transforms output vector (whose sum may not = 1) into probabilities for each category (sum = 1 and each prob >= 0)

    inputs = [input_word_ids, input_mask, input_type_ids]
    model = tf.keras.Model(inputs=inputs, outputs=output)

    # Tune the learning rate for the optimizer
#     hp_learning_rate = hp.Choice('learning_rate', values=[1e-4, 1e-5, 1e-6])
    
    # 27 July 2021 TODO: Try -4, -6.
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    # knowing backpropagation will help know what Adam, loss and metrics are doing
    # loss(input truth prob, output prob) = diff
    # grad = grad - lr*loss()
    # we train by batch, but when we train "perfectly" for one batch, it doesn't mean the model applies well to the other
    # batches
    # lr determines how fast grad moves when learning
    # https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other
    
    return model

In [None]:
with strategy.scope():
    model = build_model()
    model.summary()

# tuner = kt.RandomSearch(
#     build_model,
#     objective='val_accuracy',
#     max_trials=3
# )
# stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

# tuner.search(train_input, train.label.values, epochs=5, validation_split=0.2, callbacks=[])

# # Get the optimal hyperparameters
# best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

# print(f"""
# The hyperparameter search is complete. The optimal number of units in the first densely-connected
# layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
# is {best_hps.get('learning_rate')}.
# """)

# Can't use tuner because:
# UnimplementedError: File system scheme '[local]' not implemented (file: './untitled_project/trial_5e9d562af088aefa377fff1b7f1539ac/checkpoints/epoch_0/checkpoint_temp/part-00000-of-00001')
# 	Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.

In [None]:
# for (k, v) in train_input.items():
#     print(k, v.shape)

In [None]:
# Original call:
# model.fit(train_input, train.label.values, epochs = 2, verbose = 1, batch_size = 64, validation_split = 0.2)

# Choosing between accuracy, precision, recall etc can depend on tasks.
# But for tasks with no preferred metric, use ROC to compare model performances because one may have higher precision,
# others may have higher recall etc.

# These are hyperparameter tuning.

# Should split the sets explicitly so that we can do a case study on what works and what doesn't.
# Got overfitting when epoch increases. Consider increasing validation split.

# 27 July 2021:
# TODO: Split sets explicitly. No need to vary validation_split too much. Use small amount for validation (10% or 20%).
# TODO: Random search batch size (start from 16, etc) and learning rate.
# TODO: Seed the initial values so that it's deterministic.
# TODO: Always compare models with the highest val_accuracy value.
# Model performance after each epoch should be similar if the model architecture is largely the same. Seed initial values.

# 03 Aug 2021: TODOS:
# Batch norm, dropout
# Add Linear layers
# Size of hidden state in linear layers
# Experiments
# If model fails to predict, look at the test cases. Case study. Print those that are wrong.

import matplotlib.pyplot as plt

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', min_delta=0, patience=5, verbose=0,
    mode='max', baseline=None, restore_best_weights=False
)
fitted = model.fit(train_input, train.label.values, epochs = 20, verbose = 1, batch_size = 64, validation_split = 0.2, callbacks = [early_stopping])
print(fitted.history)
plt.plot(fitted.history['accuracy'])
plt.plot(fitted.history['val_accuracy'])
plt.plot(fitted.history['loss'])
plt.plot(fitted.history['val_loss'])
plt.title('Accuracy and loss')
plt.ylabel('Accuracy or loss')
plt.xlabel('Epoch')
plt.legend(['Accuracy', 'Validation Accuracy', 'Loss', 'Validation Loss'], loc='upper left')
plt.show()

In [None]:
# test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
# test_input = bert_encode(test.premise.values, test.hypothesis.values, tokenizer)

In [None]:
# test.head()

## Generating & Submitting Predictions

In [None]:
# predictions = [np.argmax(i) for i in model.predict(test_input)]

The submission file will consist of the ID column and a prediction column. We can just copy the ID column from the test file, make it a dataframe, and then add our prediction column.

In [None]:
# submission = test.id.copy().to_frame()
# submission['prediction'] = predictions

In [None]:
# submission.head()

In [None]:
# submission.to_csv("submission.csv", index = False)

And now we've created our submission file, which can be submitted to the competition. Good luck!