### **Hey everyone! I just want to make sure I give credit where it is due and explain the purpose of this notebook. I'm taking the Tutorial Notebook done by [Ana Sofia Uzsoy](https://www.kaggle.com/anasofiauzsoy), [Amy Jang](https://www.kaggle.com/amyjang), & [Phil Culliton](https://www.kaggle.com/philculliton) and just adding more explanations and background information about why the code is set up the way it is. Originally, I was just meaning to try and understand NLI and this notebook, but thought that it could be helpful to others if I broke things down even more than the original notebook. I've also added a few additions since some things didn't work for me when using the original workbook and have changed some of the model parameters to see if it will give me better accuracy.**

#### **Any words that I've added will be bolded or added as comments in code blocks starting with "First, let's import the libraries we'll need"**

##### *The original notebook can be found [here](https://www.kaggle.com/anasofiauzsoy/tutorial-notebook)*

Natural Language Inferencing (NLI) is a classic NLP (Natural Language Processing) problem that involves taking two sentences (the _premise_ and the _hypothesis_ ), and deciding how they are related- if the premise entails the hypothesis, contradicts it, or neither.

In this tutorial we'll look at the _Contradictory, My Dear Watson_ competition dataset, build a preliminary model using Tensorflow 2, Keras, and BERT, and prepare a submission file.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will 
# list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output
# when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current
# session

In [None]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

### **First, let's import the libraries we'll need**

In [None]:
# importing libraries

from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt
import tensorflow as tf

Let's set up our TPU.

**The code below was taken from [Watson :: XLM-R & NLI :: inference](https://www.kaggle.com/alturutin/watson-xlm-r-nli-inference). This was the notebook I was originally looking to add additional explanations to and break down further, but it went a little over my head in most places so I decided to start with a simpler notebook instead.**

**If you go to Kaggle's documentation on [TPUs](https://www.kaggle.com/docs/tpu) you'll see that some of the code in the function is taken directly from there. TPUs, Tensor Processing Units, were specifically created to work with TensorFlow, a ML library. The original poster [novichok](https://www.kaggle.com/alturutin) just turned it into a function that could tell whether the notebook is running a harware accelarator, such as a TPU, GPU, or a CPU (which is just normal). I thought it would be a nice addition. Want more info on the differences between the three? Check out this [link](https://serverguy.com/comparison/cpu-vs-gpu-vs-tpu/)**

**Last Note: I've tried my best to add pertinent info regarding each function and its purpose. I was taught that it's good practice to do so and so you'll see it in red right between the start of the function and the first line of code.**

In [None]:
def init_strategy():
    '''function that determines whethere a TPU is running or a CPU/GPU'''
    try:
        # detect the TPU
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        
        # initiate the TPU
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        
        # instantiate a distribution strategy
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
        print("Init TPU strategy")
    except ValueError:
        strategy = tf.distribute.get_strategy() # for CPU and single GPU
        print("Init CPU/GPU strategy")
    return strategy

strategy = init_strategy()
strategy

## Downloading Data

The training set contains a premise, a hypothesis, a label (0 = entailment, 1 = neutral, 2 = contradiction), and the language of the text. For more information about what these mean and how the data is structured, check out the data page: https://www.kaggle.com/c/contradictory-my-dear-watson/data

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")

We can use the pandas head() function to take a quick look at the training set.

**A very good practice when working with a new data set**

In [None]:
train.head()

Let's look at one of the pairs of sentences.

In [None]:
train.premise.values[1]

In [None]:
train.hypothesis.values[1]

In [None]:
train.label.values[1]

These statements are contradictory, and the label shows that.

Let's look at the distribution of languages in the training set.

In [None]:
labels, frequencies = np.unique(train.language.values, return_counts = True)

plt.figure(figsize = (10,10))
plt.pie(frequencies,labels = labels, autopct = '%1.1f%%')
plt.show()

**Observation: most of the statements are in English with all other languages appearing at almost the same rate (about 3%)**

## Preparing Data for Input

To start out, we can use a pretrained model. Here, we'll use a multilingual BERT model from huggingface. For more information about BERT, see: https://github.com/google-research/bert/blob/master/multilingual.md

First, we download the tokenizer.

In [None]:
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Tokenizers turn sequences of words into arrays of numbers. Let's look at an example:

In [None]:
def encode_sentence(s):
    '''Function to tokenize words'''
    tokens = list(tokenizer.tokenize(s))
    tokens.append('[SEP]')
    return tokenizer.convert_tokens_to_ids(tokens)

### **OK, now let's break the function down into smaller parts.**

**The first line of code in the function turns the statement you enter into a list of words. The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary, in this case the multilingual BERT. The tokens can be either words or subwords.**

**Let's look at the example below using the phrase _"I love machine learning"_**

In [None]:
tokens = list(tokenizer.tokenize("I love machine learning"))
tokens

**Oh look, our sentence is now broken up into a list of 4 words**

**The second line of code below is explained by the original authors below, but if you want more detailed information you can check out this [link](https://github.com/google-research/bert/blob/master/run_classifier.py) around line 400.**


In [None]:
tokens.append('[SEP]')
tokens

**Hmmm, so we just added [SEP] to denote that it's the end of the statement**

**The last line of code uses the multilingual BERT model to turn each token into an ID, which are then understandable by the model.**

In [None]:
tokenizer.convert_tokens_to_ids(tokens)

**Would you look at that, all of the words have been turned into ID numbers. I also want to note that [SEP] is encoded as ID number 102. So if you see that number I assume it will always represent the end of a sentence or statement.**

In [None]:
# now, let's run the original function and see what we get
encode_sentence("I love machine learning")

BERT uses three kind of input data- input word IDs, input masks, and input type IDs.

These allow the model to know that the premise and hypothesis are distinct sentences, and also to ignore any padding from the tokenizer.

We add a [CLS] token to denote the beginning of the inputs, and a [SEP] token to denote the separation between the premise and the hypothesis. We also need to pad all of the inputs to be the same size. For more information about BERT inputs, see: https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel

Now, we're going to encode all of our premise/hypothesis pairs for input into BERT.

In [None]:
def bert_encode(hypotheses, premises, tokenizer):
    '''Function that formats the hypothesis and premise data so it can be input into the model'''
    
    num_examples = len(hypotheses)
    
    sentence1 = tf.ragged.constant([
        encode_sentence(s)
        for s in np.array(hypotheses)])
    
    sentence2 = tf.ragged.constant([
        encode_sentence(s)
        for s in np.array(premises)])
    
    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
    input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
    
    input_mask = tf.ones_like(input_word_ids).to_tensor()
    
    type_cls = tf.zeros_like(cls)
    type_s1 = tf.zeros_like(sentence1)
    type_s2 = tf.ones_like(sentence2)
    
    input_type_ids = tf.concat(
        [type_cls, type_s1, type_s2], axis=-1).to_tensor()
    
    inputs = {
        'input_word_ids': input_word_ids.to_tensor(),
        'input_mask': input_mask,
        'input_type_ids': input_type_ids}
    return inputs

### **You know the drill, let's break this function down!**

**First, I'm going to assign the hypothesis and premise values from the training data to variables. It'll just make it easier when running the code in each step.**

In [None]:
hypothesis = train.hypothesis.values
premises = train.premise.values

**Line 1: determines how many rows there are. It should be the same number whether you're looking at hypothesis or premise since they are present at a ratio of 1:1**

In [None]:
num_examples = len(premises)
num_examples

#### sentence1 and sentence2 is essentially all of the labeled statements tokenized with a SEP at the end of each list represented by the id number 102; each tokenized sentence is then put into an array

#### looking at TF documentation: https://www.tensorflow.org/guide/ragged_tensor

> Your data comes in many shapes; your tensors should too. Ragged tensors are the TensorFlow equivalent of nested variable-length lists. They make it easy to store and process data with non-uniform shapes, including: Variable-length features, such as the set of actors in a movie. Batches of variable-length sequential inputs, such as sentences or video clips. Hierarchical inputs, such as text documents that are subdivided into sections, paragraphs, sentences, and words. Individual fields in structured inputs,such as protocol buffers.

#### The simplest way to construct a ragged tensor is using tf.ragged.constant, which builds the RaggedTensor corresponding to a given nested Python list or numpy array

#### As with normal Tensors, the values in a RaggedTensor must all have the same type; and the values must all be at the same nesting depth (the rank of the tensor)

In [None]:
sentence1 = tf.ragged.constant([
        encode_sentence(s)
        for s in np.array(hypothesis)])

sentence1[0]

In [None]:
sentence2 = tf.ragged.constant([
      encode_sentence(s)
       for s in np.array(premises)])
sentence2[0]

**Great, looks like all of the hypotheses and premises are encoded/tokenized**

In [None]:
sentence1.shape[0]

In [None]:
sentence2.shape[0]

**Because we're doing sequence classification, the model requires two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:**

```# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]```

**We can use our tokenizer to automatically generate such a sentence by passing the two sequences to the tokenizer as two arguments (and not a list, like before) like this:**

```from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])```

**which will return:**

```[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]```

**This is enough for some models to understand where one sequence ends and where another begins. However, for BERT we also need to deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.**

**The tokenizer returns this mask as the “token_type_ids” entry:**

``` encoded_dict['token_type_ids']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]```

**The first sequence, the “context” used for the question, has all its tokens represented by a ```0```, whereas the second sequence, corresponding to the “question”, has all its tokens represented by a ```1```.**

**This example comes from Hugging Face and can be found [here](https://huggingface.co/transformers/v2.4.0/glossary.html#token-type-idshttps://huggingface.co/transformers/v2.4.0/glossary.html#token-type-ids)**

**I will break the ```cls``` variable into two parts.**

In [None]:
# just converting the CLS token into an id number 
cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]

In [None]:
# converting the CLS token into an id number, then repeating it
# to equal the number of statements pairs 
cls = cls*sentence1.shape[0]
cls[0:10]

In [None]:
# Concatenates tensors along one dimension
input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
input_word_ids[0]

**The code above is essentially just taking the CLS which marks where one sequence/ statement begins and adds it to an array that also includes the tokenized hypothesis and premise**

In [None]:
input_mask = tf.ones_like(input_word_ids).to_tensor()
input_mask[0]

**The code above is creating a binary mask identifying the two types of sequence in the model, premise and hypothesis. The mask allows the model to cleanly differentiate between the content and the padding. The mask makes it so that the input_word_ids are all the same shape, and contains a 1 anywhere the input_word_ids is not padding.**

In [None]:
# Creates a tensor with all elements, in this case CLS, set to zeros
type_cls = tf.zeros_like(cls)
type_cls

In [None]:
# Creates a tensor with all elements, in this case the hypotheses, set to zeros
type_s1 = tf.zeros_like(sentence1)
type_s1[0]

In [None]:
# Creates a tensor with all elements, in this case the premises, set to ones
type_s2 = tf.ones_like(sentence2)
type_s2[0]

In [None]:
input_type_ids = tf.concat(
    [type_cls, type_s1, type_s2], axis=-1).to_tensor()
input_type_ids[0]

**The code above is doing what we did before for the input_word_ids except this time when looking inside the non-padded region, it contains a 0 or a 1 that indicates which sentence the token is a part of. The mask makes it so that the input_type_ids are again all the same shape**

In [None]:
 inputs = {
      'input_word_ids': input_word_ids.to_tensor(),
      'input_mask': input_mask,
      'input_type_ids': input_type_ids}
inputs

**Woohoo! We now have all of the information in the correct format to train and run our BERT model below**

## Creating & Training Model

Now, we can incorporate the BERT transformer into a Keras Functional Model. For more information about the Keras Functional API, see: https://www.tensorflow.org/guide/keras/functional.

This model was inspired by the model in this notebook: https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert#BERT-and-Its-Implementation-on-this-Competition, which is a wonderful introduction to NLP!

**I had issues with the original build_model function below. The hack directly below this cell came from Zakaria MESSIA and is directly linked [here](https://www.kaggle.com/anasofiauzsoy/tutorial-notebook/comments#1223205)**

In [None]:
def bert_encode(hypotheses, premises, tokenizer, max_length=150):

    x = [h + ' [SEP] ' + p for h, p in zip(np.array(hypotheses), np.array(premises))]
    x = tokenizer(x, padding=True, truncation=True, max_length=max_length)

    inputs = {
          'input_word_ids':tf.ragged.constant(x['input_ids']).to_tensor(),
          'input_mask': tf.ragged.constant(x['attention_mask']).to_tensor(),
          'input_type_ids': tf.ragged.constant(x['token_type_ids']).to_tensor()}

    return inputs

**Since just having max_len in the model that we built below was causing errors, the code above specifies a max_len and a few other parameters when the statements are being tokenized. You can also check out this webpage by Hugging Face on preprocessing that explains the code above [here](https://huggingface.co/transformers/preprocessing.html#preprocessing-pairs-of-sentences).**

In [None]:
train_input = bert_encode(train.premise.values, train.hypothesis.values, tokenizer)

**Now it's time to build our model! The code below is a pretty standard TF model set-up so I've decided not to break it down**

In [None]:
max_len = 150

def build_model():
    bert_encoder = TFBertModel.from_pretrained(model_name)
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_type_ids")
    
    embedding = bert_encoder([input_word_ids, input_mask, input_type_ids])[0]
    output = tf.keras.layers.Dense(3, activation='softmax')(embedding[:,0,:])
    
    model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
# just telling it to use the TPU 
with strategy.scope():
    model = build_model()
    model.summary()

In [None]:
model.fit(train_input, train.label.values, epochs = 5, verbose = 1, batch_size = 128, validation_split = 0.3)

In [None]:
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
test_input = bert_encode(test.premise.values, test.hypothesis.values, tokenizer)

In [None]:
test.head()

## Generating & Submitting Predictions

In [None]:
predictions = [np.argmax(i) for i in model.predict(test_input)]

The submission file will consist of the ID column and a prediction column. We can just copy the ID column from the test file, make it a dataframe, and then add our prediction column.

In [None]:
submission = test.id.copy().to_frame()
submission['prediction'] = predictions

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv", index = False)

And now we've created our submission file, which can be submitted to the competition. Good luck!