<h1> Question Natural Language Inference </h1>

This example fine-tunes a BERT model for one of the general language understanding evaluation (GLUE) benchmark tasks.  This benchmark is called question natural language inference (QNLI).  

Devlin, Chang, Lee, and Toutanova, authors of [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) define Question Natural Language Inference as follows:

"Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task (Wang et al., 2018). The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer."

This Jupyter notebook contains an example of a natural language inference task that involves two sentences (question, sentence).  Here, the term *sentence* refers to an arbitrary span of contiguous text, rather than an actual linguistic sentence. 

Fine-tuning a BERT model for QNLI requires multiple steps. The steps include integrated data preparation and model creation tasks. SAS DLPy provides detailed and documented examples of these steps in the companion tutorial Jupyter notebooks *bert_data_preparation_quickstart* and *bert_load_model_quickstart*. 

This end-to-end BERT QNLI example follows the sequence of steps shown below: 

* [Define SAS Viya Version of BERT Model](#Define-Viya-BERT)
* [Explore and Clean Training Data](#Data-Exploration-and-Cleaning-Training)
* [Tokenize Training Data and Upload as a SAS Viya Table](#Data-Tokenization-Training)
* [Compile the BERT Classification Model ](#Compile-BERT-Model)
* [Attach BERT Model Parameters](#Attach-BERT-Parms)
* [Fine-tune BERT Model for QNLI Task](#Fine-Tune-Train)
* [Explore and Clean Test Data](#Data-Exploration-and-Cleaning-Test)
* [Tokenize Test Data and Upload as a SAS Viya Table](#Data-Tokenization-Test)
* [Perform Inference Using the Test Data](#Inference-Test-Data)

The code below navigates each step and concludes by scoring a test data set that demonstrates the effectiveness of the fine-tuned BERT model.

<h2> Client and Server Terminology in this Example </h2>

SAS Viya literature and technical documentation often refers to client and server entities. In this scenario, the client is the computer that runs the Jupyter notebook. The server is the computer that is running the Viya server. These two computers might (or might not) use the same operating system, and might (or might not) have access to a common file system.

This example assumes that the client and server use the same operating system, and that they have access to a common file system. If the client and server in your environment do not have access to a common file system, you will need to copy or transfer files between client and server during this example. This notebook uses comments in cells to indicate when a given file should be moved from the client to the server.

This example begins by configuring the computing environment for the modeling task to follow. 

Configuring the environments includes [importing required Python packages](#ImportPackages), [specifying client parameters](#Specify-Client-Parms), [connecting to the SAS Viya Server](#Connect-to-SAS-Viya-Server), and [specifying the modeling task parameters](#Set-Example-Parms). 

<a id="ImportPackages"></a>
<h2> Import Python Function Packages </h2>

In [1]:
# Import the necessary Python packages
import sys
import os
import numpy as np
import pandas as pd

<a id="Specify-Client-Parms"></a>
<h3> Specify Client Parameters </h3>

This section of code configures the notebook for use with Linux or Windows SAS Viya clients.

In [2]:
# Generalize the example Jupyter notebook so it can be used with Linux and Windows Viya clients

# Linux client parameters
if sys.platform.startswith('linux'):
    
    # path to HuggingFace cache directory
    cache_dir='/path-to/huggingface-cache-dir'  
    
# Windows client parameters    
elif sys.platform.startswith('win'):
            
    # path to HuggingFace cache directory
    cache_dir='Drive:\\path-to\\huggingface-cache-dir'

# Unsupported operating system client    
else:
    raise ValueError('Unrecognized operating system')
    
from dlpy.transformers.bert_model import BERT_Model

<a id="Connect-to-SAS-Viya-Server"></a>
<h3> Connect to the SAS Viya Server </h3>

This section of code invokes the SAS Wrapper for Analytics Transfer ([SWAT](https://github.com/sassoftware/python-swat)), configures the Python matplotlib utility for Jupyter notebook output display, connects to a SAS Viya server, and loads the SAS CAS DeepLearn action set into memory.

In [3]:
# The SAS Scripting Wrapper for Analytic Transfer (SWAT)
# is a module used to connect to a SAS Viya server.
from swat import * 

# Configure the Python matplotlib utility to display output in 
# Jupyter notebook cells.
%matplotlib inline

# Specify the name of your SAS Viya host
host = 'your-host-name'

# Specify your installation's port ID (5 digit number) and user ID parameters.
s = CAS(host, portID, 'userID')

# load the SAS Viya Deep Learning Action Sets
s.loadactionset('deepLearn')

NOTE: Added action set 'deepLearn'.


<a id="Set-Example-Parms"></a>
<h3> Set the Example Task Parameters </h3>

The task performed in this Jupyter notebook example is to determine whether the answer to a question is given in the sentence portion of a question/sentence pair.  This is an instance of a binary classification problem where

* *class 1*: sentence answers question
* *class 2*: sentence does not answer question

The variable ``n_classes`` in the cell below indicates that the final classification layer in the BERT model requires predictions for two classes. 

BERT models are composed of a number of [Transformer](http://jalammar.github.io/illustrated-transformer/) encoder blocks or layers.  For this binary classification task, the BERT<sub>base</sub> model described in the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) is used.  This model consists of twelve encoder layers.  You can choose to incorporate any number of encoder layers up to maximum the model supports.  The variable ``num_encoder_layers`` in the cell below indicates that all twelve layers should be used. 

In [4]:
# Task parameters for BERT model

# Final classification layer number of class predictions
n_classes = 2

# Number of BERT encoder layers used
num_encoder_layers = 12

<a id="Define-Viya-BERT"></a>
<h2> Define SAS Viya Version of BERT Model </h2>

First, create an instance of the ``BERT_Model`` class.  This class uses the ``bert-base-uncased`` model from the [HuggingFace](https://huggingface.co/transformers/installation.html) repository to build the SAS Viya BERT model.  The code in the notebook cell below assumes that you have already run the following code to download the necessary JSON files (see [pre-trained models](https://huggingface.co/transformers/pretrained_models.html)) to your local cache directory:


``model = BertModel.from_pretrained('bert-base-uncased',
                                    cache_dir=/path-to/your-cache-directory
                                   )``

When the ``BERT_Model`` class defines a new BERT model, it also prepares the embedding table that is needed to specify the token, position, and segment inputs for the BERT encoder layers.  

In the following code block, note that verbose output is enabled. Verbose output provides additional feedback for the start, progress, and completion of some internal computation operations that are not typically surfaced. The additional process information is useful as a "heartbeat" indicator (*the computation is still processing*), because some internal BERT tasks for large models can require large amounts of time to complete.

**Note:** The new ``bert`` object is a fully functional SAS DLPy model object.  This means that you can train, score, and evaluate the BERT model just as you would any other DLPy model object.   

In [5]:
# instantiate a version of the HuggingFace BERT model
huggingface_name = 'bert-base-uncased'

bert = BERT_Model(s,
                  cache_dir,
                  huggingface_name,
                  n_classes,
                  num_hidden_layers = num_encoder_layers,
                  max_seq_len=256,
                  verbose=True)

NOTE: loading base BERT model bert-base-uncased ...
NOTE: base BERT model loaded.
NOTE: loading BERT embedding table ... 
NOTE: BERT embedding table loaded.


<a id="Download-the-Data"></a>
<h3> Download the Raw Data </h3>

In this section, you download the raw data that you will prepare for use with a BERT model.

Obtain the QNLI data ([QNLIv2.zip](https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601)) for the example, and unzip to a directory that is accessible from your Viya client. You should end up with three files: ``train.tsv``, ``test.tsv``, and ``dev.tsv``.

<a id="Data-Exploration-and-Cleaning-Training"></a>
<h2>Explore and Clean the Training Data</h2>

After downloading and unzipping ``QNLIv2.zip``, you can open ``train.tsv`` and examine the contents. You can use your favorite spreadsheet or text editor to open this file and scan the contents. Each observation contains a (question, sentence) pair, along with the correct label.   

The following code reads the ``train.tsv`` data for index, question (input), sentence (input), and label (target) into a Pandas data frame.

<a id="Read-Data-into-Memory"></a>
<h3>Read the Data into Pandas Data Frame</h3>

In [6]:
# Read in the training data set for QNLI classification
train_data = pd.read_csv('/path-to/qnli-data/train.tsv',
                         header=0,
                         sep='\t',
                         error_bad_lines=False,
                         warn_bad_lines=False,
                         names=['index', 
                                'question',
                                'sentence',
                                'label'
                               ]
                         )
# Labels for inputs
input_a_label = 'question'
input_b_label = 'sentence'
# Labels for targets
target_label = 'label'

<h3> Extract Training Data and Assign Numeric Classes </h3>

To enable the BERT analysis, the target variable values must be translated from the orignal nominal values (*entailment* and *not_entailment*) to consistent numeric integer values.  The translation uses simple rules:

*not_entailment* <=> 1  
*entailment* <=> 2

You can use a different numeric translation convention if you desire, as long as you are consistent.  The nominal to numeric translation must be performed before the ``bert_prepare_data()`` statement is processed.  

**Note**: there are two inputs in the example: *question* and *sentence*.  The inputs are combined in a subsequent step.

In [7]:
# Define the two inputs and the target
input_a = train_data[input_a_label].to_list()
input_b = train_data[input_b_label].to_list()
targets = train_data[target_label].to_list()

# Translate nominal target values 
# to numeric integer values
for ii, tgt in enumerate(targets):
    if tgt == 'entailment':
        targets[ii] = 2
    elif tgt == 'not_entailment':
        targets[ii] = 1
    else:
        print('Unknown target label.')

<a id="Data-Tokenization-Training"></a>
<h2> Tokenize the Training Data and Upload as a SAS Viya Table </h2>

After you extract the input data, the data must be tokenized for use with BERT.

The input data consists of text strings that are composed of English words, but BERT networks cannot operate directly on text data. Instead, the text must be tokenized using the [WordPiece algorithm](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf).  

The WordPiece algorithm segments words into representative components, or tokens for NLP tasks. The WordPiece tokens are translated to WordPiece embeddings internally by the BERT model. Embeddings are a vector of numeric values that represent the tokens. BERT combines the WordPiece embeddings with position and segment embeddings when forming the input to the first encoder layer.  

A full discussion of these embeddings is beyond the scope of this example, but you can find a tutorial-style explanation [here](http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/).  

After the input data has been tokenized, the table is ready to be uploaded as a SAS Viya table.

The helper function ``bert_prepare_data()`` handles all of the data preparation and formatting that is required to upload the SAS Viya table, including combining the question/sentence into a single input of the form

    [CLS] question tokens [SEP] answer tokens [SEP]

The last parameter in the ``bert_prepare_data`` specification requests verbose log output. This setting provides detailed user feedback that you can use to track the progress of the data preparation. Be warned that some large data sets might take a long time to complete.

In [9]:
# Import the DLPy bert_prepare_data function
from dlpy.transformers.bert_utils import bert_prepare_data

# Specify the bert_prepare_data function parameters
num_tgt_var, train = bert_prepare_data(s, 
                                       bert.get_tokenizer(), 
                                       bert.get_max_sequence_length(), 
                                       input_a=input_a, 
                                       input_b=input_b, 
                                       target=targets, 
                                       classification_problem=bert.get_problem_type(), 
                                       segment_vocab_size=bert.get_segment_size(), 
                                       verbose=True
                                      )

NOTE: 10% of the observations tokenized.
NOTE: 20% of the observations tokenized.
NOTE: 30% of the observations tokenized.
NOTE: 40% of the observations tokenized.
NOTE: 50% of the observations tokenized.
NOTE: 60% of the observations tokenized.
NOTE: 70% of the observations tokenized.
NOTE: 80% of the observations tokenized.
NOTE: 90% of the observations tokenized.
NOTE: 100% of the observations tokenized.
NOTE: all observations tokenized.

These observations have been truncated so that only the first 256 tokens are used.

NOTE: uploading data to table bert_data.
NOTE: there are 103106 observations in the data set.

NOTE: data set ready.



<h2> Review and Verify the BERT Data </h2>

According to the output, the example should have created a new train data table. Now, review the new table and verify that the new table has the expected number and type of columns, headings, and cell contents. After verifying proper column structures and contents in the new tables, you can also display the summary statistics to get a sense of the data.

* [Verify Training Table Columns](#Verify-Train-Columns)
* [Review Training Data](#Review-Train-Data)

<a id="Verify-Train-Columns"></a>
<h3> Verify the Training Table Columns </h3>

The ``bert_prepare_data`` function creates variables in the SAS Viya table that are used by the token, position, and segment embeddings.  If you are using data that has associated labels, the function also creates target and target length variables.  Since this example uses labeled data, the output of the *columinfo* action in the following code should contain at least 5 columns:

* \_tokens\_         => tokenized input text strings
* \_position\_       => position indication strings used by position embeddings
* \_segment\_        => segment indication strings used by segment embeddings
* \_target\_0\_      => target variable (only one for this example)
* \_target\_length\_ => target length variable

In [10]:
# Display train table column info
s.columninfo(table=train)

Unnamed: 0,Column,Label,ID,Type,RawLength,FormattedLength,Format,NFL,NFD
0,_tokens_,_tokens_,1,varchar,1434,1434,,0,0
1,_position_,_position_,2,varchar,3217,3217,,0,0
2,_segment_,_segment_,3,varchar,2815,2815,,0,0
3,_target_0_,_target_0_,4,varchar,1,1,,0,0
4,_target_length_,_target_length_,5,double,8,12,,0,0


<a id="Review-Train-Data"></a>
<h3> Review the Training Data </h3>

This step is a good development practice, but is not strictly necessary. It is usually a good idea to review your pre-processed data in order to verify that there are no remaining artifacts.  The code and output cell below displays the token and target column values for a small number of random observations, so you can visually inspect them.  You can control what is displayed through the ``num_obs`` and ``columns`` arguments.  

In [11]:
from dlpy.transformers.bert_utils import display_obs

display_obs(s, 
            train, 
            num_obs=3, 
            columns=['_tokens_',
                     '_target_0_'
                    ]
            )

------- Observation:  101014 -------

_tokens_ :  [CLS] in 1840 , about how many african - americans lived in new york city ? [SEP] under such influential united states founders as alexander hamilton and john jay , the new york man ##umi ##ssion society worked for abolition and established the african free school to educate black children . [SEP]


_target_0_ :  1


------- Observation:  8736 -------

_tokens_ :  [CLS] what part of birds was found on an expedition to mt . everest ? [SEP] an expedition to mt . everest found skeletons of northern pin ##tail ana ##s ac ##uta and black - tailed god ##wi ##t limo ##sa limo ##sa at 5 , 000 m ( 16 , 000 ft ) on the k ##hum ##bu glacier . [SEP]


_target_0_ :  2


------- Observation:  31988 -------

_tokens_ :  [CLS] what is another name for the u ##b ##da ? [SEP] not every person accused of a political crime was convicted and nobody was sentenced to death for his or her pro - soviet feelings . [SEP]


_target_0_ :  1




<a id="Compile-BERT-Model"></a>
<h2> Compile the BERT Classification Model </h2>

We exploit the functional API approach by using SAS DLPy to create a BERT model that terminates with a classification head. This step maps the layer definitions from the HuggingFace BERT model to equivalent SAS Deep Learning layers.  Given these SAS layer definitions, a SAS equivalent BERT model is created and all layers are connected.  

**Note:** So far, the example has not assigned any model parameters. Model paramters are assigned after the BERT model is compiled, using the HDF5 file (``bert-base-uncased.kerasmodel.h5``) generated as part of this step.

In [12]:
# Compile the BERT HDF5 file
bert.compile(num_target_var=num_tgt_var)

NOTE: HDF5 file is /path-to/huggingface-cache-dir/bert-base-uncased.kerasmodel.h5
NOTE: Model compiled successfully.


<a id="Attach-BERT-Parms"></a>
<h2> Attach the BERT Model Parameters </h2>

After compiling the BERT classification model, it is time to attach the base BERT model parameters, as well as the randomly initialized parameters used for the model's classification head.  The code below specifies that the base BERT model layers should not be frozen when training (``freeze_base_model=False``), so all model layers will be updated when fine-tuning the model.  

If you choose to specify ``freeze_base_model=True``, then only the final classification layer in the BERT model will be trained.  The parameters for all other layers will be frozen.

If your client and server do not share a common file system, you **must** manually copy the file ``bert-base-uncased.kerasmodel.h5`` from your client to your server before you can run the following code block.  You must also provide the fully qualified file name on the server when you invoke the ``load_weights()`` function.  

The fully-qualified client file name is displayed in the output cell after a successful BERT model compile. The file name is part of the output message: 

``NOTE: HDF5 file is /path-to-local-cache-dir/bert-base-uncased.kerasmodel.h5.``

In [13]:
# Attach the BERT model parameters 
bert.load_weights(bert.get_hdf5_file_name(), 
                  num_target_var=num_tgt_var, 
                  freeze_base_model=False)

NOTE: Model weights attached successfully!


<a id="Fine-Tune-Train"></a>
<h2> Fine-tune the BERT Model for QNLI Task </h2>

The example uses the ``fit()`` member function to fine-tune the BERT model.  The BERT_Model class provides a number of convenience functions that enable you to set the required ``fit()`` parameters automatically:

* ``get_data_spec()`` - returns the data specification for input/output layers
* ``get_optimizer()`` - returns an Optimizer object suitable for fine-tuning training
* ``get_text_parameters()`` - returns a TextParms object needed for input data parsing

When you create a BERT model, an instance of an Optimizer object is defined, along with some default training settings.  The default settings were obtained from the **Fine-tuning Procedure** section of [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). The code below overrides the default learning rate and resets the value to 2x10<sup>-5</sup>.  Since the default mini-batch size per thread is 1, setting the number of threads to 32 ensures that the total mini-batch size is 32.

The example code specified verbose output earlier when creating the BERT model, so the training process will also provide detailed information during training initiation and for each mini-batch.  A model summary and training history is also provided after the training completes.  If you don't want this level of detail, simply set ``verbose=False`` when instantiating your BERT model.

In [14]:
# Override default learning rate
bert.set_optimizer_parameters(learning_rate=2e-5)
    
# Fine-tune the BERT model
bert.fit(train, 
         data_specs=bert.get_data_spec(num_tgt_var),
         optimizer=bert.get_optimizer(), 
         text_parms=bert.get_text_parameters(), 
         seed=12345, 
         n_threads=32)

NOTE: Training based on existing weights.
NOTE:  Synchronous mode is enabled.
NOTE:  The total number of parameters is 86828546.
NOTE:  The approximate memory cost is 20536.00 MB.
NOTE:  Loading weights cost       0.48 (s).
NOTE:  Initializing each layer cost      11.97 (s).
NOTE:  The total number of threads on each worker is 32.
NOTE:  The total mini-batch size per thread on each worker is 1.
NOTE:  The maximum mini-batch size across all workers for the synchronous mode is 32.
NOTE:  Target variable: _target_0_
NOTE:  Number of levels for the target variable:      2
NOTE:  Levels for the target variable:
NOTE:  Level      0: 1
NOTE:  Level      1: 2
NOTE:  Number of input variables:     3
NOTE:  Number of text input variables:      3
NOTE:  Batch nUsed Learning Rate        Loss  Fit Error   Time(s) (Training)
NOTE:      0    32  0.00002           0.7411     0.4688    12.02
NOTE:      1    32  0.00002           0.7341        0.5    12.56
NOTE:      2    32  0.00002           0.7356   

Unnamed: 0,Descr,Value
0,Model Name,classification
1,Model Type,Recurrent Neural Network
2,Number of Layers,92
3,Number of Input Layers,3
4,Number of Output Layers,1
5,Number of Convolutional Layers,0
6,Number of Pooling Layers,0
7,Number of Fully Connected Layers,25
8,Number of Recurrent Layers,1
9,Number of Residual Layers,25

Unnamed: 0,Epoch,LearningRate,Loss,FitError
0,1,2e-05,0.235187,0.087729
1,2,2e-05,0.136159,0.046337
2,3,2e-05,0.094927,0.030193

Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,CASUSER(docair),classification_weights,86828546,3,"CASTable('classification_weights', caslib='CAS..."


<a id="Data-Exploration-and-Cleaning-Test"></a>
<h2>Explore and Clean the Test Data</h2>

This example step uses the **dev** data set to evaluate our trained model.  The **dev** set was originally intended to be used for hyperparameter tuning and validation, but the example uses the **dev** data set to test the trained model because it includes labels. (The **test** data set does not include labels.)

<h3> Read the Data Into Pandas Data Frame  </h3>

In [15]:
# Read the dev dataset for QNLI classification
test_data = pd.read_csv('/path-to/qnli-data/dev.tsv',
                         header=0,
                         sep='\t',
                         error_bad_lines=False,
                         warn_bad_lines=False,
                         names=['index', 
                                'question',
                                'sentence',
                                'label'
                                ]
                        )

# Labels for the dev inputs and target
input_a_label = 'question'
input_b_label = 'sentence'
target_label = 'label'

<h3> Extract the Test data and Assign Numeric Classes </h3>

This process follows the same procedure that was used with the training data.

In [16]:
# Labels for the test inputs and target
input_a = test_data[input_a_label].to_list()
input_b = test_data[input_b_label].to_list()
targets = test_data[target_label].to_list()

# Translate nominal target values into numeric values
for ii, tgt in enumerate(targets):
    if tgt == 'entailment':
        targets[ii] = 2
    elif tgt == 'not_entailment':
        targets[ii] = 1
    else:
        print('Unknown target label.')

<a id="Data-Tokenization-Test"></a>
<h2> Tokenize the Test Data and Upload as a SAS Viya Table </h2>

This process follows the same procedure that was used with the training data.

In [17]:
# Specify parameters for the bert_prepare_data function
num_tgt_var, test = bert_prepare_data(s, 
                                      bert.get_tokenizer(), 
                                      bert.get_max_sequence_length(), 
                                      input_a=input_a, 
                                      input_b=input_b, 
                                      target=targets, 
                                      classification_problem=bert.get_problem_type(), 
                                      segment_vocab_size=bert.get_segment_size(), 
                                      verbose=True)

NOTE: 10% of the observations tokenized.
NOTE: 20% of the observations tokenized.
NOTE: 30% of the observations tokenized.
NOTE: 40% of the observations tokenized.
NOTE: 50% of the observations tokenized.
NOTE: 60% of the observations tokenized.
NOTE: 70% of the observations tokenized.
NOTE: 80% of the observations tokenized.
NOTE: 90% of the observations tokenized.
NOTE: 100% of the observations tokenized.
NOTE: all observations tokenized.

These observations have been truncated so that only the first 256 tokens are used.

NOTE: uploading data to table bert_data.
NOTE: there are 5266 observations in the data set.

NOTE: data set ready.



<a id="Inference-Test-Data"></a>
<h2> Perform Inference Using the Test Data </h2>

Next, the example uses the ``predict()`` member function to perform inference.  Once more, the code uses the ``get_text_parameters()`` function to set the text parameters appropriately.

The results below show the misclassification rate and the loss error.  The correct classification rate is simply (1-0.965)\*100 = 90.35%. The example's 90.35% classification rate is slightly better than the 90.1% classification rate achieved for the BERT<sub>base</sub> model in *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*.  

Remember that the misclassification and loss error results for this example were obtained using the **dev** data set, whereas the results shown in the original academic paper were obtained using the **test** data set. 

In [18]:
# Score the BERT dev data
res = bert.predict(test, 
                   text_parms=bert.get_text_parameters()
                   )

print(res['ScoreInfo'])

                         Descr         Value
0  Number of Observations Read          5266
1  Number of Observations Used          5266
2  Misclassification Error (%)      9.646791
3                   Loss Error      0.300679


<h2> Terminate the SAS Viya Session </h2>

In [19]:
s.endsession()