<h1> Prepare Data for BERT Model </h1>

The example in this Jupyter notebook walks through the process of preparing data for a Bidirectional Transformers for Language Understanding (BERT) model.  Before working with this notebook, it is a good idea to review the **Input Representation** section of [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) (Devlin, Chang, Lee, and Toutanova) to understand the prerequisite requirements for data used by BERT models. 

This Jupyter notebook takes a step-by-step approach. 

First, prepare your computing environment and tools: 

* [Define and Configure the Computing Environment](#Define-and-Configure-the-Computing-Environment)
* [Connect to the SAS Viya Server](#Connect-to-SAS-Viya-Server)
* [Download the Raw Data](#Download-the-Data).

Next, perform the data preparation task for the BERT model using three steps:

1. [Explore and Clean the Data](#Data-Exploration-and-Cleaning)
2. [Tokenize the Data](#Data-Tokenization)
3. [Export the Data to a SAS Viya Table](#Export-Data-to-Viya-Table)

The output of this example is an organized and cleaned data set suitable for use with a BERT model.


<h2> Client and Server Terminology in this Example</h2>

SAS Viya literature and technical documentation often refers to client and server entities. In this example, the client is the computer that runs the Jupyter notebook, and the server is the computer that runs the SAS Viya server. These two computers might (or might not) use the same operating system, and might (or might not) have access to a common file system.

System configurations vary: the client and server operations in this example might be different in other environments. This example only requires that the data is accessible from the client computer.

<a id="Define-and-Configure-the-Computing-Environment"></a>
<h2> Define and Configure the Computing Environment </h2>

The example defines and configures the computing requirement in the following steps: 

* [Import Required Python Packages](#Import-Required-Python-Packages)
* [Specify Client Parameters](#Specify-Client-Parameters)
* [Connect to the SAS Viya Server](#Connect-to-SAS-Viya-Server).


<a id="Import-Required-Python-Packages"></a>
<h3> Import Required Python Function Packages </h3>

This section of code imports the Python packages that contain the functions required to perform the BERT data preparation.

**Note:** The SAS Deep Learning tools require the [Huggingface implementation](https://github.com/huggingface/transformers) of BERT.

In [1]:
# Import necessary packages
# Import sys and os packages
import sys
import os

# Import Numpy scientific computing and 
# Pandas data structures libraries
import numpy as np
import pandas as pd

# Import BERT Tokenizer function
from transformers import BertTokenizer

<a id="Specify-Client-Parameters"></a>
<h3> Specify Client Parameters </h3>

This section of code configures the notebook for use with Linux or Windows SAS Viya clients.

In [2]:
# Generalize the example Jupyter notebook so it can be used with Linux and Windows Viya clients

# Linux client parameters
if sys.platform.startswith('linux'):    
    
    # path to data
    data_file_name = '/your-environment-path/to-model-data/imdb_master.csv'
    
    # path to HuggingFace cache directory
    cache_dir='/your-environment-path/to-cache_dir'   
    
# Windows client parameters    
elif sys.platform.startswith('win'):

    # path to data
    data_file_name = 'Drive:\\your-network-path\\to-model-data\\imdb_master.csv'
    
    # path to HuggingFace cache directory
    cache_dir='Drive:\\your-path\\to-cache-dir'

# Unsupported operating system client    
else:
    raise ValueError('Unrecognized operating system')

<a id="Connect-to-SAS-Viya-Server"></a>
<h3> Connect to the SAS Viya Server </h3>

This section of code invokes the SAS Wrapper for Analytics Transfer (SWAT), configures the Python matplotlib utility for Jupyter notebook output display, connects to a SAS Viya server, and loads the SAS CAS DeepLearn action set into memory.

In [3]:
# Configure the Python matplotlib utility to display output in 
# Jupyter notebook cells.
%matplotlib inline

# The SAS Scripting Wrapper for Analytic Transfer (SWAT)
# is a module used to connect to a SAS Viya server.
from swat import * 

# Specify the name of your SAS Viya host
host = 'your-host-name'

# Specify your installation's port ID and UserID
s = CAS(host, 99999, 'userID')

# load the SAS Viya Deep Learning Action Set
s.loadactionset('deepLearn')

NOTE: Added action set 'deepLearn'.


<a id="Download-the-Data"></a>
<h3> Download the Raw Data </h3>

In this section, you download the raw data that you will prepare for use with a BERT model.

The model data in this example is the [Kaggle](http://kaggle.com) public-domain IMDB movie review data set named  [imdb_master.csv](https://www.kaggle.com/utathya/imdb-review-dataset). Download and store the ``imdb_master.csv`` file in a directory that your SAS Viya client can access. 

<a id="Data-Exploration-and-Cleaning"></a>
<h2>Explore and Clean the Data</h2>

After downloading the ``imdb_master.csv`` file, examine the contents. You can use your favorite spreadsheet to open the CSV file, scan the contents, and see the data schema for this table. Each observation represents a movie review, along with its associated metadata.  

While perusing the data, feel free to clean up any cell contents that might have formatting issues. For example, note that some of the movie review comments contain extraneous formatting tags (such as ``<br />``) that must be removed.  

Note: If you don't clean the data while browsing the spreadsheet, it's OK. Before this example [extracts the training data](#Read-Data-into-Memory), a line of the example code searches for and removes all occurrences of the string ``<br />``. Afterwards, the code converts the format for target values from nominal to numeric. 

The following code reads the ``imdb_master.csv`` data for movie type, movie review text (input), movie sentiment label (target), and file name into a Pandas data frame.

<a id="Read-Data-into-Memory"></a>
<h3>Read the Data into Pandas Data Frame</h3>



In [4]:
# Read the data set for IMDB movie review sentiment classification
# into SAS CAS memory.
reviews = pd.read_csv(data_file_name,
                      header=0,
                      names=['type', 
                             'review',
                             'label',
                             'file'
                            ],
                      encoding='latin_1'
                     )

# The input data is the text in the review column.
input_label = 'review'       

# The target data is the sentiment text in the label column.
target_label = 'label'       

<a id="Extract-and-Shuffle-Data"></a>
<h3> Extract a Data Subset and Randomly Shuffle Observations</h3>

The data contains three types of reviews: (**train**, **test**, and **unsup**).  This example only uses the  **train** reviews as it is a precursor step to traininig a BERT model. Hence, we must extract the correct subset of data.  

We must also translate the raw data target variables from the nominal values *neg* and *pos* to consistent numeric values that are suitable for analysis.  We adopt a simple rule for the nominal-to-numeric translation:

*neg* <=> 1  
*pos* <=> 2

**Note:** If you do not perform the nominal-to-numeric translation, your Viya data table will be incorrect.  

In [5]:
# Import random function
import random

# Define inputs and targets
t_idx1 = reviews['type'] == 'train'
# Exclude observations where review type = 'unsup' 
t_idx2 = reviews['label'] != 'unsup'
inputs = reviews[t_idx1 & t_idx2][input_label].to_list()
targets = reviews[t_idx1 & t_idx2][target_label].to_list()

# Nominal-to-numeric format change for target values
for ii,val in enumerate(targets):
    inputs[ii] = inputs[ii].replace("<br />","")
    if val == 'neg':
        targets[ii] = 1
    elif val == 'pos':
        targets[ii] = 2
        
# Shuffle the data
# You must use this random seed value if you   
# want to duplicate the example results.
random.seed(123454321)
map_index_position = list(zip(inputs, 
                              targets
                              )
                          )
random.shuffle(map_index_position)
inputs, targets = zip(*map_index_position)

<a id="Data-Tokenization"></a>
<a id="Export-Data-to-Viya-Table"></a>
<h2> Tokenize the Data and Upload the Data as a SAS Viya Table </h2>

After extracting and shuffling the input data, it must now be tokenized for use with BERT.

The input data consists of text strings that are composed of English words, but BERT networks cannot operate directly on text data. Instead, the text must be tokenized using WordPiece tokens.  

The WordPiece algorithm segments words into representative components, or tokens for NLP tasks. These WordPiece tokens are then translated to WordPiece embeddings internally by the BERT model. Embeddings are a vector of numeric values that represent the tokens. BERT combines the WordPiece embeddings with position and segment embeddings when forming the input to the first encoder layer.  

A full discussion of these embeddings is beyond the scope of this example, but you can find a tutorial-style explanation [here](http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/).  

After the input data has been tokenized, the table is ready to be uploaded as a SAS Viya table.

The helper function ``bert_prepare_data()`` handles all of the data preparation and formatting that is required to upload the SAS Viya table.  In this example, the code splits the **train** data into training and validation data sets by specifying ``train_fraction=0.9``.  This code puts 90\% of the data into the training data set, and assigns the remaining 10\% to the validation data set.  

The last parameter in the ``bert_prepare_data`` specification requests verbose output. This setting provides user feedback so you can track the progress of the data preparation. Be warned that some large data sets might take a long time to complete.


In [6]:
# Load the BERT data preparation helper function
from dlpy.transformers.bert_utils import bert_prepare_data

# Run the BERT data preparation helper function
num_tgt_var, train, valid = bert_prepare_data(s, 
                                              BertTokenizer.from_pretrained('bert-base-uncased',
                                                                            cache_dir=cache_dir
                                                                           ), 
                                              512, 
                                              input_a=list(inputs), 
                                              target=list(targets), 
                                              classification_problem=True,
                                              segment_vocab_size=2,
                                              
                                              # Partition the extracted data  
                                              # into 90% Train partition and 
                                              # 10% Validation partition.                                            
                                              train_fraction=0.9,
                                              
                                              # Verbose user feedback
                                              verbose=True
                                             )

NOTE: 10% of the observations tokenized.
NOTE: 20% of the observations tokenized.
NOTE: 30% of the observations tokenized.
NOTE: 40% of the observations tokenized.
NOTE: 50% of the observations tokenized.
NOTE: 60% of the observations tokenized.
NOTE: 70% of the observations tokenized.
NOTE: 80% of the observations tokenized.
NOTE: 90% of the observations tokenized.
NOTE: all observations tokenized.

These observations have been truncated so that only the first 512 tokens are used.

NOTE: uploading training data to CAS table bert_train_data.
NOTE: there are 22593 observations in the training data set.

NOTE: uploading test/validation data to CAS table bert_test_validation_data.
NOTE: there are 2407 observations in the test/validation data set.

NOTE: training and test/validation data sets ready.



<h2> Review and Verify the Newly Prepared BERT Data </h2>

According to the output, the example should have created new train and validation data tables. Now we can review the new tables and verify that they have the expected number and type of columns, headings, and cell contents. After verifying proper column structures and contents in the new tables, you can also display the summary statistics to get a sense of the data.

* [Verify Training Table Columns](#Verify-Train-Columns)
* [Verify Validation Table Columns](#Verify-Valid-Columns)
* [Review Training Data](#Review-Train-Data)
* [Display Training Summary Statistics](#Display-Train-Summary)
* [Review Validation Data](#Review-Valid-Data)
* [Display Validation Summary Statistics](#Display-Valid-Summary)

<a id="Verify-Train-Columns"></a>
<h3> Verify Training Table Columns </h3>

The ``bert_prepare_data`` function creates variables in the Viya table that are used by the token, position, and segment embeddings.  If you are using data that has associated labels, the function also creates target and target length variables.  Since this example uses labeled data, the output of the *columinfo* action in the following code should contain at least 5 columns:

* \_tokens\_         => tokenized input text strings
* \_position\_       => position indication strings used by position embeddings
* \_segment\_        => segment indication strings used by segment embeddings
* \_target\_0\_      => target variable (only one for this example)
* \_target\_length\_ => target length variable

In [7]:
s.columninfo(table=train)

Unnamed: 0,Column,Label,ID,Type,RawLength,FormattedLength,Format,NFL,NFD
0,_tokens_,_tokens_,1,varchar,2893,2893,,0,0
1,_position_,_position_,2,varchar,6545,6545,,0,0
2,_segment_,_segment_,3,varchar,5631,5631,,0,0
3,_target_0_,_target_0_,4,varchar,1,1,,0,0
4,_target_length_,_target_length_,5,double,8,12,,0,0


<a id="Verify-Valid-Columns"></a>
<h3> Verify Validation Table Columns </h3>

As with the training data, The ``bert_prepare_data`` function creates variables in the validation table that are used by the token, position, and segment embeddings.  If you are using data that has associated labels, the function also creates target and target length variables.  Since this example uses labeled data, the output of the *columinfo* action in the following code should contain at least 5 columns:

* \_tokens\_
* \_position\_
* \_segment\_
* \_target\_0\_
* \_target\_length\_

In [8]:
s.columninfo(table=valid)

Unnamed: 0,Column,Label,ID,Type,RawLength,FormattedLength,Format,NFL,NFD
0,_tokens_,_tokens_,1,varchar,2857,2857,,0,0
1,_position_,_position_,2,varchar,6545,6545,,0,0
2,_segment_,_segment_,3,varchar,5631,5631,,0,0
3,_target_0_,_target_0_,4,varchar,1,1,,0,0
4,_target_length_,_target_length_,5,double,8,12,,0,0


<a id="Review-Train-Data"></a>
<h3> Review Training Data </h3>

This step is a good development practice, but is not strictly necessary. It is usually a good idea to review your pre-processed data in order to verify that there are no remaining artifacts.  The code and output cell below displays the token and target column values for a small number of random observations, so you can visually inspect them.  You can control what is displayed through the ``num_obs`` and ``columns`` arguments.  

In [9]:
# Import DLPy BERT display observations function
from dlpy.transformers.bert_utils import display_obs

# Display the token and target contents for three 
# observations from the training data
display_obs(s, train, num_obs=3, columns=['_tokens_', '_target_0_'])

------- Observation:  18302 -------

_tokens_ :  [CLS] this movie describes the life of somebody who grew up in the worst of circumstances but unlike many people he actually grew up to be a respectable person . what ##s more is that this is a true story . ant ##won ##e fisher is so innocent and yet he was abused such just because he was not white . ant ##won ##e fisher has been married to the same women for ten years and he never fooled around with women , coke , cigar ##s , weed , alcohol , or any of those things that are very popular in the places he was growing up . there is not much more to say about this movie it is excellent . the only rating i can give it is a 10 / 10 . [SEP]


_target_0_ :  2


------- Observation:  2230 -------

_tokens_ :  [CLS] this is a truly awful ["] b ["] movie . it is wit ##less and often embarrassing . the plot , the basic ["] making into show business ["] routine , is almost none ##xi ##sten ##t . in fact , the film is merely an excuse to push the war


<a id="Display-Train-Summary"></a>
<h3> Display Training Data Summary Statistics </h3>

This step is also a good development practice, but is not strictly necessary. The ``summary()`` function provides an idea of how the input data is distributed in terms of the number of input tokens.  The code below specifies ``full_table=False`` in order to base the statistics on a subset of the full table.  Using a table subset might be necessary with very large tables because otherwise the full table must be copied from the SAS Viya server to the client to compute summary statistics. 

**Note:** You should expect variations between summary statistics that are generated using a data subset and summary statistics that are generated using the full data set.


In [10]:
# Import DLPy BERT Summary Statistics Function
from dlpy.transformers.bert_utils import bert_summary

# Display training data summary statistics 
# using a subset of the full table
bert_summary(s, train, full_table=False)

NOTE: there are 22593 observations in the training data set.
NOTE: calculating summary statistics based on the first 10% of the table.

NOTE: minimum number of tokens in an observation = 31
NOTE: maximum number of tokens in an observation = 512
NOTE: average number of tokens in an observation = 270.2421425409473
NOTE: standard deviation of the number of tokens in an observation = 138.61731127606353



<a id="Review-Valid-Data"></a>
<h3> Review Validation Data </h3>

The code and output cell below displays random columns for a small number of observations.  The previous invocation of ``display_obs()`` in the [Review Training Data](#Review-Train-Data) code displayed only two columns (the token and target values), but you can specify as many columns as you want.  

The only drawback of specifying multiple specific columns is that the returned output can be excessive.  Selecting random column output (the default) allows you to spot check all columns with a manageable amount of output.

In [11]:
# Display random columns from six 
# randomly selected observations
display_obs(s, valid, num_obs=6)

------- Observation:  380 -------

_tokens_ :  [CLS] this review is mostly all spoil ##ers . if you plan on enjoying this film , don ['] t read this review . that ['] s the problem with kids tv nowadays . it ['] s all so patron ##izing and conde ##sc ##ending . ` wow , that was fun , wasn ['] t it ? ['] no it wasn ['] t . and unfortunately it seems to have per ##me ##ated into children films as well . and that is what ['] flight of the reindeer ['] is all about . admitted ##ly i haven ['] t seen ['] flight of the reindeer ['] in a few years so i might be hazy on some points , but i remember being thoroughly un ##im ##pressed with it at the time . essentially , the story follows a lecturer who is given a book for christmas . now , the lecturer is an esteem ##ed scientist on the flying habits of some animal . i think it was bull ##fr ##og ##s . anyway , through this book , mr lecturer / family man learns that reindeer can fly in exactly the same way as bull ##fr ##og ##s . apparently thi

<a id="Display-Valid-Summary"></a>
<h3> Display Validation Data Summary Statistics </h3>

Here we compute summary statistics for the validation data. We use the full validation table (instead of using a subset) because it is significantly smaller in size than the training table.

In [12]:
# Display summary statistics for the validation data.
# This is a small table, so it is not necessary to 
# subset the validation data for summary reporting.
bert_summary(s, valid)

NOTE: there are 2407 observations in the training data set.
NOTE: minimum number of tokens in an observation = 35
NOTE: maximum number of tokens in an observation = 512
NOTE: average number of tokens in an observation = 268.1312837557125
NOTE: standard deviation of the number of tokens in an observation = 138.70136390205272



Both the training and validation data look good and are ready for use with a BERT model.