# Deploying BERT with Pytorch and NGC
Ever wondered how to efficiently deploy massive transformer networks in the cloud? It is surprisingly easy using AWS and Nvidia GPU Cloud (NGC) containers. For this tutorial, we are going to download a BERT base question answering model trained on the Stanford Question Answering Dataset and walk through the steps necessary to deploy it to a Sagemaker endpoint.

Our first step will be to download the pretrained question answering model from NGC.

In [1]:
!wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_base_pyt_amp_ckpt_squad_qa1_1/versions/1/files/bert_base_qa.pt

--2020-05-09 14:35:51--  https://api.ngc.nvidia.com/v2/models/nvidia/bert_base_pyt_amp_ckpt_squad_qa1_1/versions/1/files/bert_base_qa.pt
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 54.186.237.130, 52.38.124.212
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|54.186.237.130|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/bert_base_pyt_amp_ckpt_squad_qa1_1/versions/1/files/bert_base_qa.pt?response-content-disposition=attachment%3B%20filename%3D%22bert_base_qa.pt%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=IQoJb3JpZ2luX2VjELb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCID6w4sBlSpne0AXrnrTtN8YQQZJgAXIckSbRIVie6v7kAiEA1OGaAQzdBNWSullHGpOfVmOWrH4lhaKNMSTL0Jg2KFgqvQMI7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgw3ODkzNjMxMzUwMjciDMLEJ8LUVCRaCS4R%2ByqRA9VXFIcUPOSWAxN4xhWEAwOG68ukbXeT0Yiw7%2B522bC5IOg1KWybA5r3%2BFJhe43c%2BGGus3IIIkDlM73oB%2

In [2]:
import collections
import math
import torch
import os, tarfile, json
import time, datetime
from io import StringIO
import numpy as np
import sagemaker
from sagemaker.pytorch import estimator, PyTorchModel, PyTorchPredictor
from sagemaker.utils import name_from_base
import boto3
from modeling import BertForQuestionAnswering, BertConfig, WEIGHTS_NAME, CONFIG_NAME
from tokenization import (BasicTokenizer, BertTokenizer, whitespace_tokenize)
from types import SimpleNamespace
from helper_funcs import *
from file_utils import PYTORCH_PRETRAINED_BERT_CACHE

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket() # can replace with your own S3 bucket
prefix = 'bert_pytorch_ngc'
runtime_client = boto3.client('runtime.sagemaker')

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


## Run the model locally
Before deploying everything to an endpoint, let's run through how the model works and run inference locally.
We are first going to set some variables. 

In [3]:
# specify the vocabulary file
vocab_file='vocab'

# set variables that limit the maximum length of the context, query, and answer
max_seq_length, max_query_length, n_best_size, max_answer_length, null_score_diff_threshold = 384, 64, 1, 30, -11.0
do_lower_case, can_give_negative_answer = True, True

# initialize our tokenizer
tokenizer = BertTokenizer(vocab_file, do_lower_case=True, max_len=512)


Let's initialize the model architecture and load the weights.

In [21]:
# load a model configuration
config = BertConfig.from_json_file('bert_config.json')

# set up our model architecture
model = BertForQuestionAnswering(config)

# load our weights
model.load_state_dict(torch.load('bert_base_qa.pt', map_location='cpu')["model"])

# send out model to our device
model.to(device)
    
# set our model to evaluation mode for inference
model.eval()

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30528, 768)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (softmax): Softmax(dim=-1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.

Let's test out inference! For question answering, we need to supply a context statement that our model can "study" in order to answer the question. We then ask it a question about the context statement. Feel free to change the context statement and question!

In [25]:
# specify how many answers to return, here we are going to take the top answer only.
n_best_size=1
context='Danielle is a girl who really loves her cat, Steve. Steve is a large cat with a very furry belly. He gets very excited by the prospect of eating chicken covered in gravy.'
question='who loves Steve?'  # 'What kind of food does Steve like?'

# preprocessing
# split the context into tokens
doc_tokens = context.split()
# tokenize our query 
query_tokens = tokenizer.tokenize(question)
# generate features to feed to the model
feature = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=max_seq_length, 
                                    max_query_length=max_query_length)
tensors_for_inference, tokens_for_postprocessing = feature

input_ids = torch.tensor(tensors_for_inference.input_ids, dtype=torch.long).unsqueeze(0)
segment_ids = torch.tensor(tensors_for_inference.segment_ids, dtype=torch.long).unsqueeze(0)
input_mask = torch.tensor(tensors_for_inference.input_mask, dtype=torch.long).unsqueeze(0)

# load tensors to device
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)

# run inference
with torch.no_grad():
    start_logits, end_logits = model(input_ids, segment_ids, input_mask)

# post-processing
start_logits = start_logits[0].detach().cpu().tolist()
end_logits = end_logits[0].detach().cpu().tolist()
# convert logits back to English
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         start_logits, end_logits, n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)

# print result
print(f'{question} : {answer[0]["text"]}')


who loves Steve? : Danielle


## Prepare the pretrained model
Now that you've gotten a chance to play with the model locally, let's deploy it to an endpoint! In order to deploy BERT to a sagemaker endpoint, we need to save the model as a tarball. 

In [24]:
# save the model as a tarball
with tarfile.open('bert.tar.gz', 'w:gz') as f:
    f.add('bert_base_qa.pt')

## Instantiate the model
Once we have saved our model we then upload to our S3 bucket where our Docker container can access it. We use transform_script.py to define how we load our model, handle our input data, perform inference, and pass our results back to the requester. 

Sagemaker has predefined functions for all of these operations aside from importing the model, however, for our specific case we are passing in multiple arrays as input (our question and our provided context). This means we need to specify custom functions for our input data and making predictions. These functions are named input_fn and predict_fn inside of transform_script.py. To learn more about how to deploy PyTorch models in sagemaker see the following documentation:

https://sagemaker.readthedocs.io/en/stable/using_pytorch.html#deploy-pytorch-models

In [104]:
# upload model data to S3
model_data = sagemaker_session.upload_data(path='bert.tar.gz',
                                           bucket=bucket,
                                           key_prefix =os.path.join(prefix, 'model'))

# instantiate model
torch_model = PyTorchModel(model_data=model_data,
                           role=role,
                          entry_point='transform_script.py',
                          framework_version='1.4.0')

## Deploy the model
Now that we have defined our model we can deploy it to an endpoint. We will need to give our endpoint a name, determine how many instances we want to run our endpoint, and the instance types. Here we are deploying this model to a g4dn instance that utilizes a Nvidia T4 card for inference.

In [105]:
# deploy endpoint, this part may take a bit
endpoint_name = f'bert-endpoint-{datetime.datetime.fromtimestamp(time.time()).strftime("%c").replace(" ","-").replace(":","-")}'
bert_end = torch_model.deploy(instance_type='ml.g4dn.4xlarge', initial_instance_count=1, 
                              endpoint_name=endpoint_name)

---------------!

## Get Predictions
For question answering, we pass in a context statement for the model to read and then we ask it a question. In this first case we are doing the pre-processing locally and then sending the prepped data to the model as an array:

In [116]:
%time
n_best_size=3
context='Danielle is a girl who really loves her cat, Steve. Steve is a large cat with a very furry belly. He gets very excited by the prospect of eating chicken covered in gravy.'
question='who loves Steve?'  # 'What kind of food does Steve like?'
doc_tokens = context.split()
query_tokens = tokenizer.tokenize(question)
feature = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=max_seq_length, 
                                    max_query_length=max_query_length)
tensors_for_inference, tokens_for_postprocessing = feature

input_ids = np.array(tensors_for_inference.input_ids, dtype=np.int64)
segment_ids = np.array(tensors_for_inference.segment_ids, dtype=np.int64)
input_mask = np.array(tensors_for_inference.input_mask, dtype=np.int64)   

payload = np.concatenate([np.expand_dims(input_ids, axis=0), np.expand_dims(segment_ids, axis=0), np.expand_dims(input_mask, axis=0)])
try:
    response = bert_end.predict(payload.tobytes(), initial_args={'ContentType':'application/x-npy'}) 
except:
    print('using invoke_endpoint directly')
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                           ContentType='application/x-npy',
                                           Body=payload.tobytes())
    response = eval(response['Body'].read().decode('utf-8'))
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         response[0], response[1], n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)

# print result
print(f'{question} : {answer[0]["text"]}')
#print(f'inference took: {round(time.time()-t,4)} seconds')

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.87 µs
who loves Steve? : Danielle


Now let's have our endpoint handle the preprocessing and just pass it raw text:

In [112]:
%time
pass_in_data = {'context':context, 'question':question}
response = runtime_client.invoke_endpoint(EndpointName=bert_end.endpoint,
                                       ContentType='application/json',
                                       Body=json.dumps(pass_in_data))
response = eval(response['Body'].read().decode('utf-8'))
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         response[0], response[1], n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)
#print result
print(f'{question} : {answer[0]["text"]}')


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.39 µs
who loves Steve? : Danielle


## Clean up endpoint


In [103]:
!rm bert_base_qa.pt
!rm bert.tar.gz
bert_end.delete_endpoint()

rm: cannot remove ‘bert_base_qa.pt’: No such file or directory
rm: cannot remove ‘bert.tar.gz’: No such file or directory
