## Natural Language Processing
In an era where computers, smartphones, and other electronic devices increasingly need to interact with humans, Natural Language Processing or NLP has become an indispensable technique for teaching devices how to communicate in natural languages in human-like ways. In this module we will understand what Natural Language Processing or NLP means and how we will teach devices to understand constructs of language like paragraphs, sentences and words.

In this notebook we will explore the ideas mentioned video section of this course. We will start with the definition of Language modeling and where it can be used. Next we will explore BERT, a powerful NLP algorithm that is being used in many popular services like Search, Grammar Correction, Voice agents and Productivity software. We learn what it requires to train these multi-million parameter models on state of the art NVIDIA Graphic Processing Units or GPUs. TensorCores in GPUs are responsible for accelerated Deep Learning training so we will take a quick look at how we can use them to reduce our training time from days to hours by using Mixed Precision training methodology. We will take a pre-trained model and adapt it to work for our NLP task - Question and Answering. And then we will deploy this trained BERT model as a SageMaker endpoint enabling anyone to send a Question-Answering NLP task to your service.

This will be an exciting journey, where at the end of it, you would have trained and deployed a state of the art NLP Model which can be used in your personal projects like building Chatbots, Grammar correction, sentiment analysis or other language understanding tasks. But let's start with with basics - Language Modeling

## Language modeling – the basics
## What is language modeling?
"Language modeling is the task of assigning a probability to sentences in a language. […] Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words." Source: Page 105, Neural Network Methods in Natural Language Processing, 2017.

## Types of language models
There are primarily two types of Language Models:

Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.

Neural Language Models: They use different kinds of Neural Networks to model language, and have surpassed the statistical language models in their effectiveness.

"We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity." Source: Recurrent neural network based language model, 2010.

Given the superior performance of neural language models, we include in the container two popular state-of-the-art neural language models: BERT and Transformer-XL.

## Why is language modeling important?
Language modeling is fundamental in modern NLP applications. It enables machines to understand qualitative information, and enables people to communicate with machines in the natural languages that humans use to communicate with each other.

Language modeling is used directly in a variety of industries, including tech, finance, healthcare, transportation, legal, military, government, and more -- actually, you probably have just interacted with a language model today, whether it be through Google search, engaging with a voice assistant, or using text autocomplete features.

## How does language modeling work?
The roots of modern language modeling can be traced back to 1948, when Claude Shannon published a paper titled "A Mathematical Theory of Communication", laying the foundation for information theory and language modeling. In the paper, Shannon detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. The Markov models, along with n-gram, are still among the most popular statistical language models today.

However, simple statistical language models have serious drawbacks in scalability and fluency because of its sparse representation of language. Overcoming the problem by representing language units (eg. words, characters) as a non-linear, distributed combination of weights in continuous space, neural language models can learn to approximate words without being misled by rare or unknown values.

Therefore, as mentioned above, we introduce two popular state-of-the-art neural language models, BERT and Transformer-XL, in Tensorflow and PyTorch. More details can be found in the NVIDIA Deep Learning Examples Github Repository

## What is BERT ?
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's implementation of BERT is an optimized version of the Hugging Face implementation, leveraging mixed precision arithmetic and Tensor Cores on Volta V100 and Ampere A100 GPUs for faster training times while maintaining target accuracy.

## What can BERT do ?
With research organizations globally having conversational AI as the immediate goal in mind, BERT has made major breakthroughs in the field of NLP. In the past, basic voice interfaces like phone tree algorithms—used when you call your mobile phone company, bank, or internet provider—are transactional and have limited language understanding.

With transactional interfaces, the scope of the computer’s understanding is limited to a question at a time. This gives the computer a limited amount of required intelligence: only that related to the current action, a word or two or, further, possibly a single sentence. For more information, see What is Conversational AI?

But when people converse in their usual conversations, they refer to words and context introduced earlier in the paragraph. Going beyond single sentences is where conversational AI comes in.

Here’s an example of using BERT to understand a passage and answer the questions. This example is taken from NVIDIA's NGC Site

Passage: NGC Containers are designed to enable a software platform centered around minimal OS requirements, Docker and driver installation on the server or workstation, and provisioning of all application and SDK software in the NGC containers through the NGC container registry. NGC manages a catalog of fully integrated and optimized deep learning framework containers that take full advantage of NVIDIA GPUs in both single GPU and multi-GPU configurations.

Next we can ask our question: What configurations can NGC containers work with?

Soon you will see that BERT model to take this paragraph and question as input can provide the answer. We should expect to see something like this

Answer: 'single GPU and multi-GPU'

As dicussed in the video section of this course, Question answering is one of the GLUE benchmark metrics. A breakthrough is the development of the Stanford Question Answering Dataset or SQuAD, as it is the key to a robust and consistent training and standardizing learning performance observations. For more information, see SQuAD: 100,000+ Questions for Machine Comprehension of Text. In 2018, BERT became a popular deep learning model as it peaked the GLUE (General Language Understanding Evaluation) score to 80.5% (a 7.7% point absolute improvement). Question For more information, see A multi-task benchmark and analysis platform for natural understanding.

After the development of BERT at Google, it was not long before NVIDIA achieved a world record time using massive parallel processing by training BERT on many GPUs. With some additional work we can also generate the time required to answer this question with the BERT model and how many questions can we ask in a second. For example, NVIDIA recently published similar results for BERT and a few other models. They were able to run about 17K questions on an NVIDIA A100 GPU. If you are interested in performance , then you would love this page with some fascinating performance numbers on NVIDIA Deep Learning training and Inference. In the next few sections we will look at how GPUs can be used to accelerate BERT Training and Inference. Before going into that, let's take a quick look at the architecture of the BERT model.

The architecture and components of this model will give us a good idea on why GPUs can be used to accelerate this model.

## BERT Architecture


BERT stands for Bidirectional Encoder Representation of Transformers. In this video section, We already discussed the architecture of this model. Here is a quick recap.

BERT has three concepts embedded in its name - Bidirectional Encoder Representation of Transformers. Transformers are a neural network that learns the human language using self-attention, where a segment of words is compared against itself. The model learns how a given word's meaning is derived from every other word in the segment. For example, lets look a the statement shown in here taken from this https://jalammar.github.io/illustrated-transformer/

"The animal did not cross the road because it was too tired". In this statement, ‘it’ refers to the animal. If we changed the statement to:- "The animal did not cross the road because it was too wide"; the sentence is exactly the same but we removed “tired” and added “wide”. However, now it is more likely that ‘it’ refers to the road being wide. Through the self-attention mechanism, transformers learn to derive meaning and references for each word in a sentence.

Second, bidirectional means that the Neural Network which treats the words in a sentence as time-series data are able to look at sentences from both directions. The older algorithms looked at words in a forward direction, trying to predict the next word, which ignores the context and information that the words are occurring later in the sentence provided. BERT uses self-attention to look at the entire input sentence at one time. Any relationships before or after the word are accounted for

Finally, an encoder is a component of the encoder-decoder structure. You encode the input language into latent space, and then reverse the process with a decoder trained to re-create a different language. This is great for translation, as self-attention helps resolve the many differences that a language has in expressing the same ideas, such as the number of words or sentence structure.

In BERT, you just take the encoding idea to create that latent representation of the input, but then use that as a feature input into several, fully connected layers to learn a particular language task.

## How to use BERT ?
There are two steps to making BERT learn to solve a problem for you. You first need to pretrain the transformer layers to be able to encode a given type of text into representations that contain the full underlying meaning. Then, you need to train the fully connected classifier structure to solve a particular problem, also known as fine-tuning.

Pretraining is a massive endeavor that can require supercomputer levels of compute time and equivalent amounts of data. The open-source datasets most often used are the articles on Wikipedia, which constitute 2.5 billion words, and BooksCorpus, which provides 11,000 free-use texts. This culminates in a dataset of about 3.3 billion words.

All that data can be fed into the network for the model to scan and extract the structure of language. At the end of this process, you should have a model that, in a sense, knows how to read. This model has a general understanding of the language, meaning of the words, context, and grammar.

To have this model customized for a particular domain, such as finance, more domain-specific data needs to be added on the pretrained model. This allows the model to understand and be more sensitive to domain-specific jargon and terms.

A word has several meanings, depending on the context. For example, a bear to a zoologist is an animal. To someone on Wall Street, it means a bad market. Adding specialized texts makes BERT customized to that domain. It’s a good idea to take the pretrained BERT offered on NGC and customize it by adding your domain-specific data.

Fine-tuning is much more approachable, requiring significantly smaller datasets on the order of tens of thousands of labelled examples. BERT can be trained to do a wide range of language tasks.

Despite the many different fine-tuning runs that you do to create specialized versions of BERT, they can all branch off the same base pretrained model. This makes the BERT approach often referred to as an example of transfer learning, when model weights trained for one problem are then used as a starting point for another. After fine-tuning, this BERT model took the ability to read and learned to solve a problem with it.

## BERT Pretraining, Fine-tuning & Inference
## Downloading Pre-Trained Model
The first thing we want to download a pre-trained model that we can use for fine-tuning in the next step. For the BERT model, training works in two steps – Pre-training and Fine-Tuning. BERT is pretrained on unlabelled data. We hand it the downloaded text from all of Wikipedia and BookCorpus which includes 11,000 books across various genres and hope that at the end of pre-training, our model understands the structure of the English language as well as any human. While the data size is massive, the unlabelled nature is a huge benefit. The man-hours needed to generate any labels for such a massive amount of data would be at best impractical and at worst impossible. So how does BERT extract understanding from the raw text?

BERT, like all neural networks, learns by minimizing a loss function for a sample task. The task BERT was designed for in the pre-training stage is twofold: Masked Language Modeling and Next Sentence Prediction. The Masked LM task will take an input segment and mask or withhold a random 15% of the words. The model then tries to predict the missing word and compares it with the ground truth word that was initially removed. Over millions of learning steps, this provides the network the ability to predict new words from the surrounding context. It learns both the word contexts but also grammatical structure. Next, Sentence Prediction presents the model with sentence pairs, with half of the time being 2 sentences that originally followed each other in the corpus. The other 50% are two random sentences, and the model is tasked to predict if the sentences go together or not. This is to provide the model with sentence-level comprehension for tasks like Question Answering. The key here is that the label that the model needs for training, that is, to nudge all of the weights in the correct direction, is generated automatically. We present the data to the model and wait as it runs it through over and over—billions of matrix operations to provide the learning-to-read experience.

If we wanted to train the the BERT model from scratch on an 8 GPU P3DN instance, our results will be ready after about 11 days. Lucky for us, NVIDIA has done the hard work and made a trained checkpoiint available for us to download from NVIDIA GPU Cloud (NGC). Lets download and unzip the BERT model.

In [2]:
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_pyt_ckpt_base_pretraining_amp_lamb/versions/19.09.0/zip -O bert_pyt_ckpt_base_pretraining_amp_lamb_19.09.0.zip

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
# Extract the BERT model
!unzip -u bert_pyt_ckpt_base_pretraining_amp_lamb_19.09.0.zip

'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
# Update SageMaker SDK 
!pip install -U sagemaker==2.60.0
import IPython
IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel after installations"

Collecting sagemaker==2.60.0
  Downloading sagemaker-2.60.0.tar.gz (444 kB)
     -------------------------------------- 444.4/444.4 kB 4.0 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting boto3>=1.16.32
  Downloading boto3-1.28.28-py3-none-any.whl (135 kB)
     ---------------------------------------- 135.8/135.8 kB ? eta 0:00:00
Collecting google-pasta
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting protobuf>=3.1
  Downloading protobuf-4.24.0-cp38-cp38-win_amd64.whl (430 kB)
     ------------------------------------- 430.6/430.6 kB 13.6 MB/s eta 0:00:00
Collecting protobuf3-to-dict>=0.1.5
  Downloading protobuf3-to-dict-0.1.5.tar.gz (3.5 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting smdebug_rulesconfig==1.0.1
  Downloading smdebug_rulesconfig-1.0.1-py2.py3-none-any.whl (20 kB)
Collecting pathos
  Downloading patho


[notice] A new release of pip available: 22.3 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


{'status': 'ok', 'restart': True}

In [5]:
# importing libraries required for this course

import collections
import math
import random
import torch
import os, tarfile, json
import time, datetime
from io import StringIO
import numpy as np
import boto3
import sagemaker
from sagemaker.pytorch import estimator, PyTorchModel, PyTorchPredictor, PyTorch
from sagemaker.utils import name_from_base
from model_utils.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
from model_utils.modeling import BertForQuestionAnswering, BertConfig, WEIGHTS_NAME, CONFIG_NAME
from model_utils.tokenization import (BasicTokenizer, BertTokenizer, whitespace_tokenize)
from types import SimpleNamespace
from helper_funcs import *
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
runtime_client = boto3.client('runtime.sagemaker')

bucket = sagemaker_session.default_bucket()
prefix = 'NLP-course'

!pip install nvidia-pyindex
!pip install nvidia-dllogger

ModuleNotFoundError: No module named 'torch'

## Downloading Dataset for Fine-Tuning
For the BERT model, training works in two steps – Pretraining and Fine-Tuning. We will talk about these steps in a bit more detail in the next section. As we mentioned earlier, BERT is pretrained on unlabeled data from all of Wikipedia and BookCorpus, which includes 11,000 books across various genres. This data set is publicly available, but due to its large size, it is impractical to download and run training from scratch. Instead, for this class, we will work on fine-tuning the BERT model with the SQUAD Dataset.

SQUAD stands for Stanford Question Answering Dataset. This dataset is a collection of passages and a question. The answer to the question is in the passage, and our task is to predict it accurately. With the command shown in this cell, we can download the publicly available SQUAD dataset.

In [1]:
# Downloading the SQUAD dataset for fine-tuning
!cd data/squad/ && bash squad_download.sh

The system cannot find the path specified.


## Reviewing SQUAD Dataset¶
Now let's take a quick look at some examples in this dataset. In this cell block, we are picking a random passage, a Question, and the Answer in the dataset. This passage is about the Southern California. After the passage, the question is - What is Southern California often abbreviated as? The answer from within the passage is - SoCal.

Our goal is to train the BERT model to find out the answer to a given question from the passage from this dataset.

In [2]:
# load the v2.0 dev set
with open('data/squad/v2.0/dev-v2.0.json', 'r') as f:
    squad_data = json.load(f)

FileNotFoundError: [Errno 2] No such file or directory: 'data/squad/v2.0/dev-v2.0.json'

In [3]:
# Looking at specific instances of the SQUAD dataset
#ind = random.randint(0,34) #for a random paragraph and question, set this
ind = 2
sq = squad_data['data'][ind]
print('Paragraph title: ',sq['title'], '\n')
print(sq['paragraphs'][0]['context'],'\n')
print('Question:', sq['paragraphs'][0]['qas'][0]['question'])
print('Answer:', sq['paragraphs'][0]['qas'][0]['answers'][0]['text'])

NameError: name 'squad_data' is not defined

## View BERT input
BERT needs us to transform our text data into a numeric representation known as tokens. There are a variety of tokenizers available, we are going to use a tokenizer specially designed for BERT that we will instantiate with our vocabulary file. Let's take a look at our transformed question and context we will be supplying BERT for inference.

In [4]:
# Seeing the input words and the associated tokens from the vocabulary
doc_tokens = sq['paragraphs'][0]['context'].split()
tokenizer = BertTokenizer('vocab', do_lower_case=True, max_len=512)
query_tokens = tokenizer.tokenize(sq['paragraphs'][0]['qas'][0]['question'])

feature = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=384, 
                                    max_query_length=64)

tensors_for_inference, tokens_for_postprocessing = feature
print(vars(tokens_for_postprocessing)["tokens"][0:9])
print(vars(tensors_for_inference)["input_ids"][0:9])

NameError: name 'sq' is not defined

## Finetuning witth SQUAD Dataset
We have downloaded the BERT Model from NGC and also know about Mixed Precision for speeding-up training and fine-tuning. Now, let us fine-tune the BERT model to do well on the Question & Answering task.

We will run fine-tuning using run_squad.py. Let's look at the key inputs to this python file.

out_dir - Specify the output directory for the results to be written to.
Init_checkpoint - Specify the BERT model to be used as the starting model
Num_train_epochs - Each epoch typically consists of one round of training on the data. Specify the number of training rounds
Vocab_file - Specify the vocabulary used for tokenizing BERT input
Config_file - Specify the configuration details of the BERT model
Train_file - Specify the file with data for training the model
Predict_file - Link to file that is used to test for unbiased prediction.
do_train - specify if you want to train/fine-tune the model
Train_batch_size - Batch size for the input
Max_seq_length - The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded
doc_stride - When splitting up a long document into chunks, how much stride to take between chunks
seed - random seed for the initialization
fp16 - Use 16-bit precision
You can get more details of various other options by running python run_squad.py --help

Now that we have learned about various options to fine-tune the BERT model let's start fine-tuning the model. As you run the model, there are verbose logs that tell you details about Training Epoch, Iteration, the current loss, and the learning rate. With the --fp16 flag, each epoch should take around 45 minutes on AWS EC2 g4dn.xl instance, which contains NVIDIA T4 GPU. If you run the fine-tuning with FP32, then it might take 3-5 hours on the same GPU.

In [5]:
# Fine tuning BERT with SQUAD dataset
# If this script fails due to out of memory errors, then reduce the "train_batch_size" value to 4 or 2
!python run_squad.py --bert_model bert_base_uncased \
            --output_dir './output' \
            --init_checkpoint bert_base.pt \
            --num_train_epochs 1 \
            --vocab_file './vocab' \
            --config_file  './bert_config.json' \
            --train_file './data/squad/v1.1/train-v1.1.json' \
            --predict_file './data/squad/v1.1/dev-v1.1.json' \
            --do_train \
            --train_batch_size 8 \
            --max_seq_length 512 \
            --doc_stride 128 \
            --seed 1 \
            --fp16

python: can't open file 'run_squad.py': [Errno 2] No such file or directory


## Loading the fine-tuned model¶
Now our model is trained and stored in the folder named output. Let us load the trained model.

In [7]:
#Loading the BERT model trained with Question-Answering task

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# specify the vocabulary file
vocab_file='./vocab'

# set variables that limit the maximum length of the context, query, and answer
max_seq_length, max_query_length, n_best_size, max_answer_length, null_score_diff_threshold = 384, 64, 1, 30, -11.0
do_lower_case, can_give_negative_answer = True, True

# initialize our tokenizer
tokenizer = BertTokenizer(vocab_file, do_lower_case=True, max_len=512)

# load a model configuration
config = BertConfig.from_json_file('bert_config.json')

# set up our model architecture
model = BertForQuestionAnswering(config)

# load our weights
model.load_state_dict(torch.load('./output/model.pth', map_location='cpu')["model"])

# send out model to our device
model.to(device)
    
# set our model to evaluation mode for inference
model.eval()

NameError: name 'torch' is not defined

## Testing the fine-tuned model
Our trained model is now loaded. Let us test it on our local instance. In the next couple of cells, we are going to provide it with a context paragraph and a question. Feel free to edit and change the context paragraph and the question.

After that we will pre-process the input data and feed it to our trained model.

In [8]:
# Testing the fine-tuned model.
# Provide your custom context and question. 

context = """
NGC Containers are designed to enable a software platform centered around minimal OS requirements, 
Docker and driver installation on the server or workstation, and provisioning of all application and SDK software 
in the NGC containers through the NGC container registry. NGC manages a catalog of fully integrated and optimized 
deep learning framework containers that take full advantage of NVIDIA GPUs in both single GPU and 
multi-GPU configurations. 
"""
question = "What configurations can NGC containers work with?"

In [9]:
# Pre-processing the input and feeding it into the BERT model

# specify how many answers to return, here we are going to take the top answer only.
n_best_size=1

# preprocessing
# split the context into tokens
doc_tokens = context.split()
# tokenize our query 
query_tokens = tokenizer.tokenize(question)
# generate features to feed to the model
feature = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=max_seq_length, 
                                    max_query_length=max_query_length)
tensors_for_inference, tokens_for_postprocessing = feature

input_ids = torch.tensor(tensors_for_inference.input_ids, dtype=torch.long).unsqueeze(0)
segment_ids = torch.tensor(tensors_for_inference.segment_ids, dtype=torch.long).unsqueeze(0)
input_mask = torch.tensor(tensors_for_inference.input_mask, dtype=torch.long).unsqueeze(0)

# load tensors to device
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)

# run inference
with torch.no_grad():
    start_logits, end_logits = model(input_ids, segment_ids, input_mask)

# post-processing
start_logits = start_logits[0].detach().cpu().tolist()
end_logits = end_logits[0].detach().cpu().tolist()
# convert logits back to English
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         start_logits, end_logits, n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)

# print result
print(f'{question} : {answer[0]["text"]}')


NameError: name 'tokenizer' is not defined

## Prepare to deploy as a SageMaker Endpoint.
Now that you've gotten a chance to play with the model locally, let's deploy it to an endpoint! In order to deploy BERT to a sagemaker endpoint, we need to save the model as a tarball. Once we have saved our model we then upload to our S3 bucket where our Docker container can access it. We use transform_script.py to define how we load our model, handle our input data, perform inference, and pass our results back to the requester.

Sagemaker has predefined functions for all of these operations aside from importing the model, however, for our specific case we are passing in multiple arrays as input (our question and our provided context). This means we need to specify custom functions for our input data and making predictions. These functions are named input_fn and predict_fn inside of transform_script.py. To learn more about how to deploy PyTorch models in sagemaker see the following https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#deploy-pytorch-models

In [10]:
# Save the model as a tarball
with tarfile.open('bert.tar.gz', 'w:gz') as f:
    f.add('./output/model.pth')
    
# upload model data to S3
model_data = sagemaker_session.upload_data(path='bert.tar.gz',
                                           bucket=bucket,
                                           key_prefix =os.path.join(prefix, 'model'))
torch_model = PyTorchModel(model_data=model_data,
                           role=role,
                          entry_point='transform_script.py',
                          framework_version='1.5.0',
                          py_version="py3")

NameError: name 'tarfile' is not defined

## Deploy the model
Now that we have defined our model we can deploy it to an endpoint. We will need to give our endpoint a name, determine how many instances we want to run our endpoint, and the instance types. Here we are deploying this model to a g4dn instance that utilizes a Nvidia T4 card for inference.

In [11]:
# Deploy endpoint, this part may take a bit
endpoint_name = f'bert-endpoint-{datetime.datetime.fromtimestamp(time.time()).strftime("%c").replace(" ","-").replace(":","-")}'
bert_end = torch_model.deploy(instance_type='ml.g4dn.xlarge', initial_instance_count=1, 
                              endpoint_name=endpoint_name)

NameError: name 'datetime' is not defined

## Get Predictions
For question answering, we pass in a context statement for the model to read and then we ask it a question. In this first case we are doing the pre-processing locally and then sending the prepped data to the model as an array:

In [12]:
%%time

# Collect the context and question, pre-process the input strings and feed it to the deployed BERT model

n_best_size=3
doc_tokens = context.split()
query_tokens = tokenizer.tokenize(question)
feature = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=max_seq_length, 
                                    max_query_length=max_query_length)
tensors_for_inference, tokens_for_postprocessing = feature

input_ids = np.array(tensors_for_inference.input_ids, dtype=np.int64)
segment_ids = np.array(tensors_for_inference.segment_ids, dtype=np.int64)
input_mask = np.array(tensors_for_inference.input_mask, dtype=np.int64)   

payload = np.concatenate([np.expand_dims(input_ids, axis=0), np.expand_dims(segment_ids, axis=0), np.expand_dims(input_mask, axis=0)])
try:
    response = bert_end.predict(payload.tobytes(), initial_args={'ContentType':'application/x-npy'}) 
except:
    print('using invoke_endpoint directly')
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                           ContentType='application/x-npy',
                                           Body=payload.tobytes())
    response = eval(response['Body'].read().decode('utf-8'))
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         response[0], response[1], n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)

# print result
print(f'{question} : {answer[0]["text"]}')
#print(f'inference took: {round(time.time()-t,4)} seconds')

NameError: name 'tokenizer' is not defined

In [13]:
%%time
pass_in_data = {'context':context, 'question':question}
response = runtime_client.invoke_endpoint(EndpointName=bert_end.endpoint,
                                       ContentType='application/json',
                                       Body=json.dumps(pass_in_data))
response = eval(response['Body'].read().decode('utf-8'))
answer = get_predictions(doc_tokens, tokens_for_postprocessing, 
                         response[0], response[1], n_best_size, 
                         max_answer_length, do_lower_case, 
                         can_give_negative_answer, 
                         null_score_diff_threshold)
#print result
print(f'{question} : {answer[0]["text"]}')

NameError: name 'runtime_client' is not defined

## Clean-up endpoint
Ensure you delete the endpoint if you are not using it.

In [15]:
#Delete endpoint
bert_end.delete_endpoint()

NameError: name 'bert_end' is not defined

## Clean-up rest of the files we created if needed

In [16]:
# Remove the BERT model that we downloaded
!rm bert.tar.gz

'rm' is not recognized as an internal or external command,
operable program or batch file.


In [17]:
# Remove the BERT model that we downloaded
!rm bert.tar.gz

'rm' is not recognized as an internal or external command,
operable program or batch file.


In [18]:
# Remove the zip file
!rm bert_pyt_ckpt_base_pretraining_amp_lamb_19.09.0.zip

'rm' is not recognized as an internal or external command,
operable program or batch file.


In [19]:
# Remove the SQUAD trained BERT Model
!rm -r ./output/*.*

'rm' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
# Remove SQUAD files
!rm -r ./data/squad/v1.1/*.*
!rm -r ./data/squad/v2.0/*.*