<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 4.0 Using the model

In this notebook, you'll "use" the server and see real-time inference in action with a question-answering NLP task.

**[4.1 The API Basics](#4.1-The-API-Basics)**<br>
**[4.2 Inference API Overview](#4.2-Inference-API-Overview)**<br>
**[4.3 Preparing the Request](#4.3-Preparing-the-Request)**<br>
**[4.4 Querying the Server](#4.4-Querying-the-Server)**<br>
**[4.5 Post-Processing the Response](#4.5-Post-Processing-the-Response)**<br>

Triton Inference Server exposes the services using HTTP and gRPC endpoints. As a consequence, you can query it using a very wide range of tools (e.g. [gRPC](https://grpc.io/docs/languages/) can be used with Java, C++, C# Python, PHP, Ruby, and more).  Triton does not implement its own serving standard; instead it exposes its services using [KFServing Predict protocol version 2](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2). This ensures compatibility with a range of current and future tools that implement serving services.  

For further simplification of development, Triton exposes the server protocols through a number of APIs:
- [Python API](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/python_api.html?highlight=grpc)
- [C++ API](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/cpp_api/cpp_api_root.html)
- [Protobuf API](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/protobuf_api/protobuf_api_root.html)

In this example, you will learn how to consume our question-answering service using the Python API.

# 4.1 The API Basics

Let us start by reviewing the basic components of the API. The key element is the <code>tritonhttpclient</code> which can be obtained either by downloading the NGC container with the <a href="https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver">Triton client utilities</a>, or by downloading it directly from the <a href="https://github.com/NVIDIA/triton-inference-server/releases/tag/v2.0.0">Triton GitHub Repository</a>. Once deployed (they are already installed in this class), import the appropriate library:

In [None]:
import os
import json
import argparse
import numpy as np
import tritonhttpclient

The first step is to initialize the client by pointing it towards our server:

In [None]:
try:
    triton_client = tritonhttpclient.InferenceServerClient(url="triton:8000", verbose=True)
except Exception as e:
    print("channel creation failed: " + str(e))

Next, inspect the status of our server, and availability and status of our model:

In [None]:
modelName = "bertQA-torchscript"
print(triton_client.is_server_live())
print(triton_client.is_server_ready())
print(triton_client.is_model_ready(modelName,"1"))

Finally, inspect the metadata returned by the server:

In [None]:
triton_client.get_server_metadata()

In addition to basic health checks and model inference, the API provides fine-grained control over the server, enabling actions such as loading and unloading models. For more information, please refer to the <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/python_api.html">documentation</a> and the <a href="https://github.com/NVIDIA/triton-inference-server/tree/60c33d5593ad0d50716f04f69bb4b24ee3a7996d/src/clients/python/examples">API examples</a>

# 4.2 Inference API Overview

Since we have been working with a neural network built to do question answering, we'll run an example query against our server. To start, let's investigate the shape of the input and output data that the server will use:

In [None]:
triton_client.get_model_metadata(modelName)

You should have recieved a response similar to the below: <br/>
<img width=1000 src="images/DataFormat.png"/>

The server indicated that it expects three input tensors:
- input__0 being the input_ids
- input_1 being the sequence_ids
- input_2 being the mask_ids

The server will respond with:
- output__0 being the start logits
- output_-1 being the end logits

We now need to pre process our question and context into the format required by the server.

# 4.3 Preparing the Request

Start by creating the question and an answer:

In [None]:
question = "Most antibiotics target bacteria and don't affect what class of organisms? "
context = "Within the genitourinary and gastrointestinal tracts, commensal flora serve as biological barriers by " +\
        "competing with pathogenic bacteria for food and space and, in some cases, by changing the conditions in " +\
        "their environment, such as pH or available iron. This reduces the probability that pathogens will " +\
        "reach sufficient numbers to cause illness. However, since most antibiotics non-specifically target bacteria" +\
        "and do not affect fungi, oral antibiotics can lead to an overgrowth of fungi and cause conditions such as a" +\
        "vaginal candidiasis (a yeast infection). There is good evidence that re-introduction of probiotic flora, such " +\
        "as pure cultures of the lactobacilli normally found in unpasteurized yogurt, helps restore a healthy balance of" +\
        "microbial populations in intestinal infections in children and encouraging preliminary data in studies on bacterial " +\
        "gastroenteritis, inflammatory bowel diseases, urinary tract infection and post-surgical infections. " 

Secondly, by importing some additional utilities that will hide the boilerplate logic necessary for data transformation:

In [None]:
import sys
sys.path.insert(0,'/dli/task/client')
from tokenization import BertTokenizer
from inference import preprocess_tokenized_text,parse_answer

This section of code transforms the data into the required format:

In [None]:
tokenizer = BertTokenizer("/dli/task/vocab", do_lower_case=True, max_len=512) 
doc_tokens = context.split()
query_tokens = tokenizer.tokenize(question)

tensors_for_inference, tokens_for_postprocessing = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=384, 
                                    max_query_length=64)

dtype = np.int64
input_ids = np.array(tensors_for_inference.input_ids, dtype=dtype)[None,...] # make bs=1
segment_ids = np.array(tensors_for_inference.segment_ids, dtype=dtype)[None,...] # make bs=1
input_mask = np.array(tensors_for_inference.input_mask, dtype=dtype)[None,...] # make bs=1

Finally we copy the data into the structures required by Triton. Do notice that we use tensor names, data types and tensor dimensions as specified by the Triton server response earlier:

In [None]:
inputs = []
inputs.append(tritonhttpclient.InferInput('input__0', [1, len(input_ids[0])], "INT64"))
inputs.append(tritonhttpclient.InferInput('input__1', [1, len(segment_ids[0])], "INT64"))
inputs.append(tritonhttpclient.InferInput('input__2', [1, len(input_mask[0])], "INT64"))


inputs[0].set_data_from_numpy(input_ids, binary_data=False)
inputs[1].set_data_from_numpy(segment_ids, binary_data=False)
inputs[2].set_data_from_numpy(input_mask, binary_data=False)

Inspecting one of the inputs reveals the new data representation, which was tokenized and converted to the numerical format as required by the network:

In [None]:
inputs[0]._get_tensor()

Even though it is possible to just fetch all of the output tensors associated with the request it is a good practice to fetch only the bare minimum to minimize the bandwidth. We do that by specifying the request output:

In [None]:
outputs = []
outputs.append(
        tritonhttpclient.InferRequestedOutput('output__0', binary_data=False))
outputs.append(
        tritonhttpclient.InferRequestedOutput('output__1', binary_data=False))

# 4.4 Querying the Server

Let us now issue a request to the server. The <code>outputs</code> parameter is optional. If not specified all tensors will be returned.

In [None]:
results = triton_client.infer(modelName,
                                  inputs,
                                  outputs=outputs)

As you can see, the <code>results</code> and <code>outputs</code> are of the same data type.  

In [None]:
results
outputs

# 4.5 Post-Processing the Response

The results in our case are just logits of start and end positions. Let's process those further to obtain a human readable result. We start by copying the vectors to NumPy to make further processing easier:

In [None]:
# Validate the results by comparing with precomputed values.
output0_data = results.as_numpy('output__0')
output1_data = results.as_numpy('output__1')

Let's inspect the output...

In [None]:
output0_data

...and convert it into a human readable format.

In [None]:
start_logits = output0_data[0].tolist()
end_logits = output1_data[0].tolist()

answer, answers = parse_answer(doc_tokens, tokens_for_postprocessing, 
                                 start_logits, end_logits)

# print result
print()
print(answer)
print()
print(json.dumps(answers, indent=4))

And ... TA-DA!  We have our results! Feel free to experiment with your own queries.

<h3 style="color:green;">Congratulations!</h3><br>
You've completed the course! 

Please make sure you fill in the course survey and consider doing the student assessment to obtain a certificate.

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>