# Inference w/ Triton
- NVIDIA Triton 추론 서버로 AI 개발 ‘뚝딱’: https://blogs.nvidia.co.kr/2021/06/04/simplifying-ai-inference-in-production-with-triton/
- Triton on SageMaker - NLP Bert: https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-triton/nlp_bert/triton_nlp_bert.ipynb

## 환경 설정
- conda: `conda_python3`

In [1]:
!pip install -qU pip awscli boto3 sagemaker
!pip install nvidia-pyindex
!pip install tritonclient[http]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.23.41 which is incompatible.[0m
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... [?25ldone
[?25h  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8399 sha256=a5a0b8fe6ed8d3c2324278cf46c42e729885917be2bc852faab2f9fc7fdb4692
  Stored in directory: /home/ec2-user/.cache/pip/wheels/1a/79/65/9cb980b5f481843cd9896e1579abc1c1f608b5f9e60ca90e03
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
Looking in indexes: https://pypi.org/simple, https:

In [2]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())

role = get_execution_role()

## Triton inference container image URI 가져오기

In [11]:
# Not working yet...
# [Errno 2] No such file or directory: '/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/image_uri_config/triton.json'
# from sagemaker import image_uris
# image_uris.retrieve(framework='triton', region=boto3.Session().region_name, image_scope='inference', version='latest')

In [12]:
account_id_map = {
    'us-east-1': '785573368785',
    'us-east-2': '007439368137',
    'us-west-1': '710691900526',
    'us-west-2': '301217895009',
    'eu-west-1': '802834080501',
    'eu-west-2': '205493899709',
    'eu-west-3': '254080097072',
    'eu-north-1': '601324751636',
    'eu-south-1': '966458181534',
    'eu-central-1': '746233611703',
    'ap-east-1': '110948597952',
    'ap-south-1': '763008648453',
    'ap-northeast-1': '941853720454',
    'ap-northeast-2': '151534178276',
    'ap-southeast-1': '324986816169',
    'ap-southeast-2': '355873309152',
    'cn-northwest-1': '474822919863',
    'cn-north-1': '472730292857',
    'sa-east-1': '756306329178',
    'ca-central-1': '464438896020',
    'me-south-1': '836785723513',
    'af-south-1': '774647643957'
}

In [13]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise("UNSUPPORTED REGION")

In [39]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3".format(
    account_id=account_id_map[region], region=region, base=base
)
triton_image_uri

'785573368785.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:21.08-py3'

## Add utility methods for preparing request payload

The following method transforms the sample text we will be using for inference into the payload that can be sent for inference to the Triton server.

The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. **We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.**

#TODO 아래 utility function들을 Client(Lambda?)에서 구현해야 한다.

In [17]:
import tritonclient.http as httpclient
from transformers import BertTokenizer
import numpy as np

In [33]:
import tritonclient.http as httpclient
from transformers import BertTokenizer
import numpy as np


def tokenize_text(text):
    model = BertTokenizer.from_pretrained("bert-base-uncased", force_download=True) # Subword-based tokenizer, uncased: 소문자
    encoded_text = model(text, padding="max_length", max_length=128)
    return encoded_text["input_ids"], encoded_text["attention_mask"]


def _get_sample_tokenized_text_binary(text, input_names, output_names):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput(input_names[0], [1, 128], "INT32"))
    inputs.append(httpclient.InferInput(input_names[1], [1, 128], "INT32"))
    indexed_tokens, attention_mask = tokenize_text(text)

    indexed_tokens = np.array(indexed_tokens, dtype=np.int32)
    indexed_tokens = np.expand_dims(indexed_tokens, axis=0)
    inputs[0].set_data_from_numpy(indexed_tokens, binary_data=True)

    attention_mask = np.array(attention_mask, dtype=np.int32)
    attention_mask = np.expand_dims(attention_mask, axis=0)
    inputs[1].set_data_from_numpy(attention_mask, binary_data=True)

    outputs.append(httpclient.InferRequestedOutput(output_names[0], binary_data=True))
    outputs.append(httpclient.InferRequestedOutput(output_names[1], binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length


def get_sample_tokenized_text_binary_pt(text):
    return _get_sample_tokenized_text_binary(
        text, ["INPUT__0", "INPUT__1"], ["OUTPUT__0", "1634__1"]
    )


def get_sample_tokenized_text_binary_trt(text):
    return _get_sample_tokenized_text_binary(text, ["token_ids", "attn_mask"], ["output", "1634"])

In [42]:
!docker run --gpus=all --rm -it \
            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:21.08-py3 \
            /bin/bash generate_models.sh

Unable to find image 'nvcr.io/nvidia/pytorch:21.08-py3' locally
21.08-py3: Pulling from nvidia/pytorch

[1B32c2132b: Pulling fs layer 
[1Bfc91ca4c: Pulling fs layer 
[1Bbfe29823: Pulling fs layer 
[1Bbb0f48c6: Pulling fs layer 
[1B937ae0b1: Pulling fs layer 
[1B47dbb869: Pulling fs layer 
[1B9a515d38: Pulling fs layer 
[1Bbefddb18: Pulling fs layer 
[1Ba5bdde0b: Pulling fs layer 
[1B32b6dcb0: Pulling fs layer 
[1Bb39618ed: Pulling fs layer 
[1B5b7dac39: Pulling fs layer 
[1B46f1ce67: Pulling fs layer 
[1B46b2b0ee: Pulling fs layer 
[1B0f57ab67: Pulling fs layer 
[1B010c3f61: Pulling fs layer 
[1B920eee68: Pulling fs layer 
[1B7fe2ac6f: Pulling fs layer 
[1Bec3721d9: Pulling fs layer 
[1Baf4d5a99: Pulling fs layer 
[1Baee79aa7: Pulling fs layer 
[1B9b496fe3: Pulling fs layer 
[1B23103b6c: Pulling fs layer 
[1Bff55d023: Pulling fs layer 
[1Bedee2aea: Pulling fs layer 
[1B59107317: Pulling fs layer 
[1Bbe386e50: Pulling fs layer 
[1Bb3fc277b: Pulling fs layer 
