# I. Triton Inference Server Setups and Basic Run

## 0. Assumptions - Following are already installed and available

### Checks

In [4]:
# !docker
# !python --version
# !conda --version
!nvidia-smi

Sat Feb 25 21:19:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 1. Triton Server Installation (Docker)

In [5]:
# Check
!docker pull nvcr.io/nvidia/tritonserver:23.01-py3

23.01-py3: Pulling from nvidia/tritonserver
Digest: sha256:a9133b4f34aefaa2aebdb2009ae134cd220892cf16d3ed45e3e01362b094d732
Status: Image is up to date for nvcr.io/nvidia/tritonserver:23.01-py3
nvcr.io/nvidia/tritonserver:23.01-py3


## 2. Code repository structure

In [6]:
!tree ./model_repository_single_service -I '__pycache__'

[01;34m./model_repository_single_service[00m
└── [01;34msentiment-nltk-service[00m
    ├── [01;34m1[00m
    │   ├── __init__.py
    │   ├── model.py
    │   └── [01;34mresources[00m
    │       └── [01;34mnltk[00m
    │           └── [01;34msentiment[00m
    │               └── [01;31mvader_lexicon.zip[00m
    ├── [01;32mbuild_env.sh[00m
    ├── config.pbtxt
    ├── requirements.txt
    └── [01;31msentinltkenv.tar.gz[00m

5 directories, 7 files


In [7]:
!cat model_repository_single_service/sentiment-nltk-service/config.pbtxt

name: "sentiment-nltk-service"
backend: "python"  # PyTorch, TF, ONNX, TensorRT
max_batch_size: 8

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "STATUS"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "SCORE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/sentiment-nltk-service/sentinltkenv.tar.gz"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]


## 3. Run Inference Server using docker

In [8]:
container_id=!(docker run -d \
                --shm-size=5G \
                -p8000:8000 -p8001:8001 -p8002:8002 \
                -v $PWD/model_repository_single_service:/mnt/data/model_repository \
                nvcr.io/nvidia/tritonserver:23.01-py3 \
                tritonserver \
                --model-repository=/mnt/data/model_repository \
                --log-verbose=1)

In [9]:
container_id

['6099717b6c56a70498a528171a34faf4b9cc776dc14851d36b3da19ee0cf876b']

In [10]:
!docker logs {container_id[0]}


== Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

W0226 05:32:47.599411 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0226 05:32:47.650953 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0226 05:32:47.662495 1 model_c

## 4. Inference requests

### 4.1 Using Python Requests to send HTTP call

In [11]:
import requests
import json

url = "http://localhost:8000/v2/models/sentiment-nltk-service/versions/1/infer"

payload = json.dumps({
  "inputs": [
    {
      "name": "TEXT",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "Awesome"
      ]
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

json.loads(response.text)

{'model_name': 'sentiment-nltk-service',
 'model_version': '1',
 'outputs': [{'name': 'SCORE',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['0.6249']},
  {'name': 'STATUS', 'datatype': 'BYTES', 'shape': [1], 'data': ['Success']}]}

### 4.2 Using Python SDK to send GRPC call

In [12]:
import numpy as np
import tritonclient.grpc as grpcclient

class SentimentNLTKService:
    def __init__(self):
        self.model_name = "sentiment-nltk-service"
        self.input_meta = ("TEXT", [1,1], "BYTES")
        self.output_fields = ['STATUS', 'SCORE']
    def grpc_infer_call(self, in_text):
        # 1. Client Initialisation
        triton_client = grpcclient.InferenceServerClient(url="localhost:8001", verbose=True)
        # 2. Input
        input_data = grpcclient.InferInput(*self.input_meta)
        input_text = np.array([in_text], dtype=object)
        input_text.shape = (1,1)
        input_data.set_data_from_numpy(input_text)
        inputs = [input_data]
        # 3. Outputs
        outputs = [grpcclient.InferRequestedOutput(field) for field in self.output_fields]
        # 4. Send request
        results = triton_client.infer(model_name=self.model_name,inputs=inputs,outputs=outputs)
        # 5. Return output
        return [{field: results.as_numpy(field)} for field in self.output_fields]

In [13]:
sentiment_service_client = SentimentNLTKService()
print(sentiment_service_client.grpc_infer_call("Awesome"))

infer, metadata ()
model_name: "sentiment-nltk-service"
inputs {
  name: "TEXT"
  datatype: "BYTES"
  shape: 1
  shape: 1
}
outputs {
  name: "STATUS"
}
outputs {
  name: "SCORE"
}
raw_input_contents: "\007\000\000\000Awesome"

model_name: "sentiment-nltk-service"
model_version: "1"
outputs {
  name: "SCORE"
  datatype: "BYTES"
  shape: 1
}
outputs {
  name: "STATUS"
  datatype: "BYTES"
  shape: 1
}
raw_output_contents: "\006\000\000\0000.6249"
raw_output_contents: "\007\000\000\000Success"

[{'STATUS': array([b'Success'], dtype=object)}, {'SCORE': array([b'0.6249'], dtype=object)}]


## 5. Health and Monitoring

### 5.1. Health Check

In [14]:
import tritonclient.http as httpclient

triton_client = httpclient.InferenceServerClient(url="localhost:8000", verbose=True)

In [15]:
triton_client.is_server_live()

GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

In [16]:
triton_client.is_server_ready()

GET /v2/health/ready, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

In [17]:
triton_client.is_model_ready("sentiment-nltk-service")

GET /v2/models/sentiment-nltk-service/ready, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>


True

In [18]:
triton_client.get_model_metadata("sentiment-nltk-service")

GET /v2/models/sentiment-nltk-service, headers None
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '245'}>
bytearray(b'{"name":"sentiment-nltk-service","versions":["1"],"platform":"python","inputs":[{"name":"TEXT","datatype":"BYTES","shape":[-1,1]}],"outputs":[{"name":"STATUS","datatype":"BYTES","shape":[-1,1]},{"name":"SCORE","datatype":"FP32","shape":[-1,1]}]}')


{'name': 'sentiment-nltk-service',
 'versions': ['1'],
 'platform': 'python',
 'inputs': [{'name': 'TEXT', 'datatype': 'BYTES', 'shape': [-1, 1]}],
 'outputs': [{'name': 'STATUS', 'datatype': 'BYTES', 'shape': [-1, 1]},
  {'name': 'SCORE', 'datatype': 'FP32', 'shape': [-1, 1]}]}

### 5.2. Metrics

In [19]:
triton_client.get_inference_statistics()

GET /v2/models/stats, headers None
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '592'}>
bytearray(b'{"model_stats":[{"name":"sentiment-nltk-service","version":"1","last_inference":1677389751622,"inference_count":2,"execution_count":2,"inference_stats":{"success":{"count":2,"ns":16411202},"fail":{"count":0,"ns":0},"queue":{"count":2,"ns":566035},"compute_input":{"count":2,"ns":402930},"compute_infer":{"count":2,"ns":14881326},"compute_output":{"count":2,"ns":538076},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[{"batch_size":1,"compute_input":{"count":2,"ns":402930},"compute_infer":{"count":2,"ns":14881326},"compute_output":{"count":2,"ns":538076}}]}]}')


{'model_stats': [{'name': 'sentiment-nltk-service',
   'version': '1',
   'last_inference': 1677389751622,
   'inference_count': 2,
   'execution_count': 2,
   'inference_stats': {'success': {'count': 2, 'ns': 16411202},
    'fail': {'count': 0, 'ns': 0},
    'queue': {'count': 2, 'ns': 566035},
    'compute_input': {'count': 2, 'ns': 402930},
    'compute_infer': {'count': 2, 'ns': 14881326},
    'compute_output': {'count': 2, 'ns': 538076},
    'cache_hit': {'count': 0, 'ns': 0},
    'cache_miss': {'count': 0, 'ns': 0}},
   'batch_stats': [{'batch_size': 1,
     'compute_input': {'count': 2, 'ns': 402930},
     'compute_infer': {'count': 2, 'ns': 14881326},
     'compute_output': {'count': 2, 'ns': 538076}}]}]}

In [20]:
import requests

url = "http://localhost:8002/metrics"

payload={}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="sentiment-nltk-service",version="1"} 2
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="sentiment-nltk-service",version="1"} 0
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="sentiment-nltk-service",version="1"} 2
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="sentiment-nltk-service",version="1"} 2
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_infer

## 6. Cleanups

In [21]:
!docker stop {container_id[0]}

6099717b6c56a70498a528171a34faf4b9cc776dc14851d36b3da19ee0cf876b


In [22]:
!echo y | docker container prune

Are you sure you want to continue? [y/N] Deleted Containers:
6099717b6c56a70498a528171a34faf4b9cc776dc14851d36b3da19ee0cf876b

Total reclaimed space: 12.55kB


# II. Non-Optimisation Features

## 1. Multi-Versions support

### 1.1. Code repository structure

In [23]:
!tree ./model_repository_multi_version -I '__pycache__'

[01;34m./model_repository_multi_version[00m
└── [01;34msentiment-service[00m
    ├── [01;34m1[00m
    │   ├── __init__.py
    │   ├── model.py
    │   └── [01;34mresources[00m
    │       └── [01;34mnltk[00m
    │           └── [01;34msentiment[00m
    │               └── [01;31mvader_lexicon.zip[00m
    ├── [01;34m2[00m
    │   ├── __init__.py
    │   ├── model.py
    │   └── [01;34mresources[00m
    │       └── [01;34mspacy[00m
    │           └── [01;34men_core_web_sm-3.3.0[00m
    │               ├── LICENSE
    │               ├── LICENSES_SOURCES
    │               ├── README.md
    │               ├── accuracy.json
    │               ├── [01;34mattribute_ruler[00m
    │               │   └── patterns
    │               ├── config.cfg
    │               ├── [01;34mlemmatizer[00m
    │               │   └── [01;34mlookups[00m
    │               │       └── lookups.bin
    │               ├── meta.json
    │               ├

In [24]:
!cat model_repository_multi_version/sentiment-service/config.pbtxt

name: "sentiment-service"
backend: "python"
max_batch_size: 8

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "STATUS"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "SCORE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/sentiment-service/sentimentenv.tar.gz"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

version_policy: { all: {}}
# Other possible values:
# version_policy: { latest: { num_versions: 2}}
# version_policy: { specific: { versions: [1,2]}} 


### 1.2 Run container

In [25]:
container_id=!(docker run -d \
                --shm-size=5G \
                -p8000:8000 -p8001:8001 -p8002:8002 \
                -v $PWD/model_repository_multi_version:/mnt/data/model_repository \
                nvcr.io/nvidia/tritonserver:23.01-py3 \
                tritonserver \
                --model-repository=/mnt/data/model_repository \
                --log-verbose=1)

In [26]:
container_id

['5cd70b4d91e8a2577bc43afae724703921b57c2affdd04c828ba31519c730a1f']

In [28]:
!docker logs {container_id[0]}


== Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

W0226 05:44:36.503505 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0226 05:44:36.503652 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0226 05:44:36.506286 1 model_c

### 1.3 Inference Request

In [29]:
import requests
import json

url = "http://localhost:8000/v2/models/sentiment-service/versions/{}/infer"

payload = json.dumps({
  "inputs": [
    {
      "name": "TEXT",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "Awesome"
      ]
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}

In [30]:
print("--------------- V1 Request ---------------")
versioned_url = url.format("1")
print(f"URL:{versioned_url}")
response = requests.request("POST",versioned_url , headers=headers, data=payload)
json.loads(response.text)

--------------- V1 Request ---------------
URL:http://localhost:8000/v2/models/sentiment-service/versions/1/infer


{'model_name': 'sentiment-service',
 'model_version': '1',
 'outputs': [{'name': 'SCORE',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['0.6249']},
  {'name': 'STATUS', 'datatype': 'BYTES', 'shape': [1], 'data': ['Success']}]}

In [31]:
print("--------------- V2 Request ---------------")
versioned_url = url.format("2")
print(f"URL:{versioned_url}")
response = requests.request("POST",versioned_url , headers=headers, data=payload)
json.loads(response.text)

--------------- V2 Request ---------------
URL:http://localhost:8000/v2/models/sentiment-service/versions/2/infer


{'model_name': 'sentiment-service',
 'model_version': '2',
 'outputs': [{'name': 'SCORE',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['1.0']},
  {'name': 'STATUS', 'datatype': 'BYTES', 'shape': [1], 'data': ['Success']}]}

### 1.4 Cleanups

In [32]:
!docker stop {container_id[0]}

5cd70b4d91e8a2577bc43afae724703921b57c2affdd04c828ba31519c730a1f


In [33]:
!echo y | docker container prune

Are you sure you want to continue? [y/N] Deleted Containers:
5cd70b4d91e8a2577bc43afae724703921b57c2affdd04c828ba31519c730a1f

Total reclaimed space: 12.55kB


## 2. GPU support

### 2.1. Code repository structure

In [34]:
!tree ./model_repository_gpu_service -I '__pycache__'

[01;34m./model_repository_gpu_service[00m
└── [01;34mtranslation-service[00m
    ├── [01;34m1[00m
    │   ├── [01;34mhelpers[00m
    │   │   └── translation.py
    │   ├── model.py
    │   └── [01;34mresources[00m
    │       ├── __init__.py
    │       └── [01;34mm2m100_418M[00m
    │           ├── README.md
    │           ├── __init__.py
    │           ├── config.json
    │           ├── pytorch_model.bin
    │           ├── rust_model.ot
    │           ├── sentencepiece.bpe.model
    │           ├── special_tokens_map.json
    │           ├── tokenizer_config.json
    │           └── vocab.json
    ├── [01;32mbuild_env.sh[00m
    ├── config.pbtxt
    ├── requirements.txt
    └── [01;31mtranslationenv.tar.gz[00m

5 directories, 16 files


In [35]:
!cat model_repository_gpu_service/translation-service/config.pbtxt

name: "translation-service"
backend: "python"
max_batch_size: 8

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  },
  {
    name: "SRCLANG"
    data_type: TYPE_STRING
    dims:  [1]
  },
  {
    name: "TARGETLANG"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "TRANSLATION"
    data_type: TYPE_STRING
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/translation-service/translationenv.tar.gz"}
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

response_cache {
  enable: true
}


### 1.2 Run container

In [36]:
container_id=!(docker run -d \
                --shm-size=5G \
                --gpus device=0 \
                -p8000:8000 -p8001:8001 -p8002:8002 \
                -v $PWD/model_repository_gpu_service:/mnt/data/model_repository \
                nvcr.io/nvidia/tritonserver:23.01-py3 \
                tritonserver \
                --model-repository=/mnt/data/model_repository \
                --log-verbose=1)

In [37]:
container_id

['5493dbf03be993438539ac657e913ac96b906081ea827d4ce6d23a31d9df2599']

In [42]:
!docker logs {container_id[0]}


== Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.0 driver version 525.85.11 with kernel driver version 510.47.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0226 05:46:55.714760 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f37fa000000' with size 268435456
I0226 05:46:55.717136 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0226 05:

### 1.3 Inference Request

In [43]:
import requests
import json

url = "http://localhost:8000/v2/models/translation-service/versions/1/infer"

payload = json.dumps({
  "inputs": [
    {
      "name": "TEXT",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "Merci d'avoir participé au webinaire !"
      ]
    },
    {
      "name": "SRCLANG",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "fr"
      ]
    },
    {
      "name": "TARGETLANG",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "en"
      ]
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST",url , headers=headers, data=payload)
json.loads(response.text)

{'model_name': 'translation-service',
 'model_version': '1',
 'outputs': [{'name': 'TRANSLATION',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['Thank you for participating in the webinar!']}]}

### 2.4 Cleanups

In [44]:
!docker stop {container_id[0]}

5493dbf03be993438539ac657e913ac96b906081ea827d4ce6d23a31d9df2599


In [45]:
!echo y | docker container prune

Are you sure you want to continue? [y/N] Deleted Containers:
5493dbf03be993438539ac657e913ac96b906081ea827d4ce6d23a31d9df2599

Total reclaimed space: 76.87kB


# III. Optimisation Features

## 1. Response Cache

### 1.1 Setups

In [46]:
!cat model_repository_gpu_service/translation-service/config.pbtxt

name: "translation-service"
backend: "python"
max_batch_size: 8

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  },
  {
    name: "SRCLANG"
    data_type: TYPE_STRING
    dims:  [1]
  },
  {
    name: "TARGETLANG"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "TRANSLATION"
    data_type: TYPE_STRING
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/translation-service/translationenv.tar.gz"}
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

response_cache {
  enable: true
}


### 1.2 Run container

In [47]:
container_id=!(docker run -d \
                --shm-size=5G \
                --gpus device=0 \
                -p8000:8000 -p8001:8001 -p8002:8002 \
                -v $PWD/model_repository_gpu_service:/mnt/data/model_repository \
                nvcr.io/nvidia/tritonserver:23.01-py3 \
                tritonserver \
                --model-repository=/mnt/data/model_repository \
                --response-cache-byte-size 1048576 \
                --log-verbose=1)

# Note:
# --response-cache-byte-size 1048576 is deprecated and changed to --cache-config local,size=SIZE from version 23.03
# Support added for redis and custom , addition to local

In [48]:
container_id

['bfc89896cb0c2dc2de4bc185bbb9edecd07a2a6ce945c3e80b20cd25020fc4b9']

In [50]:
!docker logs {container_id[0]}


== Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.0 driver version 525.85.11 with kernel driver version 510.47.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0226 05:52:11.893818 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f374e000000' with size 268435456
I0226 05:52:11.893995 1 response_cache.cc:115] Response Cache is created at '0x563ae1c61b50' with size 1048576
I0226 05:

### 1.3 Inference Request

In [51]:
import requests
import json

url = "http://localhost:8000/v2/models/translation-service/versions/1/infer"

payload = json.dumps({
  "inputs": [
    {
      "name": "TEXT",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "Merci d'avoir participé au webinaire !"
      ]
    },
    {
      "name": "SRCLANG",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "fr"
      ]
    },
    {
      "name": "TARGETLANG",
      "shape": [
        1,
        1
      ],
      "datatype": "BYTES",
      "data": [
        "en"
      ]
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}

In [52]:
# First call
response = requests.request("POST",url , headers=headers, data=payload)
json.loads(response.text)

{'model_name': 'translation-service',
 'model_version': '1',
 'outputs': [{'name': 'TRANSLATION',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['Thank you for participating in the webinar!']}]}

In [53]:
response.elapsed.total_seconds()

0.319397

In [54]:
# Second call with same input - To be fetched from cache
response = requests.request("POST",url , headers=headers, data=payload)
json.loads(response.text)

{'model_name': 'translation-service',
 'model_version': '1',
 'outputs': [{'name': 'TRANSLATION',
   'datatype': 'BYTES',
   'shape': [1],
   'data': ['Thank you for participating in the webinar!']}]}

In [55]:
response.elapsed.total_seconds()

0.002314

### 2.4 Cleanups

In [56]:
!docker stop {container_id[0]}

bfc89896cb0c2dc2de4bc185bbb9edecd07a2a6ce945c3e80b20cd25020fc4b9


In [57]:
!echo y | docker container prune

Are you sure you want to continue? [y/N] Deleted Containers:
bfc89896cb0c2dc2de4bc185bbb9edecd07a2a6ce945c3e80b20cd25020fc4b9

Total reclaimed space: 76.38kB


## 2. Instance groups

### 1.1 Setups

In [58]:
!cat model_repository_instance_group/sentiment-nltk-service-multi-instance/config.pbtxt

name: "sentiment-nltk-service-multi-instance"
backend: "python"
max_batch_size: 1

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "STATUS"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "SCORE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/sentiment-nltk-service-multi-instance/sentinltkenv.tar.gz"}
}

instance_group [
  {
    count: 4
    kind: KIND_CPU
  }
]


### 1.2 Run container

In [59]:
container_id=!(docker run -d \
                --shm-size=5G \
                -p8000:8000 -p8001:8001 -p8002:8002 \
                -v $PWD/model_repository_instance_group:/mnt/data/model_repository \
                nvcr.io/nvidia/tritonserver:23.01-py3 \
                tritonserver \
                --model-repository=/mnt/data/model_repository \
                --response-cache-byte-size 1048576 \
                --log-verbose=1)

In [60]:
container_id

['07be1398dba9332f55919abb0d9f1d75d1c5fc1d4c13fdcd0aca8781ff28ac5a']

In [61]:
!docker logs {container_id[0]}


== Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

W0226 05:55:14.933129 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0226 05:55:14.933282 1 response_cache.cc:115] Response Cache is created at '0x7fd5e2c21010' with size 1048576

### 1.3 Inference Request

In [62]:
import concurrent.futures
import requests
import json

def send_n_concurrent_requests(num_requests, url):
    
    payload = json.dumps({
      "inputs": [
        {
          "name": "TEXT",
          "shape": [
            1,
            1
          ],
          "datatype": "BYTES",
          "data": [
            """I think you did a great job when you ran the all-hands meeting.
            It showed that you are capable of getting people to work together and communicate effectively.
            I admire your communication skills.
            One of your most impactful moments was how you handled Project X.
            You showed the power of user testing in shaping a feature roadmap.
            Your efforts increased the likelihood that we satisfy and delight our users.
            I'd love to see you do more of this.
            Something I really appreciate about you is your aptitude for problem-solving.
            I really think you have a superpower around making new hires feel welcome.
            One of the things I admire about you is your ability to manage a team remotely.
            I can see you’re having a positive impact in your new office, people seem really happy to have you on their team."""
          ]
        }
      ]
    })
    headers = {
      'Content-Type': 'application/json'
    }
    
    def infer(request_num):
        #  print(f"Processing: {request_num}")  #. For debugging
        response = requests.request("POST",url , headers=headers, data=payload)
        return response.elapsed.total_seconds()

    response_times = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
        for result in executor.map(infer, range(num_requests)):
            response_times.append(result)

    print(f"Total Time taken: {sum(response_times)}")


In [63]:
send_n_concurrent_requests(1000,"http://localhost:8000/v2/models/sentiment-nltk-service/versions/1/infer")

Total Time taken: 154.6507970000001


In [64]:
send_n_concurrent_requests(1000,"http://localhost:8000/v2/models/sentiment-nltk-service-multi-instance/versions/1/infer")

Total Time taken: 13.099421999999999


### 2.4 Cleanups

In [65]:
!docker stop {container_id[0]}

07be1398dba9332f55919abb0d9f1d75d1c5fc1d4c13fdcd0aca8781ff28ac5a


In [66]:
!echo y | docker container prune

Are you sure you want to continue? [y/N] Deleted Containers:
07be1398dba9332f55919abb0d9f1d75d1c5fc1d4c13fdcd0aca8781ff28ac5a

Total reclaimed space: 12.55kB


# IV. Optimisation Features - Introductions

## 1. Dynamic Batching

In [67]:
!cat model_repository_single_service/sentiment-nltk-service/config.pbtxt

name: "sentiment-nltk-service"
backend: "python"  # PyTorch, TF, ONNX, TensorRT
max_batch_size: 8

dynamic_batching { }

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims:  [1]
  }
]

output [
  {
    name: "STATUS"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "SCORE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/data/model_repository/sentiment-nltk-service/sentinltkenv.tar.gz"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]


## 2. Ensembler and BLS

<img src="Ensemble.jpg">

## 3. Model Analyser

# That's all folks!

<i> <h3> Meet you in Advanced Triton Optimisation Features session! </h3> </i>