## Triton Server APIs
1. ### [Version](#1.-Version)
    - [`GET /v2`](#GET-/v2)

2. ### [Health Check](#2.-Health-Check)
    - [`GET /v2/health/live`](#GET-/v2/health/live)
    - [`GET /v2/health/ready`](#GET-/v2/health/ready)
    - [`GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready`](#GET-/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready)

3. ### [Model Repository](#3.-Model-Repository)
    - [`POST /v2/repository/index`](#POST-/v2/repository/index)
    - [`POST /v2/repository/models/${MODEL_NAME}/load`](#POST-/v2/repository/models/${MODEL_NAME}/load)
    
    - [`POST /v2/repository/models/${MODEL_NAME}/unload`](#POST-/v2/repository/models/${MODEL_NAME}/unload)

4. ### [Model](#4.-Model)
    - [`GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config`](#GET-/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config)
    - [`GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/stats`](#GET-/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/stats)
    - [`POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer`](#POST-/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer)
        - [`models/add_sub`](#POST-/v2/models/add_sub[/versions/${MODEL_VERSION}]/infer)
        
        - [`models/inception_graphdef`](#POST-/v2/models/inception_graphdef[/versions/${MODEL_VERSION}]/infer)
        
        - [`models/resnet18_onnx`](#POST-/v2/models/resnet18_onnx[/versions/${MODEL_VERSION}]/infer)
    
5. ### [GPU Memory](#5.-GPU-Memory)
    - [nvidia-smi](#nvidia-smi)

6. ### [Prometheus Metrics](#6.-GPU-Memory)
    - []()

7. ### [Stability Tests](#7.-Stability-Tests)
    - [Load & Unload Tests](#Stability-Tests-/-Load-&-Unload-Tests)
        - [One Model, Many Versions](#Stability-Tests-/-Load-&-Unload-Tests-/-One-Model,-Many-Versions)
        - [Many Models, One Version per Model](#Stability-Tests-/-Load-&-Unload-Tests-/-Many-Models,-One-Version-per-Model)


## Global Variables

In [2]:
IP = '10.78.26.241'
HTTP_URL    = IP + ':9000'  # HTTP Service
GRPC_URL    = IP + ':9001'  # GRPC Inference Service
METRICS_URL = IP + ':9002'  # Metrics Service

# short URL
URL = HTTP_URL

## 1. Version

### `GET /v2`

In [343]:
!curl $URL/v2 | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   254  100   254    0     0   253k      0 --:--:-- --:--:-- --:--:--  248k
[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"triton"[0m[1;39m,
  [0m[34;1m"version"[0m[1;39m: [0m[0;32m"2.11.0"[0m[1;39m,
  [0m[34;1m"extensions"[0m[1;39m: [0m[1;39m[
    [0;32m"classification"[0m[1;39m,
    [0;32m"sequence"[0m[1;39m,
    [0;32m"model_repository"[0m[1;39m,
    [0;32m"model_repository(unload_dependents)"[0m[1;39m,
    [0;32m"schedule_policy"[0m[1;39m,
    [0;32m"model_configuration"[0m[1;39m,
    [0;32m"system_shared_memory"[0m[1;39m,
    [0;32m"cuda_shared_memory"[0m[1;39m,
    [0;32m"binary_tensor_data"[0m[1;39m,
    [0;32m"statistics"[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


## 2. Health Check

### `GET /v2/health/live`
- Failed:
  ```
  curl: (7) Failed to connect to 10.78.26.241 port 9000: Connection refused
  ```
- Successful
  <br>200 OK, empty

In [714]:
!curl $URL/v2/health/live

In [715]:
!curl -v $URL/v2/health/live

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/health/live HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host 10.78.26.241 left intact


### `GET /v2/health/ready`
- Failed:
  ```
  curl: (7) Failed to connect to 10.78.26.241 port 9000: Connection refused
  ```
- Successful
  <br>200 OK, empty

In [610]:
!curl $URL/v2/health/ready

In [611]:
!curl -v $URL/v2/health/ready

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host 10.78.26.241 left intact


### `GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready`
- Not ready:
  <br>400 Bad Request
- Ready
  <br>200 OK, empty

In [6]:
MODEL_NAME = 'add_sub'
MODEL_VERSION = 1

In [325]:
!curl -v $URL/v2/models/$MODEL_NAME/ready

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/models/inception_graphdef/ready HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host 10.78.26.241 left intact


In [7]:
!curl -v $URL/v2/models/$MODEL_NAME/versions/$MODEL_VERSION/ready

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/models/add_sub/versions/1/ready HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host 10.78.26.241 left intact


## 3. Model Repository
- 參考資料
  - [Model Repository Extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md#httprest)

### `POST /v2/repository/index`

In [244]:
!curl -v -X POST $URL/v2/repository/index | jq

*   Trying 10.78.26.241...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/index HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 105
< 
{ [105 bytes data]
100   105  100   105    0     0  86848      0 --:--:-- --:--:-- --:--:--  102k
* Connection #0 to host 10.78.26.241 left intact
[1;39m[
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"add_sub"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"inception_graphdef"[0m[1;39m,
    [0m[34;1m"version"[0m[1;39m: [0m[0;32m"2"[0m[1;39m,
    [0m[34;1m"state"[0m[1;39m: [0m[0;32m"READY"

### Load model names from `/v2/repository/index`

In [167]:
import json

# load model index from 'v2/repository/index'
def get_model_index():
    responses = !curl -X POST $URL/v2/repository/index
    if len(responses) < 5:
        raise Exception('\n' + '\n'.join(responses))
    response = responses[5] # skip the head info
    
    # response:
    #   [{"name":"add_sub","version":"1","state":"READY"}, ...]
    model_index = json.loads(response)
    return model_index

# load model names from 'v2/repository/index'
def get_model_names():
    model_index = get_model_index()
    
    model_names = []
    for model in model_index:
        if model['name'] not in model_names:
            model_names.append(model['name'])
    
    return model_names


print("model index:", json.dumps(get_model_index(), indent=4, sort_keys=False))
print("model names:", get_model_names())

model index: [
    {
        "name": "add_sub",
        "version": "1",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    },
    {
        "name": "inception_graphdef",
        "version": "1",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    },
    {
        "name": "resnet18_onnx",
        "version": "1",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    }
]
model names: ['add_sub', 'inception_graphdef', 'resnet18_onnx']


  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 10.78.26.241 port 9000: Connection refused


### `POST /v2/repository/models/${MODEL_NAME}/load`
- Failed:
  - no response
    <br>The state has been 'UNLOADING'.
  - use `--gpus='"device=0"'`, but assign `1` to `instance_group.gpu`
    - server log: ```unsupported gpu id 1```
    - client log: ```{"error":"failed to load 'resnet18_onnx', no version is available"}```
  - out of shared memory
    - server log:
      ```
      model_repository_manager.cc:1215] failed to load '${MODEL_NAME}' version ${MODEL_VERSION}: Internal: Unable to initialize shared memory key '/{MODEL_NAME}_0_CPU_0' to requested size (67108864 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 64MBs of shared memory. Flag '--shm-size=5G' should be sufficient for common usecases. Error: No such file or directory
      ```
- Successful
  <br>200 OK, empty

In [172]:
MODEL_NAME = ['add_sub', 'inception_graphdef', 'resnet18_onnx'][2]

!curl -v -X POST $URL/v2/repository/models/$MODEL_NAME/load

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/models/resnet18_onnx/load HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 0
< 
* Connection #0 to host 10.78.26.241 left intact


### `POST /v2/repository/models/${MODEL_NAME}/unload`
- Failed:
  
- Successful
  <br>200 OK, empty

In [174]:
MODEL_NAME = ['add_sub', 'inception_graphdef', 'resnet18_onnx'][2]

!curl -v -X POST $URL/v2/repository/models/$MODEL_NAME/unload

*   Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/models/resnet18_onnx/unload HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 0
< 
* Connection #0 to host 10.78.26.241 left intact


## 4. Model

### `GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config`
- Not loaded yet (state: empty)
  - `{ "error": "Request for unknown model: 'inception_graphdef' is not found" }`
- Not ready: (state: "unloaded")
  - `{ "error": "Request for unknown model: '${MODEL_NAME}' has no available versions" }`
  - `{ "error": "Request for unknown model: '${MODEL_NAME}' version ${MODEL_VERSION} is not at ready state" }`
- Ready (state: "READY")
  <br>200 OK, json-result

In [118]:
MODEL_NAME = ['add_sub', 'inception_graphdef', 'resnet18_onnx'][2]
MODEL_VERSION = 2

# current version
!curl -v $URL/v2/models/$MODEL_NAME/config | jq

# specified version
#!curl -v $URL/v2/models/$MODEL_NAME/versions/$MODEL_VERSION/config | jq

*   Trying 10.78.26.241...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/models/resnet18_onnx/config HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 923
< 
{ [923 bytes data]
100   923  100   923    0     0   592k      0 --:--:-- --:--:-- --:--:--  901k
* Connection #0 to host 10.78.26.241 left intact
[1;39m{
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"resnet18_onnx"[0m[1;39m,
  [0m[34;1m"platform"[0m[1;39m: [0m[0;32m"onnxruntime_onnx"[0m[1;39m,
  [0m[34;1m"backend"[0m[1;39m: [0m[0;32m"onnxruntime"[0m[1;39m,
  [0m[34;1m"version_policy"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"all"[0

### `GET /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/stats`
- Not ready:
  <br>`{ "error": "requested model '${MODEL_NAME}' is not available" }`
- Ready
  <br>200 OK, json-result

In [117]:
MODEL_NAME = ['add_sub', 'inception_graphdef', 'resnet18_onnx'][2]
MODEL_VERSION = 2

# current version
!curl -v $URL/v2/models/$MODEL_NAME/stats | jq

# specified version
#!curl -v $URL/v2/models/$MODEL_NAME/versions/$MODEL_VERSION/stats | jq

*   Trying 10.78.26.241...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> GET /v2/models/resnet18_onnx/stats HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 986
< 
{ [986 bytes data]
100   986  100   986    0     0   947k      0 --:--:-- --:--:-- --:--:--  962k
* Connection #0 to host 10.78.26.241 left intact
[1;39m{
  [0m[34;1m"model_stats"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"name"[0m[1;39m: [0m[0;32m"resnet18_onnx"[0m[1;39m,
      [0m[34;1m"version"[0m[1;39m: [0m[0;32m"1"[0m[1;39m,
      [0m[34;1m"last_inference"[0m[1;39m: [0m[0;39m0[0m[1;39m,
      [0m[34;1m"infe

### `POST /v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer`

In [4]:
def execute_command(cmd):
    #print('cmd:', cmd, '\n')
    
    # method1
    ! $cmd
    
    # method2: without the result
    '''
    import os
    os.system(cmd)
    '''
    
    # method3: with the result
    '''
    import subprocess
    output = subprocess.check_output(cmd, shell=True)
    print(output.decode('utf-8'))
    '''


In [63]:
def execute_image_client_py(model_name, model_version=None, image_filename='images/mug.jpg', 
                            verbose=False, no_output=False):
    cmd = '''
        python3.6 \
            "image_client.py" \
            -u {url} \
            -m {model_name} \
            {model_version} \
            -s INCEPTION \
            {image_filename} \
        '''
    cmd = cmd.format(
        url=URL, 
        model_name=model_name,
        model_version=' -x ' + str(model_version) if model_version != None else '',
        image_filename=image_filename
    )
    
    if no_output:
        cmd += ' 2>&1 > /dev/null'
    
    execute_command(cmd)
    
    if not no_output:
        print('\n')


### `POST /v2/models/add_sub[/versions/${MODEL_VERSION}]/infer`

- shell
```
curl -X POST 10.78.26.241:9000/v2/models/add_sub/infer \
  --data '{"inputs": [{"name": "INPUT0", "shape": [4], "datatype": "FP32", "data": [1, 2, 3, 4]}, {"name": "INPUT1", "shape": [4], "datatype": "FP32", "data": [1, 1, 1, 1]}]}'
```
or
```
curl -X POST 10.78.26.241:9000/v2/models/add_sub/infer \
  --data "{\"inputs\": [ \
    {\"name\": \"INPUT0\", \"shape\": [4], \"datatype\": \"FP32\", \"data\": [1, 2, 3, 4]}, \
    {\"name\": \"INPUT1\", \"shape\": [4], \"datatype\": \"FP32\", \"data\": [1, 1, 1, 1]} \
  ]}"
```

- shell in notebook
```
!curl -X POST $URL/v2/models/add_sub/infer \
  --data "{{\"inputs\": [ \
    {{ \"name\": \"INPUT0\", \"shape\": [4], \"datatype\": \"FP32\", \"data\": [1, 2, 3, 4] }}, \
    {{ \"name\": \"INPUT1\", \"shape\": [4], \"datatype\": \"FP32\", \"data\": [1, 1, 1, 1] }} \
    ]}}"
```
    - `{`, `}`: need to repeat twice to escape special character
    - use double quote instead of single quote

In [55]:
import json

def add_sub_infer(version=None, verbose=False, no_output=False):
    data = {
        'inputs': [
            {
                'name': 'INPUT0',
                'shape': [4],
                'datatype': 'FP32',
                'data': [1, 2, 3, 4]
            },
            {
                'name': 'INPUT1',
                'shape': [4],
                'datatype': 'FP32',
                'data': [1, 1, 1, 1]
            }
        ]
    }
    
    if version == None:
        endpoint = URL + '/v2/models/add_sub/infer'
    else:
        endpoint = URL + '/v2/models/add_sub/versions/{}/infer'.format(version)
    
    cmd = "curl -s {verbose} -X POST {endpoint} --data '{data}'".format(
        verbose='' if not verbose else '-v', 
        endpoint=endpoint, 
        data=json.dumps(data))
    
    if no_output:
        cmd += ' 2>&1 > /dev/null'
    
    execute_command(cmd)
    
    if not no_output:
        print('\n')


add_sub_infer(no_output=False)
add_sub_infer(version=1,no_output=True)

{"model_name":"add_sub","model_version":"1","outputs":[{"name":"OUTPUT0","datatype":"FP32","shape":[4],"data":[2.0,3.0,4.0,5.0]},{"name":"OUTPUT1","datatype":"FP32","shape":[4],"data":[0.0,1.0,2.0,3.0]}]}



### `POST /v2/models/inception_graphdef[/versions/${MODEL_VERSION}]/infer`
- python-based (via [image_client.py](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py))
  ```
  python3.6  "image_client.py" \
    -u 10.78.26.241:9000 \
    -m inception_graphdef \
    -s INCEPTION \
    images/mug.jpg
  ```


In [66]:
import json

def inception_graphdef_infer(version=None, verbose=False, no_output=False):
    execute_image_client_py(
        model_name='inception_graphdef',
        model_version=version,
        verbose=verbose,
        no_output=no_output
    )

inception_graphdef_infer(no_output=True)
inception_graphdef_infer(version=1)

Request 1, batch size 1
    0.826453 (505) = COFFEE MUG
PASS




### `POST /v2/models/resnet18_onnx[/versions/${MODEL_VERSION}]/infer`
- python-based (via [image_client.py](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py))
  ```
  python3.6  "image_client.py" \
    -u 10.78.26.241:9000 \
    -m resnet18_onnx \
    -s INCEPTION \
    images/mug.jpg
  ```


In [173]:
import json

def resnet18_onnx_infer(version=None, verbose=False, no_output=False):
    execute_image_client_py(
        model_name='resnet18_onnx',
        model_version=version,
        verbose=verbose,
        no_output=no_output
    )

resnet18_onnx_infer(no_output=True)
resnet18_onnx_infer(version=2)

failed to retrieve the metadata: Request for unknown model: 'resnet18_onnx' version 2 is not found




## 5. GPU Memory

### `nvidia-smi`

In [99]:
!nvidia-smi

Thu Aug 26 11:03:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 27%   26C    P8     8W / 250W |   1096MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
| 20%   31C    P8     8W / 250W |   1231MiB / 11178MiB |      0%      Default |
|       

In [82]:
import re

def get_gpu_memory(gpu, pid):
    response = !nvidia-smi
    
    for line in response:
        matcher = re.search('^\|\s+(\d+)\s+[^\s]+\s+[^\s]+\s+(\d+).*?(\d+)MiB\s\|$', line)
        
        if matcher != None:
            info_gpu = matcher.group(1)
            info_pid = matcher.group(2)
            info_memory = matcher.group(3)
            
            if int(gpu) == int(info_gpu) and int(pid) == int(info_pid):
                return int(info_memory)
    
    return None


gpu = input('Please input gpu-index, i.e. [0|1|2|3]:')
pid = input('Please input pid:')
get_gpu_memory(gpu=gpu, pid=pid)

Please input gpu-index, i.e. [0|1|2|3]:0
Please input pid:638528


199

True

## 7. Stability Tests

### Stability Tests / Load & Unload Tests

### Stability Tests / Load & Unload Tests / One Model, Many Versions
> 只有一種模型,內含多個 version, 針對最後一個 version 重複 load / unload

- `version_policy: { all { }}`
- Server logs: (for example)
  ```
  loading: inception_graphdef:1
  loading: inception_graphdef:2
  loading: inception_graphdef:3
  successfully loaded 'inception_graphdef' version 1
  successfully loaded 'inception_graphdef' version 2
  successfully loaded 'inception_graphdef' version 3
  unloading: inception_graphdef:1
  unloading: inception_graphdef:2
  unloading: inception_graphdef:3
  successfully unloaded 'inception_graphdef' version 2 (Does not guarantee the order)
  successfully unloaded 'inception_graphdef' version 3
  successfully unloaded 'inception_graphdef' version 1
  ...
  (repeated)
  ```
- Failed case:
  - python-backend with many versions 

In [112]:
import json

# get the following info from nvidia-smi
gpu = input('Please input gpu-index, i.e. [0|1|2|3]: ')
pid = input('Please input gpu-pid: ')
print('Current GPU memory: %dMiB' % get_gpu_memory(gpu=gpu, pid=pid))
print()

MODEL_NAMES = get_model_names()
print("MODEL_NAMES:", MODEL_NAMES)
print()

MODEL_NAME = MODEL_NAMES[0]

print('Round:', end=' ')
print('[%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')
for i in range(10):
    print(i, end='')
    !curl -X POST $URL/v2/repository/models/$MODEL_NAME/load
    
    # warmup
    function_name = MODEL_NAME + '_infer'
    globals()[function_name](no_output=True)

    !curl -X POST $URL/v2/repository/models/$MODEL_NAME/unload
    
    print(' -> [%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')

print('\n')
print('Final state:')
print(json.dumps(get_model_index(), indent=4))
print('Done!')

Please input gpu-index, i.e. [0|1|2|3]: 0
Please input gpu-pid: 751546
Current GPU memory: 199MiB

MODEL_NAMES: ['resnet18_onnx']

Round: [199MiB] -> 0 -> [1017MiB] -> 1 -> [1017MiB] -> 2 -> [1017MiB] -> 3 -> [1017MiB] -> 4 -> [1017MiB] -> 5 -> [1017MiB] -> 6 -> [1017MiB] -> 7 -> [1017MiB] -> 8 -> [1017MiB] -> 9 -> [1017MiB] -> 

Final state:
[
    {
        "name": "resnet18_onnx",
        "version": "1",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    },
    {
        "name": "resnet18_onnx",
        "version": "2",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    },
    {
        "name": "resnet18_onnx",
        "version": "3",
        "state": "UNAVAILABLE",
        "reason": "unloaded"
    }
]
Done!


### Stability Tests / Load & Unload Tests / Many Models, One Version per Model
> repository 放置多種模型(只含一個 version)
> - 載入全部模型後, 反複 load / unload 某個模型


TritonServer parameter:
- mount path
    ```
    -v ~/tj_tsai/workspace/infra/triton_server/workspace/models_v1:/models
    ```

In [107]:
import json

# get the following info from nvidia-smi
gpu = input('Please input gpu-index, i.e. [0|1|2|3]: ')
pid = input('Please input gpu-pid: ')
print('Current GPU memory: %dMiB' % get_gpu_memory(gpu=gpu, pid=pid))
print()

MODEL_NAMES = get_model_names()
print("MODEL_NAMES:", MODEL_NAMES)
print()

# STEP1: load all models
print('loading all models:')
print('>> ' '[%dMiB] ' % get_gpu_memory(gpu=gpu, pid=pid))
for MODEL_NAME in MODEL_NAMES:
    print('-', '[%dMiB] ->' % get_gpu_memory(gpu=gpu, pid=pid), MODEL_NAME, end=' ')
    !curl -X POST $URL/v2/repository/models/$MODEL_NAME/load
    print('-> [%dMiB] ' % get_gpu_memory(gpu=gpu, pid=pid))
print('<< ' '[%dMiB] ' % get_gpu_memory(gpu=gpu, pid=pid))


# STEP2: repeat loading/unloading a model 10 times
print()
print('load & infer & unload for each round:')
print('-' * 60)

for MODEL_NAME in MODEL_NAMES:
    print('Target model:', MODEL_NAME)
    print('Round:', end=' ')
    
    print('[%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')
    
    for i in range(10):
        print(i, end='')
        !curl -X POST $URL/v2/repository/models/$MODEL_NAME/load
        
        # warmup
        function_name = MODEL_NAME + '_infer'
        globals()[function_name](no_output=True)
        
        !curl -X POST $URL/v2/repository/models/$MODEL_NAME/unload
        
        print(' -> [%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')
    print('\n')

print('-' * 60)
print('Final state:')
print(json.dumps(get_model_index(), indent=4))
print('Done!')

Please input gpu-index, i.e. [0|1|2|3]: 0
Please input gpu-pid: 701226
Current GPU memory: 199MiB

MODEL_NAMES: ['add_sub', 'inception_graphdef', 'resnet18_onnx']

loading all models:
>> [199MiB] 
- [199MiB] -> add_sub -> [273MiB] 
- [273MiB] -> inception_graphdef -> [827MiB] 
- [827MiB] -> resnet18_onnx -> [1087MiB] 
<< [1087MiB] 

load & infer & unload for each round:
------------------------------------------------------------
Target model: add_sub
Round: [1087MiB] -> 0 -> [1087MiB] -> 1 -> [1087MiB] -> 2 -> [1087MiB] -> 3 -> [1087MiB] -> 4 -> [1087MiB] -> 5 -> [1087MiB] -> 6 -> [1087MiB] -> 7 -> [1087MiB] -> 8 -> [1087MiB] -> 9 -> [1087MiB] -> 

Target model: inception_graphdef
Round: [1087MiB] -> 0 -> [4497MiB] -> 1 -> [4497MiB] -> 2 -> [4497MiB] -> 3 -> [4497MiB] -> 4 -> [4497MiB] -> 5 -> [4497MiB] -> 6 -> [4497MiB] -> 7 -> [4497MiB] -> 8 -> [4497MiB] -> 9 -> [4497MiB] -> 

Target model: resnet18_onnx
Round: [4497MiB] -> 0 -> [4413MiB] -> 1 -> [4413MiB] -> 2 -> [4413MiB] -> 3 -> 

### Stability Tests / Load & Unload Tests / Many Models, One Version per Model-2
> repository 放置多種模型(只含一個 version)
> - 依序 load / unload 每一種模型, (註: Server 同時間只會載入一種模型)

In [109]:
import json

# get the following info from nvidia-smi
gpu = input('Please input gpu-index, i.e. [0|1|2|3]: ')
pid = input('Please input gpu-pid: ')
print('Current GPU memory: %dMiB' % get_gpu_memory(gpu=gpu, pid=pid))
print()

MODEL_NAMES = get_model_names()
print("MODEL_NAMES:", MODEL_NAMES)
print()

print('-' * 60)
print('>> ' '[%dMiB] ' % get_gpu_memory(gpu=gpu, pid=pid))

# repeat loading/unloading each model 10 times
for i in range(10):
    print('Round-' + str(i) + ': ', end='')
    print('[%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')
    
    for MODEL_NAME in MODEL_NAMES:
        print(MODEL_NAME, end='')
        !curl -X POST $URL/v2/repository/models/$MODEL_NAME/load
        
        # warmup
        function_name = MODEL_NAME + '_infer'
        globals()[function_name](no_output=True)
        
        !curl -X POST $URL/v2/repository/models/$MODEL_NAME/unload
        
        print(' -> [%dMiB] -> ' % get_gpu_memory(gpu=gpu, pid=pid), end='')
    print()
    
print('<< ' '[%dMiB] ' % get_gpu_memory(gpu=gpu, pid=pid))

print('-' * 60)
print('Final state:')
!sleep 1 # wait for the 'add_sub' model to be finished
print(json.dumps(get_model_index(), indent=4))
print('Done!')

Please input gpu-index, i.e. [0|1|2|3]: 0
Please input gpu-pid: 718066
Current GPU memory: 199MiB

MODEL_NAMES: ['add_sub', 'inception_graphdef', 'resnet18_onnx']

------------------------------------------------------------
>> [199MiB] 
Round-0: [199MiB] -> add_sub -> [273MiB] -> inception_graphdef -> [4391MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-1: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-2: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-3: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-4: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-5: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_onnx -> [4411MiB] -> 
Round-6: [4411MiB] -> add_sub -> [4411MiB] -> inception_graphdef -> [4411MiB] -> resnet18_on

In [261]:
import os
from datetime import datetime

t0 = datetime.now()
print(t0)
!curl localhost:9002/metrics | grep inception
print('-' * 60)

times = 1000
for i in range(times):
    if i % 100 == 0: print(i)
        
    cmd = '''
        python3.6 \
            "image_client.py" \
            -m inception_graphdef \
            -s INCEPTION \
            "/home/ocistn3/victorlw_chen/images/mug.jpg" -x 2 \
            -u localhost:9000 \
             2>&1 > /dev/null &
        '''
    #!$cmd
    os.system(cmd)
    #os.system('sleep 0.3')

print('-' * 60)
t1 = datetime.now()
print(t1)
!curl localhost:9002/metrics | grep inception

print('-' * 60)
print('elapsed time:', t1.timestamp() - t0.timestamp())

2021-08-26 16:03:41.773755
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3558  100  3558    0     0   622k      0 --:--:-- --:--:-- --:--:--  694k
nv_inference_request_success{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="inception_graphdef",version="2"} 2.000000
nv_inference_request_failure{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="inception_graphdef",version="2"} 0.000000
nv_inference_count{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="inception_graphdef",version="2"} 2.000000
nv_inference_exec_count{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="inception_graphdef",version="2"} 2.000000
nv_inference_request_duration_us{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="inception_graphdef",version="2"} 13917127.000000
nv_inference_queue_duration_us{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="