# Pretrained  GPT2  Model Deployment Example

In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon's Triton pre-packed server. the example also covers converting the model to ONNX format.
The implemented example below is of the Greedy approach for the next token prediction.
more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2

After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.


## Steps:
1. Download pretrained GPT2 model from hugging face
2. Convert the model to ONNX
3. Store it in MinIo bucket
4. Setup Seldon-Core in your kubernetes cluster
5. Deploy the ONNX model with Seldon’s prepackaged Triton server.
6. Interact with the model, run a greedy alg example (generate sentence completion)
7. Run load test using vegeta
8. Clean-up

## Basic requirements
* Helm v3.0.0+
* A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)
* kubectl v1.14+
* Python 3.6+
* install tensorflow and pytorch 

In [1]:
%%writefile requirements.txt
transformers==4.5.1
tokenizers<0.11,>=0.10.1
tf2onnx

Overwriting requirements.txt


In [2]:
!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt

Collecting transformers==4.5.1
  Downloading transformers-4.5.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 32.9 MB/s eta 0:00:01
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 154.0 MB/s eta 0:00:01
[?25hCollecting tf2onnx
  Downloading tf2onnx-1.11.1-py3-none-any.whl (440 kB)
[K     |████████████████████████████████| 440 kB 151.6 MB/s eta 0:00:01
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 157.8 MB/s eta 0:00:01
Collecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 157.0 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25ldone
[?25h  Created wheel for s

### Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

In [None]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained(
    "gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id
)
model.save_pretrained("./tfgpt2model", saved_model=True)

### Convert the TensorFlow saved model to ONNX

In [None]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11  --output model.onnx

### Copy your model to a local MinIo
#### Setup MinIo
Use the provided [notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/minio_setup.html) to install MinIo in your cluster and configure `mc` CLI tool. Instructions also [online](https://docs.min.io/docs/minio-client-quickstart-guide.html).

-- Note: You can use your prefer remote storage server (google/ AWS etc.)

#### Create a Bucket and store your model

In [6]:
!mc mb minio/minio-seldon/onnx-gpt2/ -p
!mc cp ./model.onnx minio/minio-seldon/onnx-gpt2/1

zsh:1: command not found: mc
zsh:1: command not found: mc


### Run Seldon in your kubernetes cluster

Follow the [Seldon-Core Setup notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/seldon_core_setup.html) to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core

### Deploy your model with Seldon pre-packaged Triton server

In [4]:
%%writefile secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ENV_AUTH: "false"
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: minioadmin
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: minioadmin
  RCLONE_CONFIG_S3_ENDPOINT: http://minio.minio-system.svc.cluster.local:9000


Overwriting secret.yaml


In [11]:
%%writefile gpt2-deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: gpt2
spec:
  predictors:
  - graph:
      implementation: TRITON_SERVER
      logger:
        mode: all
      modelUri: s3://minio-seldon/onnx-gpt2/1
      envSecretRefName: seldon-init-container-secret
      name: gpt2
      type: MODEL
    name: default
    replicas: 1
  protocol: kfserving

Overwriting gpt2-deploy.yaml


In [12]:
!kubectl apply -f secret.yaml -n default
!kubectl apply -f gpt2-deploy.yaml -n default

secret/seldon-init-container-secret configured
seldondeployment.machinelearning.seldon.io/gpt2 configured


In [13]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0].metadata.name}')

Waiting for deployment "gpt2-default-0-gpt2" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "gpt2-default-0-gpt2" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "gpt2-default-0-gpt2" rollout to finish: 1 old replicas are pending termination...
deployment "gpt2-default-0-gpt2" successfully rolled out


#### Interact with the model: get model metadata (a "test" request to make sure our model is available and loaded correctly)

In [15]:
!curl -v http://localhost:32000/seldon/defualt/gpt2/v2/models/gpt2

*   Trying ::1:32000...
* TCP_NODELAY set
* connect to ::1 port 32000 failed: Connection refused
*   Trying 127.0.0.1:32000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 32000 (#0)
> GET /seldon/defualt/gpt2/v2/models/gpt2 HTTP/1.1
> Host: localhost:32000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Fri, 24 Jun 2022 16:11:18 GMT
< server: istio-envoy
< content-length: 0
< 
* Connection #0 to host localhost left intact


### Run prediction test: generate a sentence completion using GPT2 model  - Greedy approach


In [1]:
import json

import numpy as np
import requests
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
    shape = input_ids.shape.as_list()
    payload = {
        "inputs": [
            {
                "name": "input_ids",
                "datatype": "INT32",
                "shape": shape,
                "data": input_ids.numpy().tolist(),
            },
            {
                "name": "attention_mask",
                "datatype": "INT32",
                "shape": shape,
                "data": np.ones(shape, dtype=np.int32).tolist(),
            },
        ]
    }

    ret = requests.post(
        "http://localhost:8004/seldon/seldon/gpt2/v2/models/gpt2/infer", json=payload
    )

    try:
        res = ret.json()
    except:
        continue

    # extract logits
    logits = np.array(res["outputs"][1]["data"])
    logits = logits.reshape(res["outputs"][1]["shape"])

    # take the best next token probability of the last token of input ( greedy approach)
    next_token = logits.argmax(axis=2)[0]
    next_token_str = tokenizer.decode(
        next_token[-1:], skip_special_tokens=True, clean_up_tokenization_spaces=True
    ).strip()
    gen_sentence += " " + next_token_str
    count += 1

print(f"Input: {input_text}\nOutput: {gen_sentence}")

  from .autonotebook import tqdm as notebook_tqdm
2022-05-06 18:33:58.271431: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-05-06 18:33:58.271481: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (k8s-cluster): /proc/driver/nvidia/version does not exist
2022-05-06 18:33:58.272792: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using sep_token, but it is not set yet.


### Run Load Test / Performance Test using vegeta

#### Install vegeta, for more details take a look in [vegeta](https://github.com/tsenart/vegeta#install) official documentation

In [8]:
!wget https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-amd64.tar.gz
!tar -zxvf vegeta-12.8.3-linux-amd64.tar.gz
!chmod +x vegeta

#### Generate vegeta [target file](https://github.com/tsenart/vegeta#-targets) contains "post" cmd with payload in the requiered structure

In [9]:
import base64
import json
from subprocess import PIPE, Popen, run

import numpy as np
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
input_ids = tokenizer.encode(input_text, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
    "inputs": [
        {
            "name": "input_ids",
            "datatype": "INT32",
            "shape": shape,
            "data": input_ids.numpy().tolist(),
        },
        {
            "name": "attention_mask",
            "datatype": "INT32",
            "shape": shape,
            "data": np.ones(shape, dtype=np.int32).tolist(),
        },
    ]
}

cmd = {
    "method": "POST",
    "header": {"Content-Type": ["application/json"]},
    "url": "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer",
    "body": base64.b64encode(bytes(json.dumps(payload), "utf-8")).decode("utf-8"),
}

with open("vegeta_target.json", mode="w") as file:
    json.dump(cmd, file)
    file.write("\n\n")

In [10]:
!vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | vegeta report -type=text

Requests      [total, rate, throughput]         60, 1.02, 1.01
Duration      [total, attack, wait]             59.198s, 59s, 197.751ms
Latencies     [min, mean, 50, 90, 95, 99, max]  179.123ms, 280.177ms, 214.79ms, 325.753ms, 457.825ms, 1.936s, 2.009s
Bytes In      [total, mean]                     475783920, 7929732.00
Bytes Out     [total, mean]                     13140, 219.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:60  
Error Set:


### Clean-up

In [11]:
!kubectl delete -f gpt2-deploy.yaml -n default

seldondeployment.machinelearning.seldon.io "gpt2" deleted
