# Pretrained  GPT2  Model Deployment Example

In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon's Triton pre-packed server. the example also covers converting the model to ONNX format.
The implemented example below is of the Greedy approach for the next token prediction.

more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2

## Steps:
1. Download pretrained GPT2 model from hugging face
2. Convert the model to ONNX
3. Store it in MinIo bucket
4. Setup Seldon-Core in your kubernetes cluster
5. Deploy the ONNX model with Seldon’s prepackaged Triton server.
6. Interact with the model, run a greedy alg example (generate sentence completion)
7. Clean-up

## Basic requirements
* Helm v3.0.0+
* A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)
* kubectl v1.14+
* Python 3.6+ 

In [None]:
%%writefile requirements.txt
transformers==4.5.1
torch==1.8.1
tokenizers<0.11,>=0.10.1
tensorflow==2.4.1
tf2onnx

In [None]:
!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt


### Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

In [None]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id)
model.save_pretrained("./tfgpt2model", saved_model=True)

### Convert the TensorFlow saved model to ONNX

In [None]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11  --output model.onnx

### Copy your model to a local MinIo
#### Setup MinIo
Use the provided [notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/minio_setup.html) to install MinIo in your cluster and configure `mc` CLI tool. Instructions also [online](https://docs.min.io/docs/minio-client-quickstart-guide.html).

-- Note: You can use your prefer remote storage server (google/ AWS etc.)

#### Create a Bucket and store your model

In [None]:
!mc mb minio-seldon/onnx-gpt2 -p
!mc cp ./model.onnx minio-seldon/onnx-gpt2/gpt2/1/

### Run Seldon in your kubernetes cluster

Follow the [Seldon-Core Setup notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/seldon_core_setup.html) to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core

### Deploy your model with Seldon pre-packaged Triton server

In [3]:
%%writefile secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: minioadmin
  AWS_SECRET_ACCESS_KEY: minioadmin
  AWS_ENDPOINT_URL: http://minio.minio-system.svc.cluster.local:9000
  USE_SSL: "false"

Overwriting secret.yaml


In [4]:
%%writefile gpt2-deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: gpt2
spec:
  predictors:
  - graph:
      implementation: TRITON_SERVER
      logger:
        mode: all
      modelUri: s3://onnx-gpt2
      envSecretRefName: seldon-init-container-secret
      name: gpt2
      type: MODEL
    name: default
    replicas: 1
  protocol: kfserving

Overwriting gpt2-deploy.yaml


In [5]:
!kubectl apply -f secret.yaml
!kubectl apply -f gpt2-deploy.yaml

secret/seldon-init-container-secret configured
seldondeployment.machinelearning.seldon.io/gpt2 configured


In [6]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0].metadata.name}')

deployment "gpt2-default-0-gpt2" successfully rolled out


#### Interact with the model: get model metadata (a "test" request to make sure our model is available and loaded correctly)

In [43]:
!curl -v http://localhost:80/seldon/seldon/gpt2/v2/models/gpt2

*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /seldon/seldon/gpt2/v2/models/gpt2 HTTP/1.1
> Host: localhost
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< access-control-allow-headers: Accept, Accept-Encoding, Authorization, Content-Length, Content-Type, X-CSRF-Token
< access-control-allow-methods: GET,OPTIONS
< access-control-allow-origin: *
< content-type: application/json
< seldon-puid: 7e24a20b-3130-4f50-a86b-bda5a9c4c917
< x-content-type-options: nosniff
< date: Fri, 16 Apr 2021 15:19:28 GMT
< content-length: 336
< x-envoy-upstream-service-time: 1
< server: istio-envoy
< 
* Connection #0 to host localhost left intact
{"name":"gpt2","versions":["1"],"platform":"onnxruntime_onnx","inputs":[{"name":"input_ids:0","datatype":"INT32","shape":[-1,-1]},{"name":"attention_mask:0","datatype":"INT32","shape":[-1,-1]}],"outputs":[{"name":"past_

### Run prediction test: generate a sentence completion using GPT2 model  - Greedy approach


In [7]:
import requests
import json
import numpy as np
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = 'I enjoy working in Seldon'
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors='tf')
    shape = input_ids.shape.as_list()
    payload = {
            "inputs": [
                {"name": "input_ids:0",
                 "datatype": "INT32",
                 "shape": shape,
                 "data": input_ids.numpy().tolist()
                 },
                {"name": "attention_mask:0",
                 "datatype": "INT32",
                 "shape": shape,
                 "data": np.ones(shape, dtype=np.int32).tolist()
                 }
                ]
            }

    ret = requests.post('http://localhost:80/seldon/seldon/gpt2/v2/models/gpt2/infer', json=payload)

    try:
        res = ret.json()
    except:
       continue

    # extract logits
    logits = np.array(res["outputs"][1]["data"])
    logits = logits.reshape(res["outputs"][1]["shape"])

    # take the best next token probability of the last token of input ( greedy approach)
    next_token = logits.argmax(axis=2)[0]
    next_token_str = tokenizer.decode(next_token[-1:], skip_special_tokens=True,
                                      clean_up_tokenization_spaces=True).strip()
    gen_sentence += ' ' + next_token_str
    count += 1

print(f'Input: {input_text}\nOutput: {gen_sentence}')

Input: I enjoy working in Seldon
Output: I enjoy working in Seldon 's office , and I 'm glad to see that


### Clean-up

In [None]:
!kubectl delete -f gpt2-deploy.yaml