![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F05+-+TensorFlow&dt=05Tools+-+Prediction+-+Online.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/05%20-%20TensorFlow/05Tools%20-%20Prediction%20-%20Online.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/05%20-%20TensorFlow/05Tools%20-%20Prediction%20-%20Online.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/05%20-%20TensorFlow/05Tools%20-%20Prediction%20-%20Online.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/05%20-%20TensorFlow/05Tools%20-%20Prediction%20-%20Online.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# 05Tools: Prediction - Online
Predictions from models created in the 05 series of notebooks.

This notebook is part of collection of examples that showcase many ways to serve models:
- Online:
    - (**THIS NOTEBOOK**) Vertex AI Endpoints: Python, REST, CLI (gcloud): [05Tools - Prediction - Online.ipynb](./05Tools%20-%20Prediction%20-%20Online.ipynb)
    - Local with TensorFlow ModelServer: [05Tools - Prediction - Local.ipynb](./05Tools%20-%20Prediction%20-%20Local.ipynb)
    - Custom: Build a custom container with TensorFlow ModelServer: [05Tools - Prediction - Custom.ipynb](./05Tools%20-%20Prediction%20-%20Custom.ipynb)
        - Remote Service with Cloud Run
        - Local Service with Docker Run
- Batch: [05Tools - Prediction - Batch.ipynb](./05Tools%20-%20Prediction%20-%20Batch.ipynb)
    - BigQuery ML Model Import
    - Vertex AI Batch Prediction Jobs

**Prerequisites:**
-  At least 1 of the notebooks in this series [05, 05a-05i]

**Conceptual Flow & Workflow**

<p align="center">
  <img alt="Conceptual Flow" src="../architectures/slides/05tools_pred_arch.png" width="45%">
&nbsp; &nbsp; &nbsp; &nbsp;
  <img alt="Workflow" src="../architectures/slides/05tools_pred_console.png" width="45%">
</p>

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/05%20-%20TensorFlow/05Tools%20-%20Prediction%20-%20Online.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.bigquery', 'google-cloud-bigquery')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [6]:
REGION = 'us-central1'
EXPERIMENT = '05_predictions'
SERIES = '05'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

# Resources
DEPLOY_COMPUTE = 'n1-standard-4'
DEPLOY_IMAGE ='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id,splits' # add more variables to the string with comma delimiters

packages:

In [7]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

from google.cloud import aiplatform
from google.cloud import bigquery

import tensorflow as tf

from google.api import httpbody_pb2
from datetime import datetime
import json
import numpy as np

import asyncio
import time
import multiprocessing

clients:

In [8]:
aiplatform.init(project = PROJECT_ID, location = REGION)
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [9]:
DIR = f"temp/{EXPERIMENT}"

environment:

In [10]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Get Vertex AI Endpoint

This project already has a model serving online predictions at a Vertex AI Endpoint.  This section will use the endpoint to retrieve the deployed model and get its information to use for online prediction methods in this notebook.

### Get Endpoint

[Endpoint Properties and Methods](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint):

```python
endpoint
endpoint.display_name
endpoint.resource_name
endpoint.traffic_split
endpoint.list_models()
```

In [11]:
endpoints = aiplatform.Endpoint.list(filter = f"labels.series={SERIES} AND display_name={SERIES}")
if endpoints:
    endpoint = endpoints[0]
    print(f"Endpoint Exists: {endpoints[0].resource_name}")
else:
    print(f"There does not appear to be an endpoint for SERIES = {SERIES}")

Endpoint Exists: projects/1026793852137/locations/us-central1/endpoints/725723853820526592


In [12]:
endpoint.display_name

'05'

In [13]:
print(f'Review the Endpoint in the Console:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/endpoints/{endpoint.name}?project={PROJECT_ID}')

Review the Endpoint in the Console:
https://console.cloud.google.com/vertex-ai/locations/us-central1/endpoints/725723853820526592?project=statmike-mlops-349915


In [14]:
endpoint.traffic_split

{'2423682068408958976': 100}

In [15]:
endpoint.list_models()[0]

dedicated_resources {
  machine_spec {
    machine_type: "n1-standard-4"
  }
  min_replica_count: 1
  max_replica_count: 1
}
id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
display_name: "05_05"
create_time {
  seconds: 1703181253
  nanos: 58849000
}

---
## Retrieve Records For Prediction

In [16]:
n = 10000
# Make a list of columns to omit
OMIT = [x for x in VAR_OMIT.split(',') if x != '']
samples = bq.query(
    query = f"""
        SELECT * EXCEPT({','.join([VAR_TARGET] + OMIT)})
        FROM {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}
        WHERE splits='TEST'
        LIMIT {n}
        """
).to_dataframe()

In [17]:
samples.head(4)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,0.437707,0.39677,...,-0.240428,0.037603,0.380026,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0
1,60481,1.238973,0.035226,0.063003,0.641406,-0.260893,-0.580097,0.049938,-0.034733,0.405932,...,-0.26508,-0.060003,-0.053585,-0.057718,0.104983,0.537987,0.589563,-0.046207,-0.006212,0.0
2,139587,1.870539,0.211079,0.224457,3.889486,-0.380177,0.249799,-0.577133,0.179189,-0.120462,...,-0.374356,0.196006,0.656552,0.180776,-0.060226,-0.228979,0.080827,0.009868,-0.036997,0.0
3,162908,-3.368339,-1.980442,0.153645,-0.159795,3.847169,-3.516873,-1.209398,-0.292122,0.760543,...,-0.923275,-0.545992,-0.252324,-1.171627,0.214333,-0.159652,-0.060883,1.294977,0.120503,0.0


Remove columns not included as features in the model:

In [33]:
newobs = samples.to_dict(orient='records')
newobs[0]

{'Time': 35337,
 'V1': 1.0928441854981998,
 'V2': -0.0132303486713432,
 'V3': 1.35982868199426,
 'V4': 2.7315370965921004,
 'V5': -0.707357349219652,
 'V6': 0.8738370029866129,
 'V7': -0.7961301510622031,
 'V8': 0.437706509544851,
 'V9': 0.39676985012996396,
 'V10': 0.587438102569443,
 'V11': -0.14979756231827498,
 'V12': 0.29514781622888103,
 'V13': -1.30382621882143,
 'V14': -0.31782283120234495,
 'V15': -2.03673231037199,
 'V16': 0.376090905274179,
 'V17': -0.30040350116459497,
 'V18': 0.433799615590844,
 'V19': -0.145082264348681,
 'V20': -0.240427548108996,
 'V21': 0.0376030733329398,
 'V22': 0.38002620963091405,
 'V23': -0.16764742731151097,
 'V24': 0.0275573495476881,
 'V25': 0.59211469704354,
 'V26': 0.219695164116351,
 'V27': 0.0369695108704894,
 'V28': 0.010984441006191,
 'Amount': 0.0}

In [35]:
newobs = [{k:[v] for k,v in newob.items()} for newob in newobs]

In [36]:
newobs[0]

{'Time': [35337],
 'V1': [1.0928441854981998],
 'V2': [-0.0132303486713432],
 'V3': [1.35982868199426],
 'V4': [2.7315370965921004],
 'V5': [-0.707357349219652],
 'V6': [0.8738370029866129],
 'V7': [-0.7961301510622031],
 'V8': [0.437706509544851],
 'V9': [0.39676985012996396],
 'V10': [0.587438102569443],
 'V11': [-0.14979756231827498],
 'V12': [0.29514781622888103],
 'V13': [-1.30382621882143],
 'V14': [-0.31782283120234495],
 'V15': [-2.03673231037199],
 'V16': [0.376090905274179],
 'V17': [-0.30040350116459497],
 'V18': [0.433799615590844],
 'V19': [-0.145082264348681],
 'V20': [-0.240427548108996],
 'V21': [0.0376030733329398],
 'V22': [0.38002620963091405],
 'V23': [-0.16764742731151097],
 'V24': [0.0275573495476881],
 'V25': [0.59211469704354],
 'V26': [0.219695164116351],
 'V27': [0.0369695108704894],
 'V28': [0.010984441006191],
 'Amount': [0.0]}

In [49]:
len(newobs)

10000

---
## Online Predictions: Methods for Vertex AI Endpoints

There are multiple ways to interact with a Vertex AI Endpoint from Python.  This notebook gives examples of for Python (multiple version of the client, and layers), as well as REST and the `gcloud` CLI.  To better understand these clients, review the notes here: [aiplatform_notes.md](../Tips/aiplatform_notes.md).

>**Explanations**
>For each of the methods below, the `predict` part of the request can be exchanged for `explain` if the endpoint has a model deployed with explanations setup.  Note: This >will not work for the raw prediction methods.  See more about setting up explainability in the explainability notebooks within this series.
>- [05Tools - Explainability - Example-Based.ipynb](./05Tools%20-%20Explainability%20-%20Example-Based.ipynb)
>- [05Tools - Explainability - Feature-Based.ipynb](./05Tools%20-%20Explainability%20-%20Feature-Based.ipynb)

---
### Get Predictions: Python Client

[aiplatform.Endpoint.predict()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint#google_cloud_aiplatform_Endpoint_predict)

In [25]:
prediction = endpoint.predict(instances = newobs[0:1])
prediction

Prediction(predictions=[[0.999759495, 0.000240550202]], deployed_model_id='2423682068408958976', metadata=None, model_version_id='18', model_resource_name='projects/1026793852137/locations/us-central1/models/model_05_05', explanations=None)

In [26]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [27]:
np.argmax(prediction.predictions[0])

0

In [46]:
endpoint.predict(instances = newobs[0:2]).predictions

[[0.999759495, 0.000240550202], [0.999389887, 0.000610153715]]

#### Raw Predictions

Use arbitrary headers

[aiplatform.Endpoint.raw_predict()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint#google_cloud_aiplatform_Endpoint_raw_predict)

In [50]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
headers = {'Content-Type':'application/json'}

In [51]:
prediction = endpoint.raw_predict(
    body = json.dumps(instances).encode("utf-8"),
    headers = headers
)
prediction

<Response [200]>

In [52]:
prediction = json.loads(prediction.text)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [53]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [54]:
np.argmax(prediction['predictions'][0])

0

#### Async Predictions

Make the call for a prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform.Endpoint.predict_async()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint#google_cloud_aiplatform_Endpoint_predict_async)

In [55]:
prediction = await endpoint.predict_async(instances = newobs[0:1])
prediction

Prediction(predictions=[[0.999759495, 0.000240550202]], deployed_model_id='2423682068408958976', metadata=None, model_version_id='18', model_resource_name='projects/1026793852137/locations/us-central1/models/model_05_05', explanations=None)

In [56]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [57]:
np.argmax(prediction.predictions[0])

0

---
### Get Predictions: Python Client (gapic access to v1)

This is functionally the same as the Python Client V1 section below.

[aiplatform.gapic.PredictionServiceClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceClient)

#### Client

In [58]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [59]:
predictor = aiplatform.gapic.PredictionServiceClient(client_options = client_options)

#### Predictions

In [60]:
prediction = predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [61]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [62]:
np.argmax(prediction.predictions[0])

0

#### Raw Predictions

In [63]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [64]:
prediction = predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [65]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [66]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [67]:
np.argmax(prediction['predictions'][0])

0

#### Async Predictions

Make the call for a prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform.gapic.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceAsyncClient#google_cloud_aiplatform_v1_services_prediction_service_PredictionServiceAsyncClient_predict)

In [68]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [69]:
async_predictor = aiplatform.gapic.PredictionServiceAsyncClient(client_options = client_options)

In [70]:
prediction = await async_predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [71]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [72]:
np.argmax(prediction.predictions[0])

0

#### Async Raw Predictions

Make the call for a raw prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform.gapic.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceAsyncClient#google_cloud_aiplatform_v1_services_prediction_service_PredictionServiceAsyncClient_predict)

In [73]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [74]:
async_predictor = aiplatform.gapic.PredictionServiceAsyncClient(client_options = client_options)

In [75]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [76]:
prediction = await async_predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [77]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [78]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [79]:
np.argmax(prediction['predictions'][0])

0

---
### Get Predictions: Python Client V1

This is functionally the same as the Python Client gapic section above.

[aiplatform_v1.PredictionServiceClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceClient)

#### Client

In [80]:
from google.cloud import aiplatform_v1

In [81]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [82]:
predictor = aiplatform_v1.PredictionServiceClient(client_options = client_options)

#### Predictions

In [83]:
prediction = predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [84]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [85]:
np.argmax(prediction.predictions[0])

0

#### Raw Predictions

In [86]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [87]:
prediction = predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [88]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [89]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [90]:
np.argmax(prediction['predictions'][0])

0

#### Async Predictions

Make the call for a prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform_v1.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceAsyncClient)

In [91]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [92]:
async_predictor = aiplatform_v1.PredictionServiceAsyncClient(client_options = client_options)

In [93]:
prediction = await async_predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [94]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [95]:
np.argmax(prediction.predictions[0])

0

#### Async Raw Predictions

Make the call for a raw prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform_v1.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.prediction_service.PredictionServiceAsyncClient#google_cloud_aiplatform_v1_services_prediction_service_PredictionServiceAsyncClient_predict)

In [96]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [97]:
async_predictor = aiplatform_v1.PredictionServiceAsyncClient(client_options = client_options)

In [98]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [99]:
prediction = await async_predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [100]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [101]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [102]:
np.argmax(prediction['predictions'][0])

0

---
### Get Predictions: Python Client V1 beta 1

[aiplatform_v1beta1.PredictionServiceClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.services.prediction_service.PredictionServiceClient)

#### Client

In [103]:
from google.cloud import aiplatform_v1beta1

In [104]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [105]:
predictor = aiplatform_v1beta1.PredictionServiceClient(client_options = client_options)

#### Predictions

In [106]:
prediction = predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [107]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [108]:
np.argmax(prediction.predictions[0])

0

#### Raw Predictions

In [109]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [110]:
prediction = predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [111]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [112]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [113]:
np.argmax(prediction['predictions'][0])

0

#### Async Predictions

Make the call for a prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform_v1beta1.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.services.prediction_service.PredictionServiceAsyncClient)

In [114]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [115]:
async_predictor = aiplatform_v1beta1.PredictionServiceAsyncClient(client_options = client_options)

In [116]:
prediction = await async_predictor.predict(
    endpoint = endpoint.resource_name,
    instances = newobs[0:1]
)
prediction

predictions {
  list_value {
    values {
      number_value: 0.999759495
    }
    values {
      number_value: 0.000240550202
    }
  }
}
deployed_model_id: "2423682068408958976"
model: "projects/1026793852137/locations/us-central1/models/model_05_05"
model_version_id: "18"
model_display_name: "05_05"

In [117]:
prediction.predictions[0]

[0.999759495, 0.000240550202]

In [118]:
np.argmax(prediction.predictions[0])

0

#### Async Raw Predictions

Make the call for a raw prediction an [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) coroutine.

[aiplatform_v1beta1.PredictionServiceAsyncClient()](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.services.prediction_service.PredictionServiceAsyncClient)

In [119]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

In [120]:
async_predictor = aiplatform_v1beta1.PredictionServiceAsyncClient(client_options = client_options)

In [121]:
instances = {"instances": newobs[0:1], "signature_name": "serving_default"}
body = httpbody_pb2.HttpBody(
    data = json.dumps(instances).encode("utf-8"),
    content_type = "application/json"
)

In [122]:
prediction = await async_predictor.raw_predict(
    endpoint = endpoint.resource_name,
    http_body = body
)
prediction

content_type: "application/json"
data: "{\n    \"predictions\": [[0.999759495, 0.000240550202]\n    ]\n}"

In [123]:
prediction = json.loads(prediction.data)
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [124]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [125]:
np.argmax(prediction['predictions'][0])

0

---
### Get Prediction: REST

REST Resource [v1.projects.locations.endpoints](https://cloud.google.com/vertex-ai/docs/reference/rest#rest-resource:-v1.projects.locations.endpoints)

#### Method 1: Command Line CURL

In [126]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"instances": newobs[0:1]}))

In [127]:
prediction = !curl -s POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @{DIR}/request.json \
https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict

prediction = json.loads(''.join([p.strip() for p in prediction]))
prediction

{'predictions': [[0.999759495, 0.000240550202]],
 'deployedModelId': '2423682068408958976',
 'model': 'projects/1026793852137/locations/us-central1/models/model_05_05',
 'modelDisplayName': '05_05',
 'modelVersionId': '18'}

In [128]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [129]:
np.argmax(prediction['predictions'][0])

0

##### Use CURL for Raw Predictions

In [130]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"signature_name": "serving_default", "instances": [newobs[0]]}))

In [131]:
prediction = !curl -s POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @{DIR}/request.json \
https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:rawPredict

prediction = json.loads(''.join([p.strip() for p in prediction]))
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [132]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [133]:
np.argmax(prediction['predictions'][0])

0

#### Method 2: Python with requests

In [134]:
import requests

In [135]:
token = !gcloud auth application-default print-access-token
headers = {
    "content-type": "application/json; charset=utf-8",
    "Authorization": f'Bearer {token[0]}'
}
json_response = requests.post(
    f'https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict',
    data = json.dumps({"instances": newobs[0:1]}),
    headers = headers
)

In [136]:
print(json_response.text)

{
  "predictions": [
    [
      0.999759495,
      0.000240550202
    ]
  ],
  "deployedModelId": "2423682068408958976",
  "model": "projects/1026793852137/locations/us-central1/models/model_05_05",
  "modelDisplayName": "05_05",
  "modelVersionId": "18"
}



In [137]:
predictions = json.loads(json_response.text)['predictions']
predictions

[[0.999759495, 0.000240550202]]

In [138]:
np.argmax(predictions[0])

0

##### Use Requests for Raw Predictions

In [139]:
import requests

In [140]:
token = !gcloud auth application-default print-access-token
headers = {
    "content-type": "application/json; charset=utf-8", 
    "Authorization": f'Bearer {token[0]}'
}
json_response = requests.post(
    f'https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:rawPredict', 
    data = json.dumps({"signature_name": "serving_default", "instances": newobs[0:1]}),
    headers = headers
)

In [141]:
print(json_response.text)

{
    "predictions": [[0.999759495, 0.000240550202]
    ]
}


In [142]:
predictions = json.loads(json_response.text)['predictions']
predictions

[[0.999759495, 0.000240550202]]

In [143]:
np.argmax(predictions[0])

0

---
### Get Prediction: gcloud (CLI)

[gcloud ai endpoints](https://cloud.google.com/sdk/gcloud/reference/ai/endpoints)

In [144]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"instances": newobs[0:1]}))

In [145]:
prediction = !gcloud ai endpoints predict {endpoint.name.rsplit('/',1)[-1]} --region={REGION} --json-request={DIR}/request.json
prediction

['Using endpoint [https://us-central1-prediction-aiplatform.googleapis.com/]',
 '[[0.999759495, 0.000240550202]]']

In [146]:
import ast
prediction = ast.literal_eval(prediction[1])
prediction[0]

[0.999759495, 0.000240550202]

In [147]:
np.argmax(prediction[0])

0

#### Use gcloud (CLI) For Raw Predictions

In [148]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"signature_name": "serving_default", "instances": newobs[0:1]}))

In [149]:
prediction = !gcloud ai endpoints raw-predict {endpoint.name.rsplit('/',1)[-1]} --region={REGION} --format="json" --request=@{DIR}/request.json
prediction

['Using endpoint [https://us-central1-aiplatform.googleapis.com/]',
 '{',
 '  "predictions": [',
 '    [',
 '      0.999759495,',
 '      0.000240550202',
 '    ]',
 '  ]',
 '}']

In [150]:
prediction = json.loads("".join(prediction[1:]))
prediction

{'predictions': [[0.999759495, 0.000240550202]]}

In [151]:
prediction['predictions'][0]

[0.999759495, 0.000240550202]

In [152]:
np.argmax(prediction['predictions'][0])

0

---
## Requesting Many Predictions: Synchronous and Asynchronus

There are times where you want to make many request of an endpoint for predictions.  If you send request one at a time, synchronous, then the endpoint will fullfill each request as it receives it.  An endpoint is designed to handle simoultaneous requests, asynchronous.  If you set the max_replicas > 1 during the endpoint setup then it will also scale up to handle the amount of traffic.  
- [Configure compute resources for prediction](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute)

Using Python to make concurrent request is possible and covered the tip notebook [Python Multiprocessing](../Tips/Python%20Multiprocessing.ipynb) which is helpful for understanding the method used below to make asynchronous requests concurrently using `asyncio`.

---
## Online Predictions: Synchronous Examples
Synchronous calls to the Vertex AI Endpoint with different batch size of instances.  This is packaging multiple prediction request up in a single call of the endpoint - size is batch_size. Each request completes before the next request is submitted - sequential, sychronous.

In [130]:
len(newobs)

10000

In [131]:
def syncPredictions(instances, batch_size = 1):
    predictions = []
    start = time.perf_counter()
    # a loop where each step request predictions for batch_size number of instances - in a single request
    for p in range(0, len(instances), batch_size):
        #instances = [json_format.ParseDict(example, Value()) for example in newobs[p:p+batch_size]]
        preds = endpoint.predict(instances = instances[p:p+batch_size])
        predictions.extend(np.argmax(pred) for pred in preds.predictions)
    elapsed = time.perf_counter() - start
    print(f'{elapsed:0.5f} seconds for {len(instances)} instances in sychronous batches of size = {batch_size}')
    return predictions

In [132]:
# default batch_size = 1
predictions = syncPredictions(newobs)

155.82147 seconds for 10000 instances in sychronous batches of size = 1


In [133]:
# specify batch_size = 2 - expecting half the time if the endpoint can handle multiple at the same time
predictions = syncPredictions(newobs, batch_size = 2)

81.97352 seconds for 10000 instances in sychronous batches of size = 2


In [134]:
# specify batch_size = 10 - expecting 1/10 the time if the endpoint can handle this many at the same time
predictions = syncPredictions(newobs, batch_size = 10)

21.00186 seconds for 10000 instances in sychronous batches of size = 10


In [135]:
# get a count of the number of predictions that resulted in 0 (not fraud) and 1 (fraud)
from collections import Counter
c = Counter(predictions)
c

Counter({0: 9980, 1: 20})

In [136]:
# get the index for the predictions that resulted in predicting = 1 (fraud)
[i for i, j in enumerate(predictions) if j == 1]

[53,
 576,
 3371,
 3958,
 3995,
 4028,
 4212,
 4292,
 4293,
 4350,
 4500,
 4532,
 4654,
 4707,
 4738,
 4892,
 5207,
 7052,
 8778,
 9070]

In [137]:
# review the inputs that lead to a prediction = 1 (fraud)
samples.iloc[[i for i, j in enumerate(predictions) if j == 1]]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
53,85285,-7.030308,3.421991,-9.525072,5.270891,-4.02463,-2.865682,-6.989195,3.791551,-4.62273,...,0.545698,1.103398,-0.541855,0.036943,-0.355519,0.353634,1.042458,1.359516,-0.272188,0.0
576,56887,-0.075483,1.812355,-2.566981,4.127549,-1.628532,-0.805895,-3.390135,1.019353,-2.451251,...,0.338598,0.794372,0.270471,-0.143624,0.013566,0.634203,0.213693,0.773625,0.387434,5.0
3371,43369,-3.365319,2.426503,-3.752227,0.276017,-2.30587,-1.961578,-3.029283,-1.674462,0.183961,...,-0.036837,2.070008,-0.512626,-0.248502,0.12655,0.104166,-1.055997,-1.200165,-1.012066,88.0
3958,143354,1.118331,2.074439,-3.837518,5.44806,0.071816,-1.020509,-1.808574,0.521744,-2.032638,...,0.163513,0.289861,-0.172718,-0.02191,-0.37656,0.192817,0.114107,0.500996,0.259533,1.0
3995,93888,-10.040631,6.139183,-12.972972,7.740555,-8.684705,-3.837429,-11.907702,5.833273,-5.731054,...,-0.082275,2.823431,1.153005,-0.567343,0.843012,0.549938,0.113892,-0.307375,0.061631,1.0
4028,20332,-15.271362,8.326581,-22.338591,11.885313,-8.721334,-2.324307,-16.196419,0.512882,-6.333685,...,0.993121,-2.356896,1.068019,1.085617,-1.039797,-0.182006,0.649921,2.149247,-1.406811,1.0
4212,7551,0.316459,3.809076,-5.615159,6.047445,1.554026,-2.651353,-0.746579,0.055586,-2.678679,...,0.388307,0.208828,-0.511747,-0.583813,-0.219845,1.474753,0.491192,0.518868,0.402528,1.0
4292,13126,-2.880042,5.225442,-11.06333,6.689951,-5.759924,-2.244031,-11.199975,4.014722,-3.429304,...,1.191444,2.002883,0.351102,0.795255,-0.778379,-1.646815,0.487539,1.427713,0.583172,1.0
4293,152036,-4.320609,3.199939,-5.799736,6.50233,0.378479,-1.948246,-2.16786,-0.728207,-1.977238,...,-0.263686,0.47666,0.434278,-0.13694,-0.620072,0.642531,0.280717,-2.649107,0.533641,1.0
4350,100298,-22.341889,15.536133,-22.865228,7.043374,-14.183129,-0.463145,-28.215112,-14.607791,-9.481456,...,4.100019,-9.110423,4.158895,1.412928,0.382801,0.447154,-0.632816,-4.380154,-0.467863,1.0


---
## Online Predictions: Asynchronous Examples

  

### What is Async?
If we make a request with the async method the response is a [coroutine](https://docs.python.org/3/glossary.html#term-coroutine) object.  This means the method is already implemented with an `async def` statement which makes it [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables).

The following cells show using the method with, and without, an await expression:

In [153]:
len(newobs)

10000

In [169]:
endpoint.predict_async(instances = newobs[0:1])

<coroutine object Endpoint.predict_async at 0x7f51c9929850>

In [170]:
await endpoint.predict_async(instances = newobs[0:1])

Prediction(predictions=[[0.999759495, 0.000240550202]], deployed_model_id='2423682068408958976', metadata=None, model_version_id='18', model_resource_name='projects/1026793852137/locations/us-central1/models/model_05_05', explanations=None)

### How To Use Async Concurrently

The previous section showed that the `predict_async()` method returns a coroutine, which is an awaitable object.  When multiple coroutines are grouped together they can be awaited together - concurrently.

To group the coroutines together use [asyncio.gather()](https://docs.python.org/3/library/asyncio-task.html#running-tasks-concurrently):

In [171]:
responses = asyncio.gather(*[
    endpoint.predict_async(
        instances = [newobs[i]]
    ) for i in range(5)
])

In [172]:
type(responses)

asyncio.tasks._GatheringFuture

To make the requests concurrent, `await` the coroutine grouping:

In [173]:
responses = await asyncio.gather(*[
    endpoint.predict_async(
        instances = [newobs[i]]
    ) for i in range(5)
])

In [174]:
type(responses), len(responses)

(list, 5)

In [176]:
for response in responses:
    print(response.predictions)

[[0.999759495, 0.000240550202]]
[[0.999389887, 0.000610153715]]
[[0.999763429, 0.00023654796]]
[[0.999877572, 0.000122421348]]
[[0.999873042, 0.000126962143]]


### Managing Concurrency

In some cases, doing all the tasks concurrently can work. Usually, there are limitations though. Waiting on a API to respond does not put a burden on the local compute so managing lots of requests may not be an issue on the client side.  It can still be helpful to put limits on concurrency for managing the requests.  A first step to limiting concurrency is using a tool like [asyncio.Semaphore](https://docs.python.org/3/library/asyncio-sync.html#semaphore) to managed a counter of current concurrent requests.

The following builds a function that manages the full list of request and uses a semaphore to control the concurrency.  Think of this as the concurrency buffer limit.

In [154]:
async def asyncPredictions(instances, batch_size = 1, limit_concur_request = 10):
    limit = asyncio.Semaphore(limit_concur_request)
    # requests come back out of order so create an ordered structure to capture them
    predictions = [None] * len(instances)
    
    # function to make prediction requests
    async def predictor(p):
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            preds = await endpoint.predict_async(instances = instances[p:p+batch_size])
        predictions[p:p+batch_size] = [np.argmax(pred) for pred in preds.predictions]

    # manage tasks
    tasks = [asyncio.create_task(predictor(p)) for p in range(0, len(instances), batch_size)]
    start = time.perf_counter()
    responses = await asyncio.gather(*tasks)
    elapsed = time.perf_counter() - start
    print(f'{elapsed:0.5f} seconds for {len(instances)} instances in asynchronous batches of size = {batch_size} managed within {limit_concur_request} concurrent requests')
    
    return predictions

In [156]:
# force synchronous request in batch_size = 1
predictions = await asyncPredictions(newobs, batch_size = 1, limit_concur_request = 1)

244.08676 seconds for 10000 instances in asynchronous batches of size = 1 managed within 1 concurrent requests


In [157]:
# force synchronous request in batch_size = 2
predictions = await asyncPredictions(newobs, batch_size = 2, limit_concur_request = 1)

126.50838 seconds for 10000 instances in asynchronous batches of size = 2 managed within 1 concurrent requests


In [158]:
# force synchronous request in batch_size = 10
predictions = await asyncPredictions(newobs, batch_size = 10, limit_concur_request = 1)

30.81778 seconds for 10000 instances in asynchronous batches of size = 10 managed within 1 concurrent requests


In [159]:
# force asynchronous with 2 concurrent and batch_size = 1
predictions = await asyncPredictions(newobs, batch_size = 1, limit_concur_request = 2)

123.37394 seconds for 10000 instances in asynchronous batches of size = 1 managed within 2 concurrent requests


In [160]:
# force asynchronous with 10 concurrent and batch_size = 1
predictions = await asyncPredictions(newobs, batch_size = 1, limit_concur_request = 10)

25.57906 seconds for 10000 instances in asynchronous batches of size = 1 managed within 10 concurrent requests


In [161]:
# force asynchronous with 20 concurrent and batch_size = 1
predictions = await asyncPredictions(newobs, batch_size = 1, limit_concur_request = 20)

14.61041 seconds for 10000 instances in asynchronous batches of size = 1 managed within 20 concurrent requests


In [162]:
# force asynchronous with 10 concurrent and batch_size = 2
predictions = await asyncPredictions(newobs, batch_size = 2, limit_concur_request = 20)

8.82831 seconds for 10000 instances in asynchronous batches of size = 2 managed within 20 concurrent requests


In [163]:
# all at once with 10 concurrent and batch_size = 100
predictions = await asyncPredictions(newobs, batch_size = 100, limit_concur_request = 10)

4.09104 seconds for 10000 instances in asynchronous batches of size = 100 managed within 10 concurrent requests


In [164]:
# all at once with 100 concurrent and batch_size = 10
predictions = await asyncPredictions(newobs, batch_size = 10, limit_concur_request = 100)

4.64933 seconds for 10000 instances in asynchronous batches of size = 10 managed within 100 concurrent requests


In [165]:
# get a count of the number of predictions that resulted in 0 (not fraud) and 1 (fraud)
from collections import Counter
c = Counter(predictions)
c

Counter({0: 9980, 1: 20})

In [166]:
# get the index for the predictions that resulted in predicting = 1 (fraud)
[i for i, j in enumerate(predictions) if j == 1]

[53,
 576,
 3371,
 3958,
 3995,
 4028,
 4212,
 4292,
 4293,
 4350,
 4500,
 4532,
 4654,
 4707,
 4738,
 4892,
 5207,
 7052,
 8778,
 9070]

In [167]:
# review the inputs that lead to a prediction = 1 (fraud)
samples.iloc[[i for i, j in enumerate(predictions) if j == 1]]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
53,85285,-7.030308,3.421991,-9.525072,5.270891,-4.02463,-2.865682,-6.989195,3.791551,-4.62273,...,0.545698,1.103398,-0.541855,0.036943,-0.355519,0.353634,1.042458,1.359516,-0.272188,0.0
576,56887,-0.075483,1.812355,-2.566981,4.127549,-1.628532,-0.805895,-3.390135,1.019353,-2.451251,...,0.338598,0.794372,0.270471,-0.143624,0.013566,0.634203,0.213693,0.773625,0.387434,5.0
3371,43369,-3.365319,2.426503,-3.752227,0.276017,-2.30587,-1.961578,-3.029283,-1.674462,0.183961,...,-0.036837,2.070008,-0.512626,-0.248502,0.12655,0.104166,-1.055997,-1.200165,-1.012066,88.0
3958,143354,1.118331,2.074439,-3.837518,5.44806,0.071816,-1.020509,-1.808574,0.521744,-2.032638,...,0.163513,0.289861,-0.172718,-0.02191,-0.37656,0.192817,0.114107,0.500996,0.259533,1.0
3995,93888,-10.040631,6.139183,-12.972972,7.740555,-8.684705,-3.837429,-11.907702,5.833273,-5.731054,...,-0.082275,2.823431,1.153005,-0.567343,0.843012,0.549938,0.113892,-0.307375,0.061631,1.0
4028,20332,-15.271362,8.326581,-22.338591,11.885313,-8.721334,-2.324307,-16.196419,0.512882,-6.333685,...,0.993121,-2.356896,1.068019,1.085617,-1.039797,-0.182006,0.649921,2.149247,-1.406811,1.0
4212,7551,0.316459,3.809076,-5.615159,6.047445,1.554026,-2.651353,-0.746579,0.055586,-2.678679,...,0.388307,0.208828,-0.511747,-0.583813,-0.219845,1.474753,0.491192,0.518868,0.402528,1.0
4292,13126,-2.880042,5.225442,-11.06333,6.689951,-5.759924,-2.244031,-11.199975,4.014722,-3.429304,...,1.191444,2.002883,0.351102,0.795255,-0.778379,-1.646815,0.487539,1.427713,0.583172,1.0
4293,152036,-4.320609,3.199939,-5.799736,6.50233,0.378479,-1.948246,-2.16786,-0.728207,-1.977238,...,-0.263686,0.47666,0.434278,-0.13694,-0.620072,0.642531,0.280717,-2.649107,0.533641,1.0
4350,100298,-22.341889,15.536133,-22.865228,7.043374,-14.183129,-0.463145,-28.215112,-14.607791,-9.481456,...,4.100019,-9.110423,4.158895,1.412928,0.382801,0.447154,-0.632816,-4.380154,-0.467863,1.0


###  Handling Errors With Retries

In application design it is not always possible to control all applications trying to make request of a resource at the minute level.  

The two most common errors that will be faced are:
- `ServiceUnavailable: 503 Model server connection error. Please retry.`
- `ResourceExhausted: 429 Quota exceeded for `
    - There are [quotas and limits](https://cloud.google.com/vertex-ai/docs/quotas) that restrict the projects use of models, specifically the number of request per minutes.  Across all models in a project there is a very high limit for predictions per minute in a region (30,000). 

Techniques are needed to detect these and retry after waiting.  The section below shows using asynchronous request, as covered above, as well as forcing and handling errors in serving.

#### Force Error

Running the cell below (uncomment to test it), will lead to a `503` service unavailable error:

In [177]:
try:
    predictions = await asyncPredictions(newobs[0:1]*10000, batch_size = 1, limit_concur_request = 10000)
except Exception as err:
    print(f"{type(err).__name__} was raised: {err}")

ServiceUnavailable was raised: 503 Model server connection error. Please retry. endpoint_id: 725723853820526592, deployed_model_id: 2423682068408958976


#### Detect Errors and Retry:

Recall the function above that managed concurrent requests and modify it to detect the errors and retry with graduated backoff intervals (exponential backoff):
- sets a limit on the retries, 20 in this case
- increments the wait time for each retry, exponential backoff in this case

In [178]:
async def asyncPredictions(instances, batch_size = 1, limit_concur_request = 10):
    limit = asyncio.Semaphore(limit_concur_request)
    # requests come back out of order so create an ordered structure to capture them
    predictions = [None] * len(instances)
    
    # function to make prediction requests
    async def predictor(p):
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            ########################### ERROR HANDLING ###########################
            fail_count = 0
            while fail_count <= 20:
                try:
                    preds = await endpoint.predict_async(instances = instances[p:p+batch_size])
                    if fail_count > 0:
                        print(f'Item {p} succeed after fail count = {fail_count}')
                    break
                except:
                    fail_count += 1
                    print(f'Item {p} failed: current fail count = {fail_count}')
                    await asyncio.sleep(2^(min(fail_count, 6) - 1))
            ######################################################################
        predictions[p:p+batch_size] = [np.argmax(pred) for pred in preds.predictions]

    # manage tasks
    tasks = [asyncio.create_task(predictor(p)) for p in range(0, len(instances), batch_size)]
    start = time.perf_counter()
    responses = await asyncio.gather(*tasks)
    elapsed = time.perf_counter() - start
    print(f'{elapsed:0.5f} seconds for {len(instances)} instances in asynchronous batches of size = {batch_size} managed within {limit_concur_request} concurrent requests')
    
    return predictions

In [180]:
# 1000 concurrent requests, each with 10 instances, sustained for 10,000 requests
predictions = await asyncPredictions(newobs[0:1]*20000, batch_size = 1, limit_concur_request = 2000)

Item 156 failed: current fail count = 1
Item 8 failed: current fail count = 1
Item 82 failed: current fail count = 1
Item 128 failed: current fail count = 1
Item 53 failed: current fail count = 1
Item 13 failed: current fail count = 1
Item 120 failed: current fail count = 1
Item 151 failed: current fail count = 1
Item 13 succeed after fail count = 1
Item 8 succeed after fail count = 1
Item 128 succeed after fail count = 1
Item 156 succeed after fail count = 1
Item 120 succeed after fail count = 1
Item 82 succeed after fail count = 1
Item 151 succeed after fail count = 1
Item 53 succeed after fail count = 1
23.03266 seconds for 20000 instances in asynchronous batches of size = 1 managed within 2000 concurrent requests


In [181]:
len(predictions)

20000

In [182]:
predictions[0]

0