![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2Flegacy&file=Python+Asynchronous+API+Calls.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/legacy/Python%20Asynchronous%20API%20Calls.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2Flegacy%2FPython%2520Asynchronous%2520API%2520Calls.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/legacy/Python%20Asynchronous%20API%20Calls.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/legacy/Python%20Asynchronous%20API%20Calls.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

---

**File Move Notices**

This file moved locations:
- On 10/13/2024 (mm/dd/yyyy)
	- From: `Applied GenAI/Python Asynchronous API Calls.ipynb`
	- To: `Applied GenAI/legacy/Python Asynchronous API Calls.ipynb`
---
<!---end of move notices--->

# Python Asynchronous API Calls

Methods for making asynchronous API calls.  Additionally, managing concurrent request and handling errors.

To illustrate the concepts, the Vertex AI SDK will be used to make sychronous and asynchronous request for generative AI APIs for Gemini and PaLM.  The concept and solutions for managing concurrency and errors apply to any API with an asynchronous client.

The example used below starts with requesting a list of vocabulary words.  This is a good synchronous task because it is really just a single request.  This is followed with the tasks of requesting definitions for each word.  Using a synchronous approach to this would be time consuming.  Switching to an asynchronous approach allows requesting many words at the same time.  However, this introduces the need to manage concurrency - how many simountaneous requests are being made.  As more requests are made the chances of hitting qouta limits increase, especially in a shared environment with multiple application make calls.  The concept of concurrency is also extended to include error handling and retries.

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Python%20Asynchronous%20API%20Calls.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [1]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [2]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [3]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [4]:
REGION = 'us-central1'
SERIES = 'tips'
EXPERIMENT = 'async-api'

packages:

In [5]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import asyncio

import vertexai.language_models # PaLM and Codey Models
import vertexai.generative_models # for Gemini Models

clients:

In [6]:
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Synchronous Use Of APIs - Using Vertex AI Generative AI Models

To get started, the [Vertex AI SDK for Python](https://cloud.google.com/python/docs/reference/aiplatform/latest) will be used to make requests using the generative AI APIs for PaLM and Gemini.
- [Vertex AI SDK for Python](https://cloud.google.com/python/docs/reference/aiplatform/latest)
- [Gemini Class Overview](https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/sdk-for-gemini/gemini-sdk-overview-reference)
- [PaLM Text Model Classes](https://cloud.google.com/vertex-ai/docs/generative-ai/sdk-for-llm/sdk-use-text-models)

### Generate A List of Vocabulary Words - With Gemini

Connect to the Gemini Model API:

In [7]:
gemini_model = vertexai.generative_models.GenerativeModel("gemini-1.0-pro")

Request a list of vocabulary words:

In [8]:
vocab_words = gemini_model.generate_content(
    [
        "I need a long list of vocabulary words to study for the SAT.",
        "Respond with only a comma separated list of words."
    ],
    generation_config = dict(max_output_tokens = 8000, temperature = 0.5)
)

In [9]:
vocab_words.text

'abrogate, abscond, accede, acquiesce, acrimony, adjudicate, adumbrate, affable, alacrity, altruistic, ambiguous, ameliorate, anachronism, anathema, antithesis, apocryphal, apothegm, approbation, arcane, ardor, assiduous, audacious, autonomy, avarice, avidity, avuncular, benign, bombastic, bucolic, capricious, catharsis, circumlocution, clairvoyance, cogent, commensurate, complacence, compunction, condone, conflagration, congruity, connoisseur, contentious, contrite, copious, cosmopolitan, craven, credulous, crescendo, crux, cryptic, culpability, cynical, magnanimous, maudlin, mellifluous, mendacity, meticulous, miasma, myriad, nascent, nemesis, neophyte, obsequious, obstreperous, obtuse, odious, officious, ominous, opulent, opprobrium, osmotic, ostentatious, pacific, palliative, panacea, pandemonium, pariah, paucity, peccadillo, pedantic, penitent, perfidious, perfidy, perfunctory, peripatetic, perspicacious, philanderer, phlegmatic, plethora, poignant, polemic, portentous, prevaricat

Reformat the list of words as a Python list:

In [10]:
vocab_words = [word.strip() for word in vocab_words.text.split(',')]

In [11]:
vocab_words[0:10] + [f'... ({len(vocab_words) - 20} more words)'] + vocab_words[-10:]

['abrogate',
 'abscond',
 'accede',
 'acquiesce',
 'acrimony',
 'adjudicate',
 'adumbrate',
 'affable',
 'alacrity',
 'altruistic',
 '... (154 more words)',
 'vestige',
 'vicarious',
 'vicissitude',
 'vindictive',
 'virile',
 'vituperate',
 'voluble',
 'voracious',
 'whimsical',
 'zephyr']

### Get Word Definitions - With Gemini

Request a definition for the first vocabulary word:

In [23]:
print(
    gemini_model.generate_content(
        ['Describe the word', vocab_words[0], 'in a way that will make it easy to remember.']
    ).text
)

**Remember Abrogate as "Rocking the Boat"**

**Break down the root:**

* **"Rocking"** (ro-): rocking, shaking, or upsetting
* **"boat"** (gat-): something that conveys or supports

**Put it together:**

**Abrogate (v.)** means to repeal or cancel something formally, essentially "rocking the boat" and upsetting the established order.

**Mnemonic:**

Imagine a boat that's been firmly anchored. When someone abrogates a law or agreement, it's like they're suddenly rocking the boat and causing it to list or capsize. The consequences can be significant, especially if the boat represents something important or stable.


### Get Word Definitions - With PaLM

Connect to the Palm Model API:

In [12]:
palm_model = vertexai.language_models.TextGenerationModel.from_pretrained("text-bison@002")

Request a definition for the first vocabulary word:

In [13]:
palm_model.predict(prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.', max_output_tokens = 500)

 **Mnemonic:** 
To remember the word "abrogate," think of a "broken gate." Just as a broken gate can no longer serve its purpose, something that is abrogated is no longer in effect.

**Definition:**
Abrogate means to repeal or annul a law, treaty, or agreement. It is a formal term often used in legal or political contexts to describe the official termination or cancellation of a previously established rule or arrangement.

### Definitions For Many Words

What if the tasks changes to needing to make multiple calls, like requesting the definition for many words.  If the list is short or timing is not important then doing synchronous, one at a time, calls may work.  In the following example the `predict_streaming()` method is so that results appear as they are generated by the API.
- [Streaming text generation](https://cloud.google.com/vertex-ai/docs/generative-ai/sdk-for-llm/sdk-use-text-models#stream-text-generation-sdk)
- [`.predict_streaming()` method](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_streaming)

In [14]:
for word in vocab_words[0:5]:
    print(f'Results for {word}:')
    for r in palm_model.predict_streaming(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ):
        print(r)
    print('-'*100)

Results for abrogate:
 **Mnemonic:** 
To remember the word "abrogate," think of a "broken
 gate." Just as a broken gate can no longer serve its purpose, something that is abrogated is
 no longer in effect.

**Definition:**
Abrogate means to repeal or annul a law, treaty,
 or agreement. It is a formal term often used in legal or political contexts to describe the official termination
 or cancellation of a previously established rule or arrangement.
----------------------------------------------------------------------------------------------------
Results for abscond:
 **Mnemonic:** 
Imagine a thief quickly running away from the scene of a crime, leaving no trace behind.
 The word "abscond" sounds similar to "absent," which is what the thief is from the scene
.

**Definition:** 
To leave suddenly and secretly, especially in order to escape
 from danger or avoid arrest.
----------------------------------------------------------------------------------------------------
Results for accede

## Asynchronous Use of APIs - Using Vertex AI Generative AI Models

To request the definition for all words in the vocabularly list it will be beneficial to make request asynchronously - at the same time.  Some APIs have separate clients for asynchronous requests.  In the case of the PaLM model APIs there is actually a helpful asynchronous method provided `.predict_async()`.
- [`.predict_async()` method](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_async)

### What Exactly is Async?

If we make a request with the async method the response is a [coroutine](https://docs.python.org/3/glossary.html#term-coroutine) object.  This means the method is already implemented with an `async def` statement which makes it [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables).

#### With PaLM:

The following cells show using the method with, and without, an await expression:

In [15]:
palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

<coroutine object _TextGenerationModel.predict_async at 0x7ff75556e730>

In [16]:
await palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

 **Mnemonic:** 
To remember the word "abrogate," think of a "broken gate." Just as a broken gate can no longer serve its purpose, something that is abrogated is no longer in effect.

**Definition:**
Abrogate means to repeal or annul a law, treaty, or agreement. It is a formal term often used in legal or political contexts to describe the official termination or cancellation of a previously established rule or arrangement.

#### With Gemini:

The following cells show using the method with, and without, an await expression:

In [25]:
gemini_model.generate_content_async(
    ['Describe the word', vocab_words[0], 'in a way that will make it easy to remember.']
)

<coroutine object _GenerativeModel.generate_content_async at 0x7ff7543822d0>

In [27]:
(await gemini_model.generate_content_async(
    ['Describe the word', vocab_words[0], 'in a way that will make it easy to remember.']
)).text

'**Abrogate** sounds like "a broken gate." Just like a broken gate can\'t keep something in or out, abrogate means to "do away with" or "cancel."'

### How To Use Async Concurrently

The previous section showed that the `predict_async()` method returns a coroutine, which is an awaitable object.  When multiple coroutines are grouped together they can be awaited together - concurrently.

To group the coroutines together use [asyncio.gather()](https://docs.python.org/3/library/asyncio-task.html#running-tasks-concurrently):

In [17]:
responses = asyncio.gather(*[
    palm_model.predict_async(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ) for word in vocab_words[0:5]
])

In [18]:
type(responses)

asyncio.tasks._GatheringFuture

To make the requests concurrent, `await` the coroutine grouping:

In [19]:
responses = await asyncio.gather(*[
    palm_model.predict_async(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ) for word in vocab_words[0:5]
])

In [20]:
type(responses), len(responses)

(list, 5)

In [21]:
for response in responses:
    print(response.text)
    print('-'*100)

 **Mnemonic:** 
To remember the word "abrogate," think of a "broken gate." Just as a broken gate can no longer serve its purpose, something that is abrogated is no longer in effect.

**Definition:**
Abrogate means to repeal or annul a law, treaty, or agreement. It is a formal term often used in legal or political contexts to describe the official termination or cancellation of a previously established rule or arrangement.
----------------------------------------------------------------------------------------------------
 **Mnemonic:** 
Imagine a thief quickly running away from the scene of a crime, leaving no trace behind. The word "abscond" sounds similar to "absent," which is what the thief is from the scene.

**Definition:** 
To leave suddenly and secretly, especially in order to escape from danger or avoid arrest.
----------------------------------------------------------------------------------------------------
 **Mnemonic:** 
*Ac*cede sounds like "**a** **see**d**."  Imagine pl

## Managing Concurrency

In some cases, doing all the tasks concurrently can work. Usually, there are limitations though. Waiting on a API to respond does not put a burden on the local compute so managing lots of requests may not be an issue on the client side.  It can still be helpful to put limits on concurrency for managing the requests.  A first step to limiting concurrency is using a tool like [asyncio.Semaphore](https://docs.python.org/3/library/asyncio-sync.html#semaphore) to managed a counter of current concurrent requests.

The following builds a function that manages the full list of request and uses a semaphore to control the concurrency.  Think of this as the concurrency buffer limit.

In [24]:
async def study_notes(instances, limit_concur_requests = 10):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            result = await palm_model.predict_async(
                                prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                max_output_tokens = 500
                            )
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

In [25]:
responses = await study_notes(vocab_words[0:20])

In [26]:
type(responses), type(responses[0]), len(responses)

(list, tuple, 20)

In [27]:
print(responses[-1][0])
print(responses[-1][1])

approbation
 **Mnemonic:** Think of "approbation" as "approval with a pat on the back."

**Definition:** Approbation is the expression of approval or praise, often accompanied by a sense of admiration or respect. It is a positive evaluation or endorsement of someone or something, typically given for their actions, achievements, or qualities. Approbation can be conveyed through words, gestures, or actions that demonstrate appreciation, support, or encouragement.


## Managing Concurrency - With Limits

Just managing the concurrency may not be enough.  In cases where API have limits the total requests need to stay under these limits to prevent errors. In the case of this example, the PaLM model is limited by request per minute.  The default per project is 60 request per minute for the model used here ('text-bison@002').  See [Quotas and limits](https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai).

The following modifies the previous function to also incorporate a time based limit for requests.

In [28]:
async def study_notes(instances, limit_concur_requests = 10, limit_per_minute = 60):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        
        # pause for time based limit
        if p >= limit_per_minute:
            await asyncio.sleep(60 * (p // limit_per_minute))
        
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            result = await palm_model.predict_async(
                                prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                max_output_tokens = 500
                            )
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

Try the function under the limit:

In [29]:
responses = await study_notes(vocab_words[0:20])

In [30]:
type(responses), len(responses)

(list, 20)

In [31]:
print(responses[-1][0])
print(responses[-1][1])

approbation
 **Mnemonic:** Think of "approbation" as "approval with a pat on the back."

**Definition:** Approbation is the expression of approval or praise, often accompanied by a sense of admiration or respect. It is a positive evaluation or endorsement of someone or something, typically given for their actions, achievements, or qualities. Approbation can be conveyed through words, gestures, or actions that demonstrate appreciation, support, or encouragement.


Try the function just over the limit:

In [32]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [33]:
responses = await study_notes(vocab_words[0:65])

In [34]:
type(responses), len(responses)

(list, 65)

In [35]:
print(responses[-1][0])
print(responses[-1][1])

centrifugal
 **Mnemonic:** Think of a "centrifuge," which is a machine that uses centrifugal force to separate materials. The word "centrifugal" contains the root word "center," which is related to the idea of moving away from the center.

**Definition:** Centrifugal refers to the force or tendency of an object to move away from the center of rotation or curvature. It is the opposite of centripetal force, which pulls objects towards the center.


Try the function at triple the limit:

In [36]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [37]:
responses = await study_notes(vocab_words[0:180])

In [38]:
type(responses), len(responses)

(list, 180)

In [39]:
print(responses[-1][0])
print(responses[-1][1])

esoteric
 **Mnemonic:** Esoteric sounds like "secret."

**Definition:** Esoteric means difficult to understand or know; obscure.


## Managing Concurrency - With Limits And Error Handling

Sometimes handling concurrency and limits is still not enough.  For example, in a shared enviornment it may not be possible to know how many other applications are making requesst in the same time frame.  In some cases clients have retry methods built in.  In other cases errors are returned and the calling application has to handle them.

The following with futher modify the function to handle error responses by retrying and increasing time increment.

First, force an error by exceeding the limit:

In [40]:
try:
    # setting the limit_per_minute to 80, higher than the actual limit of 60
    responses = await study_notes(vocab_words[0:80], limit_per_minute = 80)
except Exception as err:
    print(f"{type(err).__name__} was raised: {err}")

ResourceExhausted was raised: 429 Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: text-bison. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.


Now, Modify the function to capture the error and retry with incrementing wait times.  The method used below does two things:
- sets a limit on the retries, 20 in this case
- increments the wait time for each retry, exponential backoff in this case

In [41]:
async def study_notes(instances, limit_concur_requests = 10, limit_per_minute = 60):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        
        # pause for time based limit
        if p >= limit_per_minute:
            await asyncio.sleep(60 * (p // limit_per_minute))
        
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            ########## ERROR HANDLING ##################################
            fail_count = 0
            while fail_count <= 20:
                try:
                    result = await palm_model.predict_async(
                                        prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                        max_output_tokens = 500
                                    )
                    if fail_count > 0:
                        print(f'Item {p} succeed after fail count = {fail_count}')
                    break
                except:
                    fail_count += 1
                    print(f'Item {p} failed: current fail count = {fail_count}')
                    await asyncio.sleep(2^(min(fail_count, 6) - 1))
            ############################################################
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

Try 200 words with the correct limit:

In [42]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [43]:
responses = await study_notes(vocab_words[0:200])

Item 124 failed: current fail count = 1
Item 125 failed: current fail count = 1
Item 122 failed: current fail count = 1
Item 132 failed: current fail count = 1
Item 133 failed: current fail count = 1
Item 125 succeed after fail count = 1
Item 122 succeed after fail count = 1
Item 124 succeed after fail count = 1
Item 133 succeed after fail count = 1
Item 132 succeed after fail count = 1


In [44]:
type(responses), len(responses)

(list, 200)

In [45]:
print(responses[-1][0])
print(responses[-1][1])

felonious
 **Felonious**

**Mnemonic:** A felonious act is a serious crime, like a felony.

**Definition:** Relating to or constituting a felony.


Now, try 200 words but force errors by setting the limit higher than the actual (60):

In [46]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [47]:
# setting the limit_per_minute to 80, higher than the actual limit of 60
responses = await study_notes(vocab_words[0:200], limit_per_minute = 80)

Item 72 failed: current fail count = 1
Item 73 failed: current fail count = 1
Item 74 failed: current fail count = 1
Item 76 failed: current fail count = 1
Item 78 failed: current fail count = 1
Item 79 failed: current fail count = 1
Item 72 failed: current fail count = 2
Item 73 failed: current fail count = 2
Item 74 failed: current fail count = 2
Item 76 failed: current fail count = 2
Item 78 failed: current fail count = 2
Item 79 failed: current fail count = 2
Item 72 failed: current fail count = 3
Item 72 failed: current fail count = 4
Item 73 failed: current fail count = 3
Item 73 failed: current fail count = 4
Item 74 failed: current fail count = 3
Item 74 failed: current fail count = 4
Item 76 failed: current fail count = 3
Item 76 failed: current fail count = 4
Item 78 failed: current fail count = 3
Item 78 failed: current fail count = 4
Item 79 failed: current fail count = 3
Item 79 failed: current fail count = 4
Item 72 failed: current fail count = 5
Item 73 failed: current f

In [48]:
type(responses), len(responses)

(list, 200)

In [49]:
print(responses[-1][0])
print(responses[-1][1])

felonious
 **Felonious**

**Mnemonic:** A felonious act is a serious crime, like a felony.

**Definition:** Relating to or constituting a felony.


---
## Managing Concurrency - With Limits, Error Handling, and Regional Failover

What if the max tries, `fail_count`, exceeds the threshold used?  This might be an indicator that the API is not responding.  A potential solution is switching to a different region.  The models provided by Vertex AI have [regional qouta](https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai#quotas_by_region_and_model) for each region they are available in - [location](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/locations-genai). For instance, if you are working with PaLM models like this exampel from `us-central1` then it might be feasible to also consider `us-west1`, `us-west4`, or `us-east4`. 

The following builds upon the error handling and tries an alternative region once the `fail_count` is exceeded.

In [42]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

First, try the current model that is being used:

In [51]:
await palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

 **Mnemonic:** Think of a "riot" of colors.

**Definition:** A roit is a tumult or uproar, especially one involving a large number of people.

Make a backup connection to the same model:

In [52]:
vertexai.init(location = 'us-east4')
palm_model2 = vertexai.language_models.TextGenerationModel.from_pretrained("text-bison@002")

Now, try the backup model connection:

In [53]:
await palm_model2.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

 **Mnemonic:** Think of a "riot" of colors.

**Definition:** A roit is a tumult or uproar, especially one involving a large number of people and typically characterized by violence and destruction.

What happens if trying to connect to model in unsupported location?

In [54]:
vertexai.init(location = 'us-east1')
try:
    palm_model2 = vertexai.language_models.TextGenerationModel.from_pretrained("text-bison@002")
except Exception as err:
    print(f"{type(err).__name__} was raised: {err}")

NotFound was raised: 404 Publisher Model `publishers/google/models/text-bison@002` is not found.


Can the primary model connection still be used?

In [55]:
await palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

 **Mnemonic:** Think of a "riot" of colors.

**Definition:** A roit is a tumult or uproar, especially one involving a large number of people.

Now, officially set the backup to `us-east4` for use in error handling function:

In [56]:
vertexai.init(location = 'us-east4')
palm_model2 = vertexai.language_models.TextGenerationModel.from_pretrained("text-bison@002")

Modify the function to try the backup model (location) after `fail_count >= region_check`:

In [57]:
async def study_notes(instances, limit_concur_requests = 10, limit_per_minute = 60):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        
        # pause for time based limit
        if p >= limit_per_minute:
            await asyncio.sleep(60 * (p // limit_per_minute))
        
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            ########## ERROR HANDLING ##################################
            fail_count = 0
            region_check = 3
            while fail_count <= 20:
                try:
                    result = await palm_model.predict_async(
                                        prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                        max_output_tokens = 500
                                    )
                    if fail_count > 0:
                        print(f'Item {p} succeed after fail count = {fail_count}')
                    break
                except:
                    fail_count += 1
                    print(f'Item {p} failed: current fail count = {fail_count}')
                    ########## REGIONAL Failover check ########################################
                    if fail_count >= region_check:
                        try:
                            result = await palm_model2.predict_async(
                                                prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                                max_output_tokens = 500
                                            )
                            if fail_count > 0:
                                print(f'Item {p} succeed after fail count = {fail_count} by trying a backup region.')
                            break
                        except:
                            print(f'Item {p} failed: current fail count = {fail_count}. This was a Regional Failover check.')
                    ###########################################################################
                    await asyncio.sleep(2^(min(fail_count, 6) - 1))
            ############################################################
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

Now, try 200 words but force errors by setting the limit higher than the actual (60):

In [58]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [59]:
# setting the limit_per_minute to 80, higher than the actual limit of 60
responses = await study_notes(vocab_words[0:200], limit_per_minute = 80)

Item 71 failed: current fail count = 1
Item 72 failed: current fail count = 1
Item 75 failed: current fail count = 1
Item 77 failed: current fail count = 1
Item 78 failed: current fail count = 1
Item 79 failed: current fail count = 1
Item 71 failed: current fail count = 2
Item 72 failed: current fail count = 2
Item 75 failed: current fail count = 2
Item 77 failed: current fail count = 2
Item 78 failed: current fail count = 2
Item 79 failed: current fail count = 2
Item 72 failed: current fail count = 3
Item 75 failed: current fail count = 3
Item 77 failed: current fail count = 3
Item 71 succeed after fail count = 2
Item 72 succeed after fail count = 3 by trying a backup region.
Item 77 succeed after fail count = 3 by trying a backup region.
Item 79 succeed after fail count = 2
Item 78 succeed after fail count = 2
Item 75 succeed after fail count = 3 by trying a backup region.
Item 137 failed: current fail count = 1
Item 138 failed: current fail count = 1
Item 145 failed: current fail co

In [60]:
type(responses), len(responses)

(list, 200)

In [61]:
print(responses[-1][0])
print(responses[-1][1])

felonious
 **Felonious**

**Mnemonic:** A felonious act is a serious crime, like a felony.

**Definition:** Relating to or constituting a felony.
