![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FDev%2Fnew&dt=Python+Asynchronous+API+Calls.ipynb)

# Python Asynchronous API Calls

Methods for making asynchronous API calls.  Additionally, managing concurrent request and handling errors.

To illustrate the concepts, the Vertex AI SDK will be used to make sychronous and asynchronous request for generative AI APIs for Gemini and PaLM.  The concept and solutions for managing concurrency and errors apply to any API with an asynchronous client.

The example used below starts with requesting a list of vocabulary words.  This is a good synchronous task because it is really just a single request.  This is followed with the tasks of requesting definitions for each word.  Using a synchronous approach to this would be time consuming.  Switching to an asynchronous approach allows requesting many words at the same time.  However, this introduces the need to manage concurrency - how many simountaneous requests are being made.  As more requests are made the chances of hitting qouta limits increase, especially in a shared environment with multiple application make calls.  The concept of concurrency is also extended to include error handling and retries.

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Getting%20Started%20-%20Vertex%20AI%20GenAI%20Python%20Client.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [6]:
REGION = 'us-central1'
SERIES = 'tips'
EXPERIMENT = 'async-api'

packages:

In [123]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import asyncio

import vertexai.language_models
import vertexai.preview.generative_models

clients:

In [8]:
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Synchronous Use Of APIs - Using Vertex AI Generative AI Models

To get started, the [Vertex AI SDK for Python](https://cloud.google.com/python/docs/reference/aiplatform/latest) will be used to make requests using the generative AI APIs for PaLM and Gemini.
- [Vertex AI SDK for Python](https://cloud.google.com/python/docs/reference/aiplatform/latest)
- [Gemini Class Overview](https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/sdk-for-gemini/gemini-sdk-overview-reference)
- [PaLM Text Model Classes](https://cloud.google.com/vertex-ai/docs/generative-ai/sdk-for-llm/sdk-use-text-models)

### Generate A List of Vocabulary Words - With Gemini

Connect to the Gemini Model API:

In [83]:
gemini_model = vertexai.preview.generative_models.GenerativeModel("gemini-pro")

Request a list of vocabulary words:

In [103]:
vocab_words = gemini_model.generate_content(
    [
        "I need a long list of vocabulary words to study for the GMAT.",
        "Respond with only a comma separated list of words."
    ],
    generation_config = dict(max_output_tokens = 8000, temperature = 0.5)
)

In [104]:
vocab_words.text

'abrogate, acquiesce, adjudicate, aggregate, alleviate, allocate, ambivalent, ameliorate, amicable, amortize, anachronistic, analogous, anomalous, antagonistic, antiquated, apathetic, apprehensive, arbitrary, articulate, ascertain, ascetic, assiduous, audacious, auspicious, autonomous, avarice, aversion, benevolent, begrudging, bellicose, benignant, capricious, candid, cessation, circumspect, clandestine, coalesce, cognizant, commensurate, conciliatory, conducive, conundrum, copious, corroborate, covetous, credulous, dearth, debunk, decimate, decorous, deferential, definitive, deleterious, delineate, demagogue, deprecate, derogatory, descry, desolate, desuetude, didactic, diffident, diligent, disabuse, discerning, disincentive, disparage, disparate, dispassionate, disparate, disquisition, dogmatic, duplicity, eclectic, efface, efficacious, egregious, elegiac, elicit, elucidate, emaciated, empirical, enigmatic, endemic, ephemeral, equivocal, eradicate, errant, erstwhile, esoteric, ether

Reformat the list of words as a Python list:

In [105]:
vocab_words = [word.strip() for word in vocab_words.text.split(',')]

In [106]:
vocab_words[0:10] + [f'... ({len(vocab_words) - 20} more words)'] + vocab_words[-10:]

['abrogate',
 'acquiesce',
 'adjudicate',
 'aggregate',
 'alleviate',
 'allocate',
 'ambivalent',
 'ameliorate',
 'amicable',
 'amortize',
 '... (263 more words)',
 'vexatious',
 'viable',
 'vicissitude',
 'vindictive',
 'visage',
 'volatile',
 'voracious',
 'waver',
 'whimsical',
 'zealous']

### Get Word Definitions - With PaLM

Connect to the Palm Model API:

In [76]:
palm_model = vertexai.language_models.TextGenerationModel.from_pretrained("text-bison@002")

Request a definition for the first vocabulary word:

In [109]:
palm_model.predict(prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.', max_output_tokens = 500)

 **Mnemonic:** 
Imagine a person trying to "abrogate" or "ab-rogate" a rug from under someone's feet. The rug is suddenly pulled out, causing the person to fall.

**Definition:**
To repeal or annul a law, treaty, or agreement.

### Definitions For Many Words

What if the tasks changes to needing to make multiple calls, like requesting the definition for many words.  If the list is short or timing is not important then doing synchronous, one at a time, calls may work.  In the following example the `predict_streaming()` method is so that results appear as they are generated by the API.
- [Streaming text generation](https://cloud.google.com/vertex-ai/docs/generative-ai/sdk-for-llm/sdk-use-text-models#stream-text-generation-sdk)
- [`.predict_streaming()` method](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_streaming)

In [119]:
for word in vocab_words[0:5]:
    print(f'Results for {word}:')
    for r in palm_model.predict_streaming(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ):
        print(r)
    print('-'*100)

Results for abrogate:
 **Mnemonic:** 
Imagine a person trying to "abrogate" or "break away" from
 a rope that is tying them down.

**Definition:**
To repeal or annul a law,
 treaty, or agreement.
----------------------------------------------------------------------------------------------------
Results for acquiesce:
 **Mnemonic:** 
*Acqui*esce sounds like "I *acqu*ire peace
."

**Definition:** 
To comply without protest; to agree or consent, usually reluctantly.
----------------------------------------------------------------------------------------------------
Results for adjudicate:
 **Adjudicate** can be remembered as "**Ad**vocate **Jud**ge **
Cat**e**."

**Definition**: To make an official decision about who is right in a
 dispute or competition.
----------------------------------------------------------------------------------------------------
Results for aggregate:
 **Aggregate**

Imagine a large pile of sand. Each grain of sand is an individual particle,
 but when you look a

## Asynchronous Use of APIs - Using Vertex AI Generative AI Models

To request the definition for all words in the vocabularly list it will be beneficial to make request asynchronously - at the same time.  Some APIs have separate clients for asynchronous requests.  In the case of the PaLM model APIs there is actually a helpful asynchronous method provided `.predict_async()`.
- [`.predict_async()` method](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextGenerationModel#vertexai_language_models_TextGenerationModel_predict_async)

### What Exactly is Async?

If we make a request with the async method the response is a [coroutine](https://docs.python.org/3/glossary.html#term-coroutine) object.  This means the method is already implemented with an `async def` statement which makes it [awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables).

The following cells show using the method with, and without, an await expression:

In [136]:
palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

<coroutine object _TextGenerationModel.predict_async at 0x7fc844694f20>

In [137]:
await palm_model.predict_async(
    prompt = f'Describe the word {vocab_words[0]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
    max_output_tokens = 500
)

 **Mnemonic:** 
Imagine a person trying to "abrogate" or "break away" from a rope that is tying them down.

**Definition:**
To repeal or annul a law, treaty, or agreement.

### How To Use Async Concurrently

The previous section showed that the `predict_async()` method returns a coroutine, which is an awaitable object.  When multiple coroutines are grouped together they can be awaited together - concurrently.

To group the coroutines together use [asyncio.gather()](https://docs.python.org/3/library/asyncio-task.html#running-tasks-concurrently):

In [138]:
responses = asyncio.gather(*[
    palm_model.predict_async(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ) for word in vocab_words[0:5]
])

In [139]:
type(responses)

asyncio.tasks._GatheringFuture

To make the requests concurrent, `await` the coroutine grouping:

In [140]:
responses = await asyncio.gather(*[
    palm_model.predict_async(
        prompt = f'Describe the word {word} in a way that will make it easy to remember.  Then, provide a definition of the word.',
        max_output_tokens = 500
    ) for word in vocab_words[0:5]
])

In [141]:
type(responses), len(responses)

(list, 5)

In [142]:
for response in responses:
    print(response.text)
    print('-'*100)

 **Mnemonic:** 
Imagine a person trying to "abrogate" or "break away" from a rope that is tying them down.

**Definition:**
To repeal or annul a law, treaty, or agreement.
----------------------------------------------------------------------------------------------------
 **Mnemonic:** 
*Acqui*esce sounds like "I'm *acquiring* peace."

**Definition:** 
To comply without protest; to agree or consent, usually reluctantly.
----------------------------------------------------------------------------------------------------
 **Adjudicate** can be remembered as "**Ad**vocate **Jud**ge **Cat**e**."

**Definition**: To make an official decision about who is right in a dispute or competition.
----------------------------------------------------------------------------------------------------
 **Aggregate**

Imagine a large pile of sand. Each grain of sand is an individual particle, but when you look at the pile as a whole, you see an aggregate of sand.

**Definition**

Aggregate means a mass o

## Managing Concurrency

In some cases, doing all the tasks concurrently can work. Usually, there are limitations though. Waiting on a API to respond does not put a burden on the local compute so managing lots of requests may not be an issue on the client side.  It can still be helpful to limits to concurrency for managing the requests.  A first step to limiting concurrency is using a tool like [asyncio.Semaphore](https://docs.python.org/3/library/asyncio-sync.html#semaphore) to managed a counter of current concurrent requests.

The following builds a function that managed the full list of request and uses a semaphore to control the concurrency.  Think of this as the currency buffer limit.

In [161]:
async def study_notes(instances, limit_concur_requests = 10):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            result = await palm_model.predict_async(
                                prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                max_output_tokens = 500
                            )
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

In [162]:
responses = await study_notes(vocab_words[0:20])

In [163]:
type(responses), type(responses[0]), len(responses)

(list, tuple, 20)

In [166]:
print(responses[-1][0])
print(responses[-1][1])

ascertain
 **Mnemonic:** "**A** **S**ail **C**rawls **E**very **R**ock **T**o **A**void **I**njury **N**ow"

**Definition:** To find out or determine with certainty; to establish as a fact.


## Managing Concurrency - With Limits

Just managing the concurrency may not be enough.  In cases where API have limits the total requests need to stay under these limits to prevent errors. In the case of this example, the PaLM model is limited by request per minute.  The default per project is 60 request per minute for the model used here ('text-bison@002').  See [Quotas and limits](https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai).

The following modifies the previous function to also incorporate a time based limit for requests.

In [186]:
async def study_notes(instances, limit_concur_requests = 10, limit_per_minute = 60):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        
        # pause for time based limit
        if p >= limit_per_minute:
            await asyncio.sleep(60 * (p // limit_per_minute))
        
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            result = await palm_model.predict_async(
                                prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                max_output_tokens = 500
                            )
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

Try the function under the limit:

In [187]:
responses = await study_notes(vocab_words[0:20])

In [188]:
type(responses), len(responses)

(list, 20)

In [189]:
print(responses[-1][0])
print(responses[-1][1])

ascertain
 **Mnemonic:** "**A** **S**ure **C**ertain**"

**Definition:** To find out or determine with certainty; to establish as a fact.


Try the function just over the limit:

In [190]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [192]:
responses = await study_notes(vocab_words[0:65])

In [191]:
type(responses), len(responses)

(list, 20)

In [193]:
print(responses[-1][0])
print(responses[-1][1])

discerning
 **Mnemonic:** A discerning person is like a detective, carefully examining evidence and making thoughtful judgments.

**Definition:** Having or showing good judgment; able to make careful distinctions.


Try the function at triple the limit:

In [194]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [195]:
responses = await study_notes(vocab_words[0:180])

In [196]:
type(responses), len(responses)

(list, 180)

In [197]:
print(responses[-1][0])
print(responses[-1][1])

obviate
 **Mnemonic:** Obviate is like "obliterate" - it means to do away with something completely.

**Definition:** To make something unnecessary or no longer needed; to do away with.


## Managing Concurrency - With Limits And Error Handling

Sometimes handling concurrency and limits is still not enough.  For example, in a shared enviornment it may not be possible to know how many other applications are making requesst in the same time frame.  In some cases clients have retry methods built in.  In other cases errors are returned and the calling application has to handle them.

The following with futher modify the function to handle error responses by retrying and increasing time increment.

First, force an error by exceeding the limit:

In [198]:
# setting the limit_per_minute to 80, higher than the actual limit of 60
responses = await study_notes(vocab_words[0:80], limit_per_minute = 80)

ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: text-bison. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/quotas.

Now, Modify the function to capture the error and retry with incrementing wait times.  The method used below does two things:
- sets a limit on the retries, 20 in this case
- increments the wait time for each retry, exponential backoff in this case

In [220]:
async def study_notes(instances, limit_concur_requests = 10, limit_per_minute = 60):
    limit = asyncio.Semaphore(limit_concur_requests)
    results = [None] * len(instances)
    
    # make requests
    async def make_request(p):
        
        # pause for time based limit
        if p >= limit_per_minute:
            await asyncio.sleep(60 * (p // limit_per_minute))
        
        async with limit:
            if limit.locked():
                await asyncio.sleep(.01)
            ########## ERROR HANDLING ##################################
            fail_count = 0
            while fail_count <= 20:
                try:
                    result = await palm_model.predict_async(
                                        prompt = f'Describe the word {instances[p]} in a way that will make it easy to remember.  Then, provide a definition of the word.',
                                        max_output_tokens = 500
                                    )
                    if fail_count > 0:
                        print(f'Item {p} succeed after fail count = {fail_count}')
                    break
                except:
                    fail_count += 1
                    print(f'Item {p} failed: current fail count = {fail_count}')
                    await asyncio.sleep(2^(min(counter, 6) - 1))
            ############################################################
        results[p] = (instances[p], result.text)
        
    # manage tasks
    tasks = [asyncio.create_task(make_request(p)) for p in range(len(instances))]
    responses = await asyncio.gather(*tasks)
    
    return results

Try 200 words with the correct limit:

In [208]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [209]:
responses = await study_notes(vocab_words[0:200])

In [210]:
type(responses), len(responses)

(list, 200)

In [211]:
print(responses[-1][0])
print(responses[-1][1])

polemic
 **Polemic:** 

Imagine a heated debate or argument where people are passionately defending their viewpoints. The word "polemic" comes from the Greek word "polemos," which means "war" or "battle." Just like in a battle, a polemic is a controversial or provocative statement that often sparks intense debate and discussion.

**Definition:** 

A polemic is a strongly worded or controversial statement that is intended to provoke debate or argument. It is often used in political, religious, or social contexts to express strong opinions or challenge opposing viewpoints. Polemics are typically characterized by their passionate and persuasive tone, and they can be highly influential in shaping public opinion or promoting certain ideologies.


Now, try 200 words but force errors by setting the limit higher than the actual (60):

In [221]:
# wait a minute for the qouta to clear - assumes no other activity in the project
await asyncio.sleep(60)

In [222]:
# setting the limit_per_minute to 80, higher than the actual limit of 60
responses = await study_notes(vocab_words[0:200], limit_per_minute = 80)

Item 75 failed: current fail count = 1
Item 76 failed: current fail count = 1
Item 77 failed: current fail count = 1
Item 79 failed: current fail count = 1
Item 75 failed: current fail count = 2
Item 76 failed: current fail count = 2
Item 77 failed: current fail count = 2
Item 79 failed: current fail count = 2
Item 75 failed: current fail count = 3
Item 76 failed: current fail count = 3
Item 77 failed: current fail count = 3
Item 79 failed: current fail count = 3
Item 75 failed: current fail count = 4
Item 76 failed: current fail count = 4
Item 77 failed: current fail count = 4
Item 79 failed: current fail count = 4
Item 75 failed: current fail count = 5
Item 76 failed: current fail count = 5
Item 77 failed: current fail count = 5
Item 79 succeed after fail count = 4
Item 76 succeed after fail count = 5
Item 75 succeed after fail count = 5
Item 77 succeed after fail count = 5
Item 126 failed: current fail count = 1
Item 127 failed: current fail count = 1
Item 139 failed: current fail c

In [223]:
type(responses), len(responses)

(list, 200)

In [224]:
print(responses[-1][0])
print(responses[-1][1])

polemic
 **Polemic:** 

Imagine a heated debate or argument where people are passionately defending their viewpoints. The word "polemic" comes from the Greek word "polemos," which means "war" or "battle." Just like in a battle, a polemic is a controversial or provocative statement that often sparks heated debate and discussion.

**Definition:** 

A polemic is a strongly worded or controversial statement that is intended to provoke debate or argument. It is often used in political, religious, or social contexts to express strong opinions or challenge opposing viewpoints. Polemics are typically characterized by their passionate and persuasive tone, and they often aim to influence or persuade the reader to adopt a particular point of view.
