## Multi-Region Load Balancing with Azure API Management (APIM)

Multi-region load balancing with Azure API Management (APIM) enables high availability, improved performance, and disaster recovery for your APIs by distributing traffic across multiple Azure regions.

### Key Benefits

- **High Availability:** Ensures your APIs remain accessible even if one region experiences an outage.
- **Performance Optimization:** Routes client requests to the healthiest region, reducing latency.
- **Disaster Recovery:** Provides failover capabilities in case of regional failures.

### Architecture Overview

1. **Circuit Breaker Pattern:** Use Load Balancer with Circuit Breaker Pattern.
2. **Backend Services:** Ensure backend APIs are also deployed in multiple regions or are accessible from all APIM instances.


![backend-pool-load-balancing.gif](backend-pool-load-balancing.gif)

In [None]:
%pip install openai utils

### 0️⃣ Initialize notebook variables

In [None]:
import requests
import os 
from dotenv import load_dotenv

load_dotenv()
apim_endpoint = os.getenv("APIM_ENDPOINT")
apim_subscription_key = os.getenv("APIM_SUBSCRIPTION_KEY")
apim_gateway_url = os.getenv("APIM_GATEWAY_URL")
model_deployment_name = os.getenv("MODEL_DEPLOYMENT_NAME")
api_version = os.getenv("OPENAI_API_VERSION")


### 🧪 Test the API using a direct HTTP call

Requests is an elegant and simple HTTP library for Python that will be used here to make raw API requests and inspect the responses.

You will not see HTTP 429s returned as API Management's retry policy will select an available backend. If no backends are viable, an HTTP 503 will be returned.

In [None]:
import requests, time
import utils
import json

runs=5
sleep_time_ms = 100
url = apim_endpoint

messages = {
        "messages": [
            {
                "role": "user",
                "content": "I am going to Paris, what should I see?"
            }
        ],
        "max_tokens": 1500,
        "temperature": 1,
        "top_p": 1,
        "model": f"{model_deployment_name}"
    }

api_runs = []

# Initialize a session for connection pooling and set any default headers
session = requests.Session()
session.headers.update({'api-key': apim_subscription_key})

try:
    for i in range(runs):
        print(f"▶️ Run {i+1}/{runs}:")

        start_time = time.time()
        response = session.post(url, json = messages)
        response_time = time.time() - start_time
        print(f"⌚ {response_time:.2f} seconds")

        print(f"Response code: {response.status_code}")

      
        if "x-ms-region" in response.headers:
            print(f"x-ms-region: \x1b[1;32m{response.headers.get('x-ms-region')}\x1b[0m")  # this header is useful to determine the region of the backend that served the request
            api_runs.append((response_time, response.headers.get('x-ms-region')))


        if (response.status_code == 200):
            data = json.loads(response.text)
            print(f"Token usage: {json.dumps(dict(data.get('usage')), indent = 4)}\n")
            print(f"💬 {data.get('choices')[0].get('message').get('content')}\n")
        else:
            print(f"{response.text}\n")

        time.sleep(sleep_time_ms/1000)
finally:
    # Close the session to release the connection
    session.close()

### 🔍 Analyze Load Balancing results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle as pltRectangle
import matplotlib as mpl

mpl.rcParams['figure.figsize'] = [15, 7]
df = pd.DataFrame(api_runs, columns = ['Response Time', 'Region'])
df['Run'] = range(1, len(df) + 1)

# Define a color map for each region
color_map = {'East US': 'lightpink', 'Sweden Central': 'lightyellow'}  # Add more regions and colors as needed

# Plot the dataframe with colored bars
ax = df.plot(kind = 'bar', x = 'Run', y = 'Response Time', color = [color_map.get(region, 'gray') for region in df['Region']], legend = False)

# Add legend
legend_labels = [pltRectangle((0, 0), 1, 1, color = color_map.get(region, 'gray')) for region in df['Region'].unique()]
ax.legend(legend_labels, df['Region'].unique())

plt.title('Load Balancing results')
plt.xlabel('Run #')
plt.ylabel('Response Time')
plt.xticks(rotation = 0)

average = df['Response Time'].mean()
plt.axhline(y = average, color = 'r', linestyle = '--', label = f'Average: {average:.2f}')

plt.show()

### 🧪 Test the API using the Azure OpenAI Python SDK

Repeat the same test using the Python SDK to ensure compatibility. Note that we do not know what region served the response; we only see that we obtained a response.

In [None]:
import time
from openai import AzureOpenAI

runs = 5
sleep_time_ms = 100

client = AzureOpenAI(
    azure_endpoint = apim_gateway_url,
    api_key = apim_subscription_key,
    api_version = api_version
)

for i in range(runs):
    print(f"▶️ Run {i+1}/{runs}:")

    start_time = time.time()
    raw_response = client.chat.completions.with_raw_response.create(
        messages = [
             {
                "role": "user",
                "content": "I am going to Paris, what should I see?"
            }
        ],
        max_tokens= 1500,
        temperature=1,
        top_p=1,
        model = model_deployment_name
        )
    response_time = time.time() - start_time

    print(f"⌚ {response_time:.2f} seconds")
    print(f"x-ms-region: \x1b[1;32m{raw_response.headers.get('x-ms-region')}\x1b[0m") # this header is useful to determine the region of the backend that served the request

    response = raw_response.parse()

    if response.usage:
        print(f"Token usage:\n   Total tokens: {response.usage.total_tokens}\n   Prompt tokens: {response.usage.prompt_tokens}\n   Completion tokens: {response.usage.completion_tokens}\n")

    print(f"💬 {response.choices[0].message.content}\n")

    time.sleep(sleep_time_ms/1000)