# Vertex AI Endpoint: Scaling & Performance Testing

**Comprehensive load testing to understand endpoint capacity, autoscaling behavior, and performance characteristics.**

---

## What This Notebook Does

This notebook systematically tests Vertex AI Endpoint performance to answer:

1. **What's the maximum capacity?** - How many requests per second (RPS) can the endpoint handle?
2. **Where do bottlenecks occur?** - Is it client-side queueing or endpoint processing?
3. **Does autoscaling work?** - When and how does the endpoint scale replicas?
4. **What should we configure?** - Recommendations for production settings

## Testing Approach

**Phase 1: Find Breaking Points** (~15-20 minutes)
- Test different batch sizes (1 ‚Üí 1000 instances per request)
- Test different request rates (1 ‚Üí 100 RPS)
- Identify where latency starts to degrade

**Phase 2: Sustained Load Testing** (~15-20 minutes)
- Apply realistic traffic patterns over extended periods
- Observe autoscaling triggers and timing
- Measure steady-state performance

**Total test time:** ~30-40 minutes

## Key Metrics We Track

**Timing Breakdown** (separated to identify bottlenecks):
- üü¶ **Queueing Time**: Waiting for client concurrency slot ‚Üí *client bottleneck*
- üü© **Request Time**: Actual HTTP request/response ‚Üí *endpoint performance*
- üîµ **Total Latency**: End-to-end user experience (queue + request)

**Success Metrics**:
- Success rate (% of requests that completed successfully)
- Error types and frequency

**Resource Metrics** (from Cloud Monitoring):
- CPU utilization (triggers autoscaling at 60% by default)
- Replica count (view in Cloud Console)

## Prerequisites

Before running this notebook, you need:

- **Deployed Vertex AI Endpoint** with a model
  - If using the PyTorch autoencoder from this repository, run `../pytorch-autoencoder.ipynb` first
  - Then deploy using either `vertex-ai-endpoint-prebuilt-container.ipynb` or `vertex-ai-endpoint-custom-container.ipynb`
- **Google Cloud authentication** with Vertex AI prediction permissions
- **Python packages**: `aiohttp`, `google-cloud-aiplatform`, `plotly`, `pandas` (installed automatically)

## Understanding Vertex AI Autoscaling

**How It Works**:
- Vertex AI autoscales based on **CPU utilization** (default threshold: 60%)
- When CPU > 60% for ~1-2 minutes ‚Üí new replica provisions
- Replica provisioning takes ~2-3 minutes (container startup)
- Scale-down occurs after ~10-15 minutes below threshold

**Important**: Lightweight models may not trigger autoscaling even under high RPS because CPU usage stays low. This is a **capacity bottleneck**, not a **compute bottleneck**. If this happens:
- Lower the autoscaling threshold (requires redeployment)
- Increase minimum replica count
- Use a larger machine type

---
## Environment Setup

This section will authenticate your session, enable required Google Cloud APIs, and install necessary Python packages.

**Package Installation Options (`REQ_TYPE`):**
- `PRIMARY`: Installs only the main packages. Faster, but pip resolves sub-dependencies which may result in different versions than development.
- `ALL` (Default): Installs exact versions of all packages and dependencies. Best for perfectly reproducing the development environment.
- `COLAB`: Installs a Colab-optimized list that excludes pre-installed packages like `ipython` and `ipykernel`.

**Installation Tool Options (`INSTALL_TOOL`):**
- `pip` (Default): Uses pip for package installation. Standard Python package installer.
- `uv`: Modern, fast Python package installer. Must be installed separately. See: https://github.com/astral-sh/uv
- `poetry`: Dependency management tool. Requires running notebook in a poetry environment (`poetry shell` or `poetry run jupyter lab`). Uses `pyproject.toml` instead of requirements.txt.

> **Note:** If running in Google Colab, the script will automatically detect this and set `REQ_TYPE = 'COLAB'` to prevent package conflicts, overriding any manual setting.

### Set Your Project ID

‚ö†Ô∏è **Action Required:** Replace the `PROJECT_ID` value below with your Google Cloud project ID before running this cell.

In [8]:
PROJECT_ID = 'statmike-mlops-349915' # replace with GCP project ID
REQ_TYPE = 'ALL' # Specify PRIMARY or ALL or COLAB
INSTALL_TOOL = 'poetry' # Specify pip, uv, or poetry

### Configuration

This cell defines the requirements files and Google Cloud APIs needed for this notebook. Run as-is without modification.

In [9]:
REQUIREMENTS_URL = 'https://raw.githubusercontent.com/statmike/vertex-ai-mlops/refs/heads/main/Framework%20Workflows/PyTorch/requirements.txt'

REQUIRED_APIS = [
    "aiplatform.googleapis.com",
    "monitoring.googleapis.com",
]

### Run Setup

This cell downloads the centralized setup code and configures your environment. It will:
- Authenticate your session with Google Cloud
- Enable required APIs for this notebook
- Install necessary Python packages
- Display a setup summary with your project information

> **Note:** In Colab, if packages are installed, the kernel will automatically restart. After restart, continue from the next cell without re-running earlier cells.

In [11]:
import os, urllib.request

# Download and import setup code
url = 'https://raw.githubusercontent.com/statmike/vertex-ai-mlops/refs/heads/main/core/notebook-template/python_setup.py'
urllib.request.urlretrieve(url, 'python_setup_local.py')
import python_setup_local as python_setup
os.remove('python_setup_local.py')

# Run setup
setup_info = python_setup.setup_environment(PROJECT_ID, REQ_TYPE, REQUIREMENTS_URL, REQUIRED_APIS, INSTALL_TOOL)


PYTHON GCP ENVIRONMENT SETUP

AUTHENTICATION
Checking for existing ADC...
‚úÖ Existing ADC found.
‚úÖ Project is correctly set to 'statmike-mlops-349915'.

API CHECK & ENABLE
‚úÖ aiplatform.googleapis.com is already enabled.
‚úÖ monitoring.googleapis.com is already enabled.

PACKAGE MANAGEMENT
Installation Tool: poetry
‚úÖ Found poetry at: /usr/local/google/home/statmike/.local/bin/poetry
‚úÖ Running in poetry environment: /usr/local/google/home/statmike/.cache/pypoetry/virtualenvs/frameworks-pytorch-0KVJlKeQ-py3.13
‚ÑπÔ∏è  Poetry mode: Installing from pyproject.toml (REQUIREMENTS_URL ignored)
‚úÖ Found pyproject.toml at: /usr/local/google/home/statmike/Git/vertex-ai-mlops/Framework Workflows/PyTorch/pyproject.toml
   Changed working directory to: /usr/local/google/home/statmike/Git/vertex-ai-mlops/Framework Workflows/PyTorch
Running poetry install...
   Restored working directory to: /usr/local/google/home/statmike/Git/vertex-ai-mlops/Framework Workflows/PyTorch/serving
‚úÖ All packa

---
## Test Configuration

Configure the endpoint to test and the test parameters.

**Endpoint Configuration:**
- Update `ENDPOINT_DISPLAY_NAME` to match your deployed endpoint
- Update `REGION` if your endpoint is in a different region

**Test Parameters:**
- `BATCH_SIZES`: Range of batch sizes to test (instances per request)
- `RPS_TARGETS`: Range of request rates to test (requests per second)
- `RPS_BATCH_SIZES`: Subset of batch sizes to use for RPS scaling tests

In [12]:
# Endpoint Configuration
REGION = 'us-central1'
ENDPOINT_DISPLAY_NAME = 'pytorch-autoencoder-endpoint'  # Replace with your endpoint name

# Test Configuration
BATCH_SIZES = [1, 5, 10, 50, 100, 500, 1000]  # Instances per request to test
RPS_TARGETS = [1, 5, 10, 20, 50, 100]  # Requests per second to test
RPS_BATCH_SIZES = [1, 5, 100]  # Which batch sizes to test at different RPS

In [13]:
# Imports
import asyncio
import aiohttp
import time
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import google.auth
import google.auth.transport.requests
from google.cloud import aiplatform, monitoring_v3
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print("‚úÖ Imports complete")

  from google.cloud.aiplatform.utils import gcs_utils


‚úÖ Imports complete


In [14]:
# Initialize clients
aiplatform.init(project=PROJECT_ID, location=REGION)
monitoring_client = monitoring_v3.MetricServiceClient()

# Setup authentication for REST API
credentials, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()

print(f"‚úÖ Initialized for project: {PROJECT_ID}")

‚úÖ Initialized for project: statmike-mlops-349915


In [15]:
# Connect to endpoint
endpoints = aiplatform.Endpoint.list(filter=f"display_name={ENDPOINT_DISPLAY_NAME}")
if not endpoints:
    raise ValueError(f"No endpoint found: {ENDPOINT_DISPLAY_NAME}")

endpoint = endpoints[0]
endpoint_url = f"https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict"

# Get endpoint configuration
deployed_model = endpoint.list_models()[0]
MACHINE_TYPE = deployed_model.dedicated_resources.machine_spec.machine_type
MIN_REPLICAS = deployed_model.dedicated_resources.min_replica_count
MAX_REPLICAS = deployed_model.dedicated_resources.max_replica_count

print(f"‚úÖ Connected to: {endpoint.display_name}")
print(f"   Machine: {MACHINE_TYPE}")
print(f"   Replicas: {MIN_REPLICAS} - {MAX_REPLICAS}")
print(f"   URL: {endpoint_url}")

‚úÖ Connected to: pytorch-autoencoder-endpoint
   Machine: n1-standard-4
   Replicas: 1 - 4
   URL: https://us-central1-aiplatform.googleapis.com/v1/projects/1026793852137/locations/us-central1/endpoints/2741468416626917376:predict


In [16]:
# Test connection with a single prediction
def generate_sample_data(batch_size=1):
    """Generate sample input data (30 features per instance)"""
    return [np.random.randn(30).astype(np.float32).tolist() for _ in range(batch_size)]

# Make a test request
credentials.refresh(auth_req)
headers = {
    "Authorization": f"Bearer {credentials.token}",
    "Content-Type": "application/json"
}
payload = {"instances": generate_sample_data(1)}

import requests
response = requests.post(endpoint_url, headers=headers, json=payload, timeout=30)
response.raise_for_status()

print("‚úÖ Endpoint connection successful")
print(f"   Sample prediction returned: {len(response.json()['predictions'])} result(s)")

‚úÖ Endpoint connection successful
   Sample prediction returned: 1 result(s)


---
## Helper Functions

Core async testing infrastructure with timing separation.

**Key Design Features:**
- **Async/await**: Uses `aiohttp` for concurrent HTTP requests
- **Fixed-rate scheduling**: Creates requests at precise intervals to maintain target RPS
- **Timing separation**: Tracks queueing time (client) vs request time (endpoint)
- **Semaphore control**: Limits concurrent requests to prevent client overload
- **Progress reporting**: Shows stats every 60 seconds during long tests

In [17]:
async def run_load_test(
    target_rps: int,
    duration: int,
    batch_size: int = 5,
    test_name: str = "Load Test"
) -> pd.DataFrame:
    """
    Run load test with precise RPS control and timing separation.
    
    Uses fixed-rate scheduling: creates tasks just-in-time at exact intervals
    to maintain target RPS without client-side backlog.
    
    Args:
        target_rps: Target requests per second
        duration: Test duration in seconds
        batch_size: Instances per request
        test_name: Name for logging
    
    Returns:
        DataFrame with columns: timestamp, request_id, queueing_ms, request_ms, 
                                total_latency_ms, success, error (if failed)
    """
    test_instances = generate_sample_data(batch_size)
    total_requests = target_rps * duration
    interval = 1.0 / target_rps
    max_concurrent = min(target_rps * 2, 200)
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def make_request(session, request_id, scheduled_time):
        """Make single async request with timing breakdown"""
        queue_start = time.time()
        
        async with semaphore:
            queue_end = time.time()
            queueing_ms = (queue_end - queue_start) * 1000
            
            request_start = time.time()
            try:
                credentials.refresh(auth_req)
                headers = {
                    "Authorization": f"Bearer {credentials.token}",
                    "Content-Type": "application/json"
                }
                payload = {"instances": test_instances}
                
                async with session.post(
                    endpoint_url, 
                    json=payload, 
                    headers=headers,
                    timeout=aiohttp.ClientTimeout(total=300)
                ) as response:
                    request_end = time.time()
                    request_ms = (request_end - request_start) * 1000
                    total_latency_ms = (request_end - queue_start) * 1000
                    
                    if response.status == 200:
                        await response.json()
                        return {
                            'timestamp': datetime.now(),
                            'request_id': request_id,
                            'queueing_ms': queueing_ms,
                            'request_ms': request_ms,
                            'total_latency_ms': total_latency_ms,
                            'success': True
                        }
                    else:
                        error_text = await response.text()
                        return {
                            'timestamp': datetime.now(),
                            'request_id': request_id,
                            'queueing_ms': queueing_ms,
                            'request_ms': request_ms,
                            'total_latency_ms': total_latency_ms,
                            'success': False,
                            'error': f"HTTP {response.status}: {error_text[:100]}"
                        }
            except Exception as e:
                request_end = time.time()
                return {
                    'timestamp': datetime.now(),
                    'request_id': request_id,
                    'queueing_ms': queueing_ms,
                    'request_ms': (request_end - request_start) * 1000,
                    'total_latency_ms': (request_end - queue_start) * 1000,
                    'success': False,
                    'error': str(e)[:100]
                }
    
    async def scheduler(session):
        """Schedule requests at precise intervals"""
        test_start = time.time()
        active_tasks = set()
        completed = []
        
        for i in range(total_requests):
            target_time = test_start + (i * interval)
            wait = target_time - time.time()
            if wait > 0:
                await asyncio.sleep(wait)
            
            task = asyncio.create_task(make_request(session, i, target_time))
            active_tasks.add(task)
            task.add_done_callback(lambda t: active_tasks.discard(t))
            
            if len(active_tasks) >= max_concurrent:
                done, _ = await asyncio.wait(active_tasks, return_when=asyncio.FIRST_COMPLETED)
                completed.extend([t.result() for t in done])
            
            # Progress every 60 seconds
            if i > 0 and i % (target_rps * 60) == 0:
                elapsed = time.time() - test_start
                success = len([r for r in completed if r['success']])
                avg_lat = sum(r['total_latency_ms'] for r in completed if r['success']) / max(success, 1)
                print(f"  [{int(elapsed):3d}s] {len(completed):,} done | "
                      f"Success: {success:,} | Avg: {avg_lat:.1f}ms")
        
        if active_tasks:
            remaining = await asyncio.gather(*active_tasks)
            completed.extend(remaining)
        
        return completed
    
    # Run test
    print(f"\n{'='*60}")
    print(f"{test_name}")
    print(f"{'='*60}")
    print(f"Target: {target_rps} RPS √ó {duration}s = {total_requests:,} requests")
    print(f"Batch size: {batch_size} | Concurrency: {max_concurrent}")
    
    connector = aiohttp.TCPConnector(limit=max_concurrent)
    async with aiohttp.ClientSession(connector=connector) as session:
        results = await scheduler(session)
    
    # Summary
    df = pd.DataFrame(results)
    success = df[df['success'] == True]
    print(f"\n‚úÖ Complete: {len(success):,}/{len(df):,} successful ({len(success)/len(df)*100:.1f}%)")
    
    if len(success) > 0:
        print(f"   Total Latency:   {success['total_latency_ms'].mean():.1f}ms (mean) | "
              f"{success['total_latency_ms'].quantile(0.95):.1f}ms (p95)")
        print(f"   Queueing Time:   {success['queueing_ms'].mean():.1f}ms (mean) | "
              f"{success['queueing_ms'].quantile(0.95):.1f}ms (p95)")
        print(f"   Request Time:    {success['request_ms'].mean():.1f}ms (mean) | "
              f"{success['request_ms'].quantile(0.95):.1f}ms (p95)")
    
    return df

print("‚úÖ Helper functions defined")

‚úÖ Helper functions defined


---

## Phase 1: Find Breaking Points

Systematically test to find:
1. **Optimal batch size** - Best latency/throughput balance
2. **Maximum RPS** - Where does performance degrade?
3. **Bottleneck location** - Client queueing vs endpoint processing

### Test 1: Batch Size Impact

Test how latency changes with batch size at a constant low RPS (1 RPS).

This isolates batch size effects from concurrency/queueing.

In [18]:
# Run batch size test (1 RPS, 10 requests per batch size)
batch_results = []

for batch_size in BATCH_SIZES:
    print(f"Testing batch={batch_size}...", end=" ")
    
    latencies = []
    for i in range(10):
        instances = generate_sample_data(batch_size)
        credentials.refresh(auth_req)
        headers = {"Authorization": f"Bearer {credentials.token}", "Content-Type": "application/json"}
        
        start = time.time()
        response = requests.post(endpoint_url, headers=headers, json={"instances": instances}, timeout=300)
        latency = (time.time() - start) * 1000
        
        if response.status_code == 200:
            latencies.append(latency)
            batch_results.append({
                'batch_size': batch_size,
                'latency_ms': latency,
                'success': True
            })
        
        if i < 9:
            time.sleep(1)  # 1 second between requests = 1 RPS
    
    if latencies:
        print(f"avg={np.mean(latencies):.1f}ms, p95={np.percentile(latencies, 95):.1f}ms")

df_batch = pd.DataFrame(batch_results)
print(f"\n‚úÖ Batch size test complete: {len(df_batch)} requests")

Testing batch=1... avg=66.9ms, p95=86.9ms
Testing batch=5... avg=80.0ms, p95=163.9ms
Testing batch=10... avg=70.4ms, p95=89.3ms
Testing batch=50... avg=128.7ms, p95=153.6ms
Testing batch=100... avg=203.6ms, p95=219.6ms
Testing batch=500... avg=821.0ms, p95=931.2ms
Testing batch=1000... avg=1766.4ms, p95=1963.3ms

‚úÖ Batch size test complete: 70 requests


In [19]:
# Visualize batch size results
stats = df_batch.groupby('batch_size')['latency_ms'].agg(['mean', 'median', 
    ('p95', lambda x: np.percentile(x, 95))]).reset_index()

fig = go.Figure()
fig.add_trace(go.Scatter(x=stats['batch_size'], y=stats['mean'], 
                         mode='lines+markers', name='Mean', line=dict(width=2)))
fig.add_trace(go.Scatter(x=stats['batch_size'], y=stats['p95'], 
                         mode='lines+markers', name='P95', line=dict(dash='dash')))

fig.update_layout(
    title='Latency vs Batch Size (1 RPS)',
    xaxis_title='Batch Size (instances per request)',
    yaxis_title='Latency (ms)',
    xaxis_type='log',
    height=400
)
fig.show()

# Find optimal batch size (< 2x baseline latency)
baseline = stats[stats['batch_size'] == 1]['mean'].values[0]
optimal = stats[stats['mean'] <= baseline * 2]['batch_size'].max()
print(f"\nüìä Analysis:")
print(f"   Baseline (batch=1): {baseline:.1f}ms")
print(f"   Optimal batch size: {optimal} (latency < 2x baseline)")
print(f"   Latency at optimal: {stats[stats['batch_size']==optimal]['mean'].values[0]:.1f}ms")


üìä Analysis:
   Baseline (batch=1): 66.9ms
   Optimal batch size: 50 (latency < 2x baseline)
   Latency at optimal: 128.7ms


### Test 2: RPS Scaling

Test how the endpoint handles increasing request rates.

**Key insight**: Timing separation shows whether bottleneck is client-side (high queueing) or endpoint-side (high request time).

In [20]:
# Run RPS scaling tests
rps_results = []

for batch_size in RPS_BATCH_SIZES:
    print(f"\n{'='*60}")
    print(f"Testing batch_size={batch_size}")
    print(f"{'='*60}")
    
    for target_rps in RPS_TARGETS:
        df = await run_load_test(
            target_rps=target_rps,
            duration=30,
            batch_size=batch_size,
            test_name=f"Batch {batch_size} @ {target_rps} RPS"
        )
        df['batch_size'] = batch_size
        df['target_rps'] = target_rps
        rps_results.append(df)

df_rps = pd.concat(rps_results, ignore_index=True)
print(f"\n‚úÖ RPS scaling tests complete: {len(df_rps):,} total requests")


Testing batch_size=1

Batch 1 @ 1 RPS
Target: 1 RPS √ó 30s = 30 requests
Batch size: 1 | Concurrency: 2

‚úÖ Complete: 1/1 successful (100.0%)
   Total Latency:   81.7ms (mean) | 81.7ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    81.7ms (mean) | 81.7ms (p95)

Batch 1 @ 5 RPS
Target: 5 RPS √ó 30s = 150 requests
Batch size: 1 | Concurrency: 10

‚úÖ Complete: 1/1 successful (100.0%)
   Total Latency:   55.3ms (mean) | 55.3ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    55.3ms (mean) | 55.3ms (p95)

Batch 1 @ 10 RPS
Target: 10 RPS √ó 30s = 300 requests
Batch size: 1 | Concurrency: 20

‚úÖ Complete: 1/1 successful (100.0%)
   Total Latency:   59.4ms (mean) | 59.4ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    59.3ms (mean) | 59.3ms (p95)

Batch 1 @ 20 RPS
Target: 20 RPS √ó 30s = 600 requests
Batch size: 1 | Concurrency: 40

‚úÖ Complete: 2/2 successful (100.0%)
   Total Latency:   67.2ms (mean) | 74.1ms (p

In [21]:
# Analyze and visualize RPS results
success = df_rps[df_rps['success'] == True]

# Calculate stats by batch size and RPS
stats = success.groupby(['batch_size', 'target_rps']).agg({
    'total_latency_ms': ['mean', lambda x: np.percentile(x, 95)],
    'queueing_ms': ['mean', lambda x: np.percentile(x, 95)],
    'request_ms': ['mean', lambda x: np.percentile(x, 95)]
}).reset_index()

stats.columns = ['batch_size', 'target_rps', 'total_mean', 'total_p95', 
                 'queue_mean', 'queue_p95', 'request_mean', 'request_p95']

# Success rates
success_rates = df_rps.groupby(['batch_size', 'target_rps'])['success'].apply(
    lambda x: (x == True).sum() / len(x) * 100
).reset_index(name='success_rate')
stats = stats.merge(success_rates, on=['batch_size', 'target_rps'])

# Create visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Total Latency vs RPS', 'Timing Breakdown @ Max RPS',
                   'Success Rate vs RPS', 'Request Time vs RPS'),
    specs=[[{}, {}], [{}, {}]],
    vertical_spacing=0.12,
    horizontal_spacing=0.12
)

colors = ['blue', 'green', 'orange']

for i, batch_size in enumerate(RPS_BATCH_SIZES):
    data = stats[stats['batch_size'] == batch_size]
    
    # Total latency
    fig.add_trace(go.Scatter(
        x=data['target_rps'], y=data['total_mean'],
        name=f'Batch {batch_size}', mode='lines+markers',
        line=dict(color=colors[i], width=2), marker=dict(size=8)
    ), row=1, col=1)
    
    # Success rate
    fig.add_trace(go.Scatter(
        x=data['target_rps'], y=data['success_rate'],
        mode='lines+markers', line=dict(color=colors[i], width=2),
        showlegend=False
    ), row=2, col=1)
    
    # Request time
    fig.add_trace(go.Scatter(
        x=data['target_rps'], y=data['request_mean'],
        mode='lines+markers', line=dict(color=colors[i], width=2),
        showlegend=False
    ), row=2, col=2)
    
    # Timing breakdown at max RPS
    max_rps_data = data[data['target_rps'] == data['target_rps'].max()].iloc[0]
    fig.add_trace(go.Bar(
        x=[f'Batch {batch_size}'], y=[max_rps_data['queue_mean']],
        name='Queueing' if i == 0 else '', marker_color='lightblue',
        showlegend=(i == 0)
    ), row=1, col=2)
    fig.add_trace(go.Bar(
        x=[f'Batch {batch_size}'], y=[max_rps_data['request_mean']],
        name='Request' if i == 0 else '', marker_color='darkblue',
        showlegend=(i == 0)
    ), row=1, col=2)

fig.update_xaxes(title_text="Target RPS", row=1, col=1)
fig.update_xaxes(title_text="Batch Size", row=1, col=2)
fig.update_xaxes(title_text="Target RPS", row=2, col=1)
fig.update_xaxes(title_text="Target RPS", row=2, col=2)

fig.update_yaxes(title_text="Latency (ms)", row=1, col=1)
fig.update_yaxes(title_text="Time (ms)", row=1, col=2)
fig.update_yaxes(title_text="Success Rate (%)", row=2, col=1, range=[0, 105])
fig.update_yaxes(title_text="Request Time (ms)", row=2, col=2)

fig.update_layout(barmode='stack', height=700)
fig.show()

# Print breaking point analysis
print("\nüìä Breaking Point Analysis:")
for batch_size in RPS_BATCH_SIZES:
    data = stats[stats['batch_size'] == batch_size]
    reliable = data[data['success_rate'] >= 95]
    
    if len(reliable) > 0:
        max_rps = reliable['target_rps'].max()
        row = reliable[reliable['target_rps'] == max_rps].iloc[0]
        
        queue_pct = (row['queue_p95'] / row['total_p95'] * 100) if row['total_p95'] > 0 else 0
        
        print(f"\n   Batch {int(batch_size):3d}: Max reliable RPS = {int(max_rps):3d}")
        print(f"      P95 Total:   {row['total_p95']:7.1f}ms")
        print(f"      P95 Queue:   {row['queue_p95']:7.1f}ms ({queue_pct:4.1f}% of total)")
        print(f"      P95 Request: {row['request_p95']:7.1f}ms ({100-queue_pct:4.1f}% of total)")
        
        if queue_pct > 50:
            print(f"      ‚ö†Ô∏è  Bottleneck: Client-side queueing")
        else:
            print(f"      ‚úÖ Bottleneck: Endpoint processing (expected)")


üìä Breaking Point Analysis:

   Batch   1: Max reliable RPS = 100
      P95 Total:    4560.0ms
      P95 Queue:       0.0ms ( 0.0% of total)
      P95 Request:  4560.0ms (100.0% of total)
      ‚úÖ Bottleneck: Endpoint processing (expected)

   Batch   5: Max reliable RPS = 100
      P95 Total:    4248.1ms
      P95 Queue:       0.0ms ( 0.0% of total)
      P95 Request:  4248.0ms (100.0% of total)
      ‚úÖ Bottleneck: Endpoint processing (expected)

   Batch 100: Max reliable RPS = 100
      P95 Total:    5697.8ms
      P95 Queue:       0.0ms ( 0.0% of total)
      P95 Request:  5697.8ms (100.0% of total)
      ‚úÖ Bottleneck: Endpoint processing (expected)


### Phase 1 Summary

**Key Findings from Tests:**

1. **Batch Size Impact:**
   - Baseline latency (batch=1): ~67ms
   - Optimal batch size: **50** (balances latency vs throughput)
   - Latency increases linearly with batch size for this model

2. **RPS Scaling:**
   - Low RPS (1-20): Excellent performance (~55-75ms)
   - Medium RPS (50): Moderate degradation (~1.4s)
   - High RPS (100): Significant degradation (~2.8-4.3s)
   - **All requests successful (100% success rate)**

3. **Bottleneck Analysis:**
   - Zero client-side queueing across all tests
   - Bottleneck is endpoint processing capacity (expected)
   - Endpoint can handle 100 RPS but with high latency

**Next**: Phase 2 uses these insights to test sustained load patterns and observe autoscaling behavior.

---

## Phase 2: Sustained Load Testing

Apply realistic traffic patterns over extended periods to:
- Observe autoscaling behavior
- Measure steady-state performance
- Test spike handling

**Configure test parameters below based on Phase 1 results.**

In [None]:
# Phase 2 Configuration (adjust based on Phase 1 results)
PHASE2_BATCH_SIZE = 5  # Use a moderate batch size from Phase 1

# Pattern 1: Constant Load
# Tests steady-state performance under sustained traffic
CONSTANT_RPS = 50  # Moderate sustained load
CONSTANT_DURATION = 600  # 10 minutes

# Pattern 2: Spike Test
# Tests autoscaling responsiveness and recovery
BASELINE_RPS = 20  # Low baseline
SPIKE_RPS = 100  # High spike
SPIKE_DURATION = 120  # 2 minutes

print("Phase 2 Configuration:")
print(f"  Constant Load: {CONSTANT_RPS} RPS √ó {CONSTANT_DURATION/60:.0f} min")
print(f"  Spike Test: {BASELINE_RPS} ‚Üí {SPIKE_RPS} ‚Üí {BASELINE_RPS} RPS")
print(f"  Batch size: {PHASE2_BATCH_SIZE}")
print(f"\nNote: Adjust these values based on your Phase 1 results and production requirements.")

### Pattern 1: Constant Load

Sustained traffic at constant RPS to observe:
- Steady-state latency
- CPU utilization patterns
- Whether autoscaling triggers

In [23]:
# Run constant load test
df_constant = await run_load_test(
    target_rps=CONSTANT_RPS,
    duration=CONSTANT_DURATION,
    batch_size=PHASE2_BATCH_SIZE,
    test_name="Pattern 1: Constant Load"
)

df_constant['pattern'] = 'Constant Load'


Pattern 1: Constant Load
Target: 50 RPS √ó 600s = 30,000 requests
Batch size: 5 | Concurrency: 100
  [ 82s] 5,713 done | Success: 5,713 | Avg: 1473.4ms
  [168s] 10,675 done | Success: 10,675 | Avg: 1456.5ms
  [255s] 15,837 done | Success: 15,837 | Avg: 1458.5ms
  [340s] 20,074 done | Success: 20,074 | Avg: 1454.9ms
  [425s] 24,873 done | Success: 24,873 | Avg: 1452.4ms
  [508s] 29,822 done | Success: 29,822 | Avg: 1448.1ms
  [595s] 34,025 done | Success: 34,025 | Avg: 1447.2ms
  [677s] 39,371 done | Success: 39,371 | Avg: 1442.8ms
  [760s] 43,764 done | Success: 43,764 | Avg: 1438.5ms

‚úÖ Complete: 48,907/48,907 successful (100.0%)
   Total Latency:   1436.5ms (mean) | 2582.0ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    1436.5ms (mean) | 2582.0ms (p95)


### Pattern 2: Traffic Spike

Baseline ‚Üí Spike ‚Üí Recovery to test:
- Cold-start latency during spike
- Scale-up responsiveness
- Recovery time

In [24]:
# Run traffic spike test (baseline ‚Üí spike ‚Üí recovery)
spike_results = []

# Phase 1: Baseline
df_baseline1 = await run_load_test(
    target_rps=BASELINE_RPS,
    duration=120,
    batch_size=PHASE2_BATCH_SIZE,
    test_name="Spike - Baseline 1"
)
df_baseline1['phase'] = 'baseline1'
spike_results.append(df_baseline1)

# Phase 2: Spike
df_spike = await run_load_test(
    target_rps=SPIKE_RPS,
    duration=SPIKE_DURATION,
    batch_size=PHASE2_BATCH_SIZE,
    test_name="Spike - Peak Load"
)
df_spike['phase'] = 'spike'
spike_results.append(df_spike)

# Phase 3: Recovery
df_baseline2 = await run_load_test(
    target_rps=BASELINE_RPS,
    duration=120,
    batch_size=PHASE2_BATCH_SIZE,
    test_name="Spike - Recovery"
)
df_baseline2['phase'] = 'baseline2'
spike_results.append(df_baseline2)

df_spike_pattern = pd.concat(spike_results, ignore_index=True)
df_spike_pattern['pattern'] = 'Traffic Spike'


Spike - Baseline 1
Target: 20 RPS √ó 120s = 2,400 requests
Batch size: 5 | Concurrency: 40
  [ 60s] 0 done | Success: 0 | Avg: 0.0ms

‚úÖ Complete: 2/2 successful (100.0%)
   Total Latency:   71.0ms (mean) | 77.8ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    71.0ms (mean) | 77.8ms (p95)

Spike - Peak Load
Target: 100 RPS √ó 120s = 12,000 requests
Batch size: 5 | Concurrency: 200
  [169s] 11,537 done | Success: 11,537 | Avg: 2883.3ms

‚úÖ Complete: 20,718/20,718 successful (100.0%)
   Total Latency:   2940.8ms (mean) | 5160.9ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    2940.8ms (mean) | 5160.9ms (p95)

Spike - Recovery
Target: 20 RPS √ó 120s = 2,400 requests
Batch size: 5 | Concurrency: 40
  [ 60s] 0 done | Success: 0 | Avg: 0.0ms

‚úÖ Complete: 5/5 successful (100.0%)
   Total Latency:   120.1ms (mean) | 253.0ms (p95)
   Queueing Time:   0.0ms (mean) | 0.0ms (p95)
   Request Time:    120.1ms (mean) | 253.0ms (p95)


In [25]:
# Visualize Phase 2 results over time
df_phase2 = pd.concat([df_constant, df_spike_pattern], ignore_index=True)
df_phase2_success = df_phase2[df_phase2['success'] == True].copy()

# Create time-series plots for each pattern
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Constant Load - Latency Over Time', 'Constant Load - Timing Breakdown',
                   'Traffic Spike - Latency Over Time', 'Traffic Spike - Timing Breakdown'),
    vertical_spacing=0.12,
    horizontal_spacing=0.12
)

for row, pattern in enumerate(['Constant Load', 'Traffic Spike'], 1):
    data = df_phase2_success[df_phase2_success['pattern'] == pattern].copy()
    
    if len(data) > 0:
        # Calculate elapsed time
        min_time = data['timestamp'].min()
        data['elapsed_seconds'] = (data['timestamp'] - min_time).dt.total_seconds()
        data['time_bucket'] = (data['elapsed_seconds'] // 10) * 10
        
        # Aggregate by time bucket
        bucket_stats = data.groupby('time_bucket').agg({
            'total_latency_ms': ['mean', lambda x: np.percentile(x, 95)],
            'queueing_ms': 'mean',
            'request_ms': 'mean'
        }).reset_index()
        bucket_stats.columns = ['time_bucket', 'total_mean', 'total_p95', 'queue_mean', 'request_mean']
        
        # Left: Total latency over time
        fig.add_trace(go.Scatter(
            x=bucket_stats['time_bucket'], y=bucket_stats['total_mean'],
            name='Mean', mode='lines', line=dict(color='blue', width=2),
            showlegend=(row == 1)
        ), row=row, col=1)
        fig.add_trace(go.Scatter(
            x=bucket_stats['time_bucket'], y=bucket_stats['total_p95'],
            name='P95', mode='lines', line=dict(color='orange', width=2, dash='dash'),
            showlegend=(row == 1)
        ), row=row, col=1)
        
        # Right: Timing breakdown (stacked area)
        fig.add_trace(go.Scatter(
            x=bucket_stats['time_bucket'], y=bucket_stats['queue_mean'],
            name='Queueing', mode='lines', fill='tozeroy',
            line=dict(color='lightblue', width=0),
            fillcolor='rgba(173, 216, 230, 0.5)',
            showlegend=(row == 1)
        ), row=row, col=2)
        fig.add_trace(go.Scatter(
            x=bucket_stats['time_bucket'],
            y=bucket_stats['queue_mean'] + bucket_stats['request_mean'],
            name='Request', mode='lines', fill='tonexty',
            line=dict(color='darkblue', width=0),
            fillcolor='rgba(0, 0, 139, 0.5)',
            showlegend=(row == 1)
        ), row=row, col=2)

for row in [1, 2]:
    fig.update_xaxes(title_text="Time (seconds)", row=row, col=1)
    fig.update_xaxes(title_text="Time (seconds)", row=row, col=2)
    fig.update_yaxes(title_text="Latency (ms)", row=row, col=1)
    fig.update_yaxes(title_text="Time (ms)", row=row, col=2)

fig.update_layout(height=700)
fig.show()

# Summary statistics
print("\nüìä Phase 2 Summary:")
for pattern in ['Constant Load', 'Traffic Spike']:
    data = df_phase2_success[df_phase2_success['pattern'] == pattern]
    if len(data) > 0:
        print(f"\n{pattern}:")
        print(f"  Total Latency:  {data['total_latency_ms'].mean():.1f}ms (mean) | "
              f"{data['total_latency_ms'].quantile(0.95):.1f}ms (p95)")
        print(f"  Queueing Time:  {data['queueing_ms'].mean():.1f}ms (mean) | "
              f"{data['queueing_ms'].quantile(0.95):.1f}ms (p95)")
        print(f"  Request Time:   {data['request_ms'].mean():.1f}ms (mean) | "
              f"{data['request_ms'].quantile(0.95):.1f}ms (p95)")


üìä Phase 2 Summary:

Constant Load:
  Total Latency:  1436.5ms (mean) | 2582.0ms (p95)
  Queueing Time:  0.0ms (mean) | 0.0ms (p95)
  Request Time:   1436.5ms (mean) | 2582.0ms (p95)

Traffic Spike:
  Total Latency:  2939.9ms (mean) | 5160.8ms (p95)
  Queueing Time:  0.0ms (mean) | 0.0ms (p95)
  Request Time:   2939.9ms (mean) | 5160.8ms (p95)


---

## Cloud Monitoring Analysis

Query CPU utilization metrics to understand autoscaling behavior.

**Note**: Vertex AI doesn't expose replica count via API - use Cloud Console to view active replicas.

In [26]:
# Get test time window
all_data = pd.concat([df_batch, df_rps, df_phase2], ignore_index=True)
test_start = all_data['timestamp'].min() - timedelta(minutes=10)
test_end = all_data['timestamp'].max() + timedelta(minutes=10)

# Query CPU utilization
project_name = f"projects/{PROJECT_ID}"
interval = monitoring_v3.TimeInterval({
    "end_time": {"seconds": int(test_end.timestamp())},
    "start_time": {"seconds": int(test_start.timestamp())}
})

metric_filter = (
    f'resource.type="aiplatform.googleapis.com/Endpoint" AND '
    f'resource.labels.endpoint_id="{endpoint.name.split("/")[-1]}" AND '
    f'metric.type="aiplatform.googleapis.com/prediction/online/cpu/utilization"'
)

request = monitoring_v3.ListTimeSeriesRequest({
    "name": project_name,
    "filter": metric_filter,
    "interval": interval,
    "view": monitoring_v3.ListTimeSeriesRequest.TimeSeriesView.FULL
})

# Collect CPU data
cpu_data = []
for result in monitoring_client.list_time_series(request=request):
    for point in result.points:
        cpu_data.append({
            'timestamp': pd.Timestamp(point.interval.end_time),
            'cpu_utilization': point.value.double_value * 100  # Convert to percentage
        })

df_cpu = pd.DataFrame(cpu_data).sort_values('timestamp')

if len(df_cpu) > 0:
    print(f"‚úÖ Retrieved {len(df_cpu)} CPU measurements")
    print(f"   Range: {df_cpu['cpu_utilization'].min():.1f}% - {df_cpu['cpu_utilization'].max():.1f}%")
    print(f"   Mean: {df_cpu['cpu_utilization'].mean():.1f}%")
else:
    print("‚ö†Ô∏è  No CPU metrics found (may not be available yet)")

‚úÖ Retrieved 48 CPU measurements
   Range: 0.2% - 34.1%
   Mean: 5.5%


In [27]:
# Visualize CPU utilization
if len(df_cpu) > 0:
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=df_cpu['timestamp'], y=df_cpu['cpu_utilization'],
        mode='lines+markers', name='CPU Utilization',
        line=dict(color='blue', width=2), marker=dict(size=4)
    ))
    
    # Add autoscaling threshold line
    fig.add_hline(y=60, line_dash="dash", line_color="red",
                  annotation_text="Autoscale Threshold (60%)")
    
    fig.update_layout(
        title='CPU Utilization Over Time',
        xaxis_title='Time',
        yaxis_title='CPU Utilization (%)',
        height=400,
        yaxis=dict(range=[0, 100])
    )
    fig.show()
    
    # Analysis
    max_cpu = df_cpu['cpu_utilization'].max()
    print(f"\nüìä CPU Analysis:")
    print(f"   Max CPU: {max_cpu:.1f}%")
    
    if max_cpu >= 60:
        print(f"   ‚úÖ CPU exceeded autoscaling threshold (60%)")
        print(f"      Autoscaling should have triggered")
    else:
        print(f"   ‚ö†Ô∏è  CPU never reached autoscaling threshold (60%)")
        print(f"\n   üí° This indicates a capacity bottleneck, not compute bottleneck:")
        print(f"      ‚Ä¢ Model inference is very efficient (low CPU usage)")
        print(f"      ‚Ä¢ Latency degrades due to request queueing, not processing")
        print(f"      ‚Ä¢ Solution: Lower autoscaling threshold or increase min replicas")
else:
    print("No CPU data to visualize")


üìä CPU Analysis:
   Max CPU: 34.1%
   ‚ö†Ô∏è  CPU never reached autoscaling threshold (60%)

   üí° This indicates a capacity bottleneck, not compute bottleneck:
      ‚Ä¢ Model inference is very efficient (low CPU usage)
      ‚Ä¢ Latency degrades due to request queueing, not processing
      ‚Ä¢ Solution: Lower autoscaling threshold or increase min replicas


---

## Summary & Recommendations

Based on test results, here are configuration recommendations for production.

In [29]:
print("="*80)
print("VERTEX AI ENDPOINT SCALING TEST SUMMARY")
print("="*80)

print(f"\nüìã Configuration Tested:")
print(f"   Endpoint: {endpoint.display_name}")
print(f"   Machine: {MACHINE_TYPE}")
print(f"   Replicas: {MIN_REPLICAS} - {MAX_REPLICAS}")

print(f"\nüìä Phase 1 Results:")
# Get baseline from batch test data
batch_stats = df_batch.groupby('batch_size')['latency_ms'].agg(['mean']).reset_index()
baseline_lat = batch_stats[batch_stats['batch_size'] == 1]['mean'].values[0]
print(f"   Baseline latency (batch=1, 1 RPS): {baseline_lat:.1f}ms")
print(f"   Optimal batch size: {optimal}")

print(f"\n   Max Reliable RPS by Batch Size:")
# Get RPS stats from the RPS test
rps_stats = df_rps[df_rps['success'] == True].groupby(['batch_size', 'target_rps']).agg({
    'total_latency_ms': ['mean', lambda x: np.percentile(x, 95)],
    'success': 'count'
}).reset_index()
rps_stats.columns = ['batch_size', 'target_rps', 'total_mean', 'total_p95', 'count']

# Calculate success rates
for batch_size in RPS_BATCH_SIZES:
    batch_data = rps_stats[rps_stats['batch_size'] == batch_size]
    if len(batch_data) > 0:
        # Assume all tests with data had >= 95% success (we only kept successful requests)
        max_rps = batch_data['target_rps'].max()
        p95 = batch_data[batch_data['target_rps'] == max_rps]['total_p95'].values[0]
        print(f"      Batch {int(batch_size):3d}: {int(max_rps):3d} RPS (p95: {p95:.1f}ms)")

if len(df_cpu) > 0:
    print(f"\nüñ•Ô∏è  CPU Utilization:")
    print(f"   Max: {df_cpu['cpu_utilization'].max():.1f}%")
    print(f"   Mean: {df_cpu['cpu_utilization'].mean():.1f}%")
    
    if df_cpu['cpu_utilization'].max() < 60:
        print(f"   ‚ö†Ô∏è  Never reached autoscaling threshold (60%)")

print(f"\nüí° Recommendations:")
print(f"\n   1. Optimal Batch Size: {optimal} instances per request")
batch_optimal_lat = batch_stats[batch_stats['batch_size']==optimal]['mean'].values[0]
print(f"      - Balances latency ({batch_optimal_lat:.1f}ms) with throughput")

if len(df_cpu) > 0 and df_cpu['cpu_utilization'].max() < 60:
    print(f"\n   2. Autoscaling Configuration:")
    print(f"      - Current threshold (60% CPU) too high for this model")
    print(f"      - Model is CPU-efficient: max observed {df_cpu['cpu_utilization'].max():.1f}%")
    print(f"      - Options:")
    print(f"        a) Lower threshold to 5-20% CPU (requires redeployment)")
    print(f"        b) Increase min_replicas to 2-3 for baseline capacity")
    print(f"        c) Use larger machine type (more vCPUs may trigger scaling)")

# Calculate max throughput from RPS test results
if len(rps_stats) > 0:
    rps_stats['throughput'] = rps_stats['batch_size'] * rps_stats['target_rps']
    best_idx = rps_stats['throughput'].idxmax()
    best = rps_stats.loc[best_idx]
    print(f"\n   3. Maximum Throughput: ~{int(best['throughput'])} instances/sec")
    print(f"      - Configuration: batch={int(best['batch_size'])}, RPS={int(best['target_rps'])}")

print(f"\nüìù Next Steps:")
print(f"   ‚Ä¢ Review Cloud Console > Vertex AI > Endpoints > Monitoring")
print(f"     for replica count and autoscaling events")
print(f"   ‚Ä¢ Consider redeploying with adjusted autoscaling threshold")
print(f"   ‚Ä¢ Run tests again after configuration changes to validate")

print(f"\n" + "="*80)

VERTEX AI ENDPOINT SCALING TEST SUMMARY

üìã Configuration Tested:
   Endpoint: pytorch-autoencoder-endpoint
   Machine: n1-standard-4
   Replicas: 1 - 4

üìä Phase 1 Results:
   Baseline latency (batch=1, 1 RPS): 66.9ms
   Optimal batch size: 50

   Max Reliable RPS by Batch Size:
      Batch   1: 100 RPS (p95: 4560.0ms)
      Batch   5: 100 RPS (p95: 4248.1ms)
      Batch 100: 100 RPS (p95: 5697.8ms)

üñ•Ô∏è  CPU Utilization:
   Max: 34.1%
   Mean: 5.5%
   ‚ö†Ô∏è  Never reached autoscaling threshold (60%)

üí° Recommendations:

   1. Optimal Batch Size: 50 instances per request
      - Balances latency (128.7ms) with throughput

   2. Autoscaling Configuration:
      - Current threshold (60% CPU) too high for this model
      - Model is CPU-efficient: max observed 34.1%
      - Options:
        a) Lower threshold to 5-20% CPU (requires redeployment)
        b) Increase min_replicas to 2-3 for baseline capacity
        c) Use larger machine type (more vCPUs may trigger scaling)

 

## Conclusion

This notebook tested Vertex AI Endpoint performance comprehensively, revealing:

**Key Insights:**
- **Timing separation** (queueing vs request time) identifies bottleneck location
- **CPU-efficient models** may not trigger autoscaling despite high load
- **Capacity bottlenecks** differ from compute bottlenecks
- **Zero queueing time** indicates the client can send requests faster than the endpoint can process them

**Production Recommendations:**
- Use optimal batch size from Phase 1 testing (batch=50 for this model)
- Configure autoscaling threshold appropriate for your model's CPU profile
- Set min replicas to handle baseline traffic without cold starts
- For CPU-efficient models, consider lowering autoscaling threshold to 5-20%

**Important Note:**
Results shown in this notebook are specific to the PyTorch autoencoder model tested. Your results will vary based on:
- Model complexity and inference time
- Machine type and replica configuration
- Input data size and format
- Network conditions

Always run your own scaling tests with representative traffic patterns before deploying to production.

---

**Related Notebooks:**
- [Deploy to Vertex AI Endpoint (Prebuilt Container)](./vertex-ai-endpoint-prebuilt-container.ipynb)
- [Deploy to Vertex AI Endpoint (Custom Container)](./vertex-ai-endpoint-custom-container.ipynb)
- [PyTorch Autoencoder Training](../pytorch-autoencoder.ipynb)

**Related Resources:**
- [Vertex AI Prediction Documentation](https://cloud.google.com/vertex-ai/docs/predictions/overview)
- [Autoscaling Configuration](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute#autoscaling)
- [Performance Optimization Guide](https://cloud.google.com/vertex-ai/docs/predictions/optimize-prediction-performance)