Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## AKS Load Testing

Once a model has been deployed to production it is important to ensure that the deployment target can support the expected load (number of users and expected response speed). This is critical for providing recommendations in production systems that must support recommendations for multiple users simultaneously. As the number of concurrent users grows the load on the recommendation system can increase significantly, so understanding the limits of any operationalized system is necessary to avoid unwanted system failures or slow response times for users. 

To perform this kind of load test we can leverage tools that simulate user requests at varying rates and establish how many requests per seconds, or what the average response time is for the service. This notebook walks through the process of performing load testing for a deployed model on Azure Kubernetes Service (AKS).

This notebook assumes an AKS Webservice was used to deploy the model from a Azure Machine Learning service Workspace.
An example of this approach is provided in the [LightGBM Operationalization notebook](lightgbm_criteo_o16n.ipynb).

We use [Locust](https://docs.locust.io/en/stable/) to perform the load testing, see documentation for more details about this tool.

In [1]:
import os
import subprocess
from tempfile import TemporaryDirectory
from urllib.parse import urlparse

import requests

from azureml.core import Workspace
from azureml.core import VERSION as azureml_version
from azureml.core.webservice import AksWebservice

from reco_utils.dataset.criteo import get_spark_schema, load_pandas_df

# Check core SDK version number
print("Azure ML SDK version: {}".format(azureml_version))

Azure ML SDK version: 1.0.18


In [2]:
# We increase the cell width to capture all the output from locust later
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Create a temporary directory for generated files

In [3]:
TMP_DIR = TemporaryDirectory()

### Retrieve the AKS service information

In [4]:
# this must match the service name that has been deployed
SERVICE_NAME = 'lightgbm-criteo'

In [5]:
ws = Workspace.get(
    name="<AZUREML-WORKSPACE-NAME",
    subscription_id='<AZURE-SUBSCRIPTION-ID>',
    resource_group='<AZURE-RESOURCE-GROUP>',
)

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Found the config file in: C:\Users\scgraham\repos\Recommenders\notebooks\05_operationalize\aml_config\config.json
Wrote the config file config.json to: C:\Users\scgraham\repos\Recommenders\notebooks\05_operationalize\aml_config\config.json


In [6]:
aks_service = AksWebservice(ws, name=SERVICE_NAME)

In [7]:
# Get the scoring the URI
url = aks_service.scoring_uri
parsed_url = urlparse(url)

# Setup authentication using one of the keys from aks_service
headers = dict(Authorization='Bearer {}'.format(aks_service.get_keys()[0]))

### Get Sample data for testing

In [8]:
# Grab some sample data
df = load_pandas_df(size='sample')

8.79MB [00:04, 1.93MB/s]                                                                                                                                                                                                                                                   


In [9]:
data = df.iloc[0, :].to_json()
print(data)

{"label":0,"int00":1.0,"int01":1,"int02":5.0,"int03":0.0,"int04":1382.0,"int05":4.0,"int06":15.0,"int07":2.0,"int08":181.0,"int09":1.0,"int10":2.0,"int11":null,"int12":2.0,"cat00":"68fd1e64","cat01":"80e26c9b","cat02":"fb936136","cat03":"7b4723c4","cat04":"25c83c98","cat05":"7e0ccccf","cat06":"de7995b8","cat07":"1f89b562","cat08":"a73ee510","cat09":"a8cd5504","cat10":"b2cb9c98","cat11":"37c9c164","cat12":"2824a5f6","cat13":"1adce6ef","cat14":"8ba8b39a","cat15":"891b62e7","cat16":"e5ba7672","cat17":"f54016b9","cat18":"21ddcdc9","cat19":"b1252a9d","cat20":"07b5194c","cat21":null,"cat22":"3a171ecb","cat23":"c5c50484","cat24":"e8b83407","cat25":"9727dd16"}


In [10]:
# Ensure the aks service is running and provides expected results
aks_service.run(data)

'{"result": 0.35952275816753043}'

In [11]:
# Make sure an HTTP request to the service will also work
response = requests.post(url=url, json=data, headers=headers)
print(response.json())

{"result": 0.35952275816753043}


### Setup LocustFile

Locust uses a locust file (defaulting to locustfile.py) which controls the user behavior. 

In this example we create a UserBehavior class which encapsulates the tasks that the user will conduct each time it is started. We are only interested in ensure the service can handle a request with sample data so the only task used is the score task which is a simple post request like what was done manually above.

The next class defines how a user will be instantiated, in this case we create a user which will make start an http session with the host server and execute the defined tasks. The task will be repeated after waiting for a small period of time. That wait period is determined by making a uniform random sample between the min and max wait times (in milliseconds).

In [12]:
locustfile = """
from locust import HttpLocust, TaskSet, task


class UserBehavior(TaskSet):
    @task
    def score(self):
        self.client.post("{score_url}", json='{data}', headers={headers})


class WebsiteUser(HttpLocust):
    task_set = UserBehavior
    # min and max time to wait before repeating task
    min_wait = 1000
    max_wait = 2000
""".format(data=data, headers=headers, score_url=parsed_url.path)

locustfile_path = os.path.join(TMP_DIR.name, 'locustfile.py')
with open(locustfile_path, 'w') as f:
    f.write(locustfile)

The next step is to start the locust load test tool. It can be run with a web interface or directly from the command line. In this case we will just run it from the command line and specify the number of concurrent users, how fast the users should spawn and how long the test should run for. All these options can be controlled via the web interface gui as well as providing more information on failures so it is useful to read the documentation for more advanced usage. Here we will just run the test and capture the summary results.

In [13]:
cmd = "locust -H {host} -f {path} --no-web -c {users} -r {rate} -t {duration} --only-summary".format(
    host='{url.scheme}://{url.netloc}'.format(url=parsed_url),
    path=locustfile_path,
    users=200,  # concurrent users
    rate=10,  # hatch rate (users / second)
    duration='1m',  # test duration
)
process = subprocess.run(cmd, shell=True, stderr=subprocess.PIPE)
print(process.stderr.decode('utf-8'))

[2019-05-28 12:36:31,630] 9821192-1116/INFO/locust.main: Run time limit set to 60 seconds
[2019-05-28 12:36:31,631] 9821192-1116/INFO/locust.main: Starting Locust 0.11.0
[2019-05-28 12:36:31,631] 9821192-1116/INFO/locust.runners: Hatching and swarming 200 clients at the rate 10 clients/s...
[2019-05-28 12:36:51,864] 9821192-1116/INFO/locust.runners: All locusts hatched: WebsiteUser: 200
[2019-05-28 12:37:30,701] 9821192-1116/INFO/locust.main: Time limit reached. Stopping Locust.
[2019-05-28 12:37:30,707] 9821192-1116/INFO/locust.main: Shutting down (exit code 0), bye.
[2019-05-28 12:37:30,707] 9821192-1116/INFO/locust.main: Cleaning up runner...
[2019-05-28 12:37:30,738] 9821192-1116/INFO/locust.main: Running teardowns...
 Name                                                          # reqs      # fails     Avg     Min     Max  |  Median   req/s
--------------------------------------------------------------------------------------------------------------------------------------------
 

### Load Test Results

Above you can see the number of requests, failures and statistics on response time, as well as the number of requests per second that the server is handling.

The second line shows the distribution of response times which can be helpful to understand over all the requests how the load is impacting the response speed and whether there may be outliers which are impacting performance.

### Cleanup temporary directory

In [14]:
TMP_DIR.cleanup()