# Anyscale Services HA and Canary Rollout Features

## High availability for web services and ML serving

High availability (HA) refers to the ability of a system to continue to function despite one or more component failures.

Traditional web service
* Load balancers or layer 7 switches are placed as the external front end to traffic
    * Load is distributed across multiple service instances
    * Additional capacity and the ability to add service instances provides HA
* Stateful services complicate this picture a bit
    * Data storage must also be HA; this involves at least temporary compromises to ensure soundness

Scale-out clustered compute service for ML
* Modern machine learning workloads often leverage scale-out clustering technologies like Ray
* Cluster compute environments typically do *not* offer HA for performance and architecture reasons
* __Anyscale services__ enable HA for Ray Serve
    * *No single point of failure: even Ray's head node can fail without impacting service availability or capacity*
    * For the user of Anyscale services, this is available by default and requires no configuration

<img src='images/grafana-dash.png' width=800 />

## Canary service rollouts

The canary rollout feature allows zero-downtime upgrades as a live service transitions from one implementation to a new one
* Additional clusters are automatically provisioned by Anyscale for new service versions
    * service versions do *not* need to share config, dependencies, or even hardware requirements
    * the only thing that stays the same is the (internal) name and external endpoint
* Load is gradually shifted from the old service to the new one by Anyscale load balancers
    * rollout (changeover) schedule can be automatic, customized, or manually controlled
* Status of old and new versions are visible and accessible simultaneously in the Anyscale UI
    * Grafana integration shows realtime statistics on service transition
    * __Rollback__ feature is available if it is necessary to abort the transition and return all traffic to original service

### Demo of canary rollouts

We'll demonstrate rolling out multiple service versions while monitoring both externally and via Anyscale

#### Setup

The service versions are implemented in Python using standard Ray Serve APIs
* `1-hello-world.py` - simple "hello world" echo service
* `2-hello-chat.py` - skeleton for a chat service, it generates a response in a trivial static manner
* `3-llm-chat.py` - functional chatbot service built on Huggingface and the `blenderbot-400M-distill model`

Each service version has a corresponding YAML file used to deploy that version -- 1-service.yaml, 2-service.yaml, etc. Extended configuration is possible via these YAML files as well as via CLI parameters, but the present examples are minimal starting points for clarity.

Prior to running the present notebook, a security token was obtained and stored in `token.txt`

In [None]:
with open('token.txt', 'r') as f:
    token = f.read()

Each Anyscale service has a unique URL -- calls to this URL will be routed automatically during the version changeover

In [None]:
base_url = "https://service-demo-xa4v3.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com"

We'll set up minimal code to make a request to our service

In [None]:
import requests

path = "/"
full_url = f"{base_url}{path}"
headers = {"Authorization": f"Bearer {token}"}

In [None]:
sample_json = '{ "user_input" : "hello", "history":[] }'

requests.post(base_url,  headers={'Authorization': 'Bearer '+ token }, json = sample_json).json()

Currently, the service is not deployed

#### Initial service rollout

In [None]:
! anyscale service rollout -f 1-service.yaml

In the Anyscale UI (or via logs) we can monitor the initial rollout

> We can launch `load_test.py` in a console to generate a steady stream of requests to our service

In [None]:
requests.post(base_url,  headers={'Authorization': 'Bearer '+ token }, json = sample_json).json()

#### Upgrading the service

When we are ready to upgrade the service, we issue another CLI command

In [None]:
! anyscale service rollout -f 2-service.yaml

At this point, we may want to observe the canary rollout service changeover in
* the Anyscale service UI
* Grafana timeseries chart of all-version traffic
* service response changes to local requests (visible by running `tail -f data.txt` to view the output stats from the `load_test.py` script)

In [None]:
requests.post(base_url,  headers={'Authorization': 'Bearer '+ token }, json = sample_json).json()

#### LLM chat service

To change over to a real LLM-backed chat service, we run another similar CLI command

Note that
* real LLM-backed service requires different software (Python libraries) and hardware (GPU)
    * those changes are managed automatically with no new config
* LLM chat service, due to model inference computations, involves more latency than the "hello world" service
    * capacity/throughput of the LLM chat model can be improved by adding more replicas or autoscaling that service deployment (using the standard Ray Serve APIs)

In [None]:
! anyscale service rollout -f 3-service.yaml

We can further manage the service via Anyscale UI, Python SDK or CLI