Add serve ft nightly #19125

jiaodong · 2021-10-06T00:10:12Z

Create a serve FT test for both local testing as well as product nightly.

It uses s3://serve-nightly-tests/fault-tolerant-test-checkpoint as checkpoint path with 7 days TTL, each checkpoint is 260 Bytes in size.

In each test run, it will try to run test in uuid4 namespace with unique storage key, go through deploy() -> kill controller & ray & cluster -> resume in same namespace -> check endpoints availability process, then attempt to clean up checkpoint file and exit.

Works for both local disk and s3 path, since we don't have sufficient infra for multi cluster yet, both local and product tests will be using the same cluster_utils to run local cluster, except using different storage paths.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/serve/storage/checkpoint_path.py

jiaodong · 2021-10-06T00:11:41Z

python/ray/serve/storage/kv_store.py

-            aws_access_key_id=aws_access_key_id,
-            aws_secret_access_key=aws_secret_access_key,
-            aws_session_token=aws_session_token)
+            aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID", None),


i think previously we're not really getting these tokens in make_kv_store, so adding it at client level with actual s3 integration.

boto should fall back to environ if we pass in None, so this is not necessary.

bump on this

python/ray/serve/storage/checkpoint_path.py

edoakes · 2021-10-06T16:01:37Z

release/serve_tests/workloads/serve_cluster_fault_tolerance.py

+For product testing, we skip the part of actually starting new cluster as
+it's Job Manager's responsibility, and only re-deploy to the same cluster
+with remote checkpoint.


Hmm, I think we should still fully test that path where the entire cluster goes down. If we just test the controller being killed there may be some hidden assumptions about state in the GCS or something.

Can you just use cluster_utils and blow away the whole cluster and create a new one?

I would like to completely start new ray cluster if possible as well, but think cluster_utils is for local tests, is there an equivalent on product to blow away current cluster and start new one ?

Hmm I'm not sure. cluster_utils would be a good first step, it at least clears all process-level state

Ask internally on slack if there's a best way to do this? Worst case you just use the SDK to kill the cluster and start a new one.

cluster_utils will run much faster though and it'll be nice to be able to run it locally for testing. if possible maybe make one test with both "backends" and start with just cluster_utils?

i asked a few product folks, our SDK has start / terminate cluster APIs but the issues is current releaser/e2e.py is assuming it runs a single script as if on laptop, in single ray cluster. Doing this in our test script might not work well and most feasible way is probably extend existing test script's yaml field to extend a multi-cluster setup.

For now im trying to work with multiple local clusters on product, which should terminate quick but also give us more coverage compare to current revision.

simon-mo

@jiaodong please make sure to add them to preset dashboard once they are running!

jiaodong added 3 commits October 5, 2021 16:15

working locally

f1c4b97

add compute templates

4e28f41

travis

78a9426

jiaodong added the serve Ray Serve Related Issue label Oct 6, 2021

jiaodong assigned edoakes and simon-mo Oct 6, 2021

jiaodong commented Oct 6, 2021

View reviewed changes

edoakes requested changes Oct 6, 2021

View reviewed changes

jiaodong added 3 commits October 8, 2021 16:47

working both local and on s3

e9dbcfd

travis

70ec445

change redis port to work on product

4171c0c

simon-mo requested a review from edoakes October 11, 2021 16:52

jiaodong added 2 commits October 11, 2021 10:24

back to aws tokens as args without explictly fetching from os env

6fd37e6

travis

626b859

edoakes approved these changes Oct 11, 2021

View reviewed changes

simon-mo approved these changes Oct 12, 2021

View reviewed changes

simon-mo merged commit 85b8a6d into ray-project:master Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add serve ft nightly #19125

Add serve ft nightly #19125

jiaodong commented Oct 6, 2021 •

edited

jiaodong Oct 6, 2021

simon-mo Oct 6, 2021

simon-mo Oct 11, 2021

edoakes Oct 6, 2021

jiaodong Oct 6, 2021

edoakes Oct 6, 2021

edoakes Oct 6, 2021

edoakes Oct 6, 2021

jiaodong Oct 6, 2021

simon-mo left a comment

Add serve ft nightly #19125

Add serve ft nightly #19125

Conversation

jiaodong commented Oct 6, 2021 • edited

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

jiaodong commented Oct 6, 2021 •

edited