Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic failure handling #23

Merged
merged 69 commits into from
Dec 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
366bc4b
Update API
Dec 1, 2020
d352737
Update readme
Dec 1, 2020
698ed80
Update tests
Dec 1, 2020
0a0c52a
Fix benchmark test
Dec 1, 2020
cc6a46d
Fix benchmark test
Dec 1, 2020
7f500f0
Fix benchmark test
Dec 1, 2020
d0cd1b2
Fix benchmark test
Dec 1, 2020
bfa6c6a
Fix lint
Dec 1, 2020
2a6cb0b
Simplify smoke test
Dec 1, 2020
839281e
Add manual checkpointing test
Dec 1, 2020
03322ac
almost equal
Dec 1, 2020
9005a5c
Finish checkpointing test
Dec 1, 2020
65ba0a0
Remove references to checkpoint_path
Dec 1, 2020
0815102
Merge branch 'master' into rabit-checkpointing
Dec 1, 2020
b716ca6
Re-order imports
Dec 1, 2020
5bbf145
Apply suggestions from code review
Dec 2, 2020
29afb75
Switch to ray callbacks
Dec 2, 2020
b0356f6
More actors per trial
Dec 2, 2020
8be0753
Replace Ray Tune callbacks with xgboost_ray-specific callbacks
Dec 4, 2020
cd9885a
Demo for failure handling
Dec 4, 2020
5bed369
Merge branch 'master' into tune-checkpointing
Dec 7, 2020
9bd4e62
New callbacks and legacy support
Dec 7, 2020
3a48e20
Try to make FT test less flaky
Dec 7, 2020
6041552
Flaky test debug
Dec 7, 2020
b317fae
Fix typing
Dec 7, 2020
fe058d0
Debug outputs
Dec 7, 2020
8bb4e66
Get queue after ray.get
Dec 7, 2020
d04f54b
Kill queue actors
Dec 7, 2020
d8d0b7b
4 CPUs
Dec 7, 2020
235817f
clean up experiment dir
Dec 7, 2020
8e14947
Only 1 trial
Dec 7, 2020
fd01b53
Fix flaky test, remove debug output
Dec 7, 2020
198ea20
Fix flaky test
Dec 7, 2020
4d96794
Merge branch 'flaky-test' into fix-flaky-test
Dec 7, 2020
db861b0
Merge branch 'master' into fix-flaky-test
Dec 7, 2020
b29b6d8
Merge branch 'tune-checkpointing' into failure-handling-polling
Dec 7, 2020
bb2fa57
Elastic failure handling with actor status polling
Dec 7, 2020
44d34d7
Avoid `with_parameters`
Dec 7, 2020
f47262e
Merge branch 'tune-checkpointing' into failure-handling-polling
Dec 7, 2020
7d6b586
Remove unneeded session function
Dec 7, 2020
fb6e050
Merge branch 'tune-checkpointing' into failure-handling-polling
Dec 7, 2020
84eaefc
Merge branch 'master' into failure-handling-polling
Dec 7, 2020
f66dcf3
Fix lint from merge
Dec 7, 2020
845aa1e
Merge branch 'master' into failure-handling-polling
Dec 8, 2020
e25a0ba
Set test timeouts, kill Event actor
Dec 8, 2020
1146241
Stop ray before running examples
Dec 8, 2020
ff63b50
Kill queue and actors, set training thread to daemon
Dec 9, 2020
7abc6b8
Merge branch 'master' into failure-handling-polling
Dec 9, 2020
a66b206
Ray init cpus
Dec 9, 2020
d14ed77
smoke test
Dec 9, 2020
ea12cd8
Only shutdown if an object
Dec 9, 2020
7f4c5c4
Unpin xgboost
Dec 10, 2020
add29cb
Move to new-style callback API
Dec 10, 2020
31e2e22
Keep compatibility with legacy tune callbacks
Dec 10, 2020
cf7f049
Keep compatibility with legacy tune callbacks
Dec 10, 2020
90d3768
Merge branch 'unpin-xgboost' into failure-handling-polling
Dec 10, 2020
e002388
Move to new-style callbacks
Dec 10, 2020
eb08abc
Merge branch 'master' into failure-handling-polling
Dec 10, 2020
03a04ce
logging scope and silent local training error
Dec 10, 2020
d6d2740
Refactor checkpoint callback
Dec 10, 2020
e75d6ba
Checkpoint after training
Dec 10, 2020
c16adb4
Doc additional RayParams
Dec 10, 2020
0868127
Remove eval metric
Dec 10, 2020
bc5e203
Apply suggestions from code review
Dec 11, 2020
07b9ade
Apply suggestions from code review
Dec 11, 2020
ff584a7
Update xgboost_ray/main.py
krfricke Dec 13, 2020
86e4397
Apply suggestions from code review (mostly docs)
Dec 13, 2020
c1fb8b1
Added a bunch of in-code documentation
Dec 13, 2020
96a402d
Separate logs for failure modes
Dec 14, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 9 additions & 6 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on: [push, pull_request]
jobs:
test_lint:
runs-on: ubuntu-latest
timeout-minutes: 3

steps:
- uses: actions/checkout@v2
Expand All @@ -22,8 +23,8 @@ jobs:
./format.sh --all

test_linux_ray_master:

runs-on: ubuntu-latest
timeout-minutes: 10

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -51,15 +52,16 @@ jobs:
echo "running smoke test on benchmark_cpu_gpu.py" && python release/benchmark_cpu_gpu.py 2 10 20 --smoke-test
popd
pushd examples/
echo "running simple.py" && python simple.py
ray stop || true
echo "running simple.py" && python simple.py --smoke-test
echo "running simple_predict.py" && python simple_predict.py
echo "running simple_tune.py" && python simple_tune.py
echo "running simple_tune.py" && python simple_tune.py --smoke-test
echo "running train_on_test_data.py" && python train_on_test_data.py --smoke-test
# for f in *.py; do echo "running $f" && python "$f" || exit 1 ; done

test_linux_ray_release:

runs-on: ubuntu-latest
timeout-minutes: 10

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -87,8 +89,9 @@ jobs:
echo "running smoke test on benchmark_cpu_gpu.py" && python release/benchmark_cpu_gpu.py 2 10 20 --smoke-test
popd
pushd examples/
echo "running simple.py" && python simple.py
ray stop || true
echo "running simple.py" && python simple.py --smoke-test
echo "running simple_predict.py" && python simple_predict.py
echo "running simple_tune.py" && python simple_tune.py
echo "running simple_tune.py" && python simple_tune.py --smoke-test
echo "running train_on_test_data.py" && python train_on_test_data.py --smoke-test
# for f in *.py; do echo "running $f" && python "$f" || exit 1 ; done
12 changes: 9 additions & 3 deletions examples/simple.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from sklearn import datasets
from sklearn.model_selection import train_test_split

import ray

from xgboost_ray import RayDMatrix, train, RayParams


Expand Down Expand Up @@ -60,12 +62,16 @@ def main(cpus_per_actor, num_actors):
parser.add_argument(
"--num-actors",
type=int,
default=1,
default=4,
help="Sets number of xgboost workers to use.")
parser.add_argument(
"--smoke-test", action="store_true", default=False, help="gpu")

args, _ = parser.parse_known_args()

import ray
ray.init(address=args.address)
if args.smoke_test:
ray.init(num_cpus=args.num_actors)
else:
ray.init(address=args.address)

main(args.cpus_per_actor, args.num_actors)
10 changes: 8 additions & 2 deletions examples/simple_tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from sklearn.model_selection import train_test_split
import xgboost as xgb

import ray
from ray import tune

from xgboost_ray import train, RayDMatrix, RayParams
Expand Down Expand Up @@ -98,9 +99,14 @@ def main(cpus_per_actor, num_actors, num_samples):
type=int,
default=4,
help="Number of samples to use for Tune.")
parser.add_argument(
"--smoke-test", action="store_true", default=False, help="gpu")

args, _ = parser.parse_known_args()

import ray
ray.init(address=args.address)
if args.smoke_test:
ray.init(num_cpus=args.num_actors * args.num_samples)
else:
ray.init(address=args.address)

main(args.cpus_per_actor, args.num_actors, args.num_samples)