Skip to content
This repository has been archived by the owner on Nov 15, 2021. It is now read-only.

Commit

Permalink
Merge ray upstream into master (#35)
Browse files Browse the repository at this point in the history
* [rllib] Remove dependency on TensorFlow (ray-project#4764)

* remove hard tf dep

* add test

* comment fix

* fix test

* Dynamic Custom Resources - create and delete resources (ray-project#3742)

* Update tutorial link in doc (ray-project#4777)

* [rllib] Implement learn_on_batch() in torch policy graph

* Fix `ray stop` by killing raylet before plasma (ray-project#4778)

* Fatal check if object store dies (ray-project#4763)

* [rllib] fix clip by value issue as TF upgraded (ray-project#4697)

*  fix clip_by_value issue

*  fix typo

* [autoscaler] Fix submit (ray-project#4782)

* Queue tasks in the raylet in between async callbacks (ray-project#4766)

* Add a SWAP TaskQueue so that we can keep track of tasks that are temporarily dequeued

* Fix bug where tasks that fail to be forwarded don't appear to be local by adding them to SWAP queue

* cleanups

* updates

* updates

* [Java][Bazel]  Refine auto-generated pom files (ray-project#4780)

* Bump version to 0.7.0 (ray-project#4791)

* [JAVA] setDefaultUncaughtExceptionHandler to log uncaught exception in user thread. (ray-project#4798)

* Add WorkerUncaughtExceptionHandler

* Fix

* revert bazel and pom

* [tune] Fix CLI test (ray-project#4801)

* Fix pom file generation (ray-project#4800)

* [rllib] Support continuous action distributions in IMPALA/APPO (ray-project#4771)

* [rllib] TensorFlow 2 compatibility (ray-project#4802)

* Change tagline in documentation and README. (ray-project#4807)

* Update README.rst, index.rst, tutorial.rst and  _config.yml

* [tune] Support non-arg submit (ray-project#4803)

* [autoscaler] rsync cluster (ray-project#4785)

* [tune] Remove extra parsing functionality (ray-project#4804)

* Fix Java worker log dir (ray-project#4781)

* [tune] Initial track integration (ray-project#4362)

Introduces a minimally invasive utility for logging experiment results. A broad requirement for this tool is that it should integrate seamlessly with Tune execution.

* [rllib] [RFC] Dynamic definition of loss functions and modularization support (ray-project#4795)

* dynamic graph

* wip

* clean up

* fix

* document trainer

* wip

* initialize the graph using a fake batch

* clean up dynamic init

* wip

* spelling

* use builder for ppo pol graph

* add ppo graph

* fix naming

* order

* docs

* set class name correctly

* add torch builder

* add custom model support in builder

* cleanup

* remove underscores

* fix py2 compat

* Update dynamic_tf_policy_graph.py

* Update tracking_dict.py

* wip

* rename

* debug level

* rename policy_graph -> policy in new classes

* fix test

* rename ppo tf policy

* port appo too

* forgot grads

* default policy optimizer

* make default config optional

* add config to optimizer

* use lr by default in optimizer

* update

* comments

* remove optimizer

* fix tuple actions support in dynamic tf graph

* [rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (ray-project#4819)

This implements some of the renames proposed in ray-project#4813
We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch.

* [Java] Dynamic resource API in Java (ray-project#4824)

* Add default values for Wgym flags

* Fix import

* Fix issue when starting `raylet_monitor` (ray-project#4829)

* Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (ray-project#4776)

* Enable BaseId.

* Change TaskID and make python test pass

* Remove unnecessary functions and fix test failure and change TaskID to
16 bytes.

* Java code change draft

* Refine

* Lint

* Update java/api/src/main/java/org/ray/api/id/TaskId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/BaseId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/BaseId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/ObjectId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comment

* Lint

* Fix SINGLE_PROCESS

* Fix comments

* Refine code

* Refine test

* Resolve conflict

* Fix bug in which actor classes are not exported multiple times. (ray-project#4838)

* Bump Ray master version to 0.8.0.dev0 (ray-project#4845)

* Add section to bump version of master branch and cleanup release docs (ray-project#4846)

* Fix import

* Export remote functions when first used and also fix bug in which rem… (ray-project#4844)

* Export remote functions when first used and also fix bug in which remote functions and actor classes are not exported from workers during subsequent ray sessions.

* Documentation update

* Fix tests.

* Fix grammar

* Update wheel versions in documentation to 0.8.0.dev0 and 0.7.0. (ray-project#4847)

* [tune] Later expansion of local_dir (ray-project#4806)

* [rllib] [RFC] Deprecate Python 2 / RLlib (ray-project#4832)

* Fix a typo in kubernetes yaml (ray-project#4872)

* Move global state API out of global_state object. (ray-project#4857)

* Install bazel in autoscaler development configs. (ray-project#4874)

* [tune] Fix up Ax Search and Examples (ray-project#4851)

* update Ax for cleaner API

* docs update

* [rllib] Update concepts docs and add "Building Policies in Torch/TensorFlow" section (ray-project#4821)

* wip

* fix index

* fix bugs

* todo

* add imports

* note on get ph

* note on get ph

* rename to building custom algs

* add rnn state info

* [rllib] Fix error getting kl when simple_optimizer: True in multi-agent PPO

* Replace ReturnIds with NumReturns in TaskInfo to reduce the size (ray-project#4854)

* Refine TaskInfo

* Fix

* Add a test to print task info size

* Lint

* Refine

* Update deps commits of opencensus to support building with bzl 0.25.x (ray-project#4862)

* Update deps to support bzl 2.5.x

* Fix

* Upgrade arrow to latest master (ray-project#4858)

* [tune] Auto-init Ray + default SearchAlg (ray-project#4815)

* Bump version from 0.8.0.dev0 to 0.7.1. (ray-project#4890)

* [rllib] Allow access to batches prior to postprocessing (ray-project#4871)

* [rllib] Fix Multidiscrete support (ray-project#4869)

* Refactor redis callback handling (ray-project#4841)

* Add CallbackReply

* Fix

* fix linting by format.sh

* Fix linting

* Address comments.

* Fix

* Initial high-level code structure of CoreWorker. (ray-project#4875)

* Drop duplicated string format (ray-project#4897)

This string format is unnecessary. java_worker_options has been appended to the commandline later.

* Refactor ID Serial 2: change all ID functions to `CamelCase` (ray-project#4896)

* Hotfix for change of from_random to FromRandom (ray-project#4909)

* [rllib] Fix documentation on custom policies (ray-project#4910)

* wip

* add docs

* lint

* todo sections

* fix doc

* [rllib] Allow Torch policies access to full action input dict in extra_action_out_fn (ray-project#4894)

* fix torch extra out

* preserve setitem

* fix docs

* [tune] Pretty print params json in logger.py (ray-project#4903)

* [sgd] Distributed Training via PyTorch (ray-project#4797)

Implements distributed SGD using distributed PyTorch.

* [rllib] Rough port of DQN to build_tf_policy() pattern (ray-project#4823)

* fetching objects in parallel in _get_arguments_for_execution (ray-project#4775)

* [tune] Disallow setting resources_per_trial when it is already configured (ray-project#4880)

* disallow it

* import fix

* fix example

* fix test

* fix tests

* Update mock.py

* fix

* make less convoluted

* fix tests

* [rllib] Rename PolicyEvaluator => RolloutWorker (ray-project#4820)

* Fix local cluster yaml (ray-project#4918)

* [tune] Directional metrics for components (ray-project#4120) (ray-project#4915)

* [Core Worker] implement ObjectInterface and add test framework (ray-project#4899)

* [tune] Make PBT Quantile fraction configurable (ray-project#4912)

* Better organize ray_common module (ray-project#4898)

* Fix error

* [tune] Add requirements-dev.txt and update docs for contributing (ray-project#4925)

* Add requirements-dev.txt and update docs.

* Update doc/source/tune-contrib.rst

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* Unpin everything except for yapf.

* Fix compute actions return value

* Bump version from 0.7.1 to 0.8.0.dev1. (ray-project#4937)

* Update version number in documentation after release 0.7.0 -> 0.7.1 and 0.8.0.dev0 -> 0.8.0.dev1. (ray-project#4941)

* [doc] Update developer docs with bazel instructions (ray-project#4944)

* [C++] Add hash table to Redis-Module (ray-project#4911)

* Flush lineage cache on task submission instead of execution (ray-project#4942)

* [rllib] Add docs on how to use TF eager execution (ray-project#4927)

* [rllib] Port remainder of algorithms to build_trainer() pattern (ray-project#4920)

* Fix resource bookkeeping bug with acquiring unknown resource. (ray-project#4945)

* Update aws keys for uploading wheels to s3. (ray-project#4948)

* Upload wheels on Travis to branchname/commit_id. (ray-project#4949)

* [Java] Fix serializing issues of `RaySerializer` (ray-project#4887)

* Fix

* Address comment.

* fix (ray-project#4950)

* [Java] Add inner class `Builder` to build call options. (ray-project#4956)

* Add Builder class

* format

* Refactor by IDE

* Remove uncessary dependency

* Make release stress tests work and improve them. (ray-project#4955)

* Use proper session directory for debug_string.txt (ray-project#4960)

* [core] Use int64_t instead of int to keep track of fractional resources (ray-project#4959)

* [core worker] add task submission & execution interface (ray-project#4922)

* [sgd] Add non-distributed PyTorch runner (ray-project#4933)

* Add non-distributed PyTorch runner

* use dist.is_available() instead of checking OS

* Nicer exception

* Fix bug in choosing port

* Refactor some code

* Address comments

* Address comments

* Flush all tasks from local lineage cache after a node failure (ray-project#4964)

* Remove typing from setup.py install_requirements. (ray-project#4971)

* [Java] Fix bug of `BaseID` in multi-threading case. (ray-project#4974)

* [rllib] Fix DDPG example (ray-project#4973)

* Upgrade CI clang-format to 6.0 (ray-project#4976)

* [Core worker] add store & task provider (ray-project#4966)

* Fix bugs in the a3c code template. (ray-project#4984)

* Inherit Function Docstrings and other metedata (ray-project#4985)

* Fix a crash when unknown worker registering to raylet (ray-project#4992)

* [gRPC] Use gRPC for inter-node-manager communication (ray-project#4968)
  • Loading branch information
stefanpantic committed Jun 21, 2019
1 parent 59274f7 commit b850e14
Show file tree
Hide file tree
Showing 150 changed files with 4,846 additions and 2,076 deletions.
2 changes: 2 additions & 0 deletions .bazelrc
Expand Up @@ -2,3 +2,5 @@
build --compilation_mode=opt
build --action_env=PATH
build --action_env=PYTHON_BIN_PATH
# This workaround is needed due to https://github.com/bazelbuild/bazel/issues/4341
build --per_file_copt="external/com_github_grpc_grpc/.*@-DGRPC_BAZEL_BUILD"
2 changes: 2 additions & 0 deletions .travis.yml
@@ -1,5 +1,7 @@
language: generic

dist: xenial


services:
- docker
Expand Down
43 changes: 41 additions & 2 deletions BUILD.bazel
@@ -1,12 +1,37 @@
# Bazel build
# C/C++ documentation: https://docs.bazel.build/versions/master/be/c-cpp.html

load("@com_github_grpc_grpc//bazel:grpc_build_system.bzl", "grpc_proto_library")
load("@com_github_google_flatbuffers//:build_defs.bzl", "flatbuffer_cc_library")
load("@//bazel:ray.bzl", "flatbuffer_py_library")
load("@//bazel:cython_library.bzl", "pyx_library")

COPTS = ["-DRAY_USE_GLOG"]

# Node manager gRPC lib.
grpc_proto_library(
name = "node_manager_grpc_lib",
srcs = ["src/ray/protobuf/node_manager.proto"],
)

# Node manager server and client.
cc_library(
name = "node_manager_rpc_lib",
srcs = glob([
"src/ray/rpc/*.cc",
]),
hdrs = glob([
"src/ray/rpc/*.h",
]),
copts = COPTS,
deps = [
":node_manager_grpc_lib",
":ray_common",
"@boost//:asio",
"@com_github_grpc_grpc//:grpc++",
],
)

cc_binary(
name = "raylet",
srcs = ["src/ray/raylet/main.cc"],
Expand Down Expand Up @@ -89,6 +114,7 @@ cc_library(
":gcs",
":gcs_fbs",
":node_manager_fbs",
":node_manager_rpc_lib",
":object_manager",
":ray_common",
":ray_util",
Expand All @@ -111,13 +137,18 @@ cc_library(
srcs = glob(
[
"src/ray/core_worker/*.cc",
"src/ray/core_worker/store_provider/*.cc",
"src/ray/core_worker/transport/*.cc",
],
exclude = [
"src/ray/core_worker/*_test.cc",
"src/ray/core_worker/mock_worker.cc",
],
),
hdrs = glob([
"src/ray/core_worker/*.h",
"src/ray/core_worker/store_provider/*.h",
"src/ray/core_worker/transport/*.h",
]),
copts = COPTS,
deps = [
Expand All @@ -127,7 +158,15 @@ cc_library(
],
)

# This test is run by src/ray/test/run_core_worker_tests.sh
cc_binary(
name = "mock_worker",
srcs = ["src/ray/core_worker/mock_worker.cc"],
copts = COPTS,
deps = [
":core_worker_lib",
],
)

cc_binary(
name = "core_worker_test",
srcs = ["src/ray/core_worker/core_worker_test.cc"],
Expand Down Expand Up @@ -535,7 +574,7 @@ flatbuffer_py_library(
"ErrorTableData.py",
"ErrorType.py",
"FunctionTableData.py",
"GcsTableEntry.py",
"GcsEntry.py",
"HeartbeatBatchTableData.py",
"HeartbeatTableData.py",
"Language.py",
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Expand Up @@ -6,7 +6,7 @@
.. image:: https://readthedocs.org/projects/ray/badge/?version=latest
:target: http://ray.readthedocs.io/en/latest/?badge=latest

.. image:: https://img.shields.io/badge/pypi-0.7.0-blue.svg
.. image:: https://img.shields.io/badge/pypi-0.7.1-blue.svg
:target: https://pypi.org/project/ray/

|
Expand Down
4 changes: 4 additions & 0 deletions bazel/ray_deps_build_all.bzl
Expand Up @@ -3,10 +3,14 @@ load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")
load("@com_github_jupp0r_prometheus_cpp//:repositories.bzl", "prometheus_cpp_repositories")
load("@com_github_ray_project_ray//bazel:python_configure.bzl", "python_configure")
load("@com_github_checkstyle_java//:repo.bzl", "checkstyle_deps")
load("@com_github_grpc_grpc//bazel:grpc_deps.bzl", "grpc_deps")


def ray_deps_build_all():
gen_java_deps()
checkstyle_deps()
boost_deps()
prometheus_cpp_repositories()
python_configure(name = "local_config_python")
grpc_deps()

8 changes: 8 additions & 0 deletions bazel/ray_deps_setup.bzl
Expand Up @@ -101,3 +101,11 @@ def ray_deps_setup():
# `https://github.com/jupp0r/prometheus-cpp/pull/225` getting merged.
urls = ["https://github.com/jovany-wang/prometheus-cpp/archive/master.zip"],
)

http_archive(
name = "com_github_grpc_grpc",
urls = [
"https://github.com/grpc/grpc/archive/7741e806a213cba63c96234f16d712a8aa101a49.tar.gz",
],
strip_prefix = "grpc-7741e806a213cba63c96234f16d712a8aa101a49",
)
Expand Up @@ -9,7 +9,7 @@ pushd "$ROOT_DIR"

python -m pip install pytest-benchmark

pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev0-cp27-cp27mu-manylinux1_x86_64.whl
pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev1-cp27-cp27mu-manylinux1_x86_64.whl
python -m pytest --benchmark-autosave --benchmark-min-rounds=10 --benchmark-columns="min, max, mean" $ROOT_DIR/../../../python/ray/tests/perf_integration_tests/test_perf_integration.py

pushd $ROOT_DIR/../../../python
Expand Down
10 changes: 10 additions & 0 deletions ci/jenkins_tests/run_rllib_tests.sh
Expand Up @@ -392,6 +392,16 @@ docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/ci/suppress_output python /ray/python/ray/rllib/examples/rollout_worker_custom_workflow.py

docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/ci/suppress_output python /ray/python/ray/rllib/examples/eager_execution.py --iters=2

docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/ci/suppress_output /ray/python/ray/rllib/train.py \
--env CartPole-v0 \
--run PPO \
--stop '{"training_iteration": 1}' \
--config '{"use_eager": true, "simple_optimizer": true}'

docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/ci/suppress_output python /ray/python/ray/rllib/examples/custom_tf_policy.py --iters=2

Expand Down
6 changes: 3 additions & 3 deletions ci/stress_tests/application_cluster_template.yaml
Expand Up @@ -37,7 +37,7 @@ provider:
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b
availability_zone: us-west-2b

# How Ray will authenticate with newly launched nodes.
auth:
Expand Down Expand Up @@ -90,8 +90,8 @@ file_mounts: {
# List of shell commands to run to set up nodes.
setup_commands:
- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_<<<PYTHON_VERSION>>>/bin:$PATH"' >> ~/.bashrc
- ray || wget https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev0-<<<WHEEL_STR>>>-manylinux1_x86_64.whl
- rllib || pip install -U ray-0.8.0.dev0-<<<WHEEL_STR>>>-manylinux1_x86_64.whl[rllib]
- ray || wget https://s3-us-west-2.amazonaws.com/ray-wheels/releases/<<<RAY_VERSION>>>/<<<RAY_COMMIT>>>/ray-<<<RAY_VERSION>>>-<<<WHEEL_STR>>>-manylinux1_x86_64.whl
- rllib || pip install -U ray-<<<RAY_VERSION>>>-<<<WHEEL_STR>>>-manylinux1_x86_64.whl[rllib]
- pip install tensorflow-gpu==1.12.0
- echo "sudo halt" | at now + 60 minutes
# Consider uncommenting these if you also want to run apt-get commands during setup
Expand Down
88 changes: 56 additions & 32 deletions ci/stress_tests/run_application_stress_tests.sh
@@ -1,4 +1,11 @@
#!/usr/bin/env bash

# This script should be run as follows:
# ./run_application_stress_tests.sh <ray-version> <ray-commit>
# For example, <ray-version> might be 0.7.1
# and <ray-commit> might be bc3b6efdb6933d410563ee70f690855c05f25483. The commit
# should be the latest commit on the branch "releases/<ray-version>".

# This script runs all of the application tests.
# Currently includes an IMPALA stress test and a SGD stress test.
# on both Python 2.7 and 3.6.
Expand All @@ -10,26 +17,39 @@

# This script will exit with code 1 if the test did not run successfully.

# Show explicitly which commands are currently running. This should only be AFTER
# the private key is placed.
set -x

ROOT_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd)
RESULT_FILE=$ROOT_DIR/"results-$(date '+%Y-%m-%d_%H-%M-%S').log"

echo "Logging to" $RESULT_FILE
echo -e $RAY_AWS_SSH_KEY > /root/.ssh/ray-autoscaler_us-west-2.pem && chmod 400 /root/.ssh/ray-autoscaler_us-west-2.pem || true
touch "$RESULT_FILE"
echo "Logging to" "$RESULT_FILE"

if [[ -z "$1" ]]; then
echo "ERROR: The first argument must be the Ray version string."
exit 1
else
RAY_VERSION=$1
fi

# Show explicitly which commands are currently running. This should only be AFTER
# the private key is placed.
set -x
if [[ -z "$2" ]]; then
echo "ERROR: The second argument must be the commit hash to test."
exit 1
else
RAY_COMMIT=$2
fi

touch $RESULT_FILE
echo "Testing ray==$RAY_VERSION at commit $RAY_COMMIT."
echo "The wheels used will live under https://s3-us-west-2.amazonaws.com/ray-wheels/releases/$RAY_VERSION/$RAY_COMMIT/"

# This function identifies the right string for the Ray wheel.
_find_wheel_str(){
local python_version=$1
# echo "PYTHON_VERSION", $python_version
local wheel_str=""
if [ $python_version == "p27" ]; then
if [ "$python_version" == "p27" ]; then
wheel_str="cp27-cp27mu"
else
wheel_str="cp36-cp36m"
Expand All @@ -41,7 +61,7 @@ _find_wheel_str(){
# Actual test runtime is roughly 10 minutes.
test_impala(){
local PYTHON_VERSION=$1
local WHEEL_STR=$(_find_wheel_str $PYTHON_VERSION)
local WHEEL_STR=$(_find_wheel_str "$PYTHON_VERSION")

pushd "$ROOT_DIR"
local TEST_NAME="rllib_impala_$PYTHON_VERSION"
Expand All @@ -50,32 +70,34 @@ test_impala(){

cat application_cluster_template.yaml |
sed -e "
s/<<<RAY_VERSION>>>/$RAY_VERSION/g;
s/<<<RAY_COMMIT>>>/$RAY_COMMIT/;
s/<<<CLUSTER_NAME>>>/$TEST_NAME/;
s/<<<HEAD_TYPE>>>/g3.16xlarge/;
s/<<<HEAD_TYPE>>>/p3.16xlarge/;
s/<<<WORKER_TYPE>>>/m5.24xlarge/;
s/<<<MIN_WORKERS>>>/5/;
s/<<<MAX_WORKERS>>>/5/;
s/<<<PYTHON_VERSION>>>/$PYTHON_VERSION/;
s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > $CLUSTER
s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > "$CLUSTER"

echo "Try running IMPALA stress test."
{
RLLIB_DIR=../../python/ray/rllib/
ray --logging-level=DEBUG up -y $CLUSTER &&
ray rsync_up $CLUSTER $RLLIB_DIR/tuned_examples/ tuned_examples/ &&
ray --logging-level=DEBUG up -y "$CLUSTER" &&
ray rsync_up "$CLUSTER" $RLLIB_DIR/tuned_examples/ tuned_examples/ &&
sleep 1 &&
ray --logging-level=DEBUG exec $CLUSTER "rllib || true" &&
ray --logging-level=DEBUG exec $CLUSTER "
ray --logging-level=DEBUG exec "$CLUSTER" "rllib || true" &&
ray --logging-level=DEBUG exec "$CLUSTER" "
rllib train -f tuned_examples/atari-impala-large.yaml --redis-address='localhost:6379' --queue-trials" &&
echo "PASS: IMPALA Test for" $PYTHON_VERSION >> $RESULT_FILE
} || echo "FAIL: IMPALA Test for" $PYTHON_VERSION >> $RESULT_FILE
echo "PASS: IMPALA Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
} || echo "FAIL: IMPALA Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"

# Tear down cluster.
if [ "$DEBUG_MODE" = "" ]; then
ray down -y $CLUSTER
rm $CLUSTER
ray down -y "$CLUSTER"
rm "$CLUSTER"
else
echo "Not tearing down cluster" $CLUSTER
echo "Not tearing down cluster" "$CLUSTER"
fi
popd
}
Expand All @@ -93,32 +115,34 @@ test_sgd(){

cat application_cluster_template.yaml |
sed -e "
s/<<<RAY_VERSION>>>/$RAY_VERSION/g;
s/<<<RAY_COMMIT>>>/$RAY_COMMIT/;
s/<<<CLUSTER_NAME>>>/$TEST_NAME/;
s/<<<HEAD_TYPE>>>/g3.16xlarge/;
s/<<<WORKER_TYPE>>>/g3.16xlarge/;
s/<<<HEAD_TYPE>>>/p3.16xlarge/;
s/<<<WORKER_TYPE>>>/p3.16xlarge/;
s/<<<MIN_WORKERS>>>/3/;
s/<<<MAX_WORKERS>>>/3/;
s/<<<PYTHON_VERSION>>>/$PYTHON_VERSION/;
s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > $CLUSTER
s/<<<WHEEL_STR>>>/$WHEEL_STR/;" > "$CLUSTER"

echo "Try running SGD stress test."
{
SGD_DIR=$ROOT_DIR/../../python/ray/experimental/sgd/
ray --logging-level=DEBUG up -y $CLUSTER &&
ray --logging-level=DEBUG up -y "$CLUSTER" &&
# TODO: fix submit so that args work
ray rsync_up $CLUSTER $SGD_DIR/mnist_example.py mnist_example.py &&
ray rsync_up "$CLUSTER" "$SGD_DIR/mnist_example.py" mnist_example.py &&
sleep 1 &&
ray --logging-level=DEBUG exec $CLUSTER "
ray --logging-level=DEBUG exec "$CLUSTER" "
python mnist_example.py --redis-address=localhost:6379 --num-iters=2000 --num-workers=8 --devices-per-worker=2 --gpu" &&
echo "PASS: SGD Test for" $PYTHON_VERSION >> $RESULT_FILE
} || echo "FAIL: SGD Test for" $PYTHON_VERSION >> $RESULT_FILE
echo "PASS: SGD Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"
} || echo "FAIL: SGD Test for" "$PYTHON_VERSION" >> "$RESULT_FILE"

# Tear down cluster.
if [ "$DEBUG_MODE" = "" ]; then
ray down -y $CLUSTER
rm $CLUSTER
ray down -y "$CLUSTER"
rm "$CLUSTER"
else
echo "Not tearing down cluster" $CLUSTER
echo "Not tearing down cluster" "$CLUSTER"
fi
popd
}
Expand All @@ -130,6 +154,6 @@ do
test_sgd $PYTHON_VERSION
done

cat $RESULT_FILE
cat $RESULT_FILE | grep FAIL > test.log
cat "$RESULT_FILE"
cat "$RESULT_FILE" | grep FAIL > test.log
[ ! -s test.log ] || exit 1

0 comments on commit b850e14

Please sign in to comment.