Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic #6171

Merged
merged 25 commits into from
Nov 25, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Nov 15, 2019

Why are these changes needed?

Related issue number

Closes #6264

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18508/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18510/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18568/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Nov 18, 2019

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18570/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Nov 18, 2019

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18569/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18696/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18697/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18693/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18708/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18711/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18714/
Test FAILed.

@ericl ericl changed the title [WIP] Try enabling direct calls for RLlib Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic, and test_actor Nov 24, 2019
@ericl ericl changed the title Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic, and test_actor Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic Nov 24, 2019
name = "test_actor_direct",
size = "medium",
srcs = ["test_actor_direct.py", "test_actor.py"],
tags = ["exclusive", "manual"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_actor_load_balancing is flaky, so setting it to MANUAL for now

srcs = ["test_basic.py"],
tags = ["exclusive"],
deps = ["//:ray_lib"],
)

py_test(
name = "test_advanced",
size = "large",
srcs = ["test_advanced.py"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have been removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed.

@@ -42,7 +42,7 @@ void DoInlineObjectValue(const ObjectID &obj_id, std::shared_ptr<RayObject> valu
RAY_CHECK(found) << "obj id " << obj_id << " not found";
}

void LocalDependencyResolver::ResolveDependencies(const TaskSpecification &task,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no more const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this mutates the task, it is weird to call it const.

return raylet_client_->SubmitTask(task_spec);
// TODO(ekl) if we moved actor creation to use direct call tasks, then we won't
// need to manually resolve direct call args here.
resolver_->ResolveDependencies(task_spec, [this, task_spec]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, was there a test that was failing with the old version of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it causes rllib ES to hang, since we pass an object id to an actor constructor there.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18845/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18847/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18849/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Nov 25, 2019

Interestingly, the RLlib test is failing with corrupted sample data returned.

@@ -873,7 +874,7 @@ Status CoreWorker::BuildArgsForExecutor(const TaskSpecification &task,
metadata = std::make_shared<LocalMemoryBuffer>(
const_cast<uint8_t *>(task.ArgMetadata(i)), task.ArgMetadataSize(i));
}
args->at(i) = std::make_shared<RayObject>(data, metadata);
args->at(i) = std::make_shared<RayObject>(data, metadata, /*copy_data*/ true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @edoakes @stephanie-wang this fixes #6264

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18862/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18863/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18866/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Nov 25, 2019

hm //python/ray/tests:test_multi_node seems to be consistently failing in bazel, but passes locally.

@ericl
Copy link
Contributor Author

ericl commented Nov 25, 2019

multi_node_test errors seem to be due to breakage on master

@ericl ericl merged commit 64a3a72 into ray-project:master Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory corruption for inline args if RAY_FORCE_DIRECT=1
3 participants