Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failure tests to test_reference_counting #7400

Merged
merged 10 commits into from
Mar 17, 2020

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Mar 2, 2020

Why are these changes needed?

Need to test cases where workers fail. This is not comprehensive, but a good start.

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ray.remote
@pytest.mark.parametrize("failure", [False, True])
def test_basic_serialized_reference(one_worker_100MiB, failure):
@ray.remote(max_retries=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would be great to also have some tests where max_retries > 0 so we can make sure that ref counting works when there are retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that's a really good point - I just did this to make it run more quickly. Do you think it's worth parametrizing it and testing both with and without retries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it seems like testing with retries should cover all the cases without retries too, but testing both also seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can update the config to make the retries faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just test the retry case with a short timeout. Will update.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22623/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22651/
Test PASSed.

@edoakes
Copy link
Contributor Author

edoakes commented Mar 4, 2020

@stephanie-wang I left TODOs for the two bugs the tests uncovered this morning. I think we should merge this and address those separately.

@edoakes
Copy link
Contributor Author

edoakes commented Mar 6, 2020

FYI - diff grew because I needed to split the tests to avoid timeouts in bazel

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22793/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22818/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22999/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23056/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23250/
Test FAILed.

@edoakes edoakes merged commit c1b0f9c into ray-project:master Mar 17, 2020
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 17, 2020
stephanie-wang added a commit that referenced this pull request Mar 19, 2020
* enable

* Turn on eager eviction

* Shorten tests and drain ReferenceCounter

* Don't force kill actor handles that have gone out of scope, lint

* Fix locks

* Cleanup Plasma Async Callback (#7452)

* [rllib][tune] fix some nans (#7611)

* Change /tmp to platform-specific temporary directory (#7529)

* [Serve] UI Improvements (#7569)

* bugfix about test_dynres.py (#7615)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* Java call Python actor method use actor.call (#7614)

* bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase (#7633)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [Java] Make both `RayActor` and `RayPyActor` inheriting from `BaseActor` (#7462)

* [Java] Fix the issue that the cached value in `RayObject` is serialized (#7613)

* Add failure tests to test_reference_counting (#7400)

* Fix typo in asyncio documentation (#7602)

* Fix segfault

* debug

* Force kill actor

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants