Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS]Fix flaky testcase: ServiceBasedGcsClientTest #14248

Merged
merged 3 commits into from
Feb 24, 2021

Conversation

ffbin
Copy link
Contributor

@ffbin ffbin commented Feb 22, 2021

Why are these changes needed?

In the process of actor registration, if the callback function of WaitForActorOutOfScope is executed first, and then the callback function of ActorTable().Put is executed, the actor registration fails; otherwise, the actor registration succeeds. So we can't assert whether the actor is registered successfully.

Related issue number

#14245

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ffbin ffbin changed the title [GCS]Fix flaky testcase [GCS]Fix flaky testcase: ServiceBasedGcsClientTest.TestActorTableResubscribe Feb 22, 2021
@simon-mo
Copy link
Contributor

actually, the test still failed:


[2021-02-22 13:01:39,521 I 2081 2405] gcs_actor_manager.cc:114: Registered actor, job id = 01000000, actor id = 01a99e434e4b545de599f93f01000000
--
  | src/ray/gcs/gcs_client/test/service_based_gcs_client_test.cc:522: Failure
  | Value of: actor.state() == expected_state
  | Actual: false
  | Expected: true
  | src/ray/gcs/gcs_client/test/service_based_gcs_client_test.cc:522: Failure
  | Value of: actor.state() == expected_state
  | Actual: false
  | Expected: true

https://buildkite.com/ray-project/ray-builders-pr/builds/1671#cb77d52f-0752-4675-9f6e-a98b88b486e3/128-1019

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 23, 2021
@ffbin
Copy link
Contributor Author

ffbin commented Feb 23, 2021

actually, the test still failed:


[2021-02-22 13:01:39,521 I 2081 2405] gcs_actor_manager.cc:114: Registered actor, job id = 01000000, actor id = 01a99e434e4b545de599f93f01000000
--
  | src/ray/gcs/gcs_client/test/service_based_gcs_client_test.cc:522: Failure
  | Value of: actor.state() == expected_state
  | Actual: false
  | Expected: true
  | src/ray/gcs/gcs_client/test/service_based_gcs_client_test.cc:522: Failure
  | Value of: actor.state() == expected_state
  | Actual: false
  | Expected: true

https://buildkite.com/ray-project/ray-builders-pr/builds/1671#cb77d52f-0752-4675-9f6e-a98b88b486e3/128-1019

Thanks, i have fix it.

@@ -204,7 +204,7 @@ void ServiceBasedGcsClient::ReconnectGcsServer() {
if (last_reconnect_address_ == address &&
(current_sys_time_ms() - last_reconnect_timestamp_ms_) <
RayConfig::instance().minimum_gcs_reconnect_interval_milliseconds()) {
RAY_LOG(INFO)
RAY_LOG(DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we keep this log level as INFO or even upgrade it to WARNING? It's useful when we try to figure out what happened while client reconnecting. (although it prints much logs)

Copy link
Contributor

@rkooo567 rkooo567 Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using WARNING logs unless the error message clearly indicates what to do for end users (so this sort of implementation details shouldn't be used for WARNING). All of the warning logs from GCS server and raylet will be streamed to drivers. (If we'd like to use WARNING more aggressively, we should change this behavior first).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How about keeping it as INFO? @ffbin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prints much logs, so DEBUG is better. If client reconnect, we have print log Reconnected to GCS server:.

Copy link
Contributor

@WangTaoTheTonic WangTaoTheTonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ffbin ffbin removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 23, 2021
@simon-mo
Copy link
Contributor


[ RUN      ] ServiceBasedGcsClientTest.TestActorSubscribeAll
--
  | [2021-02-23 18:46:16,143 I 1825 1825] test_util.cc:61: Start redis command is: 'external/com_github_antirez_redis/redis-server' --loglevel warning --loadmodule './libray_redis_module.so' --port 4419
  | 1852:C 23 Feb 2021 18:46:16.146 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
  | 1852:C 23 Feb 2021 18:46:16.146 # Redis version=6.0.9, bits=64, commit=8b8d6b98, modified=1, pid=1852, just started
  | 1852:C 23 Feb 2021 18:46:16.146 # Configuration loaded
  | 1852:M 23 Feb 2021 18:46:16.147 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
  | 1852:M 23 Feb 2021 18:46:16.147 # Server initialized
  | 1852:M 23 Feb 2021 18:46:16.147 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:44: Loading job table data.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:56: Loading node table data.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:68: Loading object table data.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:81: Loading cluster resources table data.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:108: Loading actor table data.
  | [2021-02-23 18:46:16,346 I 1825 1825] gcs_init_data.cc:94: Loading placement group table data.
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:48: Finished loading job table data, size = 0
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:60: Finished loading node table data, size = 0
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:73: Finished loading object table data, size = 0
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:85: Finished loading cluster resources table data, size = 0
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:112: Finished loading actor table data, size = 0
  | [2021-02-23 18:46:16,347 I 1825 1862] gcs_init_data.cc:99: Finished loading placement group table data, size = 0
  | [2021-02-23 18:46:16,356 I 1825 1862] grpc_server.cc:64: MockedGcsServer server started, listening on port 45431.
  | [2021-02-23 18:46:16,370 I 1825 1862] gcs_server.cc:267: Gcs server address = 172.18.0.3:45431
  | [2021-02-23 18:46:16,371 I 1825 1862] gcs_server.cc:271: Finished setting gcs server address: 172.18.0.3:45431
  | [2021-02-23 18:46:16,371 I 1825 1862] gcs_server.cc:357: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
  | GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
  | GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
  | GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
  | GcsPubSub:
  | - num channels subscribed to: 0
  | - total commands queued: 0
  | DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
  | [2021-02-23 18:46:16,379 I 1825 1862] gcs_actor_manager.cc:109: Registering actor, job id = 01000000, actor id = ec6b53a9fd78bfc9167c92d701000000
  | [2021-02-23 18:46:16,379 I 1825 1862] gcs_actor_manager.cc:114: Registered actor, job id = 01000000, actor id = ec6b53a9fd78bfc9167c92d701000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:109: Registering actor, job id = 01000000, actor id = 67308ecfff7044de304b4c1401000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:114: Registered actor, job id = 01000000, actor id = 67308ecfff7044de304b4c1401000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:396: Worker NIL_ID failed, destroying actor child, job id = 01000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:415: Destroying actor, actor id = 67308ecfff7044de304b4c1401000000, job id = 01000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:396: Worker NIL_ID failed, destroying actor child, job id = 01000000
  | [2021-02-23 18:46:16,380 I 1825 1862] gcs_actor_manager.cc:415: Destroying actor, actor id = ec6b53a9fd78bfc9167c92d701000000, job id = 01000000
  | [2021-02-23 18:47:16,371 I 1825 1862] gcs_server.cc:357: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
  | GcsActorManager: {RegisterActor request count: 2, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, Registered actors count: 0, Destroyed actors count: 2, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
  | GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
  | GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
  | GcsPubSub:
  | - num channels subscribed to: 0
  | - total commands queued: 0
  | DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
  | src/ray/common/test_util.cc:193: Failure
  | Value of: WaitForCondition(condition, timeout_ms)
  | Actual: false
  | Expected: true
  | [2021-02-23 18:47:16,739 I 1825 1825] gcs_server.cc:127: Stopping GCS server.
  | [2021-02-23 18:47:16,743 I 1825 1825] gcs_server.cc:134: GCS server stopped.
  | [2021-02-23 18:47:16,743 I 1825 1825] test_util.cc:93: Cleaning up redis with command: 'external/com_github_antirez_redis/redis-cli' -p 4419 flushall
  | OK
  | [2021-02-23 18:47:16,846 I 1825 1825] test_util.cc:77: Stop redis command is: 'external/com_github_antirez_redis/redis-cli' -p 4419 shutdown
  | 1852:M 23 Feb 2021 18:47:16.849 # User requested shutdown...
  | 1852:M 23 Feb 2021 18:47:16.849 # Redis is now ready to exit, bye bye...
  | [  FAILED  ] ServiceBasedGcsClientTest.TestActorSubscribeAll (60807 ms)


https://buildkite.com/ray-project/ray-builders-pr/builds/1709#ff6775b6-5aa4-49d7-9416-faa4f4760199/124-310

@ffbin
Copy link
Contributor Author

ffbin commented Feb 24, 2021

r_manager.cc:109: Registering actor, job id = 01000000, actor id = ec6b53a9fd78bfc9167c92d701000000
  | [2021-02-23 18:46:16,379 I 1825 1862] gcs_actor_manager.cc:114: Registered a

This is another problem. I'm still locating the cause. I'll ping you after fix it, thanks.

@ffbin
Copy link
Contributor Author

ffbin commented Feb 24, 2021

@ffbin ffbin changed the title [GCS]Fix flaky testcase: ServiceBasedGcsClientTest.TestActorTableResubscribe [GCS]Fix flaky testcase: ServiceBasedGcsClientTest Feb 24, 2021
@simon-mo
Copy link
Contributor

Awesome! Feel free to merge whenever you think its ready.

@ffbin ffbin merged commit 482a002 into ray-project:master Feb 24, 2021
richardliaw added a commit to richardliaw/ray that referenced this pull request Feb 24, 2021
@WangTaoTheTonic WangTaoTheTonic deleted the fix_flaky_testcase branch February 25, 2021 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants