repair: high latency of RBNO bootstrap #15505

Deexie · 2023-09-21T09:50:39Z

Sometimes when a new node is bootstrapped with RBNO enabled,
the state change indicating bootstrap finish is delayed. Thus, tests
wait idly until a server is added, even though the logs show that
the bootstrap's already finished.

Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: scylladb#15505.

avikivity · 2023-10-25T16:49:46Z

@Deexie does this merit a backport, or does it only impact tests? in any case I think the code is new enough that 5.2 doesn't have the problem.

Deexie · 2023-10-28T08:50:51Z

@Deexie does this merit a backport, or does it only impact tests? in any case I think the code is new enough that 5.2 doesn't have the problem.

It impacts users as well. And 5.2 contains shard_repair_task_impl, so I think it's worth backporting

denesb · 2023-10-30T09:29:41Z

@Deexie does this merit a backport, or does it only impact tests? in any case I think the code is new enough that 5.2 doesn't have the problem.

It impacts users as well. And 5.2 contains shard_repair_task_impl, so I think it's worth backporting

There are many conflicts, @Deexie please prepare a backport PR.

Deexie · 2023-10-30T10:19:18Z

@Deexie does this merit a backport, or does it only impact tests? in any case I think the code is new enough that 5.2 doesn't have the problem.

It impacts users as well. And 5.2 contains shard_repair_task_impl, so I think it's worth backporting

There are many conflicts, @Deexie please prepare a backport PR.

The change for that modifies what's added in #15005 and related issue #14966 is not backported. Of course I can skip that part (and in this case that would be really easy). But the thing is that the changes are pretty related and #14966 seems to be much more harmful than this one. @denesb what do you think about opening a joint backport PR to branch-5.2?

denesb · 2023-10-30T10:23:51Z

The change for that modifies what's added in #15005 and related issue #14966 is not backported. Of course I can skip that part (and in this case that would be really easy). But the thing is that the changes are pretty related and #14966 seems to be much more harmful than this one. @denesb what do you think about opening a joint backport PR to branch-5.2?

Yes please.

Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: scylladb#15505. (cherry picked from commit 0474e15)

Deexie · 2023-10-30T11:20:58Z

Oh, nevermind, I checked for compaction tasks in the middle of cherry-pick... 5.2 does not contain #14966, sorry. I opened #15875 with this change only

asias · 2023-11-06T01:34:19Z

Sometimes when a new node is bootstrapped with RBNO enabled, the state change indicating bootstrap finish is delayed. Thus, tests wait idly until a server is added, even though the logs show that the bootstrap's already finished.

It is not clear to me what is the problem according the the description above. What "state changing" are you referring to? I first thought it was about cql read / write latency.

Deexie · 2023-11-06T08:24:28Z

effective_replication_map_ptr is kept in memory longer than it used to. From a comment to erm class:

Holding to this object keeps the associated token_metadata_ptr alive, keeping the token metadata version alive and seen as in use.

Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: scylladb#15505. (cherry picked from commit 0474e15)

asias · 2023-11-07T06:15:32Z

effective_replication_map_ptr is kept in memory longer than it used to. From a comment to erm class:
Holding to this object keeps the associated token_metadata_ptr alive, keeping the token metadata version alive and seen as in use.

Yes, it helps to release the effective_replication_map_ptr earlier. But do you see any actual problem, e.g., taking longer to finish repair, user facing higher read / write cql latency?

You mentioned in the opening message:

Thus, tests wait idly until a server is added, even though the logs show that
the bootstrap's already finished.

Not releasing effective_replication_map_ptr should not have any impact on when the bootstrap should be considered as finished.

Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: #15505. (cherry picked from commit 0474e15) Closes #15875

Deexie · 2023-11-07T09:34:58Z

effective_replication_map_ptr is kept in memory longer than it used to. From a comment to erm class:
Holding to this object keeps the associated token_metadata_ptr alive, keeping the token metadata version alive and seen as in use.
Yes, it helps to release the effective_replication_map_ptr earlier. But do you see any actual problem, e.g., taking longer to finish repair, user facing higher read / write cql latency?

You mentioned in the opening message:
Thus, tests wait idly until a server is added, even though the logs show that
the bootstrap's already finished.
Not releasing effective_replication_map_ptr should not have any impact on when the bootstrap should be considered as finished.

Yes, I saw the problem while writing a test in test/topology. It was a test related to task manager so I set task_ttl to some really big number and while 2-node cluster was created immediately, it took task_ttl for 3-node cluster.

I didn't really check the actual path, rather focused on symptoms. So I do not know how exactly alive erm causes ManagerClient::server_add to hang, but I can find it if that's important.

mykaul · 2023-12-13T12:48:43Z

Backport to 5.2 is via #15875 - what else remains here?

Deexie self-assigned this Sep 21, 2023

Deexie mentioned this issue Sep 21, 2023

repair: release resources of shard_repair_task_impl #15506

Closed

mykaul added this to the 5.4 milestone Sep 21, 2023

mykaul added Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Sep 21, 2023

scylladb-promoter closed this as completed in f42be12 Sep 27, 2023

Deexie mentioned this issue Oct 30, 2023

[Backport 5.2] repair: release resources of shard_repair_task_impl #15875

Closed

mykaul removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repair: high latency of RBNO bootstrap #15505

repair: high latency of RBNO bootstrap #15505

Deexie commented Sep 21, 2023

avikivity commented Oct 25, 2023

Deexie commented Oct 28, 2023

denesb commented Oct 30, 2023

Deexie commented Oct 30, 2023

denesb commented Oct 30, 2023

Deexie commented Oct 30, 2023

asias commented Nov 6, 2023

Deexie commented Nov 6, 2023

asias commented Nov 7, 2023

Deexie commented Nov 7, 2023

mykaul commented Dec 13, 2023

repair: high latency of RBNO bootstrap #15505

repair: high latency of RBNO bootstrap #15505

Comments

Deexie commented Sep 21, 2023

avikivity commented Oct 25, 2023

Deexie commented Oct 28, 2023

denesb commented Oct 30, 2023

Deexie commented Oct 30, 2023

denesb commented Oct 30, 2023

Deexie commented Oct 30, 2023

asias commented Nov 6, 2023

Deexie commented Nov 6, 2023

asias commented Nov 7, 2023

Deexie commented Nov 7, 2023

mykaul commented Dec 13, 2023