storage_service: Replicate and advertise tokens early in the boot up process #4710

asias · 2019-07-15T09:25:57Z

When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

n1, n2
n2 is down, n1 think n2 is down
n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
reads/writes to n2, but n2 hasn't replicated the token_metadata to all
the shards.
n2 complains:
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
(sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: #4709
Fixes: #4723

tgrabiec · 2019-07-15T21:06:11Z

I don't understand this explanation. What's proxy_service? Why a race between gossip and token replication is causing problems? Replica side is not supposed to need token metadata. Maybe the problem is that there's a race between token metadata replication and CQL server init?

asias · 2019-07-16T02:20:31Z

I don't understand this explanation. What's proxy_service? Why a race between gossip and token replication is causing problems? Replica side is not supposed to need token metadata. Maybe the problem is that there's a race between token metadata replication and CQL server init?

Sorry, I meant storage_proxy.cc.

I will give you an example of the race:

n1, n2
n2 is down, n1 think n2 is down
n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends reads/writes to n2, but n2 hasn't replicated the token_metadata to all the shards.

n2 complains:

[shard 52] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 45] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 38] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 41] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 7] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 7] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 47] storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error (sorted_tokens is empty in first_token_index!)

gleb-cloudius · 2019-07-16T07:50:15Z

On Mon, Jul 15, 2019 at 07:20:43PM -0700, Asias He wrote: > I don't understand this explanation. What's `proxy_service`? Why a race between gossip and token replication is causing problems? Replica side is not supposed to need token metadata. Maybe the problem is that there's a race between token metadata replication and CQL server init? Sorry, I meant storage_proxy.cc. I will give you an example of the race: n1, n2 n2 is down, n1 think n2 is down n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends reads/writes to n2, but n2 hasn't replicated the token_metadata to all the shards.

Regular read writes do not consult token_metadata on a replica. Counters and MV updates do though.

…

n2 complains: ``` [shard 52] token_metadata - sorted_tokens is empty in first_token_index! [shard 45] token_metadata - sorted_tokens is empty in first_token_index! [shard 38] token_metadata - sorted_tokens is empty in first_token_index! [shard 41] token_metadata - sorted_tokens is empty in first_token_index! [shard 7] token_metadata - sorted_tokens is empty in first_token_index! [shard 7] token_metadata - sorted_tokens is empty in first_token_index! [shard 47] storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error (sorted_tokens is empty in first_token_index!) ``` -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #4710 (comment)

-- Gleb.

…process When a node is restarted, there is a race between gossip starts (other nodes will mark this node up again and send requests) and the tokens are replicated to other shards. Here is an example: - n1, n2 - n2 is down, n1 think n2 is down - n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends reads/writes to n2, but n2 hasn't replicated the token_metadata to all the shards. - n2 complains: token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! token_metadata - sorted_tokens is empty in first_token_index! storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error (sorted_tokens is empty in first_token_index!) The code path looks like below: 0 stoarge_service::init_server 1 prepare_to_join() 2 add gossip application state of NET_VERSION, SCHEMA and so on. 3 _gossiper.start_gossiping().get() 4 join_token_ring() 5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address()); 6 replicate_to_all_cores().get() 7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS The race talked above is at line 3 and line 6. To fix, we can replicate the token_metadata early after it is filled with the tokens read from system table before gossip starts. So that when other nodes think this restarting node is up, the tokens are already replicated to all the shards. In addition, this patch also fixes the issue that other nodes might see a node miss the TOKENS and STATUS application state in gossip if that node failed in the middle of a restarting process, i.e., it is killed after line 3 and before line 7. As a result we could not replace the node. Tests: update_cluster_layout_tests.py Fixes: scylladb#4709 Fixes: scylladb#4723

asias · 2019-07-16T09:30:29Z

I pushed a new branch.
Changes:

Fixes both Can not replace a node which is failed in the middle of boot up process #4723 and [shard 15] storage_proxy - Failed to apply mutation from : std::runtime_error (sorted_tokens is empty in first_token_index!) error message #4709
Add local tokens in token_metadata early if it is a restarting of node in NORMAL status
Advertise TOKENS and STATUS application state early if it is a restarting of node in NORMAL status

bhalevy · 2019-07-24T12:16:22Z

@tgrabiec please re-review

tgrabiec · 2019-07-25T12:36:48Z

The fix for #4723 looks good.

I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

asias · 2019-07-25T13:32:12Z

The fix for #4723 looks good.

I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.

Even if they do not notice, this patch reduces the big race window to a tiny window.

gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

tgrabiec · 2019-07-26T12:46:56Z

The fix for #4723 looks good.
I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

Is it guaranteed to fail? Or is it undefined behavior?

Isn't the fact that the request fails the problem which this pull request is trying to fix?

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.

Even if they do not notice, this patch reduces the big race window to a tiny window.

gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

Well, I think we should eliminate the race completely rather than add complexity to reduce the likelihood of failure.

asias · 2019-07-26T16:06:28Z

The fix for #4723 looks good.
I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

Is it guaranteed to fail? Or is it undefined behavior?

If the request queries the token_metadata, it is guaranteed to fail because we throw if ti is not initialized.

Isn't the fact that the request fails the problem which this pull request is trying to fix?

It tries to fix the restart node case.

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.
Even if they do not notice, this patch reduces the big race window to a tiny window.
gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

Well, I think we should eliminate the race completely rather than add complexity to reduce the likelihood of failure.

This patch tries to fix the restart node case and it fixes it 100%.

For the case a node does not notice the node is down, it also helped but not 100%. We can add a new issue to track.

tgrabiec · 2019-07-29T10:36:25Z

On Fri, Jul 26, 2019 at 6:06 PM Asias He ***@***.***> wrote: The fix for #4723 <#4723> looks good. I am not convinced that #4709 <#4709> is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests. If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized. Is it guaranteed to fail? Or is it undefined behavior? If the request queries the token_metadata, it is guaranteed to fail because we throw if ti is not initialized.

Yes. But it may rely on something else which may result in more subtle errors. In general, the system may not behave correctly if you allow requests to act on an uninitialized system. The proper fix is to not let requests in when the system is in such a state. Initializing things sooner, as this PR does, may reduce the likelihood of error but doesn't eliminate it. We should solve the problem completely now, when we identified it, and not cover up some cases and wait until a disaster happens.

asias · 2019-07-29T11:04:37Z

On Mon, Jul 29, 2019, 18:36 Tomasz Grabiec ***@***.***> wrote: On Fri, Jul 26, 2019 at 6:06 PM Asias He ***@***.***> wrote: > The fix for #4723 <#4723> looks > good. > I am not convinced that #4709 > <#4709> is fixed though. What > prevents requests from other nodes to arrive to the node before you load > and replica the tokens? You do it earlier, but the server is already up, > right? The other nodes may have not noticed that the node is down before > they sent the requests. > > If other nodes do not notice this node is down, they can send requests at > only time during this node restarts and this node might not be fully > initialized. It won't be a big problem because the request will fail since > the node is not fully initialized. > > Is it guaranteed to fail? Or is it undefined behavior? > > If the request queries the token_metadata, it is guaranteed to fail > because we throw if ti is not initialized. > Yes. But it may rely on something else which may result in more subtle errors. In general, the system may not behave correctly if you allow requests to act on an uninitialized system. The proper fix is to not let requests in when the system is in such a state. Initializing things sooner, as this PR does, may reduce the likelihood of error but doesn't eliminate it. We should solve the problem completely now, when we identified it, and not cover up some cases and wait until a disaster happens.

As I mentioned, we should create a new issue to track proxy service startup and early requests race in general. This patch fixes the original issue already. —

…

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4710>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACOETFRIEUCBCTBHCEI3W3QB3B33ANCNFSM4IDU24IQ> .

asias · 2019-07-29T15:04:07Z

I created #4771 to track storage proxy initialization issue.

asias · 2019-07-31T01:33:11Z

@tgrabiec @slivne ping

slivne requested a review from tgrabiec July 15, 2019 14:41

asias force-pushed the replicate_token_metadata_early branch from 89c4525 to e645b03 Compare July 16, 2019 09:26

asias changed the title ~~storage_service: Replicate tokens to all shards early in the boot up process~~ storage_service: Replicate and advertise tokens early in the boot up process Jul 16, 2019

bhalevy mentioned this pull request Jul 24, 2019

gossip: Make boot process faster and safer #4756

Closed

tgrabiec changed the base branch from master to next August 1, 2019 12:40

tgrabiec approved these changes Aug 1, 2019

View reviewed changes

tgrabiec merged commit 60e4c0d into scylladb:next Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage_service: Replicate and advertise tokens early in the boot up process #4710

storage_service: Replicate and advertise tokens early in the boot up process #4710

asias commented Jul 15, 2019 •

edited

Loading

tgrabiec commented Jul 15, 2019

asias commented Jul 16, 2019

gleb-cloudius commented Jul 16, 2019 via email

asias commented Jul 16, 2019

bhalevy commented Jul 24, 2019

tgrabiec commented Jul 25, 2019 •

edited

Loading

asias commented Jul 25, 2019

tgrabiec commented Jul 26, 2019

asias commented Jul 26, 2019

tgrabiec commented Jul 29, 2019 via email

asias commented Jul 29, 2019 via email

asias commented Jul 29, 2019

asias commented Jul 31, 2019

storage_service: Replicate and advertise tokens early in the boot up process #4710

storage_service: Replicate and advertise tokens early in the boot up process #4710

Conversation

asias commented Jul 15, 2019 • edited Loading

tgrabiec commented Jul 15, 2019

asias commented Jul 16, 2019

gleb-cloudius commented Jul 16, 2019 via email

asias commented Jul 16, 2019

bhalevy commented Jul 24, 2019

tgrabiec commented Jul 25, 2019 • edited Loading

asias commented Jul 25, 2019

tgrabiec commented Jul 26, 2019

asias commented Jul 26, 2019

tgrabiec commented Jul 29, 2019 via email

asias commented Jul 29, 2019 via email

asias commented Jul 29, 2019

asias commented Jul 31, 2019

asias commented Jul 15, 2019 •

edited

Loading

tgrabiec commented Jul 25, 2019 •

edited

Loading