Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage_service: Replicate and advertise tokens early in the boot up process #4710

Merged
merged 1 commit into from
Aug 1, 2019

Conversation

asias
Copy link
Contributor

@asias asias commented Jul 15, 2019

When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

  • n1, n2
  • n2 is down, n1 think n2 is down
  • n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
    reads/writes to n2, but n2 hasn't replicated the token_metadata to all
    the shards.
  • n2 complains:
    token_metadata - sorted_tokens is empty in first_token_index!
    token_metadata - sorted_tokens is empty in first_token_index!
    token_metadata - sorted_tokens is empty in first_token_index!
    token_metadata - sorted_tokens is empty in first_token_index!
    token_metadata - sorted_tokens is empty in first_token_index!
    token_metadata - sorted_tokens is empty in first_token_index!
    storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
    (sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: #4709
Fixes: #4723

@slivne slivne requested a review from tgrabiec July 15, 2019 14:41
@tgrabiec
Copy link
Contributor

I don't understand this explanation. What's proxy_service? Why a race between gossip and token replication is causing problems? Replica side is not supposed to need token metadata. Maybe the problem is that there's a race between token metadata replication and CQL server init?

@asias
Copy link
Contributor Author

asias commented Jul 16, 2019

I don't understand this explanation. What's proxy_service? Why a race between gossip and token replication is causing problems? Replica side is not supposed to need token metadata. Maybe the problem is that there's a race between token metadata replication and CQL server init?

Sorry, I meant storage_proxy.cc.

I will give you an example of the race:

n1, n2
n2 is down, n1 think n2 is down
n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends reads/writes to n2, but n2 hasn't replicated the token_metadata to all the shards.

n2 complains:

[shard 52] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 45] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 38] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 41] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 7] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 7] token_metadata - sorted_tokens is empty in first_token_index! 
[shard 47] storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error (sorted_tokens is empty in first_token_index!)

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Jul 16, 2019 via email

…process

When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
  reads/writes to n2, but n2 hasn't replicated the token_metadata to all
  the shards.
- n2 complains:
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
  (sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: scylladb#4709
Fixes: scylladb#4723
@asias asias force-pushed the replicate_token_metadata_early branch from 89c4525 to e645b03 Compare July 16, 2019 09:26
@asias asias changed the title storage_service: Replicate tokens to all shards early in the boot up process storage_service: Replicate and advertise tokens early in the boot up process Jul 16, 2019
@asias
Copy link
Contributor Author

asias commented Jul 16, 2019

I pushed a new branch.
Changes:

@bhalevy
Copy link
Member

bhalevy commented Jul 24, 2019

@tgrabiec please re-review

@tgrabiec
Copy link
Contributor

tgrabiec commented Jul 25, 2019

The fix for #4723 looks good.

I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

@asias
Copy link
Contributor Author

asias commented Jul 25, 2019

The fix for #4723 looks good.

I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.

Even if they do not notice, this patch reduces the big race window to a tiny window.

gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

@tgrabiec
Copy link
Contributor

The fix for #4723 looks good.
I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

Is it guaranteed to fail? Or is it undefined behavior?

Isn't the fact that the request fails the problem which this pull request is trying to fix?

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.

Even if they do not notice, this patch reduces the big race window to a tiny window.

gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

Well, I think we should eliminate the race completely rather than add complexity to reduce the likelihood of failure.

@asias
Copy link
Contributor Author

asias commented Jul 26, 2019

The fix for #4723 looks good.
I am not convinced that #4709 is fixed though. What prevents requests from other nodes to arrive to the node before you load and replica the tokens? You do it earlier, but the server is already up, right? The other nodes may have not noticed that the node is down before they sent the requests.

If other nodes do not notice this node is down, they can send requests at only time during this node restarts and this node might not be fully initialized. It won't be a big problem because the request will fail since the node is not fully initialized.

Is it guaranteed to fail? Or is it undefined behavior?

If the request queries the token_metadata, it is guaranteed to fail because we throw if ti is not initialized.

Isn't the fact that the request fails the problem which this pull request is trying to fix?

It tries to fix the restart node case.

This patch completely eliminates the race if other nodes have noticed the node is down which is very likely because restarts will send SHUTDOWN message.
Even if they do not notice, this patch reduces the big race window to a tiny window.
gossip starts
wait for gossip to settle (multiple seconds)
_token_metadata.update_normal_tokens(tokens, get_broadcast_address())
replicate_to_all_cores().get()

Well, I think we should eliminate the race completely rather than add complexity to reduce the likelihood of failure.

This patch tries to fix the restart node case and it fixes it 100%.

For the case a node does not notice the node is down, it also helped but not 100%. We can add a new issue to track.

@tgrabiec
Copy link
Contributor

tgrabiec commented Jul 29, 2019 via email

@asias
Copy link
Contributor Author

asias commented Jul 29, 2019 via email

@asias
Copy link
Contributor Author

asias commented Jul 29, 2019

I created #4771 to track storage proxy initialization issue.

@asias
Copy link
Contributor Author

asias commented Jul 31, 2019

@tgrabiec @slivne ping

@tgrabiec tgrabiec changed the base branch from master to next August 1, 2019 12:40
@tgrabiec tgrabiec merged commit 60e4c0d into scylladb:next Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants