Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay crash? #9920

Open
Gerold103 opened this issue Apr 8, 2024 · 0 comments
Open

Relay crash? #9920

Gerold103 opened this issue Apr 8, 2024 · 0 comments
Labels
bug Something isn't working crash replication

Comments

@Gerold103
Copy link
Collaborator

There is this patch from picodata: https://git.picodata.io/picodata/tarantool/-/merge_requests/160/diffs?commit_id=78da447d444adfd118c51bec88f6f269ca0d0a0f

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7450fa79c..39ee4a7d5 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -623,11 +623,16 @@ tx_status_update(struct cmsg *msg)
 	}
 	trigger_run(&replicaset.on_ack, &ack);
 
-	static const struct cmsg_hop route[] = {
-		{relay_status_update, NULL}
-	};
-	cmsg_init(msg, route);
-	cpipe_push(&status->relay->relay_pipe, msg);
+	if (status->relay->tx.is_paired) {
+		static const struct cmsg_hop route[] = {
+			{relay_status_update, NULL}
+		};
+		cmsg_init(msg, route);
+		cpipe_push(&status->relay->relay_pipe, msg);
+	} else {
+		assert(msg->hop->pipe == NULL);
+		cmsg_init(msg, NULL);
+	}
 }
 
 /**

It is claimed to fix a crash, apparently. I assume it targets a potential theoretical bug that the relay during the tx_endpoint unpair might receive a status response from tx, which will trigger sending the message back, and it would be sent back again after relay is already dead.

I was only able to reproduce it with a code hack:

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 3f3a8acb5..828bc8670 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -937,11 +937,11 @@ relay_check_status_needs_update(struct relay *relay)
 	/* Collect xlog files received by the replica. */
 	relay_schedule_pending_gc(relay, send_vclock);
 
-	double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
-	if (vclock_sum(&status_msg->vclock) ==
-	    vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
-	    status_msg->vclock_sync == last_recv_ack->vclock_sync)
-		return;
+	// double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
+	// if (vclock_sum(&status_msg->vclock) ==
+	//     vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
+	//     status_msg->vclock_sync == last_recv_ack->vclock_sync)
+	// 	return;
 	static const struct cmsg_hop route[] = {
 		{tx_status_update, NULL}
 	};

And then this script:

--
-- Instance 1
--
-- Step 1
--
box.cfg{
    listen = 3313,
    replication_timeout = 0.1,
}
box.schema.user.grant('guest', 'super')

--
-- Step 3
--
box.error.injection.set("ERRINJ_RELAY_FROM_TX_DELAY", true)



--
-- Instance 2
--
-- Step 2
--
box.cfg{
    listen = 3314,
    replication = 3313,
    replication_timeout = 0.1,
}

It seems we should not push the message when the endpoint is being unpaired. But I can't see how it could crash in real life without my unrealistic change in the code. tx_idle will be always zero when the message just arrived and there was no yields yet. send_vclock check wouldn't pass either - by the time of unpairing the relay isn't sending rows anymore. And the vclock_sync doesn't seem to be related either (don't remember why).

Other similar message handlers do filter out the relay being unpaired, somewhy. Why doesn't this place do the same? How to catch this crash normally?

@Gerold103 Gerold103 added crash bug Something isn't working replication labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crash replication
Projects
None yet
Development

No branches or pull requests

1 participant