Relay crash? #9920

Gerold103 · 2024-04-08T22:40:49Z

There is this patch from picodata: https://git.picodata.io/picodata/tarantool/-/merge_requests/160/diffs?commit_id=78da447d444adfd118c51bec88f6f269ca0d0a0f

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7450fa79c..39ee4a7d5 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -623,11 +623,16 @@ tx_status_update(struct cmsg *msg)
 	}
 	trigger_run(&replicaset.on_ack, &ack);
 
-	static const struct cmsg_hop route[] = {
-		{relay_status_update, NULL}
-	};
-	cmsg_init(msg, route);
-	cpipe_push(&status->relay->relay_pipe, msg);
+	if (status->relay->tx.is_paired) {
+		static const struct cmsg_hop route[] = {
+			{relay_status_update, NULL}
+		};
+		cmsg_init(msg, route);
+		cpipe_push(&status->relay->relay_pipe, msg);
+	} else {
+		assert(msg->hop->pipe == NULL);
+		cmsg_init(msg, NULL);
+	}
 }
 
 /**

It is claimed to fix a crash, apparently. I assume it targets a potential theoretical bug that the relay during the tx_endpoint unpair might receive a status response from tx, which will trigger sending the message back, and it would be sent back again after relay is already dead.

I was only able to reproduce it with a code hack:

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 3f3a8acb5..828bc8670 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -937,11 +937,11 @@ relay_check_status_needs_update(struct relay *relay)
 	/* Collect xlog files received by the replica. */
 	relay_schedule_pending_gc(relay, send_vclock);
 
-	double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
-	if (vclock_sum(&status_msg->vclock) ==
-	    vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
-	    status_msg->vclock_sync == last_recv_ack->vclock_sync)
-		return;
+	// double tx_idle = ev_monotonic_now(loop()) - relay->tx_seen_time;
+	// if (vclock_sum(&status_msg->vclock) ==
+	//     vclock_sum(send_vclock) && tx_idle <= replication_timeout &&
+	//     status_msg->vclock_sync == last_recv_ack->vclock_sync)
+	// 	return;
 	static const struct cmsg_hop route[] = {
 		{tx_status_update, NULL}
 	};

And then this script:

--
-- Instance 1
--
-- Step 1
--
box.cfg{
    listen = 3313,
    replication_timeout = 0.1,
}
box.schema.user.grant('guest', 'super')

--
-- Step 3
--
box.error.injection.set("ERRINJ_RELAY_FROM_TX_DELAY", true)



--
-- Instance 2
--
-- Step 2
--
box.cfg{
    listen = 3314,
    replication = 3313,
    replication_timeout = 0.1,
}

It seems we should not push the message when the endpoint is being unpaired. But I can't see how it could crash in real life without my unrealistic change in the code. tx_idle will be always zero when the message just arrived and there was no yields yet. send_vclock check wouldn't pass either - by the time of unpairing the relay isn't sending rows anymore. And the vclock_sync doesn't seem to be related either (don't remember why).

Other similar message handlers do filter out the relay being unpaired, somewhy. Why doesn't this place do the same? How to catch this crash normally?

The text was updated successfully, but these errors were encountered:

Gerold103 added crash bug Something isn't working replication labels Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay crash? #9920

Relay crash? #9920

Gerold103 commented Apr 8, 2024

Relay crash? #9920

Relay crash? #9920

Comments

Gerold103 commented Apr 8, 2024