Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

TroyDey · 2021-08-19T20:25:47Z

Relevant system information:

OS: Alpine, timescale/timescaledb:2.4.0-pg13, Host: OS X Catalina
PostgreSQL version (output of postgres --version): 13.3
TimescaleDB version (output of \dx in psql): 2.4.0
Installation method: Docker

Describe the bug
When using native replication has unexpected behavior when a data node fails.

When a data node goes down both selects and mutating commands fail
After the data node is readded, reattached, and chunks are recopied when it goes down again selects will succeed which is not consistent with the previous behavior.

To Reproduce
I have created a Github repo with a script that will launch a docker environment and run the following high level scenario.

Setup 1 access node and 3 data nodes
Create 1 distributed hyper table with replication factor of 3 and push some data in
Check SELECT and chunk replication status
Bring down one of the data nodes
Test SELECT and INSERT
Detach and delete the down data node
Test SELECT and INSERT
Readd and reattach the data node
Test SELECT and INSERT
Bring down the readded data node
Test SELECT and INSERT

To run the test:

clone timescale-multinode-test

./run-test.sh

I have provided more details in the README and comments of run-test.sh

Expected behavior

If a data node goes down and its chunks are replicated elsewhere SELECT's should still succeed.
If a data node is readded the behavior when it goes down should remain the same as it was initially which is for BOTH select and mutations to fail (or if the previous bullet is resolved SELECT's should succeed).

Actual behavior

If a data node goes down SELECT's will fail in addition to mutations
If a data node is readded when it goes down again its behavior should be consistent with how it was initially (which would ideally be for SELECT's to succeed).

The text was updated successfully, but these errors were encountered:

erimatnor · 2021-10-11T13:05:26Z

So, first, thank you for the work putting together a really helpful reproducible test case and script using Docker compose.

However, upon closer examination, I've concluded that there's nothing wrong with the current behavior. I submitted a PR to your test repo to illustrate: TroyDey/timescale-multinode-test#1

The reason the second SELECT succeeds while the node is down is because the node is no longer a "primary replica node" for any chunks (given that it was removed and then added again). The "primary replica node" is the node that is responsible for serving queries. When you delete a data node, it will no longer be the primary for any chunks, so this is reassigned to another node.

We do want to expose a way to reassign the designated primary replica node for a chunk, which would allow you to manually "fix" the first query.

Note, however, that for INSERTs we always want to fail if a node is down (and it holds chunks) because we need to insert into all replica chunks to keep them consistent.

I am therefore closing this issue. Please reopen if I missed something.

TroyDey · 2021-10-11T18:42:40Z

Thank you very much for looking into this, and your explanation makes sense and was along the lines of what I was expecting.

To ensure I got it right, my understanding is:

If a data node goes down, queries which don't involve chunks who's primary replica is the failed node will continue to succeed.
If a data node goes down, queries which involve chunks who's primary replica is the failed node will fail even though the data is replicated elsewhere because the primary replica is the node that processes queries for these chunks.
If I delete the failed data node a new data node is assigned as primary and the chunks become accessible again. Timescale does not automatically reassign primary status.

Based on the above my two options appear to be either:

Restore the original data node as quickly as possible
Manually intervene to delete the data node so that queries may continue while the failed node is restored

Any other suggestions?

I know it is a very challenging problem to solve, but curious if you have plans to do automatic fail over of the primary to another viable replica/data node?

Thanks again for taking the time to look into this and provide an explanation!

erimatnor · 2021-10-12T07:47:13Z

Thank you very much for looking into this, and your explanation makes sense and was along the lines of what I was expecting.

To ensure I got it right, my understanding is:
* If a data node goes down, queries which don't involve chunks who's primary replica is the failed node will continue to succeed.

Correct.

* If a data node goes down, queries which involve chunks who's primary replica is the failed node will fail even though the data is replicated elsewhere because the primary replica is the node that processes queries for these chunks.

Correct.

* If I delete the failed data node a new data node is assigned as primary and the chunks become accessible again.  Timescale does not automatically reassign primary status.

Correct.

We've had ideas for providing a function to reassign primary status to other nodes without having to fully delete a data node (e.g., during restarts or short down times), but the detection of failed nodes I imagine would still be left to an external monitoring system. Such a system is probably better at hooking into lifecycle events of nodes, e.g., in cloud providers or on-premise systems, than any DB-internal monitoring.

Based on the above my two options appear to be either:

* Restore the original data node as quickly as possible

* Manually intervene to delete the data node so that queries may continue while the failed node is restored

Yes, those are the options so far. Note that deleting the data node also immediately restores write capabilities. In the future we'd like to support writes even though nodes are down, but that would either entail (1) syncing up chunks that have been written to during downtime, or (2) deleting the chunks written to during downtime, followed by re-replication.

Any other suggestions?

I think you covered most of it.

I know it is a very challenging problem to solve, but curious if you have plans to do automatic fail over of the primary to another viable replica/data node?

As stated above, we believe this is better handled by an external monitoring system. We aim to provide APIs for such a monitoring system to hook into, however, for actions during events (e.g., to reassign primary status or delete a data node). But it is not entirely unlikely that we will provide an internal option for monitoring data nodes if we can figure out how to do it in a good way, but this is not a priority for us right now given that external monitoring systems already exist and are doing a good job (e.g., you get it for free in orchestration systems like Kubernetes).

Thanks again for taking the time to look into this and provide an explanation!

My pleasure!

mkindahl added bug multinode native-replication Issues related to native replication labels Aug 24, 2021

erimatnor assigned pmwkaa Sep 14, 2021

NunoFilipeSantos unassigned pmwkaa Sep 23, 2021

erimatnor self-assigned this Oct 11, 2021

erimatnor closed this as completed Oct 11, 2021

erimatnor mentioned this issue Oct 12, 2021

Divert reads away from data nodes #2104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

TroyDey commented Aug 19, 2021

erimatnor commented Oct 11, 2021

TroyDey commented Oct 11, 2021

erimatnor commented Oct 12, 2021 •

edited

Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

Comments

TroyDey commented Aug 19, 2021

erimatnor commented Oct 11, 2021

TroyDey commented Oct 11, 2021

erimatnor commented Oct 12, 2021 • edited

erimatnor commented Oct 12, 2021 •

edited