Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Behavior with Native Replication When a Data Node Failes and is Readded #3502

Closed
TroyDey opened this issue Aug 19, 2021 · 3 comments
Assignees
Labels
bug multinode native-replication Issues related to native replication

Comments

@TroyDey
Copy link

TroyDey commented Aug 19, 2021

Relevant system information:

  • OS: Alpine, timescale/timescaledb:2.4.0-pg13, Host: OS X Catalina
  • PostgreSQL version (output of postgres --version): 13.3
  • TimescaleDB version (output of \dx in psql): 2.4.0
  • Installation method: Docker

Describe the bug
When using native replication has unexpected behavior when a data node fails.

  • When a data node goes down both selects and mutating commands fail
  • After the data node is readded, reattached, and chunks are recopied when it goes down again selects will succeed which is not consistent with the previous behavior.

To Reproduce
I have created a Github repo with a script that will launch a docker environment and run the following high level scenario.

  • Setup 1 access node and 3 data nodes
  • Create 1 distributed hyper table with replication factor of 3 and push some data in
  • Check SELECT and chunk replication status
  • Bring down one of the data nodes
  • Test SELECT and INSERT
  • Detach and delete the down data node
  • Test SELECT and INSERT
  • Readd and reattach the data node
  • Test SELECT and INSERT
  • Bring down the readded data node
  • Test SELECT and INSERT

To run the test:

clone timescale-multinode-test

./run-test.sh

I have provided more details in the README and comments of run-test.sh

Expected behavior

  • If a data node goes down and its chunks are replicated elsewhere SELECT's should still succeed.
  • If a data node is readded the behavior when it goes down should remain the same as it was initially which is for BOTH select and mutations to fail (or if the previous bullet is resolved SELECT's should succeed).

Actual behavior

  • If a data node goes down SELECT's will fail in addition to mutations
  • If a data node is readded when it goes down again its behavior should be consistent with how it was initially (which would ideally be for SELECT's to succeed).
@mkindahl mkindahl added bug multinode native-replication Issues related to native replication labels Aug 24, 2021
@erimatnor erimatnor self-assigned this Oct 11, 2021
@erimatnor
Copy link
Contributor

So, first, thank you for the work putting together a really helpful reproducible test case and script using Docker compose.

However, upon closer examination, I've concluded that there's nothing wrong with the current behavior. I submitted a PR to your test repo to illustrate: TroyDey/timescale-multinode-test#1

The reason the second SELECT succeeds while the node is down is because the node is no longer a "primary replica node" for any chunks (given that it was removed and then added again). The "primary replica node" is the node that is responsible for serving queries. When you delete a data node, it will no longer be the primary for any chunks, so this is reassigned to another node.

We do want to expose a way to reassign the designated primary replica node for a chunk, which would allow you to manually "fix" the first query.

Note, however, that for INSERTs we always want to fail if a node is down (and it holds chunks) because we need to insert into all replica chunks to keep them consistent.

I am therefore closing this issue. Please reopen if I missed something.

@TroyDey
Copy link
Author

TroyDey commented Oct 11, 2021

Thank you very much for looking into this, and your explanation makes sense and was along the lines of what I was expecting.

To ensure I got it right, my understanding is:

  • If a data node goes down, queries which don't involve chunks who's primary replica is the failed node will continue to succeed.
  • If a data node goes down, queries which involve chunks who's primary replica is the failed node will fail even though the data is replicated elsewhere because the primary replica is the node that processes queries for these chunks.
  • If I delete the failed data node a new data node is assigned as primary and the chunks become accessible again. Timescale does not automatically reassign primary status.

Based on the above my two options appear to be either:

  • Restore the original data node as quickly as possible
  • Manually intervene to delete the data node so that queries may continue while the failed node is restored

Any other suggestions?

I know it is a very challenging problem to solve, but curious if you have plans to do automatic fail over of the primary to another viable replica/data node?

Thanks again for taking the time to look into this and provide an explanation!

@erimatnor
Copy link
Contributor

erimatnor commented Oct 12, 2021

Thank you very much for looking into this, and your explanation makes sense and was along the lines of what I was expecting.

To ensure I got it right, my understanding is:

* If a data node goes down, queries which don't involve chunks who's primary replica is the failed node will continue to succeed.

Correct.

* If a data node goes down, queries which involve chunks who's primary replica is the failed node will fail even though the data is replicated elsewhere because the primary replica is the node that processes queries for these chunks.

Correct.

* If I delete the failed data node a new data node is assigned as primary and the chunks become accessible again.  Timescale does not automatically reassign primary status.

Correct.

We've had ideas for providing a function to reassign primary status to other nodes without having to fully delete a data node (e.g., during restarts or short down times), but the detection of failed nodes I imagine would still be left to an external monitoring system. Such a system is probably better at hooking into lifecycle events of nodes, e.g., in cloud providers or on-premise systems, than any DB-internal monitoring.

Based on the above my two options appear to be either:

* Restore the original data node as quickly as possible

* Manually intervene to delete the data node so that queries may continue while the failed node is restored

Yes, those are the options so far. Note that deleting the data node also immediately restores write capabilities. In the future we'd like to support writes even though nodes are down, but that would either entail (1) syncing up chunks that have been written to during downtime, or (2) deleting the chunks written to during downtime, followed by re-replication.

Any other suggestions?

I think you covered most of it.

I know it is a very challenging problem to solve, but curious if you have plans to do automatic fail over of the primary to another viable replica/data node?

As stated above, we believe this is better handled by an external monitoring system. We aim to provide APIs for such a monitoring system to hook into, however, for actions during events (e.g., to reassign primary status or delete a data node). But it is not entirely unlikely that we will provide an internal option for monitoring data nodes if we can figure out how to do it in a good way, but this is not a priority for us right now given that external monitoring systems already exist and are doing a good job (e.g., you get it for free in orchestration systems like Kubernetes).

Thanks again for taking the time to look into this and provide an explanation!

My pleasure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug multinode native-replication Issues related to native replication
Projects
None yet
Development

No branches or pull requests

4 participants