Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinode HA: handle DML to chunks when target DN is down #4846

Closed
nikkhils opened this issue Oct 18, 2022 · 1 comment · Fixed by #4966
Closed

Multinode HA: handle DML to chunks when target DN is down #4846

nikkhils opened this issue Oct 18, 2022 · 1 comment · Fixed by #4966
Assignees
Labels

Comments

@nikkhils
Copy link
Contributor

nikkhils commented Oct 18, 2022

What problem does the new feature solve?

https://github.com/timescale/product/issues/288 lays down the requirements for Datanode HA. We want to allow both read and write queries when a datanode (DN) goes down.

One way to allow write queries when DN goes down is to track which chunk on that down DN is being written to and to mark that.

What does the feature do?

Instead of tracking dirty chunks explicitly on AN, we will remove the metadata on the AN which associates this down DN with such a dirty chunk. Since we will remove metadata on the AN, any future queries will never go to this DN for this data avoiding the issue of stale data fetches completely.

When the DN comes back, the rebalancer microservice in TSC (Timescale Cloud) will kick in automatically and will refresh these chunks on the DN. That will need changes in copy_chunk/move_chunk functionality.

The DN can also when it comes back tell the AN about its chunk and AN can also cleanup stale chunks that ways on it.

Hat tip to @pmwkaa for this overall idea of removing metadata on the AN as the way to track the dirtied chunks.

This issue will only track the AN handling of the removing metadata part.

Implementation challenges

No response

@jfjoly
Copy link

jfjoly commented Nov 1, 2022

Note on the behavior: when a DN is unavailable, new chunks will be created on the "surviving" nodes. The replication factor may be below the desired one until all DNs come back online.

nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 11, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
@nikkhils nikkhils added this to the TimescaleDB 2.9 milestone Nov 11, 2022
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 11, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 14, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
@vineethapai vineethapai removed this from the TimescaleDB 2.9 milestone Nov 17, 2022
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 22, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

In passing also fix SELECT behaviour on distributed caggs. We only
connect to available DNs for invalidations handling.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

In passing also fix SELECT behaviour on distributed caggs. We only
connect to available DNs for invalidations handling.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

In passing also fix SELECT behaviour on distributed caggs. We only
connect to available DNs for invalidations handling.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

In passing also fix SELECT behaviour on distributed caggs. We only
connect to available DNs for invalidations handling.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

In passing also fix SELECT behaviour on distributed caggs. We only
connect to available DNs for invalidations handling.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes timescale#4846
nikkhils added a commit that referenced this issue Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes #4846
SachinSetiya pushed a commit that referenced this issue Nov 28, 2022
If a datanode goes down for whatever reason then DML activity to
chunks residing on (or targeted to) that DN will start erroring out.
We now handle this by marking the target chunk as "stale" for this
DN by changing the metadata on the access node. This allows us to
continue to do DML to replicas of the same chunk data on other DNs
in the setup. This obviously will only work for chunks which have
"replication_factor" > 1. Note that for chunks which do not have
undergo any change will continue to carry the appropriate DN related
metadata on the AN.

This means that such "stale" chunks will become underreplicated and
need to be re-balanced by using the copy_chunk functionality by a micro
service or some such process.

Fixes #4846
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
> 1 and a datanode is down. Since each insert can go to a different
chunk with a different set of datanodes, we added checks on every
insert to check if DNs are unavailable. This increased CPU consumption
on the AN leading to a performance regression for RF > 1 code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit to nikkhils/timescaledb that referenced this issue Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
nikkhils added a commit that referenced this issue Mar 3, 2023
We added checks via #4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
svenklemm pushed a commit to svenklemm/timescaledb that referenced this issue Mar 6, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
svenklemm pushed a commit that referenced this issue Mar 7, 2023
We added checks via #4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants