New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multinode HA: handle DML to chunks when target DN is down #4846
Labels
Comments
This was referenced Oct 18, 2022
Note on the behavior: when a DN is unavailable, new chunks will be created on the "surviving" nodes. The replication factor may be below the desired one until all DNs come back online. |
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 11, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 11, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 14, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 22, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. In passing also fix SELECT behaviour on distributed caggs. We only connect to available DNs for invalidations handling. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. In passing also fix SELECT behaviour on distributed caggs. We only connect to available DNs for invalidations handling. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. In passing also fix SELECT behaviour on distributed caggs. We only connect to available DNs for invalidations handling. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 23, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. In passing also fix SELECT behaviour on distributed caggs. We only connect to available DNs for invalidations handling. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. In passing also fix SELECT behaviour on distributed caggs. We only connect to available DNs for invalidations handling. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 24, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes timescale#4846
nikkhils
added a commit
that referenced
this issue
Nov 25, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes #4846
SachinSetiya
pushed a commit
that referenced
this issue
Nov 28, 2022
If a datanode goes down for whatever reason then DML activity to chunks residing on (or targeted to) that DN will start erroring out. We now handle this by marking the target chunk as "stale" for this DN by changing the metadata on the access node. This allows us to continue to do DML to replicas of the same chunk data on other DNs in the setup. This obviously will only work for chunks which have "replication_factor" > 1. Note that for chunks which do not have undergo any change will continue to carry the appropriate DN related metadata on the AN. This means that such "stale" chunks will become underreplicated and need to be re-balanced by using the copy_chunk functionality by a micro service or some such process. Fixes #4846
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is > 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 2, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
to nikkhils/timescaledb
that referenced
this issue
Mar 3, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
nikkhils
added a commit
that referenced
this issue
Mar 3, 2023
We added checks via #4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
svenklemm
pushed a commit
to svenklemm/timescaledb
that referenced
this issue
Mar 6, 2023
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
svenklemm
pushed a commit
that referenced
this issue
Mar 7, 2023
We added checks via #4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What problem does the new feature solve?
https://github.com/timescale/product/issues/288 lays down the requirements for Datanode HA. We want to allow both read and write queries when a datanode (DN) goes down.
One way to allow write queries when DN goes down is to track which chunk on that down DN is being written to and to mark that.
What does the feature do?
Instead of tracking dirty chunks explicitly on AN, we will remove the metadata on the AN which associates this down DN with such a dirty chunk. Since we will remove metadata on the AN, any future queries will never go to this DN for this data avoiding the issue of stale data fetches completely.
When the DN comes back, the rebalancer microservice in TSC (Timescale Cloud) will kick in automatically and will refresh these chunks on the DN. That will need changes in copy_chunk/move_chunk functionality.
The DN can also when it comes back tell the AN about its chunk and AN can also cleanup stale chunks that ways on it.
Hat tip to @pmwkaa for this overall idea of removing metadata on the AN as the way to track the dirtied chunks.
This issue will only track the AN handling of the removing metadata part.
Implementation challenges
No response
The text was updated successfully, but these errors were encountered: