New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix perf regression due to DML HA #5378
Conversation
749d54b
to
ee60c63
Compare
Codecov Report
@@ Coverage Diff @@
## main #5378 +/- ##
==========================================
- Coverage 90.67% 90.67% -0.01%
==========================================
Files 226 226
Lines 52538 52553 +15
==========================================
+ Hits 47640 47653 +13
- Misses 4898 4900 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
tsl/src/remote/dist_copy.c
Outdated
@@ -1333,6 +1335,8 @@ remote_copy_process_and_send_data(RemoteCopyContext *context) | |||
Hypertable *ht = context->ht; | |||
const int n = context->batch_row_count; | |||
Assert(n <= MAX_BATCH_ROWS); | |||
static int32 chunk_id = INVALID_CHUNK_ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this variables to be static as these are being reset when the function returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, good point @pdipesh02 will remove the statics!
* unavailable before we started this transaction. If not, then we know that every chunk's | ||
* datanode list is fine and no stale chunk metadata updates are needed. | ||
*/ | ||
if (context->dns_unavailable && found && ht->fd.replication_factor > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we make a call to reset context->dns_unavailable
for the remaining rows in the batch after removing the data nodes for stale chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The dns_unavailable
is a global state for the entire transaction. We just want to avoid doing these expensive checks if all DNs are available. Also, the next INSERT can be into another chunk, so we need to mark that new chunk as stale as well in that case if it goes to that same unavailable DN.
We added checks via timescale#4846 to handle DML HA when replication factor is greater than 1 and a datanode is down. Since each insert can go to a different chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths. This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5362 Make copy fetcher more async * timescale#5336 Use NameData and namestrcpy for names * timescale#5317 Fix some incorrect memory handling * timescale#5367 Rename columns in old-style continuous aggregates * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size * timescale#5153 Fix concurrent locking with chunk_data_node table **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5362 Make copy fetcher more async * timescale#5336 Use NameData and namestrcpy for names * timescale#5317 Fix some incorrect memory handling * timescale#5367 Rename columns in old-style continuous aggregates * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size * timescale#5153 Fix concurrent locking with chunk_data_node table **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size * timescale#5226 Fix concurrent locking with chunk_data_node table * timescale#5317 Fix some incorrect memory handling * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * #5159 Support Continuous Aggregates names in hypertable_(detailed_)size * #5226 Fix concurrent locking with chunk_data_node table * #5317 Fix some incorrect memory handling * #5336 Use NameData and namestrcpy for names * #5343 Set PortalContext when starting job * #5360 Fix uninitialized bucket_info variable * #5362 Make copy fetcher more async * #5364 Fix num_chunks inconsistency in hypertables view * #5367 Fix column name handling in old-style continuous aggregates * #5378 Fix multinode DML HA performance regression * #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size * timescale#5226 Fix concurrent locking with chunk_data_node table * timescale#5317 Fix some incorrect memory handling * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * #5159 Support Continuous Aggregates names in hypertable_(detailed_)size * #5226 Fix concurrent locking with chunk_data_node table * #5317 Fix some incorrect memory handling * #5336 Use NameData and namestrcpy for names * #5343 Set PortalContext when starting job * #5360 Fix uninitialized bucket_info variable * #5362 Make copy fetcher more async * #5364 Fix num_chunks inconsistency in hypertables view * #5367 Fix column name handling in old-style continuous aggregates * #5378 Fix multinode DML HA performance regression * #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
We added checks via #4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a different
chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths.
This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.