Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix perf regression due to DML HA #5378

Merged
merged 1 commit into from Mar 3, 2023
Merged

Conversation

nikkhils
Copy link
Contributor

@nikkhils nikkhils commented Mar 2, 2023

We added checks via #4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a different
chunk with a different set of datanodes, we added checks on every insert to check if DNs are unavailable. This increased CPU consumption on the AN leading to a performance regression for RF > 1 code paths.

This patch fixes this regression. We now track if any DN is marked as unavailable at the start of the transaction and use that information to reduce unnecessary checks for each inserted row.

@nikkhils nikkhils requested review from akuzm and erimatnor March 2, 2023 07:59
@nikkhils nikkhils force-pushed the cpu_regress branch 2 times, most recently from 749d54b to ee60c63 Compare March 2, 2023 08:05
@nikkhils nikkhils added this to the TimescaleDB 2.10.1 milestone Mar 2, 2023
@codecov
Copy link

codecov bot commented Mar 2, 2023

Codecov Report

Merging #5378 (40d4d4d) into main (6be1442) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5378      +/-   ##
==========================================
- Coverage   90.67%   90.67%   -0.01%     
==========================================
  Files         226      226              
  Lines       52538    52553      +15     
==========================================
+ Hits        47640    47653      +13     
- Misses       4898     4900       +2     
Impacted Files Coverage Δ
tsl/src/data_node.c 96.11% <100.00%> (+0.01%) ⬆️
tsl/src/dist_backup.c 92.42% <100.00%> (+0.11%) ⬆️
tsl/src/remote/dist_copy.c 89.48% <100.00%> (+0.24%) ⬆️
src/loader/bgw_launcher.c 89.51% <0.00%> (-2.55%) ⬇️
tsl/src/reorder.c 85.49% <0.00%> (-0.22%) ⬇️
tsl/src/bgw_policy/job.c 87.54% <0.00%> (-0.05%) ⬇️
src/loader/bgw_message_queue.c 88.63% <0.00%> (+2.27%) ⬆️
src/compat/compat.h 96.61% <0.00%> (+6.13%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -1333,6 +1335,8 @@ remote_copy_process_and_send_data(RemoteCopyContext *context)
Hypertable *ht = context->ht;
const int n = context->batch_row_count;
Assert(n <= MAX_BATCH_ROWS);
static int32 chunk_id = INVALID_CHUNK_ID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this variables to be static as these are being reset when the function returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, good point @pdipesh02 will remove the statics!

* unavailable before we started this transaction. If not, then we know that every chunk's
* datanode list is fine and no stale chunk metadata updates are needed.
*/
if (context->dns_unavailable && found && ht->fd.replication_factor > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we make a call to reset context->dns_unavailable for the remaining rows in the batch after removing the data nodes for stale chunk?

Copy link
Contributor Author

@nikkhils nikkhils Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The dns_unavailable is a global state for the entire transaction. We just want to avoid doing these expensive checks if all DNs are available. Also, the next INSERT can be into another chunk, so we need to mark that new chunk as stale as well in that case if it goes to that same unavailable DN.

We added checks via timescale#4846 to handle DML HA when replication factor is
greater than 1 and a datanode is down. Since each insert can go to a
different chunk with a different set of datanodes, we added checks
on every insert to check if DNs are unavailable. This increased CPU
consumption on the AN leading to a performance regression for RF > 1
code paths.

This patch fixes this regression. We now track if any DN is marked as
unavailable at the start of the transaction and use that information to
reduce unnecessary checks for each inserted row.
@nikkhils nikkhils merged commit 1423b55 into timescale:main Mar 3, 2023
@nikkhils nikkhils deleted the cpu_regress branch March 3, 2023 13:04
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Mar 6, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5362 Make copy fetcher more async
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5317 Fix some incorrect memory handling
* timescale#5367 Rename columns in old-style continuous aggregates
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5343 Set PortalContext when starting job
* timescale#5360 Fix uninitialized bucket_info variable
* timescale#5362 Make copy fetcher more async
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5367 Fix column name handling in old-style continuous aggregates
* timescale#5378 Fix multinode DML HA performance regression
* timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size
* timescale#5153 Fix concurrent locking with chunk_data_node table

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
@svenklemm svenklemm mentioned this pull request Mar 6, 2023
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Mar 6, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5362 Make copy fetcher more async
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5317 Fix some incorrect memory handling
* timescale#5367 Rename columns in old-style continuous aggregates
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5343 Set PortalContext when starting job
* timescale#5360 Fix uninitialized bucket_info variable
* timescale#5362 Make copy fetcher more async
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5367 Fix column name handling in old-style continuous aggregates
* timescale#5378 Fix multinode DML HA performance regression
* timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size
* timescale#5153 Fix concurrent locking with chunk_data_node table

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Mar 6, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size
* timescale#5226 Fix concurrent locking with chunk_data_node table
* timescale#5317 Fix some incorrect memory handling
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5343 Set PortalContext when starting job
* timescale#5360 Fix uninitialized bucket_info variable
* timescale#5362 Make copy fetcher more async
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5367 Fix column name handling in old-style continuous aggregates
* timescale#5378 Fix multinode DML HA performance regression
* timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
svenklemm added a commit that referenced this pull request Mar 7, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #5159 Support Continuous Aggregates names in hypertable_(detailed_)size
* #5226 Fix concurrent locking with chunk_data_node table
* #5317 Fix some incorrect memory handling
* #5336 Use NameData and namestrcpy for names
* #5343 Set PortalContext when starting job
* #5360 Fix uninitialized bucket_info variable
* #5362 Make copy fetcher more async
* #5364 Fix num_chunks inconsistency in hypertables view
* #5367 Fix column name handling in old-style continuous aggregates
* #5378 Fix multinode DML HA performance regression
* #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Mar 7, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size
* timescale#5226 Fix concurrent locking with chunk_data_node table
* timescale#5317 Fix some incorrect memory handling
* timescale#5336 Use NameData and namestrcpy for names
* timescale#5343 Set PortalContext when starting job
* timescale#5360 Fix uninitialized bucket_info variable
* timescale#5362 Make copy fetcher more async
* timescale#5364 Fix num_chunks inconsistency in hypertables view
* timescale#5367 Fix column name handling in old-style continuous aggregates
* timescale#5378 Fix multinode DML HA performance regression
* timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
svenklemm added a commit that referenced this pull request Mar 7, 2023
This release contains bug fixes since the 2.10.0 release.
We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #5159 Support Continuous Aggregates names in hypertable_(detailed_)size
* #5226 Fix concurrent locking with chunk_data_node table
* #5317 Fix some incorrect memory handling
* #5336 Use NameData and namestrcpy for names
* #5343 Set PortalContext when starting job
* #5360 Fix uninitialized bucket_info variable
* #5362 Make copy fetcher more async
* #5364 Fix num_chunks inconsistency in hypertables view
* #5367 Fix column name handling in old-style continuous aggregates
* #5378 Fix multinode DML HA performance regression
* #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size

**Thanks**
* @justinozavala for reporting an issue with PL/Python procedures in the background worker
* @Medvecrab for discovering an issue with copying NameData when forming heap tuples.
* @pushpeepkmonroe for discovering an issue in upgrading old-style
  continuous aggregates with renamed columns
* @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants