Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark partialize_agg as parallel safe #4307

Merged
merged 1 commit into from May 31, 2022
Merged

Conversation

akuzm
Copy link
Member

@akuzm akuzm commented May 6, 2022

Just an experiment to see what happens.

UPD: the experiment turned out to be a success. Seems that we can mark the partialize_agg function as parallel safe always, and the parallel plans are generated depending on the underlying aggregate function. Postgres does the full expression tree traversal to check for parallel unsafe functions, as expected.

Fixes #4306

Commit message

Mark partialize_agg as parallel-safe

This function is used to push down aggregation to data nodes, and making it parallel-safe allows to use parallel plans on data nodes, if the underlying aggregate function is parallel-safe as well. Add negative and positive tests for this.

@codecov
Copy link

codecov bot commented May 6, 2022

Codecov Report

Merging #4307 (e6a4384) into main (1fbe2eb) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head e6a4384 differs from pull request most recent head 3e9254f. Consider uploading reports for the commit 3e9254f to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4307      +/-   ##
==========================================
- Coverage   90.84%   90.84%   -0.01%     
==========================================
  Files         220      220              
  Lines       40505    40504       -1     
==========================================
- Hits        36798    36794       -4     
- Misses       3707     3710       +3     
Impacted Files Coverage Δ
src/bgw/scheduler.c 82.94% <0.00%> (-0.89%) ⬇️
tsl/src/nodes/data_node_dispatch.c 97.10% <0.00%> (-0.25%) ⬇️
tsl/src/bgw_policy/job.c 88.62% <0.00%> (-0.05%) ⬇️
src/planner/constify_now.c 98.79% <0.00%> (ø)
tsl/src/fdw/deparse.c 69.49% <0.00%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d0670e...3e9254f. Read the comment docs.

@akuzm akuzm marked this pull request as ready for review May 30, 2022 10:27
@akuzm akuzm requested a review from a team as a code owner May 30, 2022 10:27
@akuzm akuzm requested review from erimatnor and svenklemm and removed request for a team May 30, 2022 10:27
@@ -81,6 +81,7 @@ ${PSQL} -U ${TEST_PGUSER} \
sed -e '/<exclude_from_test>/,/<\/exclude_from_test>/d' \
-e 's!_[0-9]\{1,\}_[0-9]\{1,\}_chunk!_X_X_chunk!g' \
-e 's!^ \{1,\}QUERY PLAN \{1,\}$!QUERY PLAN!' \
-e 's!: actual rows!: actual rows!' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a fan of this. Can you get a stable test output without this change as the number of rows is useful for other tests so I would rather not have this removed globally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping the number, just replacing double space for single. The formatting was changed between postgres versions, so I'm normalizing it to avoid having separate references.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Could you expand in the PR description (commit message) on why it is OK to make the function parallel safe.

I assume it is because partial aggregates actually exists in PG to do parallel computation, so they are parallel safe by definition? But it would be nice to make that clear if this is the reason.

@akuzm
Copy link
Member Author

akuzm commented May 31, 2022

I assume it is because partial aggregates actually exists in PG to do parallel computation, so they are parallel safe by definition? But it would be nice to make that clear if this is the reason.

Yes, Postgres knows whether a given aggregate is parallel-safe and paralellizes based on that. We don't need to artificially disable parallel plans with our partial aggregation wrapper, because this wrapper is a pure function that produces the unfinished aggregation state as the result, and doesn't change parallel safety. I'll try to describe this in more detail...

Postgres knows whether a given aggregate is parallel-safe, and creates
parallel aggregation plans based on that. The `partialize_agg` is a
wrapper we use to perform partial aggregation on data nodes. It is a
pure function that produces serialized aggregation state as a result.
Being pure, it doesn't influence parallel safety. This means we don't
need to mark it parallel-unsafe to artificially disable the parallel
plans for partial aggregation. They will be chosen as usual based on
the parallel-safety of the underlying aggregate function.
@akuzm akuzm enabled auto-merge (rebase) May 31, 2022 09:11
@akuzm akuzm merged commit 5c0110c into timescale:main May 31, 2022
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request Aug 31, 2022
This release adds major new features since the 2.7.2 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* time_bucket now supports bucketing by month, year and timezone
* Improve performance of bulk SELECT and COPY for distributed hypertables
* 1 step CAgg policy management
* Migrate Continuous Aggregates to the new format

**Features**
* timescale#4188 Use COPY protocol in row-by-row fetcher
* timescale#4307 Mark partialize_agg as parallel safe
* timescale#4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE
* timescale#4384 Add schedule_interval to policies
* timescale#4390 Faster lookup of chunks by point
* timescale#4393 Support intervals with day component when constifying now()
* timescale#4397 Support intervals with month component when constifying now()
* timescale#4405 Support ON CONFLICT ON CONSTRAINT for hypertables
* timescale#4412 Add telemetry about replication
* timescale#4415 Drop remote data when detaching data node
* timescale#4416 Handle TRUNCATE TABLE on chunks
* timescale#4425 Add parameter check_config to alter_job
* timescale#4430 Create index on Continuous Aggregates
* timescale#4439 Allow ORDER BY on continuous aggregates
* timescale#4443 Add stateful partition mappings
* timescale#4484 Use non-blocking data node connections for COPY
* timescale#4495 Support add_dimension() with existing data
* timescale#4502 Add chunks to baserel cache on chunk exclusion
* timescale#4545 Add hypertable distributed argument and defaults
* timescale#4552 Migrate Continuous Aggregates to the new format
* timescale#4556 Add runtime exclusion for hypertables
* timescale#4561 Change get_git_commit to return full commit hash
* timescale#4563 1 step CAgg policy management
* timescale#4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill
* timescale#4642 Add timezone support to time_bucket

**Bugfixes**
* timescale#4359 Create composite index on segmentby columns
* timescale#4374 Remove constified now() constraints from plan
* timescale#4416 Handle TRUNCATE TABLE on chunks
* timescale#4478 Synchronize chunk cache sizes
* timescale#4486 Adding boolean column with default value doesn't work on compressed table
* timescale#4512 Fix unaligned pointer access
* timescale#4519 Throw better error message on incompatible row fetcher settings
* timescale#4549 Fix dump_meta_data for windows
* timescale#4553 Fix timescaledb_post_restore GUC handling
* timescale#4573 Load TSL library on compressed_data_out call
* timescale#4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function
* timescale#4577 Fix segfaults in compression code with corrupt data
* timescale#4580 Handle default privileges on CAggs properly
* timescale#4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA
* timescale#4583 Fix partitioning functions
* timescale#4589 Fix rename for distributed hypertable
* timescale#4601 Reset compression sequence when group resets
* timescale#4611 Fix a potential OOM when loading large data sets into a hypertable
* timescale#4624 Fix heap buffer overflow
* timescale#4627 Fix telemetry initialization
* timescale#4631 Ensure TSL library is loaded on database upgrades
* timescale#4646 Fix time_bucket_ng origin handling
* timescale#4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause.

**Thanks**
* @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function
* @Creatation for reporting an issue with renaming hypertables
* @janko for reporting an issue when adding bool column with default value to compressed hypertable
* @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk
* @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates
* @mwahlhuetter for reporting error in joins in RETURNING clause
* @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable
* @PBudmark for reporting an issue with dump_meta_data.sql on Windows
* @ssmoss for reporting an issue with time_bucket_ng origin handling
@svenklemm svenklemm mentioned this pull request Aug 31, 2022
svenklemm added a commit that referenced this pull request Aug 31, 2022
This release adds major new features since the 2.7.2 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* time_bucket now supports bucketing by month, year and timezone
* Improve performance of bulk SELECT and COPY for distributed hypertables
* 1 step CAgg policy management
* Migrate Continuous Aggregates to the new format

**Features**
* #4188 Use COPY protocol in row-by-row fetcher
* #4307 Mark partialize_agg as parallel safe
* #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE
* #4384 Add schedule_interval to policies
* #4390 Faster lookup of chunks by point
* #4393 Support intervals with day component when constifying now()
* #4397 Support intervals with month component when constifying now()
* #4405 Support ON CONFLICT ON CONSTRAINT for hypertables
* #4412 Add telemetry about replication
* #4415 Drop remote data when detaching data node
* #4416 Handle TRUNCATE TABLE on chunks
* #4425 Add parameter check_config to alter_job
* #4430 Create index on Continuous Aggregates
* #4439 Allow ORDER BY on continuous aggregates
* #4443 Add stateful partition mappings
* #4484 Use non-blocking data node connections for COPY
* #4495 Support add_dimension() with existing data
* #4502 Add chunks to baserel cache on chunk exclusion
* #4545 Add hypertable distributed argument and defaults
* #4552 Migrate Continuous Aggregates to the new format
* #4556 Add runtime exclusion for hypertables
* #4561 Change get_git_commit to return full commit hash
* #4563 1 step CAgg policy management
* #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill
* #4642 Add timezone support to time_bucket

**Bugfixes**
* #4359 Create composite index on segmentby columns
* #4374 Remove constified now() constraints from plan
* #4416 Handle TRUNCATE TABLE on chunks
* #4478 Synchronize chunk cache sizes
* #4486 Adding boolean column with default value doesn't work on compressed table
* #4512 Fix unaligned pointer access
* #4519 Throw better error message on incompatible row fetcher settings
* #4549 Fix dump_meta_data for windows
* #4553 Fix timescaledb_post_restore GUC handling
* #4573 Load TSL library on compressed_data_out call
* #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function
* #4577 Fix segfaults in compression code with corrupt data
* #4580 Handle default privileges on CAggs properly
* #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA
* #4583 Fix partitioning functions
* #4589 Fix rename for distributed hypertable
* #4601 Reset compression sequence when group resets
* #4611 Fix a potential OOM when loading large data sets into a hypertable
* #4624 Fix heap buffer overflow
* #4627 Fix telemetry initialization
* #4631 Ensure TSL library is loaded on database upgrades
* #4646 Fix time_bucket_ng origin handling
* #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause.

**Thanks**
* @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function
* @Creatation for reporting an issue with renaming hypertables
* @janko for reporting an issue when adding bool column with default value to compressed hypertable
* @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk
* @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates
* @mwahlhuetter for reporting error in joins in RETURNING clause
* @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable
* @PBudmark for reporting an issue with dump_meta_data.sql on Windows
* @ssmoss for reporting an issue with time_bucket_ng origin handling
svenklemm added a commit that referenced this pull request Aug 31, 2022
This release adds major new features since the 2.7.2 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* time_bucket now supports bucketing by month, year and timezone
* Improve performance of bulk SELECT and COPY for distributed hypertables
* 1 step CAgg policy management
* Migrate Continuous Aggregates to the new format

**Features**
* #4188 Use COPY protocol in row-by-row fetcher
* #4307 Mark partialize_agg as parallel safe
* #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE
* #4384 Add schedule_interval to policies
* #4390 Faster lookup of chunks by point
* #4393 Support intervals with day component when constifying now()
* #4397 Support intervals with month component when constifying now()
* #4405 Support ON CONFLICT ON CONSTRAINT for hypertables
* #4412 Add telemetry about replication
* #4415 Drop remote data when detaching data node
* #4416 Handle TRUNCATE TABLE on chunks
* #4425 Add parameter check_config to alter_job
* #4430 Create index on Continuous Aggregates
* #4439 Allow ORDER BY on continuous aggregates
* #4443 Add stateful partition mappings
* #4484 Use non-blocking data node connections for COPY
* #4495 Support add_dimension() with existing data
* #4502 Add chunks to baserel cache on chunk exclusion
* #4545 Add hypertable distributed argument and defaults
* #4552 Migrate Continuous Aggregates to the new format
* #4556 Add runtime exclusion for hypertables
* #4561 Change get_git_commit to return full commit hash
* #4563 1 step CAgg policy management
* #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill
* #4642 Add timezone support to time_bucket

**Bugfixes**
* #4359 Create composite index on segmentby columns
* #4374 Remove constified now() constraints from plan
* #4416 Handle TRUNCATE TABLE on chunks
* #4478 Synchronize chunk cache sizes
* #4486 Adding boolean column with default value doesn't work on compressed table
* #4512 Fix unaligned pointer access
* #4519 Throw better error message on incompatible row fetcher settings
* #4549 Fix dump_meta_data for windows
* #4553 Fix timescaledb_post_restore GUC handling
* #4573 Load TSL library on compressed_data_out call
* #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function
* #4577 Fix segfaults in compression code with corrupt data
* #4580 Handle default privileges on CAggs properly
* #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA
* #4583 Fix partitioning functions
* #4589 Fix rename for distributed hypertable
* #4601 Reset compression sequence when group resets
* #4611 Fix a potential OOM when loading large data sets into a hypertable
* #4624 Fix heap buffer overflow
* #4627 Fix telemetry initialization
* #4631 Ensure TSL library is loaded on database upgrades
* #4646 Fix time_bucket_ng origin handling
* #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause.

**Thanks**
* @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function
* @Creatation for reporting an issue with renaming hypertables
* @janko for reporting an issue when adding bool column with default value to compressed hypertable
* @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk
* @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates
* @mwahlhuetter for reporting error in joins in RETURNING clause
* @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable
* @PBudmark for reporting an issue with dump_meta_data.sql on Windows
* @ssmoss for reporting an issue with time_bucket_ng origin handling
svenklemm added a commit that referenced this pull request Aug 31, 2022
This release adds major new features since the 2.7.2 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* time_bucket now supports bucketing by month, year and timezone
* Improve performance of bulk SELECT and COPY for distributed hypertables
* 1 step CAgg policy management
* Migrate Continuous Aggregates to the new format

**Features**
* #4188 Use COPY protocol in row-by-row fetcher
* #4307 Mark partialize_agg as parallel safe
* #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE
* #4384 Add schedule_interval to policies
* #4390 Faster lookup of chunks by point
* #4393 Support intervals with day component when constifying now()
* #4397 Support intervals with month component when constifying now()
* #4405 Support ON CONFLICT ON CONSTRAINT for hypertables
* #4412 Add telemetry about replication
* #4415 Drop remote data when detaching data node
* #4416 Handle TRUNCATE TABLE on chunks
* #4425 Add parameter check_config to alter_job
* #4430 Create index on Continuous Aggregates
* #4439 Allow ORDER BY on continuous aggregates
* #4443 Add stateful partition mappings
* #4484 Use non-blocking data node connections for COPY
* #4495 Support add_dimension() with existing data
* #4502 Add chunks to baserel cache on chunk exclusion
* #4545 Add hypertable distributed argument and defaults
* #4552 Migrate Continuous Aggregates to the new format
* #4556 Add runtime exclusion for hypertables
* #4561 Change get_git_commit to return full commit hash
* #4563 1 step CAgg policy management
* #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill
* #4642 Add timezone support to time_bucket

**Bugfixes**
* #4359 Create composite index on segmentby columns
* #4374 Remove constified now() constraints from plan
* #4416 Handle TRUNCATE TABLE on chunks
* #4478 Synchronize chunk cache sizes
* #4486 Adding boolean column with default value doesn't work on compressed table
* #4512 Fix unaligned pointer access
* #4519 Throw better error message on incompatible row fetcher settings
* #4549 Fix dump_meta_data for windows
* #4553 Fix timescaledb_post_restore GUC handling
* #4573 Load TSL library on compressed_data_out call
* #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function
* #4577 Fix segfaults in compression code with corrupt data
* #4580 Handle default privileges on CAggs properly
* #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA
* #4583 Fix partitioning functions
* #4589 Fix rename for distributed hypertable
* #4601 Reset compression sequence when group resets
* #4611 Fix a potential OOM when loading large data sets into a hypertable
* #4624 Fix heap buffer overflow
* #4627 Fix telemetry initialization
* #4631 Ensure TSL library is loaded on database upgrades
* #4646 Fix time_bucket_ng origin handling
* #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause.

**Thanks**
* @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function
* @Creatation for reporting an issue with renaming hypertables
* @janko for reporting an issue when adding bool column with default value to compressed hypertable
* @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk
* @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates
* @mwahlhuetter for reporting error in joins in RETURNING clause
* @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable
* @PBudmark for reporting an issue with dump_meta_data.sql on Windows
* @ssmoss for reporting an issue with time_bucket_ng origin handling
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 11, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 11, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 12, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 12, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 12, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 12, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit to fabriziomello/timescaledb that referenced this pull request Jan 12, 2023
Previous PR timescale#4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes timescale#4922
fabriziomello added a commit that referenced this pull request Jan 13, 2023
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes #4922
sb230132 pushed a commit that referenced this pull request Jan 24, 2023
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes #4922
sb230132 pushed a commit that referenced this pull request Jan 24, 2023
Previous PR #4307 mark `partialize_agg` and `finalize_agg` as parallel
safe but this change is leading to incorrect results in some cases.

Those functions are supposed work in parallel but seems is not the case
and it is not evident yet the root cause and how to properly use it in
parallel queries so we decided to revert this change and provide correct
results to users.

Fixes #4922
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement]: Parallel Execution Plans on Data Nodes
3 participants