New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use COPY protocol in row-by-row fetcher #4188
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4188 +/- ##
==========================================
- Coverage 90.86% 90.74% -0.13%
==========================================
Files 221 221
Lines 40783 40888 +105
==========================================
+ Hits 37056 37102 +46
- Misses 3727 3786 +59
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. Have some comments though.
tsl/src/remote/row_by_row_fetcher.c
Outdated
values[att] = ReceiveFunctionCall(&attconv->conv_funcs[att], | ||
NULL, | ||
attconv->ioparams[att], | ||
attconv->typmods[att]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this function call if the value is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we should only call it if the function is not strict. I fixed it and added a citation from postgres docs about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean that the result of the function call is NULL (or treated as NULL at least), as evident by the next line nulls[att] = true
so why do we need to call the function and write the result to values[att]
in that case?
STRICT
ness has to do with whether one of the function arguments is NULL or not. Not the result.
It seems to me that this whole line can be elided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the Postgres docs tell that non-strict function should be called for NULL inputs:
Usually, a receive function should be declared STRICT; if it is not, it will be called with a NULL first parameter when reading a NULL input value. The function must still return NULL in this case, unless it raises an error. (This case is mainly meant to support domain receive functions, which might need to reject NULL inputs.)
It must return NULL, but it can also throw an error, that's why we should call it.
Quoting from here: https://www.postgresql.org/docs/current/sql-createtype.html
values[att] = ReceiveFunctionCall(&attconv->conv_funcs[att], | ||
&att_data, | ||
attconv->ioparams[att], | ||
attconv->typmods[att]); | ||
nulls[att] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we shouldn't put the tuple memory context tighter around this call. It seems to me that the data from copy_data_read_bytes
is only temporary until we have converted the binary data to a datum. We only care about keeping the Datum, and not the binary data, right?
Or, we implement a per_tuple memory context that we clear in every iteration of the loop to keep memory usage low. Currently, I think that temporary tuple data is stored in the batch context, which only gets cleared with each batch. Ideally, we should only have the final Datums in the batch context and no cruft after reading a full batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data fetchers don't use a separate per-tuple context: https://github.com/timescale/timescaledb/blob/main/tsl/src/remote/data_fetcher.c#L33
Should I change it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main focus here was using the COPY protocol, that's why I tried not to change anything else, but maybe I should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighter memory handling can be left for another PR if needed. Just noted that you were reimplementing a large part of this loop anyway.
values[att] = ReceiveFunctionCall(&attconv->conv_funcs[att], | ||
&att_data, | ||
attconv->ioparams[att], | ||
attconv->typmods[att]); | ||
nulls[att] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighter memory handling can be left for another PR if needed. Just noted that you were reimplementing a large part of this loop anyway.
tsl/src/remote/row_by_row_fetcher.c
Outdated
values[att] = ReceiveFunctionCall(&attconv->conv_funcs[att], | ||
NULL, | ||
attconv->ioparams[att], | ||
attconv->typmods[att]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean that the result of the function call is NULL (or treated as NULL at least), as evident by the next line nulls[att] = true
so why do we need to call the function and write the result to values[att]
in that case?
STRICT
ness has to do with whether one of the function arguments is NULL or not. Not the result.
It seems to me that this whole line can be elided.
I think I'll merge this one first: #4434 |
6a87195
to
51ebc43
Compare
This gives about 2x speedup for bulk data transfer queries.
With the new tests from #4434, the coverage is a little better, although many error handling paths are still proving hard to cover. |
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * timescale#4188 Use COPY protocol in row-by-row fetcher * timescale#4307 Mark partialize_agg as parallel safe * timescale#4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * timescale#4384 Add schedule_interval to policies * timescale#4390 Faster lookup of chunks by point * timescale#4393 Support intervals with day component when constifying now() * timescale#4397 Support intervals with month component when constifying now() * timescale#4405 Support ON CONFLICT ON CONSTRAINT for hypertables * timescale#4412 Add telemetry about replication * timescale#4415 Drop remote data when detaching data node * timescale#4416 Handle TRUNCATE TABLE on chunks * timescale#4425 Add parameter check_config to alter_job * timescale#4430 Create index on Continuous Aggregates * timescale#4439 Allow ORDER BY on continuous aggregates * timescale#4443 Add stateful partition mappings * timescale#4484 Use non-blocking data node connections for COPY * timescale#4495 Support add_dimension() with existing data * timescale#4502 Add chunks to baserel cache on chunk exclusion * timescale#4545 Add hypertable distributed argument and defaults * timescale#4552 Migrate Continuous Aggregates to the new format * timescale#4556 Add runtime exclusion for hypertables * timescale#4561 Change get_git_commit to return full commit hash * timescale#4563 1 step CAgg policy management * timescale#4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * timescale#4642 Add timezone support to time_bucket **Bugfixes** * timescale#4359 Create composite index on segmentby columns * timescale#4374 Remove constified now() constraints from plan * timescale#4416 Handle TRUNCATE TABLE on chunks * timescale#4478 Synchronize chunk cache sizes * timescale#4486 Adding boolean column with default value doesn't work on compressed table * timescale#4512 Fix unaligned pointer access * timescale#4519 Throw better error message on incompatible row fetcher settings * timescale#4549 Fix dump_meta_data for windows * timescale#4553 Fix timescaledb_post_restore GUC handling * timescale#4573 Load TSL library on compressed_data_out call * timescale#4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * timescale#4577 Fix segfaults in compression code with corrupt data * timescale#4580 Handle default privileges on CAggs properly * timescale#4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * timescale#4583 Fix partitioning functions * timescale#4589 Fix rename for distributed hypertable * timescale#4601 Reset compression sequence when group resets * timescale#4611 Fix a potential OOM when loading large data sets into a hypertable * timescale#4624 Fix heap buffer overflow * timescale#4627 Fix telemetry initialization * timescale#4631 Ensure TSL library is loaded on database upgrades * timescale#4646 Fix time_bucket_ng origin handling * timescale#4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This gives about 2x speedup on a test query
copy (select * from live_steps_originals where user_id in (1,4) and time between '2020-01-01' and '2020-01-31') to '/dev/null'
. It returns about 770k rows.Other parts of this PR:
#4273
#4205
#4319
#4327
#4434