Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index scan to INSERT DML decompression #7048

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

antekresic
Copy link
Contributor

In order to verify constraints, we have to decompress batches that could contain duplicates of the tuples we are inserting. To find such batches, we use heap scans which can be very expensive if the compressed chunk contains a lot of tuples. Doing an index scan makes much more sense in this scenario and will
give great performance benefits.

Additionally, we don't want to create the decompressor until we determine we actually want to decompress a batch so we try to lazily initialize it once a batch is found.

@antekresic antekresic self-assigned this Jun 19, 2024
@antekresic antekresic added the enhancement An enhancement to an existing feature for functionality label Jun 19, 2024
@antekresic antekresic added this to the TimescaleDB 2.16.0 milestone Jun 19, 2024
@antekresic antekresic force-pushed the insert-index-scan branch 3 times, most recently from 94407e8 to a42cbc4 Compare June 20, 2024 07:42
Copy link

codecov bot commented Jun 20, 2024

Codecov Report

Attention: Patch coverage is 82.89474% with 39 lines in your changes missing coverage. Please review.

Project coverage is 81.86%. Comparing base (59f50f2) to head (0685126).
Report is 222 commits behind head on main.

Files Patch % Lines
tsl/src/compression/compression.c 83.03% 16 Missing and 22 partials ⚠️
src/nodes/chunk_dispatch/chunk_insert_state.c 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7048      +/-   ##
==========================================
+ Coverage   80.06%   81.86%   +1.79%     
==========================================
  Files         190      200      +10     
  Lines       37181    37297     +116     
  Branches     9450     9724     +274     
==========================================
+ Hits        29770    30533     +763     
+ Misses       2997     2861     -136     
+ Partials     4414     3903     -511     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@antekresic antekresic marked this pull request as ready for review June 20, 2024 07:59
Comment on lines +2277 to +2308
if (index_rel->rd_index->indnatts - 1 == i)
{
if (strcmp(attname, COMPRESSION_COLUMN_METADATA_SEQUENCE_NUM_NAME) == 0)
matches = true;
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some starting prefix of index columns are segmentby columns, we can use the index for lookups by these segmentbys, no matter what index columns follow, right? Maybe we can simplify/generalize this condition accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ideally we want the index with the most columns matching the unique constraints, bonus points for considering selectivity, and we could cache this selection for the chunk, but for the common case we only create 1 index on compressed chunk so maybe this is overkill

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general approach is even simpler than what we have now: count the length of prefix which consists of segmentbys, then choose the index that has the most. Or we could count the selectivity indeed in the same way, that would be a nice addition given that we have proper statistics for the segmentby columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed with my upcoming work for removing sequence numbers so I'll leave it for now.

Comment on lines 2259 to 2286
/* Must have at least two attributes. */
if (index_rel->rd_index->indnatts < 2)
{
index_close(index_rel, AccessShareLock);
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter this out specifically? There might be a user-created index on a particular segmentby column they need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, leaving for removal sequence number PR.

In order to verify constraints, we have to decompress
batches that could contain duplicates of the tuples
we are inserting. To find such batches, we use heap
scans which can be very expensive if the compressed
chunk contains a lot of tuples. Doing an index scan
makes much more sense in this scenario and will
give great performance benefits.

Additionally, we don't want to create the decompressor
until we determine we actually want to decompress a
batch so we try to lazily initialize it once a batch
is found.
@antekresic antekresic merged commit 704bf53 into timescale:main Jun 25, 2024
40 checks passed
pallavisontakke added a commit to pallavisontakke/timescaledb that referenced this pull request Jul 18, 2024
This release contains performance improvements and bug fixes since
the 2.15.3 release. We recommend that you upgrade at the next
available opportunity.

**Features**
* timescale#6880: Add support for the array operators used for compressed DML batch filtering.
* timescale#6895: Improve the compressed DML expression pushdown.
* timescale#6897: Add support for replica identity on compressed hypertables.
* timescale#6918: Remove support for PG13.
* timescale#6920: Rework compression activity wal markers.
* timescale#6989: Add support for foreign keys when converting plain tables to hypertables.
* timescale#7020: Add support for the chunk column statistics tracking.
* timescale#7048: Add an index scan for INSERT DML decompression.
* timescale#7075: Reduce decompression on the compressed INSERT.
* timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE.
* timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints

**Bugfixes**
* timescale#7018: Fix `search_path` quoting in the compression defaults function.
* timescale#7046: Prevent locking for compressed tuples.
* timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`.
* timescale#7064: Fix the bug in the default `order by` calculation in compression.
* timescale#7069: Fix the index column name usage.
* timescale#7074: Fix the bug in the default `segment by` calculation in compression.

**Thanks**
@pallavisontakke pallavisontakke mentioned this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to an existing feature for functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants