Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upOptimize planning times when hypertables have many chunks #502
Conversation
cevian
requested review from
erimatnor
,
RobAtticus
and
davidkohn88
Apr 17, 2018
cevian
force-pushed the
plan_expand_hypertables
branch
5 times, most recently
from
4c12952
to
931d7ca
Apr 17, 2018
This comment has been minimized.
This comment has been minimized.
This replaces #471. |
RobAtticus
reviewed
Apr 20, 2018
Quick review, will let other take deep dive |
chunk_scan_ctx_init(&ctx, hs, NULL); | ||
|
||
/* Abort the scan when the chunk is found */ | ||
ctx.early_abort = false; |
This comment has been minimized.
This comment has been minimized.
RobAtticus
Apr 20, 2018
Member
Comment & code don't match, at least I don't think. It appears that you are not ending the scan when the chunk is found.
find_children_oids(HypertableRestrictInfo *hri, Hypertable *ht, LOCKMODE lockmode) | ||
{ | ||
/* | ||
* optimization: using the HRI only makes sense if we ar not using all the |
This comment has been minimized.
This comment has been minimized.
bool inhparent, | ||
RelOptInfo *rel) | ||
{ | ||
RangeTblEntry *rte = rt_fetch(rel->relid, root->parse->rtable); |
This comment has been minimized.
This comment has been minimized.
RobAtticus
reviewed
Apr 24, 2018
Index rti = rel->relid; | ||
List *appinfos = NIL; | ||
HypertableRestrictInfo *hri; | ||
PlanRowMark *oldrc; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
RobAtticus
Apr 24, 2018
Member
actually nevermind, i see it getting used so I'm not sure why it complains.
This comment has been minimized.
This comment has been minimized.
RobAtticus
Apr 25, 2018
Member
I guess because it's getting used in Assert and nowhere else it gets stripped out in Release builds. May need to use it in a no-op to get rid of the warning.
This comment has been minimized.
This comment has been minimized.
goodkiller
commented
Apr 25, 2018
Please fix this ASAP, because I have approx 50 chunks, and queries are very slow. What could be estimated fix date? |
This comment has been minimized.
This comment has been minimized.
Hi @goodkiller thanks for your interest in this PR. We're in the process of reviewing this new functionality and we aim to get it in for the next release. You mentioned ~50 chunks for your setup, which doesn't seem like a lot, actually. Wondering if you are experiencing some other issue? |
This comment has been minimized.
This comment has been minimized.
goodkiller
commented
Apr 25, 2018
Hi @erimatnor |
cevian
force-pushed the
plan_expand_hypertables
branch
from
931d7ca
to
2787ecd
Apr 25, 2018
This comment has been minimized.
This comment has been minimized.
@erimatnor @cevian Benchmark numbers look good to me. Big improvement for a dataset with 4000+ (600ms -> 36ms) chunks and a more modest improvement for one with only about 6 chunks (6.6ms -> 5.9-6ms). So even if it doesn't do a ton for the low end, it's not hurting performance and is a big boon for the many chunks case. Pending the fixes I suggested, it has my approval. |
erimatnor
requested changes
Apr 26, 2018
Overall, I think the optimization is good. A bunch of nits and suggestions though. |
*/ | ||
hri = hypertable_restrict_info_create(rel, ht); | ||
hypertable_restrict_info_add(hri, root, restrictinfo); | ||
inhOIDs = find_children_oids(hri, ht, lockmode); |
This comment has been minimized.
This comment has been minimized.
|
||
foreach(l, inhOIDs) | ||
{ | ||
Oid childOID = lfirst_oid(l); |
This comment has been minimized.
This comment has been minimized.
Oid childOID = lfirst_oid(l); | ||
Relation newrelation; | ||
RangeTblEntry *childrte; | ||
Index childRTindex; |
This comment has been minimized.
This comment has been minimized.
{ | ||
RangeTblEntry *rte = rt_fetch(rel->relid, root->parse->rtable); | ||
List *inhOIDs; | ||
Oid parentOID = relationObjectId; |
This comment has been minimized.
This comment has been minimized.
RelOptInfo *rel) | ||
{ | ||
RangeTblEntry *rte = rt_fetch(rel->relid, root->parse->rtable); | ||
List *inhOIDs; |
This comment has been minimized.
This comment has been minimized.
lockmode)); | ||
return result; | ||
} | ||
else |
This comment has been minimized.
This comment has been minimized.
{ | ||
DimensionRestrictInfo *dri = dimension_restrict_info_create(&ht->space->dimensions[i]); | ||
|
||
res->diminson_restriction[AttrNumberGetAttrOffset(ht->space->dimensions[i].column_attno)] = dri; |
This comment has been minimized.
This comment has been minimized.
erimatnor
Apr 26, 2018
Contributor
Ohh, I see you are indexing by column_attno
instead of dimension ID, so the array can be sparse, hence pointer array. Is this ideal/necessary? Imagine a table with 100+ columns (which we've seen) where time is last. That would create a really sparse array.
Is it necessary to optimize getting the restriction from the array by attno? Without this, fetching would only be O(n)
with the number of dimensions, but could be optimized with hash table or tree if an issue (which it really wouldn't be unless really large number of dimensions)
This comment has been minimized.
This comment has been minimized.
cevian
Apr 29, 2018
Author
Contributor
I believe this is the most efficient representation of this structure because it is most often accessed by attribute number. There is a max number of attributes in PostgreSQL (1500 or something) and each column only takes the size of a pointer so I don't believe the size here is really an issue. I'd rather make the access as efficient as possible.
This comment has been minimized.
This comment has been minimized.
foreach(lc, query->rtable) | ||
{ | ||
RangeTblEntry *rte = lfirst(lc); | ||
Hypertable *ht = hypertable_cache_get_entry(hc, rte->relid); |
This comment has been minimized.
This comment has been minimized.
erimatnor
Apr 26, 2018
Contributor
Is this guaranteed to be non-NULL? Maybe add an Assert()
to make this clear.
This comment has been minimized.
This comment has been minimized.
cevian
Apr 29, 2018
Author
Contributor
No it can be NULL. plan_expand_hypertable_valid_hypertable
handles the NULL case.
@@ -323,18 +404,53 @@ timescaledb_set_rel_pathlist(PlannerInfo *root, | |||
cache_release(hcache); | |||
} | |||
|
|||
static void | |||
timescaledb_get_relation_info_hook(PlannerInfo *root, |
This comment has been minimized.
This comment has been minimized.
erimatnor
Apr 26, 2018
Contributor
What is the reasoning between expanding the append relation in this hook? Not saying it is wrong, but it seems non-obvious. At least there should be a comment explaining this, and what this hook function does in general (i.e., it expands the hypertable).
* | ||
* Slow planning time were previously seen because `expand_inherited_tables` expands all chunks of | ||
* a hypertable, without regard to constraints present in the query. Then, `get_relation_info` is | ||
* the called on all chunks before constraint exclusion. Getting the statistics an many chunks ends |
This comment has been minimized.
This comment has been minimized.
cevian
force-pushed the
plan_expand_hypertables
branch
2 times, most recently
from
717fa69
to
856b2e5
Apr 29, 2018
This comment has been minimized.
This comment has been minimized.
@RobAtticus @erimatnor Fixed all your comments (unless I replied directly to the msg) |
cevian
force-pushed the
plan_expand_hypertables
branch
3 times, most recently
from
7f00c74
to
f0e23c2
Apr 29, 2018
cevian
referenced this pull request
Apr 30, 2018
Closed
change planner cost for (merge)append nodes #500
mfreed
added this to the 0.10.0 milestone
May 7, 2018
mfreed
referenced this pull request
May 7, 2018
Closed
Move space-partition exclusion to planner. #471
RobAtticus
approved these changes
May 7, 2018
sspieser
referenced this pull request
May 14, 2018
Closed
Performance issues when using 10,000s of chunks #515
erimatnor
requested changes
May 14, 2018
|
||
chunk_scan_ctx_foreach_chunk(ctx, chunk_is_complete, 1); | ||
|
||
return (ctx->data == NIL ? NULL : linitial(ctx->data)); |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
Contributor
Can we make this function simply a wrapper around ...get_chunk_list
?
This comment has been minimized.
This comment has been minimized.
} | ||
|
||
/* Get a list of chunks that each have N matching dimension constraints */ | ||
chunk_list = chunk_scan_ctx_get_chunk_list(&ctx); |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
•
Contributor
Can't you just iterate the chunk scan context here with your own per-chunk handler instead of first creating a list? Seems you are adding new functionality when the equivalent functionality already exists, iterating information twice and doing unnecessary allocations.
This comment has been minimized.
This comment has been minimized.
return true; | ||
} | ||
else if (other->fd.range_start > coord && | ||
other->fd.range_start < to_cut->fd.range_end) | ||
{ | ||
/* Cut "after" the coordinate */ | ||
to_cut->fd.range_end = other->fd.range_start; | ||
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
} | ||
} | ||
|
||
bool |
This comment has been minimized.
This comment has been minimized.
{ | ||
DimensionRestrictInfo *dri = dimension_restrict_info_create(&ht->space->dimensions[i]); | ||
|
||
res->dimension_restriction[AttrNumberGetAttrOffset(ht->space->dimensions[i].column_attno)] = dri; |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
•
Contributor
Still not sure about this sparse array. I think the most common case by far is 1 or 2 dimensions, so lookup by iterating the dimensions shouldn't be much worse than array indexing, at least not in any way that matters. I think it is a lot more common to have many columns, potentially partitioning on a high attribute number, than having lots of dimensions. If this proves a problem in the future, we can optimize with a hashtable or similar.
This comment has been minimized.
This comment has been minimized.
cevian
May 16, 2018
Author
Contributor
While I agree a list would probably not be /bad/ I think the sparse array is more efficient because of O(1). Since we may have many clauses, I'm not sure why we wouldn't use this. The memory usage is limited as I mentioned before.
This comment has been minimized.
This comment has been minimized.
erimatnor
May 18, 2018
Contributor
There's really only a benefit of O(1) lookups when you have big data sets and not with one or two elements, which is the common case here. I mean, honestly, most of the time your are creating a sparse array with one single element! (Or, am I missing something?). This seems like over-engineering of an otherwise very simple thing. I wouldn't push back if you had a strong argument here, like showing an important efficiency improvement (e.g., significantly faster planning times). But I think, when in doubt, we should go for simplicity and maintainability of the code with the option of optimizing in the future.
Since this seems like a "won't fix", I guess you strongly believe this is an important efficiency/speed optimization, to the extent that it is worth pushing it through. Thus I won't block the PR on this.
This comment has been minimized.
This comment has been minimized.
dimension_restrict_info_closed_slices(DimensionRestrictInfoClosed *dri) | ||
{ | ||
if (dri->strategy == BTEqualStrategyNumber) | ||
{ |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
||
/* Since baserestrictinfo is not yet set by the planner, we have to derive | ||
* it ourselves. It's safe for us to miss some restrict info clauses (this | ||
* will just results in more chunks being included) so this does not need |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
List *result; | ||
|
||
/* | ||
* optimization: using the HRI only makes sense if we are not using all |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
Contributor
Ambiguous comment: Is this optimization done now (doesn't look like it), or is it suggested?
This comment has been minimized.
This comment has been minimized.
Oid parent_oid = relation_objectid; | ||
ListCell *l; | ||
Relation oldrelation = heap_open(parent_oid, NoLock); | ||
LOCKMODE lockmode = AccessShareLock; |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
Contributor
Why does this need to be a variable? I don't see it set anywhere else.
This comment has been minimized.
This comment has been minimized.
{ | ||
RangeTblEntry *rte = rt_fetch(rel->relid, root->parse->rtable); | ||
List *inh_oids; | ||
Oid parent_oid = relation_objectid; |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 14, 2018
Contributor
Why this extra variable? Don't see it set anywhere. Is it a name clarity issue? Then why not just use the name for the function parameter?
This comment has been minimized.
This comment has been minimized.
cevian
force-pushed the
plan_expand_hypertables
branch
from
f0e23c2
to
67b28dd
May 16, 2018
cevian
force-pushed the
plan_expand_hypertables
branch
from
67b28dd
to
7c611d0
May 16, 2018
This comment has been minimized.
This comment has been minimized.
@erimatnor ready for another review |
This comment has been minimized.
This comment has been minimized.
Build is broken @cevian |
erimatnor
requested changes
May 18, 2018
Only a few remaining things. |
{ | ||
DimensionRestrictInfo *dri = dimension_restrict_info_create(&ht->space->dimensions[i]); | ||
|
||
res->dimension_restriction[AttrNumberGetAttrOffset(ht->space->dimensions[i].column_attno)] = dri; |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 18, 2018
Contributor
There's really only a benefit of O(1) lookups when you have big data sets and not with one or two elements, which is the common case here. I mean, honestly, most of the time your are creating a sparse array with one single element! (Or, am I missing something?). This seems like over-engineering of an otherwise very simple thing. I wouldn't push back if you had a strong argument here, like showing an important efficiency improvement (e.g., significantly faster planning times). But I think, when in doubt, we should go for simplicity and maintainability of the code with the option of optimizing in the future.
Since this seems like a "won't fix", I guess you strongly believe this is an important efficiency/speed optimization, to the extent that it is worth pushing it through. Thus I won't block the PR on this.
Assert(rti != parse->resultRelation); | ||
oldrc = get_plan_rowmark(root->rowMarks, rti); | ||
if (oldrc && RowMarkRequiresRowShareLock(oldrc->markType)) | ||
{ |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
||
chunk_scan_ctx_destroy(&ctx); | ||
|
||
foreach(lc, oid_list) |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 18, 2018
Contributor
Why not also do this work (locking) in append_chunk_oid
(which is what I meant in previous comment)? You are still iterating twice here and then I presume once more when creating the appendInfos. That's at least three iterations of the same data. Ideally, you'd do all work in one iteration. Any reason not to?
This comment has been minimized.
This comment has been minimized.
cevian
force-pushed the
plan_expand_hypertables
branch
from
6e67bd1
to
b014c87
May 18, 2018
erimatnor
approved these changes
May 25, 2018
Some nits. |
append_chunk_oid(ChunkScanCtx *scanctx, Chunk *chunk) | ||
{ | ||
if (chunk_is_complete(scanctx, chunk)) | ||
{ |
This comment has been minimized.
This comment has been minimized.
erimatnor
May 25, 2018
•
Contributor
This is a bit of a style choice, and not a big issue for a small function, but I tend to favor early exits, in this case:
if (!chunk_is_complete(scanctx, chunk))
return false;
This makes code easier to read because you have less indentation and nesting and do not need go to the end of the function to know if the "negative" case means exit or executing some other code.
} | ||
|
||
static DimensionRestrictInfo * | ||
hypertable_restrict_info_get(HypertableRestrictInfo *hri, int attno) |
cevian commentedApr 17, 2018
This planner optimization reduces planning times when a hypertable has many chunks.
It does this by expanding hypertable chunks manually, eliding the
expand_inherited_tables
logic used by PG.
Slow planning time were previously seen because
expand_inherited_tables
expands all chunks ofa hypertable, without regard to constraints present in the query. Then,
get_relation_info
isthe called on all chunks before constraint exclusion. Getting the statistics an many chunks ends
up being expensive because RelationGetNumberOfBlocks has to open the file for each relation.
This gets even worse under high concurrency.
This logic solves this by expanding only the chunks needed to fulfil the query instead of all chunks.
In effect, it moves chunk exclusion up in the planning process. But, we actually don't use constraint
exclusion here, but rather a variant of range exclusion implemented
by HypertableRestrictInfo.