This repository has been archived by the owner on Jan 28, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 110
plan: compute all inner joins in memory if they fit #638
Merged
ajnavarro
merged 1 commit into
src-d:master
from
erizocosmico:feature/inmemjoin-small-tables
Mar 20, 2019
Merged
plan: compute all inner joins in memory if they fit #638
ajnavarro
merged 1 commit into
src-d:master
from
erizocosmico:feature/inmemjoin-small-tables
Mar 20, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
45eb2f1
to
da68eee
Compare
kuba--
reviewed
Mar 15, 2019
ajnavarro
reviewed
Mar 19, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after having a look to requested changes. Would be great to have this documented on gitbase. Should we open an issue on gitbase to do not forget to update https://docs.sourced.tech/gitbase/using-gitbase/optimize-queries ?
Fixes src-d#577 Because we do not have a way to estimate the cost of each side of a join, it is really difficult to know when we can compute one in memory. But not doing so, causes inner joins to be painfully slow, as one of the branches is iterated multiple times. This PR addresses this by ensuring that if the right branch of the inner join fits in memory, it will be computed in memory even if the in-memory mode has not been activated by the user. An user can set the maximum threshold of memory the gitbase server can use before considering the joins should not be performed in memory using the `MAX_MEMORY_INNER_JOIN` environment variable or the `max_memory_joins` session variable specifying the number of bytes. The default value for this is the half of the available physical memory on the operating system. Because previously we had two iterators: `innerJoinIter` and `innerJoinMemoryIter`, and now `innerJoinIter` must be able to do the join in memory, `innerJoinMemoryIter` has been removed and `innerJoinIter` replaced with a version that can work with three modes: - `unknownMode` we don't know yet how to perform the join, so keep iterating until we can find out. By the end of the first full pass over the right branch `unknownMode` will either switch to `multipassMode` or `memoryMode`. - `memoryMode` which computes the rest of the join in memory. The iterator can have this mode before starting iterating if the user activated the in memory join via session or environment vars, in which case it will load all the right side on memory before doing any further iteration. Instead, if the iterator started in `unknownMode` and switched to this mode, it's guaranteed to already have loaded all the right side. From that point on, they work in exactly the same way. - `multipassMode`, which was the previous default mode. Iterate the right side of the join for each row in the left side. More expensive, but less memory consuming. The iterator can not start in this mode, and can only be switched to it from `unknownMode` in case the memory used by the gitbase server exceeds the maximum amount of memory either set by the user or by default. Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
da68eee
to
8f508ae
Compare
Fixed |
ajnavarro
approved these changes
Mar 19, 2019
Shall we merge this? @ajnavarro |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #577
Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.
This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.
An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the
MAX_MEMORY_INNER_JOIN
environment variable orthe
max_memory_joins
session variable specifying the number ofbytes. The default value for this is the half of the available
physical memory on the operating system.
Because previously we had two iterators:
innerJoinIter
andinnerJoinMemoryIter
, and nowinnerJoinIter
must be able to dothe join in memory,
innerJoinMemoryIter
has been removed andinnerJoinIter
replaced with a version that can work with threemodes:
unknownMode
we don't know yet how to perform the join, so keepiterating until we can find out. By the end of the first full pass
over the right branch
unknownMode
will either switch tomultipassMode
ormemoryMode
.memoryMode
which computes the rest of the join in memory. Theiterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
unknownMode
and switched to this mode, it's guaranteed to alreadyhave loaded all the right side. From that point on, they work in
exactly the same way.
multipassMode
, which was the previous default mode. Iterate theright side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from
unknownMode
in case thememory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.
Signed-off-by: Miguel Molina miguel@erizocosmi.co