Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

plan: compute all inner joins in memory if they fit #638

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

erizocosmico
Copy link
Contributor

Fixes #577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the MAX_MEMORY_INNER_JOIN environment variable or
the max_memory_joins session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: innerJoinIter and
innerJoinMemoryIter, and now innerJoinIter must be able to do
the join in memory, innerJoinMemoryIter has been removed and
innerJoinIter replaced with a version that can work with three
modes:

  • unknownMode we don't know yet how to perform the join, so keep
    iterating until we can find out. By the end of the first full pass
    over the right branch unknownMode will either switch to
    multipassMode or memoryMode.
  • memoryMode which computes the rest of the join in memory. The
    iterator can have this mode before starting iterating if the user
    activated the in memory join via session or environment vars, in
    which case it will load all the right side on memory before doing
    any further iteration. Instead, if the iterator started in
    unknownMode and switched to this mode, it's guaranteed to already
    have loaded all the right side. From that point on, they work in
    exactly the same way.
  • multipassMode, which was the previous default mode. Iterate the
    right side of the join for each row in the left side. More expensive,
    but less memory consuming. The iterator can not start in this mode,
    and can only be switched to it from unknownMode in case the
    memory used by the gitbase server exceeds the maximum amount of memory
    either set by the user or by default.

Signed-off-by: Miguel Molina miguel@erizocosmi.co

@erizocosmico erizocosmico requested a review from a team March 15, 2019 13:25
@erizocosmico erizocosmico force-pushed the feature/inmemjoin-small-tables branch from 45eb2f1 to da68eee Compare March 15, 2019 13:31
sql/plan/innerjoin.go Outdated Show resolved Hide resolved
sql/plan/innerjoin.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ajnavarro ajnavarro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after having a look to requested changes. Would be great to have this documented on gitbase. Should we open an issue on gitbase to do not forget to update https://docs.sourced.tech/gitbase/using-gitbase/optimize-queries ?

sql/plan/innerjoin.go Outdated Show resolved Hide resolved
Fixes src-d#577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the `MAX_MEMORY_INNER_JOIN` environment variable or
the `max_memory_joins` session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: `innerJoinIter` and
`innerJoinMemoryIter`, and now `innerJoinIter` must be able to do
the join in memory, `innerJoinMemoryIter` has been removed and
`innerJoinIter` replaced with a version that can work with three
modes:
- `unknownMode` we don't know yet how to perform the join, so keep
iterating until we can find out. By the end of the first full pass
over the right branch `unknownMode` will either switch to
`multipassMode` or `memoryMode`.
- `memoryMode` which computes the rest of the join in memory. The
iterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
`unknownMode` and switched to this mode, it's guaranteed to already
have loaded all the right side. From that point on, they work in
exactly the same way.
- `multipassMode`, which was the previous default mode. Iterate the
right side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from `unknownMode` in case the
memory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
@erizocosmico erizocosmico force-pushed the feature/inmemjoin-small-tables branch from da68eee to 8f508ae Compare March 19, 2019 09:58
@erizocosmico
Copy link
Contributor Author

Fixed

@erizocosmico
Copy link
Contributor Author

Shall we merge this? @ajnavarro

@ajnavarro ajnavarro merged commit 0093a75 into src-d:master Mar 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants