plan: compute all inner joins in memory if they fit #638

erizocosmico · 2019-03-15T13:25:28Z

Fixes #577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the MAX_MEMORY_INNER_JOIN environment variable or
the max_memory_joins session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: innerJoinIter and
innerJoinMemoryIter, and now innerJoinIter must be able to do
the join in memory, innerJoinMemoryIter has been removed and
innerJoinIter replaced with a version that can work with three
modes:

unknownMode we don't know yet how to perform the join, so keep
iterating until we can find out. By the end of the first full pass
over the right branch unknownMode will either switch to
multipassMode or memoryMode.
memoryMode which computes the rest of the join in memory. The
iterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
unknownMode and switched to this mode, it's guaranteed to already
have loaded all the right side. From that point on, they work in
exactly the same way.
multipassMode, which was the previous default mode. Iterate the
right side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from unknownMode in case the
memory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.

Signed-off-by: Miguel Molina miguel@erizocosmi.co

sql/plan/innerjoin.go

ajnavarro

LGTM after having a look to requested changes. Would be great to have this documented on gitbase. Should we open an issue on gitbase to do not forget to update https://docs.sourced.tech/gitbase/using-gitbase/optimize-queries ?

sql/plan/innerjoin.go

Fixes src-d#577 Because we do not have a way to estimate the cost of each side of a join, it is really difficult to know when we can compute one in memory. But not doing so, causes inner joins to be painfully slow, as one of the branches is iterated multiple times. This PR addresses this by ensuring that if the right branch of the inner join fits in memory, it will be computed in memory even if the in-memory mode has not been activated by the user. An user can set the maximum threshold of memory the gitbase server can use before considering the joins should not be performed in memory using the `MAX_MEMORY_INNER_JOIN` environment variable or the `max_memory_joins` session variable specifying the number of bytes. The default value for this is the half of the available physical memory on the operating system. Because previously we had two iterators: `innerJoinIter` and `innerJoinMemoryIter`, and now `innerJoinIter` must be able to do the join in memory, `innerJoinMemoryIter` has been removed and `innerJoinIter` replaced with a version that can work with three modes: - `unknownMode` we don't know yet how to perform the join, so keep iterating until we can find out. By the end of the first full pass over the right branch `unknownMode` will either switch to `multipassMode` or `memoryMode`. - `memoryMode` which computes the rest of the join in memory. The iterator can have this mode before starting iterating if the user activated the in memory join via session or environment vars, in which case it will load all the right side on memory before doing any further iteration. Instead, if the iterator started in `unknownMode` and switched to this mode, it's guaranteed to already have loaded all the right side. From that point on, they work in exactly the same way. - `multipassMode`, which was the previous default mode. Iterate the right side of the join for each row in the left side. More expensive, but less memory consuming. The iterator can not start in this mode, and can only be switched to it from `unknownMode` in case the memory used by the gitbase server exceeds the maximum amount of memory either set by the user or by default. Signed-off-by: Miguel Molina <miguel@erizocosmi.co>

erizocosmico · 2019-03-19T09:58:12Z

Fixed

erizocosmico · 2019-03-20T14:03:19Z

Shall we merge this? @ajnavarro

erizocosmico requested a review from a team March 15, 2019 13:25

erizocosmico force-pushed the feature/inmemjoin-small-tables branch from 45eb2f1 to da68eee Compare March 15, 2019 13:31

kuba-- reviewed Mar 15, 2019

View reviewed changes

sql/plan/innerjoin.go Outdated Show resolved Hide resolved

sql/plan/innerjoin.go Outdated Show resolved Hide resolved

ajnavarro reviewed Mar 19, 2019

View reviewed changes

sql/plan/innerjoin.go Outdated Show resolved Hide resolved

erizocosmico force-pushed the feature/inmemjoin-small-tables branch from da68eee to 8f508ae Compare March 19, 2019 09:58

erizocosmico mentioned this pull request Mar 19, 2019

Update optimization guide with auto in-memory joins src-d/gitbase#734

Closed

ajnavarro approved these changes Mar 19, 2019

View reviewed changes

ajnavarro merged commit 0093a75 into src-d:master Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan: compute all inner joins in memory if they fit #638

plan: compute all inner joins in memory if they fit #638

erizocosmico commented Mar 15, 2019

ajnavarro left a comment

erizocosmico commented Mar 19, 2019

erizocosmico commented Mar 20, 2019

plan: compute all inner joins in memory if they fit #638

plan: compute all inner joins in memory if they fit #638

Conversation

erizocosmico commented Mar 15, 2019

ajnavarro left a comment

Choose a reason for hiding this comment

erizocosmico commented Mar 19, 2019

erizocosmico commented Mar 20, 2019