Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Use in-memory join when one of the tables is small enough #577

Closed
erizocosmico opened this issue Dec 21, 2018 · 3 comments
Closed

Use in-memory join when one of the tables is small enough #577

erizocosmico opened this issue Dec 21, 2018 · 3 comments
Assignees
Labels
performance Performance improvements proposal proposal for new additions or changes

Comments

@erizocosmico
Copy link
Contributor

erizocosmico commented Dec 21, 2018

This is just a future idea that I thought of leaving here for the future, as I was researching how to improve perf of the joins.

If we have two sides on a join we could try to use an in-memory cache for the smaller side instead of iterating it over and over. Depending on how expensive the underlying implementation is of iterating all over the rows, this might be a big performance boost.

What we need for this:

  • Be able to estimate the size of each side (which will be more accurate the less filters and so on you have) if it's possible (inner join of an inner join might be tricky, for example).
  • Optimisation rule that checks the size of each inner join side and if it can, replace it with an inner join that keeps the smaller side in memory instead of iterating it again and again.

Caveats:

  • Estimating the size will be real hard and it's not portable, each impl must provide their own.
  • Hard to decide a limit set in stone that's sensible for the in-memory join.

Sadly, this does not solve #454, because it can't and shouldn't be applied in every case. I've been digging on how to make a join not iterate one side N times, but it's the only way to do it without having a whole side in memory, which is not viable for large datasets.

@erizocosmico erizocosmico added performance Performance improvements proposal proposal for new additions or changes labels Dec 21, 2018
@erizocosmico
Copy link
Contributor Author

This should not be on TODO, it cannot be done without a cost estimation component.

@smola
Copy link
Collaborator

smola commented Jan 28, 2019

Does vitess parser support query hints?
Would it be feasible to implement a query hint for this in a comment?

@ajnavarro
Copy link
Contributor

@smola query hints are comments and we are removing them: https://github.com/src-d/go-mysql-server/blob/master/sql/parse/parse.go#L59 , so they might not be supported by vitess.

The other day I had an idea to avoid cost estimation component.

Right now we are iterating from every row of one side of the join all the rows from the other side.

We can check on the first pass if all the rows of the right will fit in memory (available memory will be defined by configuration) and if true we will put it on memory on the next pass.

The problem with that is that we don't know which part of the join is the smallest one, so we can only apply this logic to one of the join children always.

@erizocosmico erizocosmico self-assigned this Mar 15, 2019
erizocosmico added a commit to erizocosmico/go-mysql-server that referenced this issue Mar 15, 2019
Fixes src-d#577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the `MAX_MEMORY_INNER_JOIN` environment variable or
the `max_memory_joins` session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: `innerJoinIter` and
`innerJoinMemoryIter`, and now `innerJoinIter` must be able to do
the join in memory, `innerJoinMemoryIter` has been removed and
`innerJoinIter` replaced with a version that can work with three
modes:
- `unknownMode` we don't know yet how to perform the join, so keep
iterating until we can find out. By the end of the first full pass
over the right branch `unknownMode` will either switch to
`multipassMode` or `memoryMode`.
- `memoryMode` which computes the rest of the join in memory. The
iterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
`unknownMode` and switched to this mode, it's guaranteed to already
have loaded all the right side. From that point on, they work in
exactly the same way.
- `multipassMode`, which was the previous default mode. Iterate the
right side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from `unknownMode` in case the
memory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
erizocosmico added a commit to erizocosmico/go-mysql-server that referenced this issue Mar 15, 2019
Fixes src-d#577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the `MAX_MEMORY_INNER_JOIN` environment variable or
the `max_memory_joins` session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: `innerJoinIter` and
`innerJoinMemoryIter`, and now `innerJoinIter` must be able to do
the join in memory, `innerJoinMemoryIter` has been removed and
`innerJoinIter` replaced with a version that can work with three
modes:
- `unknownMode` we don't know yet how to perform the join, so keep
iterating until we can find out. By the end of the first full pass
over the right branch `unknownMode` will either switch to
`multipassMode` or `memoryMode`.
- `memoryMode` which computes the rest of the join in memory. The
iterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
`unknownMode` and switched to this mode, it's guaranteed to already
have loaded all the right side. From that point on, they work in
exactly the same way.
- `multipassMode`, which was the previous default mode. Iterate the
right side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from `unknownMode` in case the
memory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
erizocosmico added a commit to erizocosmico/go-mysql-server that referenced this issue Mar 19, 2019
Fixes src-d#577

Because we do not have a way to estimate the cost of each side of
a join, it is really difficult to know when we can compute one in
memory. But not doing so, causes inner joins to be painfully slow,
as one of the branches is iterated multiple times.

This PR addresses this by ensuring that if the right branch of the
inner join fits in memory, it will be computed in memory even if
the in-memory mode has not been activated by the user.

An user can set the maximum threshold of memory the gitbase server
can use before considering the joins should not be performed in
memory using the `MAX_MEMORY_INNER_JOIN` environment variable or
the `max_memory_joins` session variable specifying the number of
bytes. The default value for this is the half of the available
physical memory on the operating system.

Because previously we had two iterators: `innerJoinIter` and
`innerJoinMemoryIter`, and now `innerJoinIter` must be able to do
the join in memory, `innerJoinMemoryIter` has been removed and
`innerJoinIter` replaced with a version that can work with three
modes:
- `unknownMode` we don't know yet how to perform the join, so keep
iterating until we can find out. By the end of the first full pass
over the right branch `unknownMode` will either switch to
`multipassMode` or `memoryMode`.
- `memoryMode` which computes the rest of the join in memory. The
iterator can have this mode before starting iterating if the user
activated the in memory join via session or environment vars, in
which case it will load all the right side on memory before doing
any further iteration. Instead, if the iterator started in
`unknownMode` and switched to this mode, it's guaranteed to already
have loaded all the right side. From that point on, they work in
exactly the same way.
- `multipassMode`, which was the previous default mode. Iterate the
right side of the join for each row in the left side. More expensive,
but less memory consuming. The iterator can not start in this mode,
and can only be switched to it from `unknownMode` in case the
memory used by the gitbase server exceeds the maximum amount of memory
either set by the user or by default.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance Performance improvements proposal proposal for new additions or changes
Projects
None yet
Development

No branches or pull requests

3 participants