-
Notifications
You must be signed in to change notification settings - Fork 109
Use in-memory join when one of the tables is small enough #577
Comments
This should not be on TODO, it cannot be done without a cost estimation component. |
Does vitess parser support query hints? |
@smola query hints are comments and we are removing them: https://github.com/src-d/go-mysql-server/blob/master/sql/parse/parse.go#L59 , so they might not be supported by vitess. The other day I had an idea to avoid cost estimation component. Right now we are iterating from every row of one side of the join all the rows from the other side. We can check on the first pass if all the rows of the right will fit in memory (available memory will be defined by configuration) and if true we will put it on memory on the next pass. The problem with that is that we don't know which part of the join is the smallest one, so we can only apply this logic to one of the join children always. |
Fixes src-d#577 Because we do not have a way to estimate the cost of each side of a join, it is really difficult to know when we can compute one in memory. But not doing so, causes inner joins to be painfully slow, as one of the branches is iterated multiple times. This PR addresses this by ensuring that if the right branch of the inner join fits in memory, it will be computed in memory even if the in-memory mode has not been activated by the user. An user can set the maximum threshold of memory the gitbase server can use before considering the joins should not be performed in memory using the `MAX_MEMORY_INNER_JOIN` environment variable or the `max_memory_joins` session variable specifying the number of bytes. The default value for this is the half of the available physical memory on the operating system. Because previously we had two iterators: `innerJoinIter` and `innerJoinMemoryIter`, and now `innerJoinIter` must be able to do the join in memory, `innerJoinMemoryIter` has been removed and `innerJoinIter` replaced with a version that can work with three modes: - `unknownMode` we don't know yet how to perform the join, so keep iterating until we can find out. By the end of the first full pass over the right branch `unknownMode` will either switch to `multipassMode` or `memoryMode`. - `memoryMode` which computes the rest of the join in memory. The iterator can have this mode before starting iterating if the user activated the in memory join via session or environment vars, in which case it will load all the right side on memory before doing any further iteration. Instead, if the iterator started in `unknownMode` and switched to this mode, it's guaranteed to already have loaded all the right side. From that point on, they work in exactly the same way. - `multipassMode`, which was the previous default mode. Iterate the right side of the join for each row in the left side. More expensive, but less memory consuming. The iterator can not start in this mode, and can only be switched to it from `unknownMode` in case the memory used by the gitbase server exceeds the maximum amount of memory either set by the user or by default. Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Fixes src-d#577 Because we do not have a way to estimate the cost of each side of a join, it is really difficult to know when we can compute one in memory. But not doing so, causes inner joins to be painfully slow, as one of the branches is iterated multiple times. This PR addresses this by ensuring that if the right branch of the inner join fits in memory, it will be computed in memory even if the in-memory mode has not been activated by the user. An user can set the maximum threshold of memory the gitbase server can use before considering the joins should not be performed in memory using the `MAX_MEMORY_INNER_JOIN` environment variable or the `max_memory_joins` session variable specifying the number of bytes. The default value for this is the half of the available physical memory on the operating system. Because previously we had two iterators: `innerJoinIter` and `innerJoinMemoryIter`, and now `innerJoinIter` must be able to do the join in memory, `innerJoinMemoryIter` has been removed and `innerJoinIter` replaced with a version that can work with three modes: - `unknownMode` we don't know yet how to perform the join, so keep iterating until we can find out. By the end of the first full pass over the right branch `unknownMode` will either switch to `multipassMode` or `memoryMode`. - `memoryMode` which computes the rest of the join in memory. The iterator can have this mode before starting iterating if the user activated the in memory join via session or environment vars, in which case it will load all the right side on memory before doing any further iteration. Instead, if the iterator started in `unknownMode` and switched to this mode, it's guaranteed to already have loaded all the right side. From that point on, they work in exactly the same way. - `multipassMode`, which was the previous default mode. Iterate the right side of the join for each row in the left side. More expensive, but less memory consuming. The iterator can not start in this mode, and can only be switched to it from `unknownMode` in case the memory used by the gitbase server exceeds the maximum amount of memory either set by the user or by default. Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
Fixes src-d#577 Because we do not have a way to estimate the cost of each side of a join, it is really difficult to know when we can compute one in memory. But not doing so, causes inner joins to be painfully slow, as one of the branches is iterated multiple times. This PR addresses this by ensuring that if the right branch of the inner join fits in memory, it will be computed in memory even if the in-memory mode has not been activated by the user. An user can set the maximum threshold of memory the gitbase server can use before considering the joins should not be performed in memory using the `MAX_MEMORY_INNER_JOIN` environment variable or the `max_memory_joins` session variable specifying the number of bytes. The default value for this is the half of the available physical memory on the operating system. Because previously we had two iterators: `innerJoinIter` and `innerJoinMemoryIter`, and now `innerJoinIter` must be able to do the join in memory, `innerJoinMemoryIter` has been removed and `innerJoinIter` replaced with a version that can work with three modes: - `unknownMode` we don't know yet how to perform the join, so keep iterating until we can find out. By the end of the first full pass over the right branch `unknownMode` will either switch to `multipassMode` or `memoryMode`. - `memoryMode` which computes the rest of the join in memory. The iterator can have this mode before starting iterating if the user activated the in memory join via session or environment vars, in which case it will load all the right side on memory before doing any further iteration. Instead, if the iterator started in `unknownMode` and switched to this mode, it's guaranteed to already have loaded all the right side. From that point on, they work in exactly the same way. - `multipassMode`, which was the previous default mode. Iterate the right side of the join for each row in the left side. More expensive, but less memory consuming. The iterator can not start in this mode, and can only be switched to it from `unknownMode` in case the memory used by the gitbase server exceeds the maximum amount of memory either set by the user or by default. Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
This is just a future idea that I thought of leaving here for the future, as I was researching how to improve perf of the joins.
If we have two sides on a join we could try to use an in-memory cache for the smaller side instead of iterating it over and over. Depending on how expensive the underlying implementation is of iterating all over the rows, this might be a big performance boost.
What we need for this:
Caveats:
Sadly, this does not solve #454, because it can't and shouldn't be applied in every case. I've been digging on how to make a join not iterate one side N times, but it's the only way to do it without having a whole side in memory, which is not viable for large datasets.
The text was updated successfully, but these errors were encountered: