Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Sorted Map KVs in memtable degrade import performance #1852

Closed
etiennedi opened this issue Mar 10, 2022 · 0 comments · Fixed by #1853
Closed

Performance: Sorted Map KVs in memtable degrade import performance #1852

etiennedi opened this issue Mar 10, 2022 · 0 comments · Fixed by #1853
Assignees

Comments

@etiennedi
Copy link
Member

On datasets that make heavy use of the inverted index, the changes introduced in #1832 degrade import performance.

This most likely (but not yet investigated) comes down to this method.

@etiennedi etiennedi self-assigned this Mar 10, 2022
etiennedi added a commit that referenced this issue Mar 10, 2022
The memtable for Map is a binary tree so it's always sorted. However,
since this is type 'Map' each "row key" holds a map. This map was
unsorted in the past. In #1832 we introduced a change that made sure
this change would always be sorted ON DISK, i.e. in the segments. It was
very natural to also keep it sorted in the memtable, as we did not have
to do any sorting when flushing. However, the performance tests on
imports that make heavy use of the inverted index had a large
performance degradation after #1832. In a test I did locally the import
time went up by over 30%.

This fix goes back to keeping the KV pairs unsorted and making each
change an append only operation. This means it now needs to be sorted in
just two places (as opposed to on every single insertion):

1. On a read query. Those should be rare on memtable, since memtables
   are mostly meant for writing. The added overhead here (minimal) is
   not a problem since it was also there before #1832
2. When flushing. Flushing is an async operation and the small overhead
   of sorting each row's Map KVs is neglible.

This new implementation has the same import speed as prior to #1832
while keeping all the runtime benefits of having the KV pairs sorted on
disk.

closes #1852
etiennedi added a commit that referenced this issue Mar 10, 2022
etiennedi added a commit that referenced this issue Mar 11, 2022
etiennedi added a commit that referenced this issue Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant