Optimise queries search for a chain of OR strings #3250

ironage · 2019-03-02T01:12:57Z

This is a performance enhancement motivated by users who are generating queries with many string comparisons on a single column, for example from cocoa's "IN" queries; see https://github.com/realm/engineering/issues/22

The idea is to combine string equality conditions from a single "OR" query node and store them in an unordered_set. With N elements to search, and C conditions, the runtime changes from O(N*C) to O(N). The added benchmark goes from 30 seconds to 2 seconds. This change does not try to optimise indexed columns which should be running O(log(N)*C). The benchmark with indexes turned on runs in 3.5 seconds. Since N is likely the dominant term, using indexes should still be fastest in practice when compared to this optimisation.

First make a sort based on column id. This will allow us to only pass through the array of conditions once. Comparing column indecies before making dynamic casts. Early out if column has search index.

finnschiermer

Nice work, but see comment.

finnschiermer · 2019-03-04T16:15:14Z

src/realm/query_engine.hpp

+        auto it = m_conditions.begin();
+        while (it != m_conditions.end()) {
+            // Only try to optimize on StringNode<Equal> conditions without search index
+            if (bool(*it) && (first = dynamic_cast<StringNode<Equal>*>(it->get())) && !first->has_search_index()) {


Sligtly confused by this bool(*it) ... if it is necessary here, then how come we don't need it before dereferencing "next" in line 1691 ?

Perhaps some simplification is possible?

Thanks, I removed the check on bool(*it), it should be covered by the dynamic cast anyway.

ironage · 2019-03-04T22:41:15Z

Here's the actual benchmark results for reference (100000 rows, 1000 query conditions):

Before:

QueryChainedOrStrings (MemOnly, EncryptionOff):   min 353.80ms     max 497.77ms     median 399.02ms     avg 394.69ms     stddev  44.38ms
QueryChainedOrStrings (MemOnly, EncryptionOn):    min 351.54ms     max 456.18ms     median 368.16ms     avg 375.97ms     stddev  32.92ms
QueryChainedOrStrings (Full   , EncryptionOff):   min 350.13ms     max 360.58ms     median 351.87ms     avg 352.84ms     stddev   3.24ms
QueryChainedOrStrings (Full   , EncryptionOn):    min 350.49ms     max 400.45ms     median 352.64ms     avg 357.29ms     stddev  15.26ms

test/benchmark-common-tasks/realm-benchmark-common-tasks  28.96s user 0.22s system 94% cpu 30.743 total

After:

QueryChainedOrStrings (MemOnly, EncryptionOff):     min      5ms (-98.59%)           max   7.27ms (-98.54%)           med   5.77ms (-98.55%)           avg   5.81ms (-98.53%)           stddev   658us (-98.52%)
QueryChainedOrStrings (MemOnly, EncryptionOn):      min   4.92ms (-98.60%)           max   5.77ms (-98.74%)           med   5.19ms (-98.59%)           avg   5.20ms (-98.62%)           stddev   242us (-99.26%)
QueryChainedOrStrings (Full   , EncryptionOff):     min   4.87ms (-98.61%)           max   5.38ms (-98.51%)           med   4.95ms (-98.59%)           avg      5ms (-98.58%)           stddev   134us (-95.88%)
QueryChainedOrStrings (Full   , EncryptionOn):      min   4.93ms (-98.59%)           max   5.43ms (-98.64%)           med   5.21ms (-98.52%)           avg   5.20ms (-98.54%)           stddev   158us (-98.97%)

test/benchmark-common-tasks/realm-benchmark-common-tasks  0.81s user 0.16s system 42% cpu 2.271 total

tgoyne · 2019-03-04T23:37:26Z

What's the difference like with two conditions? I'd expect this to be slower for a sufficiently small number of conditions, but whether or not that's anything worth caring about depends on how much slower it is and where the break even point is.

ironage · 2019-03-05T01:15:45Z

It's a good point, there is an overhead cost of computing a string hash. The majority of time is spent computing our custom StringData hash (not sure how performant it tries to be). That being the case, I added a simple loop check to iterate through the conditions and check for matches without hashing anything. Based on the following tests, I found that the threshold for choosing this over the hash is around 20 conditions.

(the axis units are: milliseconds vs number of conditions)

	no optimisations	unordered_set (hashing)	list search
1
2	0.7	1.8	0.3
3	1	3.5	0.4
4	1.4	4	0.7
5	1.7	4	0.8
10	5.4	4	1.9
20	10.5	4	3.5
30	16	3.6	5
50	25.5	3.2	8.5

This is a performance enhancement motivated by users who are generating queries with many string comparisons on a single column, for example from cocoa's "IN" queries. The idea is to combine string equality conditions from a single "OR" query node and store them in an unordered_set. With N elements to search, and C conditions, the runtime changes from O(N*C) to O(N). The added benchmark goes from 30 seconds to 2 seconds. This change does not try to optimise indexed columns which should be running O(log(N)*C). The benchmark with indexes turned on runs in 3.5 seconds. Since N is likely the dominant term, using indexes should still be fastest in practice when compared to this optimisation.

Optimise queries search for a chain of OR strings

d973a7c

ironage added the T-Enhancement label Mar 2, 2019

ironage self-assigned this Mar 2, 2019

ironage requested review from jedelbo and finnschiermer March 2, 2019 01:12

jedelbo force-pushed the js/or-chains branch from 47c8177 to 2f5bb0e Compare March 4, 2019 15:16

Optimizing a bit

1011e96

First make a sort based on column id. This will allow us to only pass through the array of conditions once. Comparing column indecies before making dynamic casts. Early out if column has search index.

jedelbo force-pushed the js/or-chains branch from 2f5bb0e to 1011e96 Compare March 4, 2019 15:26

finnschiermer reviewed Mar 4, 2019

View reviewed changes

Remove an unnecessary cast

e4d5139

Further optimise for few conditions

24cfb23

jedelbo approved these changes Mar 5, 2019

View reviewed changes

jedelbo merged commit 993363c into master Mar 5, 2019

jedelbo deleted the js/or-chains branch March 5, 2019 10:12

Zhuinden mentioned this pull request May 20, 2019

Query with in() returns an empty RealmResults realm/realm-java#6522

Closed

tgoyne mentioned this pull request Sep 12, 2019

Filtering objects with IN query returning incorrect results realm/realm-swift#6249

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise queries search for a chain of OR strings #3250

Optimise queries search for a chain of OR strings #3250

ironage commented Mar 2, 2019

finnschiermer left a comment

finnschiermer Mar 4, 2019

ironage Mar 4, 2019 •

edited

Loading

ironage commented Mar 4, 2019

tgoyne commented Mar 4, 2019

ironage commented Mar 5, 2019 •

edited

Loading

Optimise queries search for a chain of OR strings #3250

Optimise queries search for a chain of OR strings #3250

Conversation

ironage commented Mar 2, 2019

finnschiermer left a comment

Choose a reason for hiding this comment

finnschiermer Mar 4, 2019

Choose a reason for hiding this comment

ironage Mar 4, 2019 • edited Loading

Choose a reason for hiding this comment

ironage commented Mar 4, 2019

tgoyne commented Mar 4, 2019

ironage commented Mar 5, 2019 • edited Loading

ironage Mar 4, 2019 •

edited

Loading

ironage commented Mar 5, 2019 •

edited

Loading