Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix querying with a path into nested collections with wildcards #7404

Merged
merged 7 commits into from
Mar 7, 2024

Conversation

jedelbo
Copy link
Contributor

@jedelbo jedelbo commented Mar 1, 2024

Comparing a collection with a list could fail if there was wildcards in the path and therefore multiple collections to compare with right hand list.

Linklist is implicitly having wildcard in the path, so if linklists is in the path there will be a similar problem. This part is not solved.

What, How & Why?

Fixes #7393

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • C-API, if public C++ API changed
  • bindgen/spec.yml, if public C++ API changed

Comparing a collection with a list could fail if there was wildcards
in the path and therefore multiple collections to compare with right
hand list.

Linklist is implicitly having wildcard in the path, so if linklists is
in the path there will be a similar problem. This part is not solved.
q = origin->query("link[0].list = {4, 5, 6}");
CHECK_EQUAL(q.count(), 1);
q = origin->query("link.list = {4, 5, 6}");
// CHECK_EQUAL(q.count(), 1); // Fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing is to define what the correct behaviour should be. The current behaviour is to try to do an element by element comparison on {4, 5, 6, 1, 2, 3} vs {4, 5, 6}. This is clearly broken and I think we should change it to use the same design that you have introduced in this PR. We could debate on if this is a breaking change or a fix, but I'd vote to call it a fix.

CHECK_EQUAL(q.count(), 1);
q = table->query("{3, 2, 1} = value[*][*]");
CHECK_EQUAL(q.count(), 1);
q = table->query("value[*][*] = dict[*][*]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that we also have meaningful comparisons when there is a expression comparison type involved as well. When comparing many-to-many lists, my intuition is that it is correct to apply ANY/ALL/NONE to the final list, not the amalgamation of all lists combined. I think this is how the code works as you have it written. But to be sure can you try adding a few combinations? eg:

ANY value[*][*] = ANY dict[*][*] // eg: {{1, 2, 3}, {4, 5, 6}} should match on {{1}, {10}} but not on {{4}, {1}}
ALL value[*][*] <= ANY dict[*][*] // eg: {{1, 2, 3}, {4, 5, 6}} should match on {{1, 10}, {10}} but not on {{2, 10}, {10}}
ALL value[*][*] = ALL dict[*][*]
NONE value[*][*] = ANY dict[*][*]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more tests.

Copy link

coveralls-official bot commented Mar 4, 2024

Pull Request Test Coverage Report for Build jorgen.edelbo_134

Details

  • 323 of 332 (97.29%) changed or added relevant lines in 6 files are covered.
  • 84 unchanged lines in 17 files lost coverage.
  • Overall coverage increased (+0.005%) to 90.906%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/query_expression.hpp 136 140 97.14%
src/realm/query_expression.cpp 51 56 91.07%
Files with Coverage Reduction New Missed Lines %
src/realm/query_expression.cpp 1 87.52%
test/fuzz_tester.hpp 1 59.58%
src/realm/array_string.cpp 2 85.25%
src/realm/cluster.cpp 2 77.52%
src/realm/sync/noinst/server/server_history.cpp 2 67.94%
test/test_sync.cpp 2 91.31%
src/realm/query_expression.hpp 3 93.76%
src/realm/sync/network/network.cpp 3 89.73%
src/realm/sync/transform.cpp 3 63.0%
src/realm/table.cpp 3 90.81%
Totals Coverage Status
Change from base Build 2106: 0.005%
Covered Lines: 238492
Relevant Lines: 262349

💛 - Coveralls

@jedelbo
Copy link
Contributor Author

jedelbo commented Mar 4, 2024

@ironage would you prefer that I fix the problem involving linklist before this is merged?

@jedelbo jedelbo requested a review from ironage March 5, 2024 15:40
@jedelbo
Copy link
Contributor Author

jedelbo commented Mar 5, 2024

@ironage I have now fixed the case where linklists are involved. What do you think about the cases, I have commented out?

Comment on lines 2217 to 2218
verify_query(test_context, t2, "ALL list.integers == 1", 2); // row 0 matches {1}. row 1 matches (any of) {} {1}
verify_query(test_context, t2, "NONE list.integers == 1", 1); // row 1 matches (any of) {}, {0}, {2}, {3} ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old behaviour is to combine all the lists along the chain into one big list, so the explanations for these results are wrong. I think the new behaviour is correct and much easier to reason about. I am in favour of changing this, I just think we should be very clear about it in the changelog in case someone was relying on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the right thing to do may be to increase the major version number again. Most SDKs haven't even released v14 yet so this breaking change would be lumped in to all the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really weird that "ALL {} == 1" matches.

I am not sure we should bump the major. I would call this change a fix. The prior behavior was never really specified, was it? Also the fact that we only had to change 2 checks that had the comment that the result might be surpricing, suggest that this was just "works as implemented".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite in doubt how to describe what is breaking here. I think that in the case where you had ANY specified (implicitly or explicitly) and you compare with a single value or ANY {x,y,z}, then it gives the same results as before. And I think all other cases were broken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. We can label it a fix, and say we never supported those other many-to-many lists comparisons. I agree that the old behaviour for those cases was strange/broken.

return m_destination_index < m_ctrl.matches.size();
}

void evaluate(size_t index, ValueBase& destination) override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird to have two modes of evaluation like this where on one hand if the path is empty, the results are driven by the index parameter and the query node is stateless. Then in the other case, the index is ignored and the query node is advancing internal state at every call to evaluate. The other downside is that all query expressions have to pay the cost of two extra virtual function calls per evaluation with no benefit (more() and reset_path()).
What do you think about removing more() and reset_path() and then changing the design to something like this:

struct QueryIndex {
    size_t index; // advanced directly for most queries
    size_t sub_index; // usually 0: used by many-to-many lists
    size_t sub_size; // usually 1: but can be set by the query node upon a call to evaluate
};
...

void evaluate(QueryIndex& index, ValueBase& destination) override {
    if (index.index == m_last_evaluated_index) {
        // fetch new lists
    }
    m_last_evaluated_index = index.index
    auto& matches = m_ctrl.matches[index.sub_index]; 
    // set destination as before
}
... 
// main driver loop also changes to iterate on index.index and while (index.sub_index < index.sub_size) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this suggestion. I have a tendency to think that virtual functions is the answer to all problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated design looks much cleaner to me, thank you!

@jedelbo jedelbo requested a review from ironage March 6, 2024 13:25
// In case of wildcard query strings, we will get a value for every collection matching the path
// We need to match those separately against the other value - which might also come in multiple
// instances.
Subexpr::Index right_index(start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks optimal to me, but since it is such a hot path, could you check how much performance is lost on normal queries that don't have any more()? I think the cost is two more conditionals per evaluation, so hopefully not too bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a query comparing 2 integer properties, you will have a 2% degradation of performance. On the other hand, I think this change will speed up queries over linklists as we will only fetch values until we have a match. And we are also avoiding some copying to intermediate buffers.

Comment on lines -357 to +356
this->get_lists(index, list_refs, 1);
this->get_lists(index, list_refs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd that we never requested more than one list anywhere. There should be room for the multi-value row-by-row comparison in cases like this. Eg. for queries such as dictionary.@size() == my_int_property we can grab 8 lists at a time, and set destination.init(not_from_list, 8) and set all the values to the sizes of the lists. It is a niche optimization that only works for only_unary_links that end in a collection count. Not blocking because we didn't have that optimized before, just observing that I think this is the case that the extra parameter was intended for.

@jedelbo jedelbo merged commit edf7064 into master Mar 7, 2024
35 of 37 checks passed
@jedelbo jedelbo deleted the je/query-wildcard branch March 7, 2024 13:56
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Collections in mixed] Query not matching expected elements
2 participants