Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting stage 3 #6670

Merged
merged 22 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
2864d8a
Add tests on BPlusTree upgrade
jedelbo May 17, 2023
8512a0d
change the sort order of strings
ironage May 24, 2023
4823705
Add test on upgrade of StringIndex and Set
jedelbo Sep 21, 2023
6ea018c
remove utf8_compare
ironage Mar 17, 2023
f16f728
Merge branch 'je/upgrade-from-23-tests' into js/sorting-stage-3
jedelbo Sep 21, 2023
e34e29d
Add upgrade functionality
jedelbo Sep 21, 2023
333982a
Avoid string index upgrade
jedelbo Sep 21, 2023
85a9397
Update test
jedelbo Sep 22, 2023
5ed056a
Merge branch 'je/upgrade-from-23-tests' into js/sorting-stage-3
jedelbo Sep 22, 2023
263e6b8
Merge branch 'next-major' into js/sorting-stage-3
jedelbo Sep 22, 2023
90acf11
Move Set::do_resort() to .cpp file
jedelbo Sep 22, 2023
79326f8
Generate test_upgrade_database_x_23.realm as on ARM
jedelbo Sep 25, 2023
1ca910e
Revert "Avoid string index upgrade"
jedelbo Sep 25, 2023
4fafb0e
Fix upgrade logic for string index
jedelbo Sep 25, 2023
704d0dc
Merge branch 'je/upgrade-from-23-tests' into js/sorting-stage-3
jedelbo Sep 25, 2023
f8d000d
memcmp is faster than std::lexi_cmp and doesn't require casting
ironage Sep 25, 2023
0d19ba4
optimize: only compare strings once
ironage Sep 25, 2023
a9ee4e2
Upgrade of fulltext index not needed
jedelbo Sep 26, 2023
64f7f50
migrate sets of binaries and better migration test
ironage Sep 26, 2023
aaf0064
generate migration Realms on a signed platform
ironage Sep 26, 2023
25e21bc
fix lint
ironage Sep 26, 2023
39b0eae
avoid a string index migration by using linear search
ironage Oct 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@
* Fixed equality queries on a Mixed property with an index possibly returning the wrong result if values of different types happened to have the same StringIndex hash. ([6407](https://github.com/realm/realm-core/issues/6407) since v11.0.0-beta.5).
* If you have more than 8388606 links pointing to one specific object, the program will crash. ([#6577](https://github.com/realm/realm-core/issues/6577), since v6.0.0)
* Query for NULL value in Dictionary<Mixed> would give wrong results ([6748])(https://github.com/realm/realm-core/issues/6748), since v10.0.0)
* A Realm generated on a non-apple ARM 64 device and copied to another platform (and vice-versa) were non-portable due to a sorting order difference. This impacts strings or binaries that have their first difference at a non-ascii character. These items may not be found in a set, or in an indexed column if the strings had a long common prefix (> 200 characters). ([PR # 6670](https://github.com/realm/realm-core/pull/6670), since 2.0.0-rc7 for indexes, and since since the introduction of sets in v10.2.0)

### Breaking changes
* Support for upgrading from Realm files produced by RealmCore v5.23.9 or earlier is no longer supported.
* Remove `set_string_compare_method`, only one sort method is now supported which was previously called `STRING_COMPARE_CORE`.
* BinaryData and StringData are now strongly typed for comparisons and queries. This change is especially relevant when querying for a string constant on a Mixed property, as now only strings will be returned. If searching for BinaryData is desired, then that type must be specified by the constant. In RQL the new way to specify a binary constant is to use `mixed = bin('xyz')` or `mixed = binary('xyz')`. ([6407](https://github.com/realm/realm-core/issues/6407)).
* In the C API, `realm_collection_changes_get_num_changes` and `realm_dictionary_get_changes` have got an extra parameter to receive information on the deletion of the entire collection.
* Sorting order of strings has changed to use standard unicode codepoint order instead of grouping similar english letters together. A noticeable change will be from "aAbBzZ" to "ABZabz". ([2573](https://github.com/realm/realm-core/issues/2573))

### Compatibility
* Fileformat: Generates files with format v24. Reads and automatically upgrade from fileformat v10. If you want to upgrade from an earlier file format version you will have to use RealmCore v13.x.y or earlier.
Expand Down
6 changes: 5 additions & 1 deletion src/realm/binary_data.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,11 @@ inline bool operator<(const BinaryData& a, const BinaryData& b) noexcept
if (a.is_null() || b.is_null())
return !a.is_null() < !b.is_null();

return std::lexicographical_compare(a.m_data, a.m_data + a.m_size, b.m_data, b.m_data + b.m_size);
// memcmp does comparison using unsigned characters which gives the correct ordering for utf8
int cmp = memcmp(a.m_data, b.m_data, std::min(a.size(), b.size()));
if (cmp == 0 && a.size() < b.size())
return true;
return cmp < 0;
}

inline bool operator>(const BinaryData& a, const BinaryData& b) noexcept
Expand Down
3 changes: 2 additions & 1 deletion src/realm/collection.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -980,6 +980,7 @@ struct CollectionIterator {
using pointer = const value_type*;
using reference = const value_type&;

CollectionIterator() noexcept = default;
CollectionIterator(const L* l, size_t ndx) noexcept
: m_list(l)
, m_ndx(ndx)
Expand Down Expand Up @@ -1070,7 +1071,7 @@ struct CollectionIterator {

private:
mutable value_type m_val;
const L* m_list;
const L* m_list = nullptr;
size_t m_ndx = size_t(-1);
};

Expand Down
1 change: 1 addition & 0 deletions src/realm/group.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -779,6 +779,7 @@ class Group : public ArrayParent {
/// 24 Variable sized arrays for Decimal128.
/// Nested collections
/// Backlinks in BPlusTree
/// Sort order of Strings changed (affects sets and the string index)
///
/// IMPORTANT: When introducing a new file format version, be sure to review
/// the file validity checks in Group::open() and DB::do_open, the file
Expand Down
119 changes: 88 additions & 31 deletions src/realm/index_string.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,8 @@ int64_t IndexArray::from_list<index_FindFirst>(const Mixed& value, InternalFindR
SortedListComparator slc(column);

IntegerColumn::const_iterator it_end = key_values.cend();
IntegerColumn::const_iterator lower = std::lower_bound(key_values.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator lower = slc.find_start_of_unsorted(value, key_values);

if (lower == it_end)
return null_key.value;

Expand All @@ -124,7 +125,7 @@ int64_t IndexArray::from_list<index_Count>(const Mixed& value, InternalFindResul
SortedListComparator slc(column);

IntegerColumn::const_iterator it_end = key_values.cend();
IntegerColumn::const_iterator lower = std::lower_bound(key_values.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator lower = slc.find_start_of_unsorted(value, key_values);
if (lower == it_end)
return 0;

Expand All @@ -134,7 +135,7 @@ int64_t IndexArray::from_list<index_Count>(const Mixed& value, InternalFindResul
if (actual != value)
return 0;

IntegerColumn::const_iterator upper = std::upper_bound(lower, it_end, value, slc);
IntegerColumn::const_iterator upper = slc.find_end_of_unsorted(value, key_values, lower);
int64_t cnt = upper - lower;

return cnt;
Expand All @@ -147,7 +148,8 @@ int64_t IndexArray::from_list<index_FindAll_nocopy>(const Mixed& value, Internal
{
SortedListComparator slc(column);
IntegerColumn::const_iterator it_end = key_values.cend();
IntegerColumn::const_iterator lower = std::lower_bound(key_values.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator lower = slc.find_start_of_unsorted(value, key_values);

if (lower == it_end)
return size_t(FindRes_not_found);

Expand Down Expand Up @@ -179,7 +181,7 @@ int64_t IndexArray::from_list<index_FindAll_nocopy>(const Mixed& value, Internal
// Last result is not equal, find the upper bound of the range of results.
// Note that we are passing upper which is cend() - 1 here as we already
// checked the last item manually.
upper = std::upper_bound(lower, upper, value, slc);
upper = slc.find_end_of_unsorted(value, key_values, lower);

result_ref.payload = from_ref(key_values.get_ref());
result_ref.start_ndx = lower.get_position();
Expand Down Expand Up @@ -339,7 +341,7 @@ void IndexArray::from_list_all(const Mixed& value, std::vector<ObjKey>& result,
SortedListComparator slc(column);

IntegerColumn::const_iterator it_end = rows.cend();
IntegerColumn::const_iterator lower = std::lower_bound(rows.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator lower = slc.find_start_of_unsorted(value, rows);
if (lower == it_end)
return;

Expand All @@ -349,7 +351,7 @@ void IndexArray::from_list_all(const Mixed& value, std::vector<ObjKey>& result,
if (a != value)
return;

IntegerColumn::const_iterator upper = std::upper_bound(lower, it_end, value, slc);
IntegerColumn::const_iterator upper = slc.find_end_of_unsorted(value, rows, lower);

// Copy all matches into result column
size_t sz = result.size() + (upper - lower);
Expand Down Expand Up @@ -802,7 +804,7 @@ void StringIndex::insert_to_existing_list_at_lower(ObjKey key, Mixed value, Inte
// At this point there exists duplicates of this value, we need to
// insert value beside it's duplicates so that rows are also sorted
// in ascending order.
IntegerColumn::const_iterator upper = std::upper_bound(lower, list.cend(), value, slc);
IntegerColumn::const_iterator upper = slc.find_end_of_unsorted(value, list, lower);
// find insert position (the list has to be kept in sorted order)
// In most cases the refs will be added to the end. So we test for that
// first to see if we can avoid the binary search for insert position
Expand All @@ -812,6 +814,7 @@ void StringIndex::insert_to_existing_list_at_lower(ObjKey key, Mixed value, Inte
list.insert(upper.get_position(), key.value);
}
else {
// insert into the group of duplicates, keeping object keys sorted
IntegerColumn::const_iterator inner_lower = std::lower_bound(lower, upper, key.value);
list.insert(inner_lower.get_position(), key.value);
}
Expand All @@ -821,7 +824,7 @@ void StringIndex::insert_to_existing_list(ObjKey key, Mixed value, IntegerColumn
{
SortedListComparator slc(m_target_column);
IntegerColumn::const_iterator it_end = list.cend();
IntegerColumn::const_iterator lower = std::lower_bound(list.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator lower = slc.find_start_of_unsorted(value, list);

if (lower == it_end) {
// Not found and everything is less, just append it to the end.
Expand Down Expand Up @@ -1176,7 +1179,7 @@ bool StringIndex::leaf_insert(ObjKey obj_key, key_type key, size_t offset, Strin
if (index_data == index_data_2 || suboffset > s_max_offset) {
// These strings have the same prefix up to this point but we
// don't want to recurse further, create a list in sorted order.
bool row_ndx_first = value.compare_signed(v2) < 0;
bool row_ndx_first = value < v2;
Array row_list(alloc);
row_list.create(Array::type_Normal); // Throws
row_list.add(row_ndx_first ? obj_key.value : obj_key2.value);
Expand Down Expand Up @@ -1213,7 +1216,7 @@ bool StringIndex::leaf_insert(ObjKey obj_key, key_type key, size_t offset, Strin
return reconstruct_string(offset, key, index_data) == value.get_string();
}
SortedListComparator slc(m_target_column);
lower = std::lower_bound(sub.cbegin(), it_end, value, slc);
lower = slc.find_start_of_unsorted(value, sub);

if (lower != it_end) {
Mixed lower_value = get(ObjKey(*lower));
Expand Down Expand Up @@ -1623,7 +1626,7 @@ bool has_duplicate_values(const Array& node, const ClusterColumn& target_col) no
SortedListComparator slc(target_col);
while (it != it_end) {
Mixed it_data = target_col.get_value(ObjKey(*it));
IntegerColumn::const_iterator next = std::upper_bound(it, it_end, it_data, slc);
IntegerColumn::const_iterator next = slc.find_end_of_unsorted(it_data, sub, it);
size_t count_of_value = next - it; // row index subtraction in `sub`
if (count_of_value > 1) {
return true;
Expand Down Expand Up @@ -1753,36 +1756,90 @@ void StringIndex::node_add_key(ref_type ref)
m_array->add(ref);
}

// Must return true if value of object(key) is less than needle.
bool SortedListComparator::operator()(int64_t key_value, Mixed needle) // used in lower_bound
// Must return true if value of object(key) is less than 'b'.
bool SortedListComparator::operator()(int64_t key_value, const Mixed& b) // used in lower_bound
{
Mixed a = m_column.get_value(ObjKey(key_value));
if (a.is_null() && !needle.is_null())
if (a.is_null() && !b.is_null())
return true;
else if (needle.is_null() && !a.is_null())
else if (b.is_null() && !a.is_null())
return false;
else if (a.is_null() && needle.is_null())
else if (a.is_null() && b.is_null())
return false;
return a.compare(b) < 0;
}

if (a == needle)
return false;

// The Mixed::operator< uses a lexicograpical comparison, therefore we
// cannot use our utf8_compare here because we need to be consistent with
// using the same compare method as how these strings were they were put
// into this ordered column in the first place.
return a.compare_signed(needle) < 0;
// Must return true if value of 'a' is less than value of object(key).
bool SortedListComparator::operator()(const Mixed& a, int64_t key_value) // used in upper_bound
{
Mixed b = m_column.get_value(ObjKey(key_value));
if (a.is_null() && !b.is_null())
return true;
else if (b.is_null() && !a.is_null())
return false;
else if (a.is_null() && b.is_null())
return false;
return a.compare(b) < 0;
}

// TODO: the next time the StringIndex is migrated (post version 23) and sorted
// properly in these lists replace this method with
// std::lower_bound(key_values.cbegin(), it_end, value, slc);
IntegerColumn::const_iterator SortedListComparator::find_start_of_unsorted(const Mixed& value,
const IntegerColumn& key_values) const
{
if (key_values.size() >= 2) {
Mixed first = m_column.get_value(ObjKey(key_values.get(0)));
Mixed last = m_column.get_value(ObjKey(key_values.get(key_values.size() - 1)));
if (first == last) {
if (value.compare(first) <= 0) {
return key_values.cbegin();
}
else {
return key_values.cend();
}
}
}

IntegerColumn::const_iterator it = key_values.cbegin();
IntegerColumn::const_iterator end = key_values.cend();
IntegerColumn::const_iterator first_greater = end;
while (it != end) {
Mixed val_it = m_column.get_value(ObjKey(*it));
int cmp = val_it.compare(value);
if (cmp == 0) {
return it;
}
if (cmp > 0 && first_greater == end) {
first_greater = it;
}
++it;
}
return first_greater;
}

// Must return true if value of needle is less than value of object(key).
bool SortedListComparator::operator()(Mixed needle, int64_t key_value) // used in upper_bound
// TODO: same as above, change to std::upper_bound(lower, it_end, value, slc);
IntegerColumn::const_iterator SortedListComparator::find_end_of_unsorted(const Mixed& value,
const IntegerColumn& key_values,
IntegerColumn::const_iterator begin) const
{
Mixed a = m_column.get_value(ObjKey(key_value));
if (needle == a) {
return false;
IntegerColumn::const_iterator end = key_values.cend();
if (begin != end && end - begin > 0) {
// optimization: check the last element first
Mixed last = m_column.get_value(ObjKey(*(key_values.cend() - 1)));
if (last == value) {
return key_values.cend();
}
}
while (begin != end) {
Mixed val_it = m_column.get_value(ObjKey(*begin));
if (value.compare(val_it) != 0) {
return begin;
}
++begin;
}
return !(*this)(key_value, needle);
return end;
}

// LCOV_EXCL_START ignore debug functions
Expand Down Expand Up @@ -1841,7 +1898,7 @@ void StringIndex::verify() const
while (it != it_end) {
Mixed it_data = get(ObjKey(*it));
size_t it_row = to_size_t(*it);
REALM_ASSERT(previous.compare_signed(it_data) <= 0);
REALM_ASSERT(previous <= it_data);
if (it != sub.cbegin() && previous == it_data) {
REALM_ASSERT_EX(it_row > last_row, it_row, last_row);
}
Expand Down
10 changes: 7 additions & 3 deletions src/realm/index_string.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#include <set>

#include <realm/array.hpp>
#include <realm/column_integer.hpp>
#include <realm/search_index.hpp>

/*
Expand Down Expand Up @@ -248,7 +249,6 @@ class StringIndex : public SearchIndex {
void do_delete(ObjKey key, StringData, size_t offset);

Mixed get(ObjKey key) const;

void node_add_key(ref_type ref);

#ifdef REALM_DEBUG
Expand All @@ -267,8 +267,12 @@ class SortedListComparator {
{
}

bool operator()(int64_t key_value, Mixed needle);
bool operator()(Mixed needle, int64_t key_value);
bool operator()(int64_t key_value, const Mixed& b);
bool operator()(const Mixed& a, int64_t key_value);

IntegerColumn::const_iterator find_start_of_unsorted(const Mixed& value, const IntegerColumn& key_values) const;
IntegerColumn::const_iterator find_end_of_unsorted(const Mixed& value, const IntegerColumn& key_values,
IntegerColumn::const_iterator begin) const;

private:
const ClusterColumn m_column;
Expand Down
17 changes: 4 additions & 13 deletions src/realm/mixed.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,10 @@ static const int sorting_rank[] = {

int compare_string(StringData a, StringData b)
{
// Observe! Changing these values breaks the file format for Set<Mixed> and the StringIndex
if (a == b)
return 0;
return utf8_compare(a, b) ? -1 : 1;
return a < b ? -1 : 1;
}

int compare_binary(BinaryData a, BinaryData b)
Expand Down Expand Up @@ -225,7 +226,7 @@ bool Mixed::accumulate_numeric_to(Decimal128& destination) const noexcept

int Mixed::compare(const Mixed& b) const noexcept
{
// Observe! Changing this function breaks the file format for Set<Mixed>
// Observe! Changing this function breaks the file format for Set<Mixed> and the StringIndex

if (is_null()) {
return b.is_null() ? 0 : -1;
Expand Down Expand Up @@ -347,17 +348,7 @@ int Mixed::compare(const Mixed& b) const noexcept
// Using rank table will ensure that all numeric values are kept together
return (sorting_rank[m_type] > sorting_rank[b.m_type]) ? 1 : -1;

// Observe! Changing this function breaks the file format for Set<Mixed>
}

int Mixed::compare_signed(const Mixed& b) const noexcept
{
if (is_type(type_String) && b.is_type(type_String)) {
auto a_val = get_string();
auto b_val = b.get_string();
return a_val == b_val ? 0 : a_val < b_val ? -1 : 1;
}
return compare(b);
// Observe! Changing this function breaks the file format for Set<Mixed> and the StringIndex
}

template <>
Expand Down
4 changes: 1 addition & 3 deletions src/realm/mixed.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -228,10 +228,8 @@ class Mixed {
bool accumulate_numeric_to(Decimal128& destination) const noexcept;
bool is_unresolved_link() const noexcept;
bool is_same_type(const Mixed& b) const noexcept;
// Will use utf8_compare for strings
// Will use unsigned lexicographical comparison for strings
int compare(const Mixed& b) const noexcept;
// Will compare strings as arrays of signed chars
int compare_signed(const Mixed& b) const noexcept;
friend bool operator==(const Mixed& a, const Mixed& b) noexcept
{
return a.compare(b) == 0;
Expand Down
Loading