Summary:
The symptom of the CHECK failure:
```
F20250514 00:37:44 /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang19-release/yugabyte-db/src/yb/tserver/tablet_server.cc:1650] Check failed: db_message_lists[i-1].first < db_message_lists[i].first (23 vs. 23) 1 6
```
This CHECK was added in recent commit e2fe6ee4a5184e7233179d05570bce8cac1cb658.
The db_message_lists should be a queue (implemented as a std::deque) of
(catalog_version, message_list) pairs sorted by catalog_version. The check
failure log indicates the queue has 6 pairs like
(23, m1), (23, m2), (24, m3), (25, m4), (26, m5)
The first two pairs have the same catalog version (we should have m1 == m2),
this isn't expected.
After debugging, I found it is because of the use of std::upper_bound where in
fact std::lower_bound should be used.
```
// The queue isn't empty, insert the new pair to the right position. Because db_message_lists
// is sorted, we can use std::upper_bound with a custom comparator function to find the right
// insertion point.
auto comp = [](uint64_t current_version,
const std::pair<uint64_t, std::optional<std::string>>& p) {
return current_version < p.first;
};
auto it3 = std::upper_bound(db_message_lists->begin(), db_message_lists->end(),
new_catalog_version, comp);
if (it3 != db_message_lists->end()) {
db_message_lists->insert(it3, std::make_pair(new_catalog_version, message_list));
} else {
// We can reach here if the last version in the queue is the same as new_catalog_version.
}
```
If the current queue has
```
(23, m1), (24, m3), (25, m4), (26, m5)
```
and we are trying to insert a new version pair (23, m2), it3 will point to (24,
m3), and we will insert new pair before (24, m3) so we end up with
```
(23, m1), (23, m2), (24, m3), (25, m4), (26, m5)
```
I tested with the following program to see the difference between upper_bound
and lower_bound:
```
tmp$ cat ub.cc
using namespace std;
int main() {
deque<std::pair<uint64_t, std::optional<std::string>>> v = {{23, nullopt}, {24, nullopt}};
auto comp = [](uint64_t current_version,
const std::pair<uint64_t, std::optional<std::string>>& p) {
return current_version < p.first;
};
// Finding upper bound for value 23 in deque v
auto it = upper_bound(v.begin(), v.end(), 23, comp);
if (it == v.end()) {
cout << "upper bound for 23 is not found" << endl;
} else {
cout << "upper bound for 23 is " << it->first << endl;
}
return 0;
}
tmp$ /opt/rh/gcc-toolset-11/root/bin/g++ -std=c++20 -o ub ub.cc
tmp$ cat lb.cc
using namespace std;
int main() {
deque<std::pair<uint64_t, std::optional<std::string>>> v = {{23, nullopt}, {24, nullopt}};
auto comp = [](const std::pair<uint64_t, std::optional<std::string>>& p,
uint64_t current_version) {
return p.first < current_version;
};
// Finding lower bound for value 23 in deque v
auto it = lower_bound(v.begin(), v.end(), 23, comp);
if (it == v.end()) {
cout << "lower bound for 23 is not found" << endl;
} else {
cout << "lower bound for 23 is " << it->first << endl;
}
return 0;
}
tmp$ /opt/rh/gcc-toolset-11/root/bin/g++ -std=c++20 -o lb lb.cc
tmp$ ./ub
upper bound for 23 is 24
tmp$ ./lb
lower bound for 23 is 23
```
We can see for the same deque, lower_bound finds 23, upper_bound finds 24.
Using lower_bound allows detection of duplication of 23.
I fixed the bug by using std::lower_bound instead of std::upper_bound.
Added a new unit test which would fail with the error before the diff (with
yb_test_delay_set_local_tserver_inval_message_ms added to force a repro of the
bug):
```
[ts-1] F0515 00:43:49.812948 924097 tablet_server.cc:1660] Check failed: db_message_lists[i-1].first < db_message_lists[i].first (2 vs. 2) 1 3
```
Jira: DB-16655
Test Plan: YB_EXTRA_TSERVER_FLAGS="--log_ysql_catalog_versions=true --vmodule=tablet_server=2" ./yb_build.sh debug --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.InvalMessageDuplicateVersion
Reviewers: kfranz, sanketh, mihnea
Reviewed By: kfranz
Subscribers: yql
Differential Revision: https://phorge.dev.yugabyte.com/D44000