Skip to content

2.25.0.0-b19

@myang2021 myang2021 tagged this 20 Sep 16:28
Summary:
The bug appeared in a recent integration test run and had the following symptom:

In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559

```
W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5)
```

Note the last breaking catalog version is 18446744073709551615 (-1 in int64)
which is unreasonably big. The version check is done by tserver, the expected
last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by
using `db_oid` as the key. The map gets its value from the tserver-master
heartbeat response where we find the contents of the table
`pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged
with the existing `ysql_db_catalog_version_map_` where we only insert/update the
map when the new version is greater than the existing value.

I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true,
will crash the tserver if the new version read from the heartbeat response is
unreasonably big (i.e., becomes negative when casted to int64_t).

Similar debugging logic is added to the master side as well. When the contents
of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat
response, if the version read from the table `pg_yb_catalog_version` is
unreasonably big, we crash the master process.

Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions.

It is expected that this `--TEST_check_catalog_version_overflow` gflag is
enabled in the integration test which showed the bug. If the bug has a repro, we
may have a better clue on where the number 18446744073709551615 comes from.
Jira: DB-12909

Test Plan:
Manual test
(1) create a local cluster and start the cluster with the new test gflag set:

```
./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true

```

(2) run the following commands:
```
yugabyte=# select * from pg_yb_catalog_version;
 db_oid | current_version | last_breaking_version
--------+-----------------+-----------------------
      1 |               1 |                     1
  13254 |               1 |                     1
  13255 |               1 |                     1
  13257 |               1 |                     1
  13258 |               1 |                     1
(5 rows)
yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1;
SET
yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257;
UPDATE 1
yugabyte=# \q
```

Look into the yb-master log directory and saw a FATAL:

```
F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257
```

(3) Repeat the above test with the master side changed as:
```
+      if (FLAGS_TEST_check_catalog_version_overflow && false) {
```

so that we can see the tserver FATAL:

```
F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 }
```

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38240
Assets 2
Loading