Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] [YCQL] Failed to load sys catalog after incorrect packing of liveness columns #18157

Closed
1 task done
zlareb1-yb opened this issue Jul 10, 2023 · 2 comments
Closed
1 task done
Assignees
Labels

Comments

@zlareb1-yb
Copy link

zlareb1-yb commented Jul 10, 2023

Jira Link: DB-7197

Description

Upgrade from 2.19.1.0-b203 to 2.19.1.0-b203 was failing with below error:
org.yb.client.NonRecoverableException: Too many attempts: YRpc(method=GetMasterClusterConfig, service=yb.master.MasterService, tablet=null, attempt=22, maxAttempts=100, maxTimeoutMs=120000, elapsedTimeMs=116544). Master config (10.150.0.207:7100,10.150.0.208:7100,10.150.0.36:7100) has no leader.. Exceptions received: org.yb.client.ConnectionResetException: [Peer YB Master - 10.150.0.36:7100] Connection reset on [id: 0xd57c27e6, L:null ! R:/10.150.0.36:7100],org.yb.client.ConnectionResetException: [Peer YB Master - 10.150.0.208:7100] Connection reset on [id: 0xdc84ea0e, L:null ! R:/10.150.0.208:7100].

Error in master:
F0705 10:03:09.221561 37220 catalog_manager.cc:1143] T 00000000000000000000000000000000 P 7f806906d6f54b07a59dce6962ba59ee: Failed to load sys catalog: Corruption (yb/master/sys_catalog_writer.cc:71): Failed while visiting snapshots in sys catalog: System catalog snapshot is corrupted or built using different build type: Unexpected value type for metadata: 0, row: { 0 => { value: int8_value: 7 ttl_seconds: 0 write_time: kUninitializedWriteTime } 1 => { value: binary_value: "}~\212\312\324\361D\\\231\233\303s\321\255\315h" ttl_seconds: 0 write_time: kUninitializedWriteTime } 2 => { value: ttl_seconds: -1 write_time: 1687711964992311 } }, type: 7, id: 7D7E8ACAD4F1445C999BC373D1ADCD68

To resolve this. Retry was done by manually adding gflag ignore_null_sys_catalog_entries set to true
master flags:

[yugabyte@yb-dev-pf-yba-installer-lru-n3 ~]$ cat master/conf/server.conf
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.0.208
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=f99c61b3-84b0-4334-8086-d426239f5da7
--cql_proxy_bind_address=10.150.0.208:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--master_addresses=10.150.0.207:7100,10.150.0.208:7100,10.150.0.36:7100
--max_log_size=256
--metric_node_name=yb-dev-pf-yba-installer-lru-n3
--pgsql_proxy_bind_address=10.150.0.208:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=8550515e-8832-4030-b700-6dc1bfa6ba38
--placement_zone=us-west1-a
--replication_factor=3
--rpc_bind_addresses=10.150.0.208:7100
--server_broadcast_addresses=
--start_cql_proxy=true
--txn_table_wait_min_ts_count=3
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.0.208.crt
--webserver_interface=10.150.0.208
--webserver_port=7000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.0.208.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
--enable_stream_compression=true
--ignore_null_sys_catalog_entries=true
--stream_compression_algo=1
--ycql_enable_packed_row=true
--ysql_enable_packed_row=true
--ysql_enable_packed_row_for_colocated_table=true

After applying the mentioned gflag, master service started working fine on n1 but on other 2 node, it is failing:
Logs observed:
E0706 09:56:20.952315 106781 async_initializer.cc:95] Failed to initialize client: Timed out (yb/rpc/rpc.cc:220): Could not locate the leader master: GetLeaderMasterRpc(addrs: [10.150.0.207:7100, 10.150.0.208:7100, 10.150.0.36:7100, 10.150.0.207:7100, 10.150.0.208:7100, 10.150.0.36:7100], num_attempts: 53) passed its deadline 90239.702s (passed: 1.540s): Network error (yb/util/net/socket.cc:534): recvmsg got EOF from remote (system error 108)

Slack Discussion - https://yugabyte.slack.com/archives/C01CB38CZHU/p1688639798730339?thread_ts=1688552470.326609&cid=C01CB38CZHU

cc: @kripasreenivasan @Arjun-yb @renjith-yb

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@zlareb1-yb zlareb1-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jul 10, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 10, 2023
@rthallamko3
Copy link
Contributor

@zlareb1-yb , Can you clarify what version were you upgrading from?

@zlareb1-yb
Copy link
Author

@rthallamko3 Universe upgrade was done from 2.19.1.0-b141 to 2.19.1.0-b203

@rthallamko3 rthallamko3 removed their assignment Jul 13, 2023
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Jul 25, 2023
@rthallamko3 rthallamko3 added the status/awaiting-triage Issue awaiting triage label Sep 11, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Sep 21, 2023
spolitov added a commit that referenced this issue Oct 15, 2023
Summary:
The packed row is interpreted as row with liveness column.
I.e. if we set all columns of this row to NULL, it will be row consisting of NULLs.
But in YCQL we could insert row without liveness column, so setting all columns to NULL should result in deleting of such row.

During compaction we could generate packed row for such row.
It is incorrect and this diff fixes the issue.
Jira: DB-7197

Test Plan:
BackupTxnTest.DeleteWithCompaction
CqlPackedRowTest.CompactWithoutLivenessColumn

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: rthallam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29102
@rthallamko3 rthallamko3 changed the title [After adding ignore_null_sys_catalog_entries] Failed to initialize client: Timed out (yb/rpc/rpc.cc:220): Could not locate the leader master: GetLeaderMasterRpc [DocDB] [YCQL] Failed to load sys catalog after incorrect packing of liveness columns Oct 16, 2023
@rthallamko3 rthallamko3 added the area/ycql Yugabyte CQL (YCQL) label Oct 16, 2023
@yugabyte-ci yugabyte-ci removed the area/ycql Yugabyte CQL (YCQL) label Oct 16, 2023
spolitov added a commit that referenced this issue Oct 17, 2023
Summary:
The packed row is interpreted as row with liveness column.
I.e. if we set all columns of this row to NULL, it will be row consisting of NULLs.
But in YCQL we could insert row without liveness column, so setting all columns to NULL should result in deleting of such row.

During compaction we could generate packed row for such row.
It is incorrect and this diff fixes the issue.
Jira: DB-7197

Original commit: 4d5c482/D29102

Test Plan:
BackupTxnTest.DeleteWithCompaction
CqlPackedRowTest.CompactWithoutLivenessColumn

Reviewers: bogdan, rthallam

Reviewed By: bogdan, rthallam

Subscribers: ybase, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29386
spolitov added a commit that referenced this issue Oct 20, 2023
Summary:
The packed row is interpreted as row with liveness column.
I.e. if we set all columns of this row to NULL, it will be row consisting of NULLs.
But in YCQL we could insert row without liveness column, so setting all columns to NULL should result in deleting of such row.

During compaction we could generate packed row for such row.
It is incorrect and this diff fixes the issue.
Jira: DB-7197

Original commit: 4d5c482/D29102

Test Plan:
BackupTxnTest.DeleteWithCompaction
CqlPackedRowTest.CompactWithoutLivenessColumn

Reviewers: bogdan, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29375
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants