Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB][Packed Columns] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Fou #14369

Closed
def- opened this issue Oct 7, 2022 · 11 comments
Assignees
Labels
2.18 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug long_running_universe priority/medium Medium priority issue QA QA filed bugs

Comments

@def-
Copy link
Contributor

def- commented Oct 7, 2022

Jira Link: DB-3792

Description

During an upgrade of my puppy-food-arm-1 universe from 2.15.4.0-b54 to 2.15.4.0-b72 the master server fails to come up after upgrade:

[yugabyte@ip logs]$ cat yb-master.FATAL.details.2022-10-07T07_13_16.pid3668968.txt
F20221007 07:13:16 ../../src/yb/master/master_main.cc:136] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10
    @     0xffff7ca59d7c  (unknown)
    @     0xffff7ca54874  (unknown)
    @     0xffff7ca551ac  (unknown)
    @     0xffff7ca57938  (unknown)
    @           0x21429c  main
    @     0xffff7b6e0de4  __libc_start_main
    @           0x213784  (unknown)

It keeps failing like this every minute when trying to start master again.

The upgrade failure is also about this master server:

Failed to execute task {"sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"25bb1fc6-d74c-41ec-afa0-b0b331bdecc8","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisS..., hit error:

WaitForServer(25bb1fc6-d74c-41ec-afa0-b0b331bdecc8, yb-15-puppy-food-arm-1-n1, type=MASTER) did not respond in the set time..

This is with packed columns enabled on YSQL and YCQL, tserver and master.
I will leave the universe in the current state for further analysis, tell me when I can destroy and recreate it.

@def- def- added area/docdb YugabyteDB core features priority/high High Priority status/awaiting-triage Issue awaiting triage labels Oct 7, 2022
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Oct 7, 2022
@yugabyte-ci yugabyte-ci changed the title [DocDB][Packed Columns] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10 [DocDB][Packed Columns] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Fou Oct 7, 2022
@def- def- added the QA QA filed bugs label Oct 7, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Oct 7, 2022
@def- def- added the 2.16.0_blocker 2.16.0 Release blocker defects label Oct 20, 2022
@rthallamko3
Copy link
Contributor

@def- , Does this repro with the latest fixes?

@def-
Copy link
Contributor Author

def- commented Nov 2, 2022

I’m not sure if something specific to to the old version or the new version caused the problem during upgrade. I can retry from 2.15.4.0-b54 to current master

@def-
Copy link
Contributor Author

def- commented Nov 2, 2022

I did not see this failure in that upgrade, will close the issue for now.

@def-
Copy link
Contributor Author

def- commented Nov 9, 2022

I have even seen this issue now on 2.17.1.0-b146 without any upgrade, just by triggering a rolling restart on my puppy-food-arm-1 LRU (all packed columns options enabled):

[yugabyte@ip-10-9-105-218 logs]$ cat yb-master.FATAL.details.2022-11-09T08_35_57.pid352319.txt
F20221109 08:35:57 ../../src/yb/master/master_main.cc:138] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10
    @     0xffff7d2739c0  (unknown)
    @     0xffff7d274440  (unknown)
    @     0xffff7d276d3c  (unknown)
    @     0xaaaacb27935c  main
    @     0xffff7be3485c  __libc_start_main
    @     0xaaaacb2787f4  (unknown)

Initially I thought it was another instance of #14767, but the error message indicates this issue instead.

@def- def- removed the 2.16.0_blocker 2.16.0 Release blocker defects label Nov 16, 2022
@def-
Copy link
Contributor Author

def- commented Nov 16, 2022

Removed from 2.16 blockers since ycql packed columns required and this will not be GA in 2.16 yet.

@def-
Copy link
Contributor Author

def- commented Jan 17, 2023

This is still failing, just FATALed my puppy-food-arm-2 universe with this after a rolling restart:

F20230117 07:11:47 ../../src/yb/master/catalog_manager.cc:1058] T 00000000000000000000000000000000 P a0486171a8e74113a0619aa950d04494: Failed to load sys catalog: Corruption (yb/master/sys_catalog_writer.cc:219): Failed while visiting tables in sys catalog: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10
    @     0xffff88ec2f00  (unknown)
    @     0xffff88ec3980  (unknown)
    @     0xffff88ec627c  (unknown)
    @     0xffff8d7d5104  yb::master::CatalogManager::LoadSysCatalogDataTask()
    @     0xffff892fa940  yb::ThreadPool::DispatchThread()
    @     0xffff892efe54  yb::Thread::SuperviseThread()
    @     0xffff87c278b8  start_thread
    @     0xffff87ac3afc  thread_start

@kripasreenivasan
Copy link
Contributor

FATAL observed in master universe with YCQl packed columns enabled on upgrading the universe from 2.17.1.0-b323 to 2.17.2.0-b11

F0118 09:06:13.384968 29839 master_main.cc:143] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10

@kripasreenivasan
Copy link
Contributor

Ran into this even today with my LRU. I had YCQL packed columns enabled on my YSQL only LRU. I added the ysql_enable_packed_row_for_colocated_table flag and restarted the universe.

F0210 10:49:20.849534 31466 master_main.cc:143] Corruption (yb/master/sys_catalog_writer.cc:219): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10

@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Feb 22, 2023
@rthallamko3 rthallamko3 added the area/ycql Yugabyte CQL (YCQL) label Mar 8, 2023
@kripasreenivasan
Copy link
Contributor

Unable to repro this issue with the initial set of tests performed. Will allow the fix to soak in in the long running universes over more rolling restarts, upgrades while workloads are executing for another week before closing the issue.

CC: @rthallamko3

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented May 4, 2023

We were able to see this issue again. Hence reopened. Seen in manual LRU pf-3g-ysql

-rw-r--r--. 1 yugabyte yugabyte    1258 May  3 12:53 yb-master.ip-172-151-18-70.us-west-2.compute.internal.yugabyte.log.ERROR.20230503-125336.27356
F20230503 12:54:18 ../../src/yb/master/master_main.cc:147] Corruption (yb/master/sys_catalog_writer.cc:63): Unable to initialize catalog manager: Failed to initialize sys tables async: Failed log replay. Reason: System catalog snapshot is corrupted or built using different build type: Found wrong metadata type: 0 vs 10
    @     0x563def02ed57  google::LogMessage::SendToLog()
    @     0x563def02fc9d  google::LogMessage::Flush()
    @     0x563def030319  google::LogMessageFatal::~LogMessageFatal()
    @     0x563deefd7218  main
    @     0x7f52d1c12825  __libc_start_main
    @     0x563deed0e02e  _start

cc: @kripasreenivasan

spolitov added a commit that referenced this issue May 12, 2023
…enumeration

Summary:
We expect that system catalog entries always have binary value.
But in rare case we could get NULL value for system catalog entry.
Currently we don't know real root cause for this issue.
Adding ignore_null_sys_catalog_entries to just ignore such entries (false by default) to have ability to recover clusters with such failure.
Jira: DB-3792

Test Plan: Jenkins

Reviewers: bogdan, qhu, rthallam

Reviewed By: rthallam

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D25201
@rthallamko3 rthallamko3 removed the area/ycql Yugabyte CQL (YCQL) label Jul 28, 2023
@kripasreenivasan
Copy link
Contributor

Observed this issue in my long running universe, @spolitov
Details in the JIRA ticket.

spolitov added a commit that referenced this issue Aug 18, 2023
… entries during enumeration

Summary:
We expect that system catalog entries always have binary value.
But in rare case we could get NULL value for system catalog entry.
Currently we don't know real root cause for this issue.
Adding ignore_null_sys_catalog_entries to just ignore such entries (false by default) to have ability to recover clusters with such failure.

Orignal commit: 60e7777/D25201

Jira: DB-3792

Test Plan: Jenkins

Reviewers: bogdan, qhu, rthallam

Reviewed By: qhu, rthallam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27429
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug long_running_universe priority/medium Medium priority issue QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

7 participants