Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Create database query fails with Namespace Create Failed: not onlined #22177

Open
1 task done
shishir2001-yb opened this issue Apr 29, 2024 · 2 comments
Open
1 task done
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs status/awaiting-triage Issue awaiting triage

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Apr 29, 2024

Jira Link: DB-11104

Description

Tried on version 2024.1.0.0-b102
Logs:

Create database query during a backup restore failed with the below error, the same name database was dropped ~14 mins before.
DB Name: postgres_99

Namespace Create Failed: not onlined.

Test details:

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Start a while loop and run it for 120 mins
          a. Create a backup of 1 random database
          b. Start the cross DB DDL workload and stop it after 10 mins
          c. Start the cross DB DDL workload and run it for 10 mins
          d. Drop the database 
          e. Sleep for ~14 mins 
          f. Restore the backup <<<<<<<<<<<<<<<<<< FAILS HERE >>>>>>>>>>>>>>>>>>>>>>>>>>.

G-flags:

 tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/ysql Yugabyte SQL (YSQL) QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation 2024.1_blocker labels Apr 29, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Apr 29, 2024
@myang2021
Copy link
Contributor

myang2021 commented Apr 29, 2024

Related logs:

  1. tserver log:
./Universe_logs/172.151.30.222/tserver/yb-tserver.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-162955.31769:I0425 17:00:42.809218 38746 client_master_rpc.cc:77] 0x000017f674a51920 -> IsCreateNamespaceDone: Failed, got resp error: Internal error (yb/master/catalog_manager.cc:9366): Namespace Create Failed: not onlined.
  1. PG log
./Universe_logs/172.151.30.222/tserver/postgresql-2024-04-25_164530.log:2024-04-25 17:00:42.818 UTC [911524] ERROR:  Namespace Create Failed: not onlined.

Apparently, the error is passed back from tserver to PG.

  1. master log
./Universe_logs/172.151.30.222/master/yb-master.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-165150.31325:W0425 17:00:41.807782 911535 catalog_manager.cc:9230] Service unavailable (yb/tablet/operations/operation_tracker.cc:190): Error copying PGSQL system tables for pending namespace: Operation of type kWrite failed: tablet 00000000000000000000000000000000 hit the limit 1622684467 of memory tracker 0x00002797bfda4520 -> root while trying to consume an additional 393916 bytes; the memory tracker had already given out 1735573504 bytes.

The master log seems to tell the root cause of the failure. The memory is capped so the create namespace operation failed.

./Universe_logs/172.151.30.222/master/yb-master.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-132437.31325.gz:I0425 13:24:37.393837 31325 mem_tracker.cc:268] Root memory limit is 1622684467

@myang2021
Copy link
Contributor

So, we create 150 databases (100 normal databases and 50 colocated databases), and continuously perform 20-25 parallel DDLs in this test. At the end of step 3, we drop the database and try to restore it from the backup created at the start of step 3.

Instance type: c6g.2xlarge

When running the test on larger instance c6g.4xlarge we didn’t see this issue.
On smaller instance with PITR steps (the test originally also did PITR snapshots and restores), we didn't see this issue either.

When no longer performing PITR, maybe we’re no longer removing DDL-related data from the master (at PITR restore time), this might help to explain why when PITR steps are removed master runs out of memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs status/awaiting-triage Issue awaiting triage
Projects
None yet
Development

No branches or pull requests

3 participants