Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hc fallback when static group has unknown status #3620

Merged

Conversation

StekPerepolnen
Copy link
Collaborator

@StekPerepolnen StekPerepolnen commented Apr 10, 2024

#3541

3 cases for HC now

  • all static groups have at least degraded or full statuses (everything is fine) - same as before
  • some static groups have UNKNOWN or DISINTEGRATED status according to bsc
    hc starts sends whiteboard requests to gather information on these specific groups
  • there is no bsc within half of the timeout period
    hc also begins sending whiteboard requests to gather information on static groups. new STORAGE RED issue says that there was lack of BSC information

Static config configuration goes from NodeWarden, there is no BSConfig in AppData (it can change on the fly)

Testing

HC specific database report when no bsc

{
  "self_check_result": "EMERGENCY",
  "issue_log": [
    {
      "id": "RED-1c0c-be81",
      "status": "RED",
      "message": "Database has storage issues",
      "location": {
        "database": {
          "name": "/slice/db"
        }
      },
      "reason": [
        "RED-1c0c-53b5"
      ],
      "type": "DATABASE",
      "level": 1
    },
    {
      "id": "RED-1c0c-53b5",
      "status": "RED",
      "message": "System tablet BSC didn't provide information",
      "location": {
        "database": {
          "name": "/slice/db"
        }
      },
      "type": "STORAGE",
      "level": 2
    }
  ],
  "location": {
    "id": 5,
    "host": "man0-0028.ydb-dev.nemax.nebiuscloud.net",
    "port": 19001
  }
}

  • there are RED issue with lack of BSC information
HC root report when no bsc

{
  "self_check_result": "EMERGENCY",
  "issue_log": [
    {
      "id": "RED-27c3-70fb",
      "status": "RED",
      "message": "Database has multiple issues",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-4e47",
        "RED-27c3-53b5",
        "YELLOW-27c3-5321"
      ],
      "type": "DATABASE",
      "level": 1
    },
    {
      "id": "RED-27c3-4e47",
      "status": "RED",
      "message": "Compute has issues with system tablets",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-c138-BSController"
      ],
      "type": "COMPUTE",
      "level": 2
    },
    {
      "id": "RED-27c3-c138-BSController",
      "status": "RED",
      "message": "System tablet is unresponsive",
      "location": {
        "compute": {
          "tablet": {
            "type": "BSController",
            "id": [
              "72057594037989391"
            ]
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "type": "SYSTEM_TABLET",
      "level": 3
    },
    {
      "id": "RED-27c3-53b5",
      "status": "RED",
      "message": "System tablet BSC didn't provide information",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "type": "STORAGE",
      "level": 2
    },
    {
      "id": "YELLOW-27c3-5321",
      "status": "YELLOW",
      "message": "Storage degraded",
      "location": {
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "YELLOW-27c3-595f-8d1d"
      ],
      "type": "STORAGE",
      "level": 2
    },
    {
      "id": "YELLOW-27c3-595f-8d1d",
      "status": "YELLOW",
      "message": "Pool degraded",
      "location": {
        "storage": {
          "pool": {
            "name": "static"
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "YELLOW-27c3-ef3e-0"
      ],
      "type": "STORAGE_POOL",
      "level": 3
    },
    {
      "id": "RED-84d8-3-3-1",
      "status": "RED",
      "message": "PDisk is not available",
      "location": {
        "storage": {
          "node": {
            "id": 3,
            "host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
            "port": 19001
          },
          "pool": {
            "group": {
              "vdisk": {
                "pdisk": [
                  {
                    "id": "3-1",
                    "path": "/dev/disk/by-partlabel/NVMEKIKIMR01"
                  }
                ]
              }
            }
          }
        }
      },
      "type": "PDISK",
      "level": 6
    },
    {
      "id": "RED-27c3-4847-3-0-1-0-2-0",
      "status": "RED",
      "message": "VDisk is not available",
      "location": {
        "storage": {
          "node": {
            "id": 3,
            "host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
            "port": 19001
          },
          "pool": {
            "name": "static",
            "group": {
              "vdisk": {
                "id": [
                  "0-1-0-2-0"
                ]
              }
            }
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-84d8-3-3-1"
      ],
      "type": "VDISK",
      "level": 5
    },
    {
      "id": "YELLOW-27c3-ef3e-0",
      "status": "YELLOW",
      "message": "Group degraded",
      "location": {
        "storage": {
          "pool": {
            "name": "static",
            "group": {
              "id": [
                "0"
              ]
            }
          }
        },
        "database": {
          "name": "/slice"
        }
      },
      "reason": [
        "RED-27c3-4847-3-0-1-0-2-0"
      ],
      "type": "STORAGE_GROUP",
      "level": 4
    }
  ],
  "location": {
    "id": 5,
    "host": "man0-0028.ydb-dev.nemax.nebiuscloud.net",
    "port": 19001
  }
}

  • there is report on bad bsc tablet
  • there is RED issue with lack of BSC information
  • there is proper issues on static group disks here

Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:07:32 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:07:34 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-10 07:10:36 UTC Build successful.
2024-04-10 07:12:17 UTC Tests are running...
🔴 2024-04-10 08:15:07 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
10069 9991 0 5 61 12

Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:08:26 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:08:29 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-10 07:18:38 UTC Build successful.

Copy link

github-actions bot commented Apr 10, 2024

2024-04-10 07:09:30 UTC Pre-commit check for 1c69ed5 has started.
2024-04-10 07:09:32 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-10 07:12:16 UTC Build successful.
2024-04-10 07:14:01 UTC Tests are running...
🔴 2024-04-10 08:42:47 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
8915 8711 0 38 148 18

@StekPerepolnen StekPerepolnen changed the title hc fallback whiteboard hc fallback when static group has unknown status Apr 10, 2024
@StekPerepolnen StekPerepolnen force-pushed the hc-fallback-whiteboard branch 2 times, most recently from d3a57de to a371c9f Compare April 23, 2024 11:01
Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:01:24 UTC Pre-commit check for 9f9e48d has started.
2024-04-23 11:01:25 UTC Check cancelled

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:02:37 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:02:39 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-23 11:04:41 UTC Build successful.

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:02:40 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:02:43 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-23 11:05:33 UTC Build successful.
2024-04-23 11:09:09 UTC Tests are running...
🔴 2024-04-23 11:28:05 UTC Test run completed, no test results found for commit a371c9f. Please check build logs.
2024-04-23 11:28:09 UTC Check cancelled

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:03:18 UTC Pre-commit check for 0c75606 has started.
2024-04-23 11:03:20 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-23 11:05:30 UTC Build successful.
2024-04-23 11:09:02 UTC Tests are running...
🔴 2024-04-23 11:28:03 UTC Test run completed, no test results found for commit a371c9f. Please check build logs.
2024-04-23 11:28:07 UTC Check cancelled

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:33 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:31:34 UTC Build linux-x86_64-release-asan is running...
🟢 2024-04-23 11:34:07 UTC Build successful.
2024-04-23 11:35:47 UTC Tests are running...
🔴 2024-04-23 13:14:33 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
8921 8773 0 43 87 18

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:57 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:32:01 UTC Build linux-x86_64-release-clang14 is running...
🟢 2024-04-23 11:34:30 UTC Build successful.

Copy link

github-actions bot commented Apr 23, 2024

2024-04-23 11:31:59 UTC Pre-commit check for c74a287 has started.
2024-04-23 11:32:03 UTC Build linux-x86_64-relwithdebinfo is running...
🟢 2024-04-23 11:34:38 UTC Build successful.
2024-04-23 11:36:25 UTC Tests are running...
🔴 2024-04-23 12:50:40 UTC Some tests failed, follow the links below.

Test history

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
12870 10978 0 15 1860 17

@StekPerepolnen StekPerepolnen merged commit ce48e6c into ydb-platform:main May 13, 2024
3 of 5 checks passed
@StekPerepolnen StekPerepolnen deleted the hc-fallback-whiteboard branch May 13, 2024 10:47
MrLolthe1st pushed a commit to MrLolthe1st/ydb that referenced this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants