[Platform] Update cluster_health script to screen out non-local volumes #5246

ajcaldera1 · 2020-07-28T18:38:59Z

In the cluster_health.py script, if there are are non-local (e.g. NFS) volumes present on a host, the cluster_health script may fail if a stale NFS handle is encountered. We should enhance the script to examine only local volumes (df -l) or exclude nfs (df -x nfs) from consideration so that false alerts are not generated.

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-64

… "safe" error in df #7402 Summary: - cluster_health.py, check_disk_utilization() now skips first lines if they are started with 'df: ...' (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Test Plan: Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error). Reviewers: arnav, daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10787

…l in case of "safe" error in df #7402 Summary: cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Original diff: https://phabricator.dev.yugabyte.com/D10787 Test Plan: Jenkins: rebase: 2.4 Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10846

…l in case of "safe" error in df #7402 Summary: cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like df: ‘/mnt’: Input/output error or df: ‘/mnt’: Stale file handle). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Original diff: https://phabricator.dev.yugabyte.com/D10787 Test Plan: Jenkins: rebase: 2.2 Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error). Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10847

…l volumes Summary: - Added key `-l` to the `df` command. Test Plan: Test scenario: 1. Create a universe; 2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787); 3. Wait for the `health-check` is completed; 4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS. Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10863

…non-local volumes Summary: - Added key `-l` to the `df` command. Test Plan: Test scenario: 1. Create a universe; 2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787); 3. Wait for the `health-check` is completed; 4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS. Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10863

ajcaldera1 added the area/platform Yugabyte Platform label Jul 28, 2020

ajcaldera1 added this to the v2.2.x milestone Jul 28, 2020

ajcaldera1 assigned Arnav15 Jul 28, 2020

bmatican added this to To do in Platform Aug 4, 2020

streddy-yb assigned SergeyPotachev and unassigned Arnav15 Sep 17, 2020

streddy-yb modified the milestones: v2.2.x, 2.5.x Nov 2, 2020

SergeyPotachev mentioned this issue Mar 1, 2021

[yb-platform] parsing of df output is fragile and may fail in case of "safe" error in df #7402

Closed

SergeyPotachev moved this from To do to In progress in Platform Mar 5, 2021

SergeyPotachev moved this from In progress to In Review in Platform Mar 10, 2021

SergeyPotachev moved this from In Review to Needs QA/Docs in Platform Mar 12, 2021

streddy-yb closed this as completed May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Update cluster_health script to screen out non-local volumes #5246

[Platform] Update cluster_health script to screen out non-local volumes #5246

ajcaldera1 commented Jul 28, 2020 •

edited by chirag-yb

[Platform] Update cluster_health script to screen out non-local volumes #5246

[Platform] Update cluster_health script to screen out non-local volumes #5246

Comments

ajcaldera1 commented Jul 28, 2020 • edited by chirag-yb

ajcaldera1 commented Jul 28, 2020 •

edited by chirag-yb