New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Platform] Update cluster_health script to screen out non-local volumes #5246
Comments
SergeyPotachev
added a commit
that referenced
this issue
Mar 9, 2021
… "safe" error in df #7402 Summary: - cluster_health.py, check_disk_utilization() now skips first lines if they are started with 'df: ...' (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Test Plan: Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error). Reviewers: arnav, daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10787
SergeyPotachev
added a commit
that referenced
this issue
Mar 9, 2021
…l in case of "safe" error in df #7402 Summary: cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Original diff: https://phabricator.dev.yugabyte.com/D10787 Test Plan: Jenkins: rebase: 2.4 Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10846
SergeyPotachev
added a commit
that referenced
this issue
Mar 9, 2021
…l in case of "safe" error in df #7402 Summary: cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like df: ‘/mnt’: Input/output error or df: ‘/mnt’: Stale file handle). Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages. Original diff: https://phabricator.dev.yugabyte.com/D10787 Test Plan: Jenkins: rebase: 2.2 Test scenario: 1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2. 2. Connect to node 1. Create the NFS resource: ``` yum install -y nfs-utils mkdir -p /nfs/share echo '/nfs/share *(rw)' >> /etc/exports systemctl start nfs ``` 3. Connect to node 2. Mount the created nfs: ``` mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt ``` where AAA.BBB.CCC.DDD is an IP address of node 1. Check that the mounted resource appeared using command `df -h`. 4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there. 5. Connect to node 1. Turn off the NFS service: ``` systemctl stop nfs ``` 6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error` 7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report. Important: It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error). Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10847
SergeyPotachev
added a commit
that referenced
this issue
Mar 12, 2021
…l volumes Summary: - Added key `-l` to the `df` command. Test Plan: Test scenario: 1. Create a universe; 2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787); 3. Wait for the `health-check` is completed; 4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS. Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10863
YintongMa
pushed a commit
to YintongMa/yugabyte-db
that referenced
this issue
May 26, 2021
…non-local volumes Summary: - Added key `-l` to the `df` command. Test Plan: Test scenario: 1. Create a universe; 2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787); 3. Wait for the `health-check` is completed; 4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS. Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10863
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the cluster_health.py script, if there are are non-local (e.g. NFS) volumes present on a host, the cluster_health script may fail if a stale NFS handle is encountered. We should enhance the script to examine only local volumes (df -l) or exclude nfs (df -x nfs) from consideration so that false alerts are not generated.
Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-64
The text was updated successfully, but these errors were encountered: