Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] Update cluster_health script to screen out non-local volumes #5246

Closed
ajcaldera1 opened this issue Jul 28, 2020 · 0 comments
Closed
Assignees
Labels
area/platform Yugabyte Platform
Projects
Milestone

Comments

@ajcaldera1
Copy link
Contributor

ajcaldera1 commented Jul 28, 2020

In the cluster_health.py script, if there are are non-local (e.g. NFS) volumes present on a host, the cluster_health script may fail if a stale NFS handle is encountered. We should enhance the script to examine only local volumes (df -l) or exclude nfs (df -x nfs) from consideration so that false alerts are not generated.

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-64

@ajcaldera1 ajcaldera1 added the area/platform Yugabyte Platform label Jul 28, 2020
@ajcaldera1 ajcaldera1 added this to the v2.2.x milestone Jul 28, 2020
@bmatican bmatican added this to To do in Platform Aug 4, 2020
@streddy-yb streddy-yb assigned SergeyPotachev and unassigned Arnav15 Sep 17, 2020
@streddy-yb streddy-yb modified the milestones: v2.2.x, 2.5.x Nov 2, 2020
@SergeyPotachev SergeyPotachev moved this from To do to In progress in Platform Mar 5, 2021
SergeyPotachev added a commit that referenced this issue Mar 9, 2021
… "safe" error in df #7402

Summary:
  - cluster_health.py, check_disk_utilization() now skips first lines if they are started
    with 'df: ...' (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`).

Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246" I'm going to exclude network resources from the output of the `df` command. I'm not doing it in this diff as I want to make an additional protection against other possible errors not related to network resources but also appearing as 'df: ...' messages.

Test Plan:
Test scenario:
1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2.
2. Connect to node 1. Create the NFS resource:
```
yum install -y nfs-utils
mkdir -p /nfs/share
echo '/nfs/share *(rw)' >> /etc/exports
systemctl start nfs
```
3. Connect to node 2. Mount the created nfs:
```
mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt
```
where AAA.BBB.CCC.DDD is an IP address of node 1.
Check that the mounted resource appeared using command `df -h`.

4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there.
5. Connect to node 1. Turn off the NFS service:
```
systemctl stop nfs
```
6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error`
7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report.

Important:
It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error).

Reviewers: arnav, daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10787
SergeyPotachev added a commit that referenced this issue Mar 9, 2021
…l in case of "safe" error in df #7402

Summary:
cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`).

Later in bounds of "[Platform] Update cluster_health script to screen out non-local
volumes #5246" I'm going to exclude network resources from the output of the `df` command.
I'm not doing it in this diff as I want to make an additional protection against other
possible errors not related to network resources but also appearing as 'df: ...' messages.

Original diff: https://phabricator.dev.yugabyte.com/D10787

Test Plan:
Jenkins: rebase: 2.4

Test scenario:
1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2.
2. Connect to node 1. Create the NFS resource:
```
yum install -y nfs-utils
mkdir -p /nfs/share
echo '/nfs/share *(rw)' >> /etc/exports
systemctl start nfs
```
3. Connect to node 2. Mount the created nfs:
```
mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt
```
where AAA.BBB.CCC.DDD is an IP address of node 1.
Check that the mounted resource appeared using command `df -h`.

4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there.
5. Connect to node 1. Turn off the NFS service:
```
systemctl stop nfs
```
6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error`
7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report.

Important:
It is better to do step 5 closer to the next health-check time (otherwise we will get

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10846
SergeyPotachev added a commit that referenced this issue Mar 9, 2021
…l in case of "safe" error in df #7402

Summary:
cluster_health.py, check_disk_utilization() now suppresses all the errors including problems
with nfs mounts (like df: ‘/mnt’: Input/output error or df: ‘/mnt’: Stale file handle).

Later in bounds of "[Platform] Update cluster_health script to screen out non-local volumes #5246"
I'm going to exclude network resources from the output of the `df` command. I'm not doing it in
this diff as I want to make an additional protection against other possible errors not related
to network resources but also appearing as 'df: ...' messages.

Original diff: https://phabricator.dev.yugabyte.com/D10787

Test Plan:
Jenkins: rebase: 2.2

Test scenario:
1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2.
2. Connect to node 1. Create the NFS resource:
```
yum install -y nfs-utils
mkdir -p /nfs/share
echo '/nfs/share *(rw)' >> /etc/exports
systemctl start nfs
```
3. Connect to node 2. Mount the created nfs:
```
mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt
```
where AAA.BBB.CCC.DDD is an IP address of node 1.
Check that the mounted resource appeared using command `df -h`.

4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there.
5. Connect to node 1. Turn off the NFS service:
```
systemctl stop nfs
```
6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error`
7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report.

Important:
It is better to do step 5 closer to the next health-check time (otherwise we will get `timeout occurred` for command `df -h` instead of the expected error).

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10847
@SergeyPotachev SergeyPotachev moved this from In progress to In Review in Platform Mar 10, 2021
SergeyPotachev added a commit that referenced this issue Mar 12, 2021
…l volumes

Summary:
  - Added key `-l` to the `df` command.

Test Plan:
Test scenario:

1. Create a universe;
2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787);
3. Wait for the `health-check` is completed;
4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS.

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10863
@SergeyPotachev SergeyPotachev moved this from In Review to Needs QA/Docs in Platform Mar 12, 2021
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
…non-local volumes

Summary:
  - Added key `-l` to the `df` command.

Test Plan:
Test scenario:

1. Create a universe;
2. Create an NFS mount point on one of nodes (see instructions in https://phabricator.dev.yugabyte.com/D10787);
3. Wait for the `health-check` is completed;
4. Check that "Disk utilization" section inside the email report doesn't have the mentioned NFS.

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10863
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform
Projects
Platform
  
Needs QA/Docs
Development

No branches or pull requests

4 participants