Skip to content

Commit

Permalink
[BACKPORT 2.4] [Platform] parsing of df output is fragile and may fai…
Browse files Browse the repository at this point in the history
…l in case of "safe" error in df #7402

Summary:
cluster_health.py, check_disk_utilization() now suppresses all the errors including problems with nfs mounts (like `df: ‘/mnt’: Input/output error` or `df: ‘/mnt’: Stale file handle`).

Later in bounds of "[Platform] Update cluster_health script to screen out non-local
volumes #5246" I'm going to exclude network resources from the output of the `df` command.
I'm not doing it in this diff as I want to make an additional protection against other
possible errors not related to network resources but also appearing as 'df: ...' messages.

Original diff: https://phabricator.dev.yugabyte.com/D10787

Test Plan:
Jenkins: rebase: 2.4

Test scenario:
1. Create a universe with three nodes. We will configure NFS resource on node 1 and will mount it to node 2.
2. Connect to node 1. Create the NFS resource:
```
yum install -y nfs-utils
mkdir -p /nfs/share
echo '/nfs/share *(rw)' >> /etc/exports
systemctl start nfs
```
3. Connect to node 2. Mount the created nfs:
```
mount -o soft,timeo=10 AAA.BBB.CCC.DDD:/nfs/share /mnt
```
where AAA.BBB.CCC.DDD is an IP address of node 1.
Check that the mounted resource appeared using command `df -h`.

4. Wait while the health-check is completed for the universe. Verify that the mounted resource is there.
5. Connect to node 1. Turn off the NFS service:
```
systemctl stop nfs
```
6. Connect to node 2. Execute command `df -h`. Verify that the result contains the error string: `df: ‘/mnt’: Input/output error`
7. Wait for the next health-check. Verify that it is OK and that the mounted resource disappeared from the report.

Important:
It is better to do step 5 closer to the next health-check time (otherwise we will get

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10846
  • Loading branch information
SergeyPotachev committed Mar 9, 2021
1 parent 33be21f commit 64d5af2
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion managed/devops/bin/cluster_health.py
Expand Up @@ -242,7 +242,7 @@ def _remote_check_output(self, command):
return output

def get_disk_utilization(self):
remote_cmd = 'df -h'
remote_cmd = 'df -h 2>/dev/null'
return self._remote_check_output(remote_cmd)

def check_disk_utilization(self):
Expand Down

0 comments on commit 64d5af2

Please sign in to comment.