Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pool health /proc entry, "SUSPENDED" pools #7563

Merged
merged 1 commit into from
Jun 6, 2018

Conversation

tonyhutter
Copy link
Contributor

@tonyhutter tonyhutter commented May 25, 2018

Description

Add a proc entry to display the pool's health:

$ cat /proc/spl/kstat/zfs/tank/health
ONLINE

This is done without using the spa config locks, so it will never hang.

Also, fix zpool status and zpool list -o health output to print SUSPENDED instead of ONLINE for suspended pools.

Motivation and Context

#7331

How Has This Been Tested?

Added test case

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@@ -207,10 +208,53 @@ const char *zfs_history_event_names[ZFS_NUM_LEGACY_HISTORY_EVENTS] = {
"pool split",
};

#if defined(_KERNEL)
/* Dummy gettext() for kernel builds */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred method is to use:

#if defined(_KERNEL)
#define gettext(x) x
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest push

@tonyhutter tonyhutter force-pushed the proc-health branch 3 times, most recently from b9e2785 to c98aa0b Compare May 29, 2018 18:22
* "SUSPENDED", etc).
*/
const char *
zpool_get_health_str_from_zhp(zpool_handle_t *zhp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest dropping the _from_zhp suffix.

@@ -207,10 +208,48 @@ const char *zfs_history_event_names[ZFS_NUM_LEGACY_HISTORY_EVENTS] = {
"pool split",
};

#if defined(_KERNEL)
#define gettext(x) x
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid allowing the localization logic to seep down in to the shared user/kernel code. Which means in this case we're going to have to live with a little code duplication. I'd suggest moving the zpool_state_to_name back to where it was and adding a new spa_state_to_name(spa_t *) function without gettext to spa_misc.c. This has the additional advantage of not disrupting the existing libzfs API.

{
return (spa_suspended(spa) && (spa_get_failmode(spa)
!= ZIO_FAILURE_MODE_CONTINUE));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding spa_state_to_name you might find it clearer to move this check and the associated strlcpy in to the new function. Then you wouldn't need a conditional at all in spa_health_data.

@tonyhutter
Copy link
Contributor Author

@behlendorf thanks for looking at it. I included your changes in my latest push.

@tonyhutter tonyhutter force-pushed the proc-health branch 2 times, most recently from bb0e94f to 214c865 Compare May 30, 2018 00:14
@tonyhutter tonyhutter force-pushed the proc-health branch 2 times, most recently from 2558e24 to 0783f1b Compare May 30, 2018 00:38
@tonyhutter tonyhutter changed the base branch from proc-health to master May 30, 2018 00:46
str = gettext("FAULTED");
} else if (status == ZPOOL_STATUS_IO_FAILURE_WAIT ||
status == ZPOOL_STATUS_IO_FAILURE_MMP) {
str = gettext("SUSPENDED");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since suspended state is in addition to other states, do we need a separate kstat for suspension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The kstat is there to be a quick "is the pool healthy?" check for pacemaker to use. If you wanted to know why the pool was suspended, you could always look at the zpool status output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it were that easy, in the zpool command suspension message is sent to stderr, so most screen-scrapers miss it.

An implementation could be as easy as exposing the value of spa->spa_suspended and it can wait for another PR.

From a practical implementation of HA perspective, it is not always desirable to failover if a pool is suspended. Hence we do want to know why, in addition to the rest of the state, of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the zpool command suspension message is sent to stderr, so most screen-scrapers miss it.

@richardelling which zpool status suspension message are you referring too? My recollection is that this has always been a point of confusion for users because a pool may be suspended yet appear totally healthy in the zpool status output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is confusing :-( zfs_standard_error[_fmt]() has many callers and buried in there is the case where users are notified over stderr that the pool is suspended. In any case, I think exposing spa_suspended can be useful, but belongs in another PR

return ("DEGRADED");
case VDEV_STATE_HEALTHY:
return ("ONLINE");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra whitespace

@@ -69,6 +69,7 @@
#define KSTAT_FLAG_WRITABLE 0x04
#define KSTAT_FLAG_PERSISTENT 0x08
#define KSTAT_FLAG_DORMANT 0x10
#define KSTAT_FLAG_NO_HEADERS 0x20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this aligned with the Illumos ones for consistency and add yours to the end.

#define KSTAT_FLAG_VIRTUAL              0x01
#define KSTAT_FLAG_VAR_SIZE             0x02
#define KSTAT_FLAG_WRITABLE             0x04
#define KSTAT_FLAG_PERSISTENT           0x08
#define KSTAT_FLAG_DORMANT              0x10
#define KSTAT_FLAG_INVALID              0x20
#define KSTAT_FLAG_LONGSTRINGS          0x40

@tonyhutter
Copy link
Contributor Author

@behlendorf my latest push fixes your last two comments.

@@ -69,6 +69,9 @@
#define KSTAT_FLAG_WRITABLE 0x04
#define KSTAT_FLAG_PERSISTENT 0x08
#define KSTAT_FLAG_DORMANT 0x10
#define KSTAT_FLAG_INVALID 0x20
#define KSTAT_FLAG_LONGSTRINGS 0x40
#define KSTAT_FLAG_NO_HEADERS 0x60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/0x60/0x80

@@ -69,6 +69,9 @@
#define KSTAT_FLAG_WRITABLE 0x04
#define KSTAT_FLAG_PERSISTENT 0x08
#define KSTAT_FLAG_DORMANT 0x10
#define KSTAT_FLAG_INVALID 0x20
#define KSTAT_FLAG_LONGSTRINGS 0x40
#define KSTAT_FLAG_NO_HEADERS 0x60
#define KSTAT_FLAG_UNSUPPORTED \
(KSTAT_FLAG_VAR_SIZE | KSTAT_FLAG_WRITABLE | \
KSTAT_FLAG_PERSISTENT | KSTAT_FLAG_DORMANT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also remove KSTAT_FLAG_UNSUPPORTED and it's one use too in __kstat_create

@@ -304,6 +304,8 @@ typedef struct kstat32 {
#define KSTAT_FLAG_PERSISTENT 0x08
#define KSTAT_FLAG_DORMANT 0x10
#define KSTAT_FLAG_INVALID 0x20
#define KSTAT_FLAG_LONGSTRINGS 0x40
#define KSTAT_FLAG_NO_HEADERS 0x60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/0x60/0x80

str = gettext("FAULTED");
} else if (status == ZPOOL_STATUS_IO_FAILURE_WAIT ||
status == ZPOOL_STATUS_IO_FAILURE_MMP) {
str = gettext("SUSPENDED");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the zpool command suspension message is sent to stderr, so most screen-scrapers miss it.

@richardelling which zpool status suspension message are you referring too? My recollection is that this has always been a point of confusion for users because a pool may be suspended yet appear totally healthy in the zpool status output.

@codecov
Copy link

codecov bot commented Jun 1, 2018

Codecov Report

Merging #7563 into master will increase coverage by 0.08%.
The diff coverage is 85.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7563      +/-   ##
==========================================
+ Coverage   77.44%   77.53%   +0.08%     
==========================================
  Files         361      362       +1     
  Lines      110322   110338      +16     
==========================================
+ Hits        85442    85551     +109     
+ Misses      24880    24787      -93
Flag Coverage Δ
#kernel 78.14% <85.36%> (+0.16%) ⬆️
#user 66.5% <51.66%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d9142c...c272d1d. Read the comment docs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after the ZTS fix.

@@ -0,0 +1,5 @@
pkgdatadir = $(datadir)/@PACKAGE@/zfs-tests/tests/functional/health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be functional/kstat instead of functional/health, which caused them to be installed in the wrong location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix that

@beren12
Copy link
Contributor

beren12 commented Jun 1, 2018

Would it be possible to use this to generate a proc file like mdstat, maybe /proc/zfsstat that has some appreciated info on pool status? maybe things like # vdevs and/or members, health, scrubbing vs resilvering, degraded, etc?

rpool: scrubbing [22%] zraid1 sdf1 sdj1 sda1 + mirror sdn1 sdd1
586006323200 512b blocks

Something more compact than zpool status but more than zpool list.
if it gets too complex with expanders and all, maybe just list the number of devices present/total of each vdev

@richardelling
Copy link
Contributor

@beren12 can you start a conversation on the mail list regarding zpool status and procfs?

@tonyhutter
Copy link
Contributor Author

One more last minute change - I'm going to rename the proc name from health to state to match zpool status:

  pool: mypool
 state: ONLINE
  scan: none requested
config:

	NAME                             STATE     READ WRITE CKSUM
	mypool                           ONLINE       0     0     0
	  /home/hutter/proc-health/file  ONLINE       0     0     0

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have one question, about zpool_get_health_str().

I agree with the plan to call the kstat 'state' instead of 'health'.


if (zpool_get_state(zhp) == POOL_STATE_UNAVAIL) {
str = gettext("FAULTED");
} else if (status == ZPOOL_STATUS_IO_FAILURE_WAIT ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the contents of status really valid here? I don't see an assignment or a pointer being passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good gravy. The assignment is right there. Nevermind.

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#7331
@behlendorf behlendorf merged commit f0ed6c7 into openzfs:master Jun 6, 2018
@bmerchant
Copy link

Can this be added to the 0.7.10 todo list?

tonyhutter added a commit to tonyhutter/zfs that referenced this pull request Aug 15, 2018
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#7331
Closes openzfs#7563
tonyhutter added a commit to tonyhutter/zfs that referenced this pull request Aug 23, 2018
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#7331
Closes openzfs#7563
tonyhutter added a commit to tonyhutter/zfs that referenced this pull request Aug 27, 2018
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#7331
Closes openzfs#7563
tonyhutter added a commit to tonyhutter/zfs that referenced this pull request Sep 5, 2018
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#7331
Closes openzfs#7563
@gmelikov
Copy link
Member

JFYI Updated ZFS resource agent is released in resource-agents v4.2.0 ClusterLabs/resource-agents@2bdeee4#diff-2f9687bda2dc6253e000b30aaea222d9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants