Introduce read/write kstats per dataset #7705

sdimitro · 2018-07-11T15:26:05Z

Description

The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved
by using the dataset's objset ID number and the pool's load_guid.
As a side, the pool's load_guid is also exposed to the userland with
this change similarly to other pool properties and the man page has
been updated to describe the property.

Note: A PR for this change has been opened for OpenZFS in the
following link -> openzfs/openzfs#664

Example output:

$ cat /proc/spl/kstat/zfs/testpool/objset-0x4e
46 1 0x01 5 1360 65332747959 103755672238
name                            type data
writes                          4    0
nwritten                        4    0
reads                           4    3
nread                           4    12288
dataset_name                    7    testpool/testvol

Motivation and Context

The motivation for this change is to provide some preliminary statistics
on dataset usage/performance. These statistics can be used by higher
level tools in the userland for analysis and visualizations.

How Has This Been Tested?

The accuracy of the statistics has been tested manually to a
certain extend using both normal filesystems and volumes.

To verify that the performance of the read/write codepaths, that are
generally pretty hot, is not affected I ran the sequential_reads_arc_cached
test from the perf-regression tests of the test suite. My setup was
an Ubuntu VM running 4.15.0-23-generic with 8 CPUs and 12 GB
of RAM, that used an eager-zeroed SSD of 800GB. The VM was
running on VMware ESX. The performance was the same, even if
I was consuming the kstats every second while the tests where
running.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

bunder2015 · 2018-07-11T17:34:46Z

Just curious, are there any plans to expose this to userspace, ala a zfs iostat?

sdimitro · 2018-07-11T20:17:22Z

@bunder2015 No plans from my side as of now, but these stats could definitely be used as sources for a utility like that.

richardelling · 2018-07-11T20:18:53Z

module/zfs/zfs_vfsops.c

+	 * GUID from the pool's config which could change upona reguid).
+	 */
+	char kstat_name[KSTAT_STRLEN];
+	int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",


In Linux, unlike illumos, there is already a subdirectory for the pool kstats, by name. This works in Linux because the kstat namespace is heirarchical, not flat like illumos. Rather than using the load guid here, let's use the pool's subdirectory.
I'm thinking something like:

int n = snprintf(kstat_name, sizeof (kstat_name), "%s/objset-0x%llx", spa_name(dmu_objset_spa(zfsvfs->z_os)), (unsigned long long)dmu_objset_id(zfsvfs->z_os));

we might be able to get more clever with the objset name, too

Yes. If you read my comment in the code I do mention that (and I clarify more on my other comment on github). The reason why I do this is to keep the codebase and the format of the key for retrieval the same between OSs.

I'm open to changing this to something else than what I have right now. That said, I think it would be easier to do this for now from the perspective of porting. Plus in the future if some utility ships in the zol userland that is ported to illumos and FreeBSD, the routines for formatting the key for the kstat retrieval will be the same across all platforms.

But it (kstats) already isn't the same in Linux. There is no kstat(1), so there isn't a way other than traversing the procfs tree to get there. Use the tree, Luke!

We want the code and the behavior of ZFS to be consistent and uniform between different OSes

Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.

I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.

ok. I'll change the current design to the above naming scheme.

I'd suggest preserving everything but the final name to make it easy for the other OpenZFS implementations to use.

Review update changed the naming scheme to the one proposed above.

richardelling · 2018-07-11T20:19:35Z

module/zfs/zvol.c

+	 * naming scheme was decided.
+	 */
+	char kstat_name[KSTAT_STRLEN];
+	int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",


see comment on pool naming in kstat space above

richardelling · 2018-07-11T20:21:00Z

module/zfs/zvol.c

+	kstat_named_t zk_nread;
+} zvol_kstats_t;
+
+static zvol_kstats_t empty_zvol_kstats = {


in theory, we could add the zvol name here

I'm not sure how useful would that be here.
If you were able to get the objsetID for a dataset, I assume you already had its name.
If there is a practical use-case that I'm missing though I'd be open to including this change.

richardelling · 2018-07-11T20:23:34Z

module/zfs/zfs_vfsops.c

@@ -972,6 +972,152 @@ zfs_id_overquota(zfsvfs_t *zfsvfs, uint64_t usedobj, uint64_t id)
 	    zfs_id_overobjquota(zfsvfs, usedobj, id));
 }

+static zfsvfs_kstats_t empty_zfsvfs_kstats = {


in theory, we could add the dataset name here

See my other comment

Review update added a field with the dataset name.

richardelling · 2018-07-11T20:27:32Z

module/zfs/zfs_vfsops.c

+	    (unsigned long long)dmu_objset_id(zfsvfs->z_os));
+
+	/*
+	 * At the time of this writing, KSTAT_STRLEN is 31 (in illumos),


KSTAT_STRLEN = 255 for Linux

If you read the comment all the way to the end, I do mention that KSTAT_STRLEN in Linux is currently 255, but also wanted to give some background on why I used the load_guid + objset ID in hex (again, trying to keep things uniform between Linux, Illumos, and BSD - 255 vs 31 & 31)

richardelling · 2018-07-11T20:28:03Z

module/zfs/zfs_vfsops.c

+	 * object ID should take 29 characters in total:
+	 * - pool GUID (64 bits - 16 chars in hex - 16 total chars)
+	 * - dash      ( 8 bits -  1 char         - 17 total chars)
+	 * - object id (48 bits - 12 chars in hex - 29 total chars)


see other comments on the kstat namespace

behlendorf · 2018-07-11T20:41:36Z

include/sys/fs/zfs.h

@@ -240,6 +240,7 @@ typedef enum {
 	ZPOOL_PROP_MAXDNODESIZE,
 	ZPOOL_PROP_MULTIHOST,
 	ZPOOL_PROP_CHECKPOINT,
+	ZPOOL_PROP_LOAD_GUID,


If we stick with the proposed naming scheme we should add a ZFS_PROP_ID dataset property. Otherwise users will need to resort to zdb to map dataset names to their matching kstat entry.

What about this?

$ zfs get -Hp -o value objsetid rpool/swap 203

Yes, that's exactly what we need! Since it's a hidden property I didn't even realize that existed, we may want to unhide it and add it to the documentation.

Review update made this a normal property

behlendorf · 2018-07-11T21:02:52Z

module/zfs/zfs_vfsops.c

+	 * GUID from the pool's config which could change upona reguid).
+	 */
+	char kstat_name[KSTAT_STRLEN];
+	int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",


We want the code and the behavior of ZFS to be consistent and uniform between different OSes

Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.

I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.

behlendorf · 2018-07-13T00:02:43Z

module/zcommon/zpool_prop.c

@@ -93,6 +93,8 @@ zpool_prop_init(void)
 	    ZFS_TYPE_POOL, "<size>", "CAP");
 	zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY,
 	    ZFS_TYPE_POOL, "<guid>", "GUID");
+	zprop_register_number(ZPOOL_PROP_LOAD_GUID, "load_guid", 0,
+	    PROP_READONLY, ZFS_TYPE_POOL, "<load_guid>", "LOAD_GUID");


The new properties will need to be added to zpool_get_002_pos test case.

Good catch! I'll update it

richardelling · 2018-07-13T01:21:25Z

module/zfs/zfs_vfsops.c

+	    KM_SLEEP);
+	bcopy(&empty_zfsvfs_kstats, zs, sizeof (empty_zfsvfs_kstats));
+
+	char *ds_name = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN, KM_SLEEP);


nit: kmem_zalloc() is safer

Good catch!

richardelling · 2018-07-13T03:29:46Z

man/man8/zfs.8

+, the
+.Sy objsetid
+of a dataset is not transferred to other pools when the snapshot is copied
+with a send/receive operation.


This should be rarely used, outside of the case for kstats. Do you think it is worth mentioning here that the objsetid can be referenced to the kstats?

You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.

ahrens · 2018-07-18T20:26:17Z

include/sys/zvol.h

@@ -22,6 +22,7 @@
 /*
 * Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
 * Copyright (c) 2016 Actifio, Inc. All rights reserved.
+ * Copyright (c) 2018 by Delphix. All rights reserved


There aren't any other changes to this file, so you don't need to update the copyright.

ahrens · 2018-07-18T20:27:07Z

man/man8/zfs.8

+, the
+.Sy objsetid
+of a dataset is not transferred to other pools when the snapshot is copied
+with a send/receive operation.


You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.

ahrens · 2018-07-18T20:48:29Z

module/zfs/zfs_vfsops.c

+	 *
+	 * Thus, pool names with lengths 1 to 242 should have valid
+	 * kstats created for their datasets. We do place an assertion
+	 * below to hopefully raise an issue when and if one of the


If you could have a valid pool name that's 250 bytes long, failing an assertion in this case seems wrong. A better behavior would be so silently not create the kstat (and maybe log a zfs_dbgmsg()).

However, the (current) maximum pool name length is 240 bytes:

int pool_namecheck(const char *pool, namecheck_err_t *why, char *what) { const char *c; /* * Make sure the name is not too long. * If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11) * we need to account for additional space needed by the origin ds which * will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN". * Play it safe and enforce this limit even if the pool version is < 11 * so it can be upgraded without issues. */ if (strlen(pool) >= (ZFS_MAX_DATASET_NAME_LEN - 2 - strlen(ORIGIN_DIR_NAME) * 2)) { if (why) *why = NAME_ERR_TOOLONG; return (-1); }

So the code you have is probably good, but we should update the comment to explain why this works (i.e. pool name is up to 240 chars).

EXCEPT, the above length analysis looks wrong because it doesn't account for the length of the literal "/objset-0x" in the name (it only counts 1 byte for the -).

ahrens · 2018-07-18T20:48:42Z

module/zfs/zfs_vfsops.c

+	 * pool names:
+	 * - pool name (1 to 256 chars)
+	 * - dash      (1 char)
+	 * - object id (1 to 12  chars in hex)


looks like there are 2 spaces after 12

ahrens · 2018-07-18T20:58:14Z

module/zfs/zvol.c

+	{ "reads",	KSTAT_DATA_UINT64 },
+	{ "nread",	KSTAT_DATA_UINT64 },
+	{ "dataset_name",	KSTAT_DATA_STRING },
+};


Can we have a generic "dataset kstats" rather than separate zvol and zpl kstats? zvol vs zpl would need to call into the generic dataset kstats from different places, but I would think that we could use the same structs, and zvol_kstats_update(), zvol_kstats_create(), zvol_kstats_destroy(), and update funcs could be shared.

You could add a new .c / .h for the dataset kstats, since this doesn't seem closely tied in with e.g. objset code.

behlendorf · 2018-07-26T23:55:36Z

include/sys/zfs_vfsops.h

+	kstat_named_t zk_nwritten;
+	kstat_named_t zk_reads;
+	kstat_named_t zk_nread;
+	kstat_named_t zk_ds_name;


nit: I'd suggest putting zk_ds_name first since we may end up adding entries to the end.

behlendorf · 2018-07-26T23:57:20Z

include/sys/zfs_vfsops.h

@@ -117,6 +134,8 @@ struct zfsvfs {
 	boolean_t	z_xattr_sa;	/* allow xattrs to be stores as SA */
 	uint64_t	z_version;	/* ZPL version */
 	uint64_t	z_shares_dir;	/* hidden shares dir */
+	kstat_t	*z_iokstat;	/* kstat of I/O to this zvol */


We should consider changing the name to something more generic like z_kstat. It seems inevitable we'll find more per-dataset properties or statistics that it would be useful to add.

nit: indent missing

behlendorf · 2018-07-27T00:26:48Z

module/zfs/zfs_vfsops.c

+	if (n >= KSTAT_STRLEN)
+		return;
+
+	kstat_t *iokstat = kstat_create("zfs", 0, kstat_name, "zfsvfs",


This works but isn't consistent with the other entries in the "zfs/<pool-name>" kstats. They only simulate the effect of a hierarchy under /proc by creating a new ks_module for each imported pool called "zfs/ <pool-name>" to which entries can be added. Each modules namespace is still technically flat. Take a look at spa_txg_history_init() for an example.

behlendorf · 2018-07-27T00:27:13Z

module/zfs/zvol.c

+	{ "reads",	KSTAT_DATA_UINT64 },
+	{ "nread",	KSTAT_DATA_UINT64 },
+	{ "dataset_name",	KSTAT_DATA_STRING },
+};


behlendorf

Looks good! Just a few last things.

behlendorf · 2018-08-16T16:50:56Z

include/sys/dataset_kstats.h

+	kstat_named_t dkv_nread;
+} dataset_kstat_values_t;
+
+typedef struct datset_kstats {


s/datset_kstats/dataset_kstats

behlendorf · 2018-08-16T17:05:53Z

module/zfs/dataset_kstats.c

+dataset_kstats_destroy(dataset_kstats_t *dk)
+{
+	if (dk->dk_kstats == NULL)
+			return;


nit: extra tab

behlendorf · 2018-08-16T17:08:16Z

module/zfs/dataset_kstats.c

+	ASSERT3S(nwritten, >=, 0);
+
+	if (dk->dk_kstats == NULL)
+			return;


nit: extra tab

behlendorf · 2018-08-16T17:08:23Z

module/zfs/dataset_kstats.c

+	ASSERT3S(nread, >=, 0);
+
+	if (dk->dk_kstats == NULL)
+			return;


nit: extra tab

behlendorf · 2018-08-16T17:22:48Z

module/zfs/zfs_vfsops.c

@@ -1944,6 +1948,7 @@ zfs_umount(struct super_block *sb)
 		dmu_objset_disown(os, B_TRUE, zfsvfs);
 	}

+	dataset_kstats_destroy(&zfsvfs->z_kstat);


This will be called twice, once here and a second time in zfsvfs_free(). What do you think about moving the dataset_kstats_create() call in to the if (mounting) { } conditional of zfsvfs_setup(). This way you'll only need the single call in zfsvfs_free() to cover all the cleanup paths.

behlendorf · 2018-08-16T17:40:16Z

module/zfs/zfs_vnops.c

@@ -937,6 +929,8 @@ zfs_write(struct inode *ip, uio_t *uio, int ioflag, cred_t *cr)
 	    zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
 		zil_commit(zilog, zp->z_id);

+	dataset_kstats_update_write_kstats(&zfsvfs->z_kstat,
+	    start_resid - uio->uio_resid);


In addition to the per-objset IO accounting this PR is adding there already exists per-task IO accounting. The calls for this are task_io_account_read() and task_io_account_write() in zpl_read_common_iovec() and zpl_write_common_iovec() respectively. As an aside this, is where iotop(8) is getting its data from.

In the interests of keeping all of the IO accounting in one place let's move these calls in to zfs_write() and zfs_read(). This will also have the effect that non-ZPL consumers, like ZVOLS and Lustre servers will get the IO properly accounted to the issuing process.

~~On second thought we probably should leave the task_io_* calls where they are or we'll end up double accounting volumes since this is already handled by submit_bio().~~

[edit] On closer inspection I see that volumes use the dmu_* interfaces so this shouldn't be a problem.

codecov · 2018-08-16T21:24:10Z

Codecov Report

Merging #7705 into master will increase coverage by 0.21%.
The diff coverage is 85.21%.

@@            Coverage Diff             @@
##           master    #7705      +/-   ##
==========================================
+ Coverage   78.26%   78.48%   +0.21%     
==========================================
  Files         373      374       +1     
  Lines      112791   112877      +86     
==========================================
+ Hits        88279    88587     +308     
+ Misses      24512    24290     -222

Flag	Coverage Δ
#kernel	`78.57% <85.21%> (+0.12%)`	⬆️
#user	`67.61% <100%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4338c5c...93dfb13. Read the comment docs.

The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

behlendorf

Thanks for putting this together. LGTM.

gmelikov · 2020-02-06T13:17:59Z

@sdimitro did you have any plan about making something like zfs iostat from these stats? I'm thinking about some cli tools.

richardelling · 2020-02-07T02:10:11Z

FWIW, we do collect that with telegraf and node_exporter for graphing and capacity planning

sdimitro · 2020-02-21T00:29:31Z

@gmelikov sorry for the late reply - no I didn't, go for it

sdimitro requested review from behlendorf and ahrens July 11, 2018 15:26

richardelling reviewed Jul 11, 2018

View reviewed changes

behlendorf reviewed Jul 11, 2018

View reviewed changes

sdimitro force-pushed the dzol-LX-kstats branch from f8e37d2 to 2289f5b Compare July 12, 2018 20:43

behlendorf requested changes Jul 13, 2018

View reviewed changes

richardelling reviewed Jul 13, 2018

View reviewed changes

sdimitro force-pushed the dzol-LX-kstats branch from 2289f5b to 6987042 Compare July 16, 2018 15:35

ahrens reviewed Jul 18, 2018

View reviewed changes

behlendorf reviewed Jul 27, 2018

View reviewed changes

behlendorf added the Status: Work in Progress Not yet ready for general review label Jul 27, 2018

sdimitro force-pushed the dzol-LX-kstats branch 3 times, most recently from 703d0dd to d4b54e5 Compare August 15, 2018 20:08

behlendorf requested changes Aug 16, 2018

View reviewed changes

sdimitro force-pushed the dzol-LX-kstats branch from d4b54e5 to 93dfb13 Compare August 16, 2018 21:23

behlendorf approved these changes Aug 17, 2018

View reviewed changes

behlendorf added Reviewed and removed Status: Work in Progress Not yet ready for general review labels Aug 17, 2018

sdimitro mentioned this pull request Aug 20, 2018

9647 Introduce ZFS Read/Write kstats openzfs/openzfs#664

Closed

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Aug 20, 2018

behlendorf merged commit a448a25 into openzfs:master Aug 20, 2018

gmelikov mentioned this pull request Feb 6, 2020

Feature request: zfs disk i/o monitoring tool on Ubuntu/Debian #6763

Open

Introduce read/write kstats per dataset #7705

Introduce read/write kstats per dataset #7705

Conversation

sdimitro commented Jul 11, 2018 • edited

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

bunder2015 commented Jul 11, 2018

sdimitro commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf Aug 16, 2018 • edited

Choose a reason for hiding this comment

behlendorf Aug 16, 2018 • edited

Choose a reason for hiding this comment

codecov bot commented Aug 16, 2018 • edited

Codecov Report

behlendorf left a comment

Choose a reason for hiding this comment

gmelikov commented Feb 6, 2020

richardelling commented Feb 7, 2020

sdimitro commented Feb 21, 2020

sdimitro commented Jul 11, 2018 •

edited

behlendorf Aug 16, 2018 •

edited

behlendorf Aug 16, 2018 •

edited

codecov bot commented Aug 16, 2018 •

edited