Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce read/write kstats per dataset #7705

Merged
merged 1 commit into from
Aug 20, 2018

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Jul 11, 2018

Description

The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved
by using the dataset's objset ID number and the pool's load_guid.
As a side, the pool's load_guid is also exposed to the userland with
this change similarly to other pool properties and the man page has
been updated to describe the property.

Note: A PR for this change has been opened for OpenZFS in the
following link -> openzfs/openzfs#664

Example output:

$ cat /proc/spl/kstat/zfs/testpool/objset-0x4e
46 1 0x01 5 1360 65332747959 103755672238
name                            type data
writes                          4    0
nwritten                        4    0
reads                           4    3
nread                           4    12288
dataset_name                    7    testpool/testvol

Motivation and Context

The motivation for this change is to provide some preliminary statistics
on dataset usage/performance. These statistics can be used by higher
level tools in the userland for analysis and visualizations.

How Has This Been Tested?

The accuracy of the statistics has been tested manually to a
certain extend using both normal filesystems and volumes.

To verify that the performance of the read/write codepaths, that are
generally pretty hot, is not affected I ran the sequential_reads_arc_cached
test from the perf-regression tests of the test suite. My setup was
an Ubuntu VM running 4.15.0-23-generic with 8 CPUs and 12 GB
of RAM, that used an eager-zeroed SSD of 800GB. The VM was
running on VMware ESX. The performance was the same, even if
I was consuming the kstats every second while the tests where
running.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@bunder2015
Copy link
Contributor

Just curious, are there any plans to expose this to userspace, ala a zfs iostat?

@sdimitro
Copy link
Contributor Author

@bunder2015 No plans from my side as of now, but these stats could definitely be used as sources for a utility like that.

* GUID from the pool's config which could change upona reguid).
*/
char kstat_name[KSTAT_STRLEN];
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Linux, unlike illumos, there is already a subdirectory for the pool kstats, by name. This works in Linux because the kstat namespace is heirarchical, not flat like illumos. Rather than using the load guid here, let's use the pool's subdirectory.
I'm thinking something like:

int n = snprintf(kstat_name, sizeof (kstat_name), "%s/objset-0x%llx", 
     spa_name(dmu_objset_spa(zfsvfs->z_os)),
     (unsigned long long)dmu_objset_id(zfsvfs->z_os));

we might be able to get more clever with the objset name, too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If you read my comment in the code I do mention that (and I clarify more on my other comment on github). The reason why I do this is to keep the codebase and the format of the key for retrieval the same between OSs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to changing this to something else than what I have right now. That said, I think it would be easier to do this for now from the perspective of porting. Plus in the future if some utility ships in the zol userland that is ported to illumos and FreeBSD, the routines for formatting the key for the kstat retrieval will be the same across all platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it (kstats) already isn't the same in Linux. There is no kstat(1), so there isn't a way other than traversing the procfs tree to get there. Use the tree, Luke!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the code and the behavior of ZFS to be consistent and uniform between different OSes

Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.

I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll change the current design to the above naming scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest preserving everything but the final name to make it easy for the other OpenZFS implementations to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review update changed the naming scheme to the one proposed above.

* naming scheme was decided.
*/
char kstat_name[KSTAT_STRLEN];
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment on pool naming in kstat space above

kstat_named_t zk_nread;
} zvol_kstats_t;

static zvol_kstats_t empty_zvol_kstats = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory, we could add the zvol name here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful would that be here.
If you were able to get the objsetID for a dataset, I assume you already had its name.
If there is a practical use-case that I'm missing though I'd be open to including this change.

@@ -972,6 +972,152 @@ zfs_id_overquota(zfsvfs_t *zfsvfs, uint64_t usedobj, uint64_t id)
zfs_id_overobjquota(zfsvfs, usedobj, id));
}

static zfsvfs_kstats_t empty_zfsvfs_kstats = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory, we could add the dataset name here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review update added a field with the dataset name.

(unsigned long long)dmu_objset_id(zfsvfs->z_os));

/*
* At the time of this writing, KSTAT_STRLEN is 31 (in illumos),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KSTAT_STRLEN = 255 for Linux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read the comment all the way to the end, I do mention that KSTAT_STRLEN in Linux is currently 255, but also wanted to give some background on why I used the load_guid + objset ID in hex (again, trying to keep things uniform between Linux, Illumos, and BSD - 255 vs 31 & 31)

* object ID should take 29 characters in total:
* - pool GUID (64 bits - 16 chars in hex - 16 total chars)
* - dash ( 8 bits - 1 char - 17 total chars)
* - object id (48 bits - 12 chars in hex - 29 total chars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comments on the kstat namespace

@@ -240,6 +240,7 @@ typedef enum {
ZPOOL_PROP_MAXDNODESIZE,
ZPOOL_PROP_MULTIHOST,
ZPOOL_PROP_CHECKPOINT,
ZPOOL_PROP_LOAD_GUID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stick with the proposed naming scheme we should add a ZFS_PROP_ID dataset property. Otherwise users will need to resort to zdb to map dataset names to their matching kstat entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

$ zfs get -Hp -o value objsetid rpool/swap
203

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what we need! Since it's a hidden property I didn't even realize that existed, we may want to unhide it and add it to the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review update made this a normal property

* GUID from the pool's config which could change upona reguid).
*/
char kstat_name[KSTAT_STRLEN];
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the code and the behavior of ZFS to be consistent and uniform between different OSes

Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.

I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.

@@ -93,6 +93,8 @@ zpool_prop_init(void)
ZFS_TYPE_POOL, "<size>", "CAP");
zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<guid>", "GUID");
zprop_register_number(ZPOOL_PROP_LOAD_GUID, "load_guid", 0,
PROP_READONLY, ZFS_TYPE_POOL, "<load_guid>", "LOAD_GUID");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new properties will need to be added to zpool_get_002_pos test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll update it

KM_SLEEP);
bcopy(&empty_zfsvfs_kstats, zs, sizeof (empty_zfsvfs_kstats));

char *ds_name = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN, KM_SLEEP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kmem_zalloc() is safer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

, the
.Sy objsetid
of a dataset is not transferred to other pools when the snapshot is copied
with a send/receive operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be rarely used, outside of the case for kstats. Do you think it is worth mentioning here that the objsetid can be referenced to the kstats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.

@@ -22,6 +22,7 @@
/*
* Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2016 Actifio, Inc. All rights reserved.
* Copyright (c) 2018 by Delphix. All rights reserved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any other changes to this file, so you don't need to update the copyright.

, the
.Sy objsetid
of a dataset is not transferred to other pools when the snapshot is copied
with a send/receive operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.

*
* Thus, pool names with lengths 1 to 242 should have valid
* kstats created for their datasets. We do place an assertion
* below to hopefully raise an issue when and if one of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could have a valid pool name that's 250 bytes long, failing an assertion in this case seems wrong. A better behavior would be so silently not create the kstat (and maybe log a zfs_dbgmsg()).

However, the (current) maximum pool name length is 240 bytes:

int
pool_namecheck(const char *pool, namecheck_err_t *why, char *what)
{
	const char *c;

	/*
	 * Make sure the name is not too long.
	 * If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11)
	 * we need to account for additional space needed by the origin ds which
	 * will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN".
	 * Play it safe and enforce this limit even if the pool version is < 11
	 * so it can be upgraded without issues.
	 */
	if (strlen(pool) >= (ZFS_MAX_DATASET_NAME_LEN - 2 -
	    strlen(ORIGIN_DIR_NAME) * 2)) {
		if (why)
			*why = NAME_ERR_TOOLONG;
		return (-1);
	}

So the code you have is probably good, but we should update the comment to explain why this works (i.e. pool name is up to 240 chars).

EXCEPT, the above length analysis looks wrong because it doesn't account for the length of the literal "/objset-0x" in the name (it only counts 1 byte for the -).

* pool names:
* - pool name (1 to 256 chars)
* - dash (1 char)
* - object id (1 to 12 chars in hex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there are 2 spaces after 12

{ "reads", KSTAT_DATA_UINT64 },
{ "nread", KSTAT_DATA_UINT64 },
{ "dataset_name", KSTAT_DATA_STRING },
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a generic "dataset kstats" rather than separate zvol and zpl kstats? zvol vs zpl would need to call into the generic dataset kstats from different places, but I would think that we could use the same structs, and zvol_kstats_update(), zvol_kstats_create(), zvol_kstats_destroy(), and update funcs could be shared.

You could add a new .c / .h for the dataset kstats, since this doesn't seem closely tied in with e.g. objset code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

kstat_named_t zk_nwritten;
kstat_named_t zk_reads;
kstat_named_t zk_nread;
kstat_named_t zk_ds_name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest putting zk_ds_name first since we may end up adding entries to the end.

@@ -117,6 +134,8 @@ struct zfsvfs {
boolean_t z_xattr_sa; /* allow xattrs to be stores as SA */
uint64_t z_version; /* ZPL version */
uint64_t z_shares_dir; /* hidden shares dir */
kstat_t *z_iokstat; /* kstat of I/O to this zvol */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider changing the name to something more generic like z_kstat. It seems inevitable we'll find more per-dataset properties or statistics that it would be useful to add.

nit: indent missing

if (n >= KSTAT_STRLEN)
return;

kstat_t *iokstat = kstat_create("zfs", 0, kstat_name, "zfsvfs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but isn't consistent with the other entries in the "zfs/<pool-name>" kstats. They only simulate the effect of a hierarchy under /proc by creating a new ks_module for each imported pool called "zfs/ <pool-name>" to which entries can be added. Each modules namespace is still technically flat. Take a look at spa_txg_history_init() for an example.

{ "reads", KSTAT_DATA_UINT64 },
{ "nread", KSTAT_DATA_UINT64 },
{ "dataset_name", KSTAT_DATA_STRING },
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Jul 27, 2018
@sdimitro sdimitro force-pushed the dzol-LX-kstats branch 3 times, most recently from 703d0dd to d4b54e5 Compare August 15, 2018 20:08
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few last things.

kstat_named_t dkv_nread;
} dataset_kstat_values_t;

typedef struct datset_kstats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/datset_kstats/dataset_kstats

dataset_kstats_destroy(dataset_kstats_t *dk)
{
if (dk->dk_kstats == NULL)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra tab

ASSERT3S(nwritten, >=, 0);

if (dk->dk_kstats == NULL)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra tab

ASSERT3S(nread, >=, 0);

if (dk->dk_kstats == NULL)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra tab

@@ -1944,6 +1948,7 @@ zfs_umount(struct super_block *sb)
dmu_objset_disown(os, B_TRUE, zfsvfs);
}

dataset_kstats_destroy(&zfsvfs->z_kstat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be called twice, once here and a second time in zfsvfs_free(). What do you think about moving the dataset_kstats_create() call in to the if (mounting) { } conditional of zfsvfs_setup(). This way you'll only need the single call in zfsvfs_free() to cover all the cleanup paths.

@@ -937,6 +929,8 @@ zfs_write(struct inode *ip, uio_t *uio, int ioflag, cred_t *cr)
zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
zil_commit(zilog, zp->z_id);

dataset_kstats_update_write_kstats(&zfsvfs->z_kstat,
start_resid - uio->uio_resid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the per-objset IO accounting this PR is adding there already exists per-task IO accounting. The calls for this are task_io_account_read() and task_io_account_write() in zpl_read_common_iovec() and zpl_write_common_iovec() respectively. As an aside this, is where iotop(8) is getting its data from.

In the interests of keeping all of the IO accounting in one place let's move these calls in to zfs_write() and zfs_read(). This will also have the effect that non-ZPL consumers, like ZVOLS and Lustre servers will get the IO properly accounted to the issuing process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought we probably should leave the task_io_* calls where they are or we'll end up double accounting volumes since this is already handled by submit_bio().

[edit] On closer inspection I see that volumes use the dmu_* interfaces so this shouldn't be a problem.

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #7705 into master will increase coverage by 0.21%.
The diff coverage is 85.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7705      +/-   ##
==========================================
+ Coverage   78.26%   78.48%   +0.21%     
==========================================
  Files         373      374       +1     
  Lines      112791   112877      +86     
==========================================
+ Hits        88279    88587     +308     
+ Misses      24512    24290     -222
Flag Coverage Δ
#kernel 78.57% <85.21%> (+0.12%) ⬆️
#user 67.61% <100%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4338c5c...93dfb13. Read the comment docs.

The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. LGTM.

@behlendorf behlendorf added Reviewed and removed Status: Work in Progress Not yet ready for general review labels Aug 17, 2018
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Aug 20, 2018
@behlendorf behlendorf merged commit a448a25 into openzfs:master Aug 20, 2018
@gmelikov
Copy link
Member

gmelikov commented Feb 6, 2020

@sdimitro did you have any plan about making something like zfs iostat from these stats? I'm thinking about some cli tools.

@richardelling
Copy link
Contributor

FWIW, we do collect that with telegraf and node_exporter for graphing and capacity planning

@sdimitro
Copy link
Contributor Author

@gmelikov sorry for the late reply - no I didn't, go for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants