OpenZFS 6950 - ARC should cache compressed data #4768

dpquigl · 2016-06-17T02:18:18Z

Authored by: George Wilson george.wilson@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
Reviewed by: Dan Kimmel dan.kimmel@delphix.com
Reviewed by: Matt Ahrens mahrens@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Ported by: David Quigley david.quigley@intel.com

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS Issue: https://www.illumos.org/issues/6950

dpquigl · 2016-06-17T02:25:46Z

Please note that for the purpose of merging we split out the regression test suite from the compressed arc patch. They are two different features and we will have a pull request coming shortly for the test suite.

dpquigl · 2016-06-17T03:05:53Z

The failure for the built-in build is because we don't have support for a DTRACE2_PROBE of the form (dmu_dbuf_impl_t, multilist_sublist_t). This needs to be added but I'm not sure how its done. Would appreciate help with this.

don-brady · 2016-06-17T03:12:17Z

@ behlendorf - Looks like ztest failed on Ubuntu bot. Is there an easy way to get the core file(s)?

trap: SIGTERM: bad trap
+ TEST_ZTEST_TIMEOUT=900
+ TEST_ZTEST_DIR=/mnt
+ TEST_ZTEST_CORE_DIR=/mnt/zloop
+ TEST_ZTEST_OPTIONS=
+ sudo -E dmesg -c
+ sudo -E zfs.sh
+ CHILD=13461
+ wait 13461
+ sudo -E zloop.sh -t 900 -f /mnt -c /mnt/zloop
core dump directory (/mnt/zloop) does not exist, creating it.
06/17 02:30:56 /sbin/ztest -VVVVV -m 2 -r 0 -R 1 -v 2 -a 9 -T 90 -P 29 -s 128m -f /mnt

don-brady · 2016-06-17T03:15:08Z

@ behlendorf - any advice/tips on adding a new DTRACE probe definition? Like in include/sys/trace_dbuf.h
Not sure if these need to be hand rolled or if there is a tool? Thanks

dweeezil · 2016-06-17T12:31:02Z

@don-brady You need to add a new template to include/sys/trace_dbuf.h by hand. The existing one takes an additional zio_t parameter. You need to make one that takes a multilist_sublist_t. The existing one doesn't do anything with the zio so I suppose the new one wouldn't need to do anything interesting with the multilist sublist (although it could).

dweeezil · 2016-06-17T12:43:25Z

Start by adding something like this to the bottom of include/sys/trace_dbuf.h:

DECLARE_EVENT_CLASS(zfs_dbuf_evict_one_class,
        TP_PROTO(dmu_buf_impl_t *db, multilist_sublist_t *mls),
        TP_ARGS(db, mls),
        TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
        TP_fast_assign(DBUF_TP_FAST_ASSIGN),
        TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
);

#define DEFINE_DBUF_EVICT_ONE_EVENT(name) \
DEFINE_EVENT(zfs_dbuf_evict_one_class, name, \
        TP_PROTO(dmu_buf_impl_t *db, multilist_sublist_t mls), \
        TP_ARGS(db, mls))
DEFINE_DBUF_EVENT(zfs_dbuf__evict__one);

(completely untested, sorry) which should get things moving in the right direction.

behlendorf · 2016-06-17T21:52:37Z

@don-brady right now there's no way to get the crash dumps. But feel free to extend the script to better support this kind of thing. That said, it looks like something much worse happened during the testing.

Normally when ztest fails the worst thing which can happen is that test case fails and the user space process cores. However in this case the entire test node went unresponsive and all subsequent tests failed. Normally, that's a clue to check the console log from either this test case or an earlier one to see if you can find an earlier problem.

In this case since none of the automated tested passed I checked the console logs for CentOS 7 and found the following. That would explain all of the test wreckage.

[ 2015.078402] VERIFY(!list_is_empty(list)) failed
[ 2015.080684] PANIC at list.h:126:list_remove()
[ 2015.082965] Showing stack for process 28245
[ 2015.084986] CPU: 1 PID: 28245 Comm: zpool Tainted: P           OE  ------------   3.10.0-327.18.2.el7.x86_64 #1
[ 2015.089681] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016
[ 2015.092693]  ffffffffa06b37f7 000000007dc791d7 ffff8800d18c38e0 ffffffff81635a0c
[ 2015.096825]  ffff8800d18c38f0 ffffffffa0473d84 ffff8800d18c3a78 ffffffffa0473e4f
[ 2015.100655]  00000000000002c2 ffff880000000028 ffff8800d18c3a88 ffff8800d18c3a28
[ 2015.104539] Call Trace:
[ 2015.105779]  [<ffffffff81635a0c>] dump_stack+0x19/0x1b
[ 2015.108401]  [<ffffffffa0473d84>] spl_dumpstack+0x44/0x50 [spl]
[ 2015.111203]  [<ffffffffa0473e4f>] spl_panic+0xbf/0xf0 [spl]
[ 2015.114009]  [<ffffffff81647d71>] ? ftrace_call+0x5/0x2f
[ 2015.116693]  [<ffffffff81647d71>] ? ftrace_call+0x5/0x2f
[ 2015.119227]  [<ffffffffa05e05a1>] multilist_sublist_remove+0x91/0xa0 [zfs]
[ 2015.122515]  [<ffffffffa05e0643>] multilist_remove+0x93/0x150 [zfs]
[ 2015.125472]  [<ffffffffa058a710>] __dbuf_hold_impl+0x330/0x970 [zfs]
[ 2015.128755]  [<ffffffffa058adcb>] dbuf_hold_impl+0x7b/0xa0 [zfs]
[ 2015.132852]  [<ffffffffa058ae25>] dbuf_hold_level+0x35/0x60 [zfs]
[ 2015.136858]  [<ffffffffa058ae66>] dbuf_hold+0x16/0x20 [zfs]
[ 2015.140681]  [<ffffffffa05af5ed>] dnode_hold_impl+0xdd/0x820 [zfs]
[ 2015.144916]  [<ffffffffa059bfff>] dmu_object_claim+0x9f/0xf0 [zfs]
[ 2015.149041]  [<ffffffffa0624e9e>] zap_create_claim_norm+0x2e/0x60 [zfs]
[ 2015.153501]  [<ffffffffa0624e75>] ? zap_create_claim_norm+0x5/0x60 [zfs]
[ 2015.157744]  [<ffffffffa0624ef0>] zap_create_claim+0x20/0x30 [zfs]
[ 2015.161936]  [<ffffffffa05cc557>] dsl_pool_create+0x97/0x4d0 [zfs]
[ 2015.166267]  [<ffffffffa05f4e0f>] spa_create+0x59f/0xaf0 [zfs]
[ 2015.170134]  [<ffffffffa063acf2>] zfs_ioc_pool_create+0x152/0x270 [zfs]
[ 2015.174360]  [<ffffffffa063ba97>] zfsdev_ioctl+0x537/0x660 [zfs]
[ 2015.178273]  [<ffffffffa063b560>] ? pool_status_check+0xf0/0xf0 [zfs]
[ 2015.182335]  [<ffffffff811f2175>] do_vfs_ioctl+0x2e5/0x4c0
[ 2015.185965]  [<ffffffff811f23f1>] SyS_ioctl+0xa1/0xc0
[ 2015.189398]  [<ffffffff816461c9>] system_call_fastpath+0x16/0x1b

Regarding the dtrace probes, @dweeezil summed it up nicely.

dpquigl · 2016-06-18T02:30:43Z

How do I build the tree with dtrace macros enabled? I'm trying to add the new dtrace2 macro but I'm not sure how to test it.

behlendorf · 2016-06-20T17:13:03Z

@dpquigl since the dtrace macro's depend on ftrace which is GPL the only way to enable them is in a local build. Change the Licence: to in the top level META build to GPL and re-run autogen+configure+make.

It also looks like the updated patch is hitting an ASSERT.

dpquigl · 2016-06-20T17:20:07Z

Yea I hadn't tracked down the assert yet. I was trying to get the last of the builds to pass successfully before I started working on that. Ill redo the build with the changes you suggested and try to get the dtrace/ftrace error ironed out first.

don-brady · 2016-06-20T17:37:55Z

@dpquigl @behlendorf I looked at the assert last week. The new dbuf cache is built with the multilist primitives (also used by L2 ARC). The state of multilist_link_active doesn't seem to always match reality. In addition the dbuf_cache_size refcount seems off and will assert if the list remove assert is suppressed. I'll keep digging since I have a reproducible case now.

dpquigl · 2016-06-21T11:44:43Z

@behlendorf Is there something in that pull request that you think will fix our issue with multilist_link_active? If so is it a single commit I can extract or do we need to wait on the entire patch?

behlendorf · 2016-06-21T16:16:56Z

@dpquigl I just referenced this PR from the ZFS encryption PR so the developers working on encryption are aware of this work. I haven't looked in to the ASSERT at all, but it looks like you've sorted out all the build issues which is great.

dpquigl · 2016-06-21T17:23:11Z

@behlendorf Yea I added the dbuf ftrace support and fixed the arc ftrace support because of struct changes. Don is working on tracking down the assert and I've built an Ubuntu vm and have ZFS built and I'll be tracking down the ztest failure on ubuntu.

dpquigl · 2016-06-22T02:23:21Z

@behlendorf Is there any way to get some assistance in trying to track down some of these testing failures? I built an ubuntu vm and ran the commands that supposedly failed and they ran through and passed on my local vm.

behlendorf · 2016-06-22T04:32:29Z

There are still a couple existing ztest failures which occur once and a while. I wouldn't worry about the ztest failure until the assert is addressed. Particulatly if you can't reproduce the failure locally.

don-brady · 2016-06-22T06:36:45Z

@behlendorf @dpquigl I found the issue with the ASSERT. I'll do some more testing but the root cause seems to be a missing multilist_link_init() in the dbuf kmem cache constructor.

dpquigl · 2016-06-23T02:55:55Z

@behlendorf The patch has been rebased to the latest master. I don't know why the tests are failing on the test machines but they seem to clear our tests on Centos and when running the failed commands on ubuntu they pass for me. I'm not sure how you want to handle this now.

tcaputi · 2016-06-24T15:22:45Z

@dpquigl It seems that most of the tests are failing because (1) the packaging is failing to include or properly place ztest and (2) There is a panic occuring in the zconfig test.

Also the debian 8 slave disconnected at some point (that happens occasionally).

behlendorf · 2016-06-24T17:15:59Z

Right, what @tcaputi said! Also be aware that d0de2e8 which adds some test coverage might have introduced a new ztest failure, I'm looking in to it today and will see if it needs to be reverted.

dpquigl · 2016-06-28T03:07:30Z

@behlendorf So the last two runs that went through came up pretty good with several of the machines coming up all green. I'm hoping this run with the tunables does the same and we can try to get the last of the problems figured out and get this merged. If we can do that then I think it would be better to rebase the zfs resume send patches on top of compressed arc instead of the other way around.

behlendorf · 2016-06-28T22:46:22Z

@dpquigl it's great to see this PR coming together. But we're going to need to run down any observed failures before this can be merged in order to keep the master branch is good shape for other developers. Ignoring some spurious buildbot infrastructure problems and a long standing ztest failure all the tests have been passing on all the builders. I've resubmitted the failures above in order to get clean test runs to determine if any new issues were introduced.

I'd also like to merge this PR after the resumable send changes since that's the same order they were merged in OpenZFS. This afternoon I resolved the last remaining issue with #4742 so as long as it passes the automated testing I expect to be able to merge it tomorrow. That will help bring a few key areas back in sync with the OpenZFS code.

After it's merged it would be great if you could rebase this against master. Then I'll be happy to review it and time permitting help run down any remaining outstanding issues.

dpquigl · 2016-06-30T03:22:24Z

Rebased on top of latest master.

dpquigl · 2016-07-06T15:38:43Z

So now that the zfs send patches have been merged and this has been rebased can we work on looking into these test failures?

dpquigl · 2016-07-11T17:17:02Z

@behlendorf Is there any way we can get a concerted effort to get this looked at and merged? I have a bunch of other work that is depending on this and I'd like to not have to keep carrying these patches forward.

behlendorf · 2016-07-15T17:53:10Z

@dpquigl the finalized version of this feature was just merged to the OpenZFS repository so we should be in a good position to get this merged. If you can refresh this against master, making sure to include any last minute fixes made to the OpenZFS version we can get this merged.

dpquigl · 2016-07-16T07:37:40Z

@behlendorf The version currently up there contains all of George's fixes. According to the PR here there are no conflicts with master that would prevent you from merging the commit. We had a F2F meeting this week and the Compressed ARC code we have completed passed our internal test machines with 100% test success. I'm not sure why the testers here are failing but we're seeing good results from the code internally. If you need the patch rebased I can try to do it this weekend but it will most likely have to wait until Monday.

behlendorf · 2016-07-18T23:30:43Z

@dpquigl that's great news regarding your internal testing. Many of the test failures appear to have been due to problems in the master branch which have been addressed since this PR was last refreshed. For testing purposes I've opened #4863 with this patch rebased against the latest master code. If things go well there this should be ready to merge.

behlendorf · 2016-07-26T16:48:16Z

Minor issue when building without debugging enabled

/home/behlendo/src/git/zfs/module/zfs/arc.c: In function ‘arc_share_buf’:
/home/behlendo/src/git/zfs/module/zfs/arc.c:2266:15: warning: unused variable ‘state’ [-Wunused-variable]
  arc_state_t *state = hdr->b_l1hdr.b_state;
               ^
/home/behlendo/src/git/zfs/module/zfs/arc.c: In function ‘arc_unshare_buf’:
/home/behlendo/src/git/zfs/module/zfs/arc.c:2295:15: warning: unused variable ‘state’ [-Wunused-variable]
  arc_state_t *state = hdr->b_l1hdr.b_state;
               ^
/home/behlendo/src/git/zfs/module/zfs/dbuf.c: In function ‘dbuf_add_ref’:
/home/behlendo/src/git/zfs/module/zfs/dbuf.c:2801:10: warning: unused variable ‘holds’ [-Wunused-variable]
  int64_t holds = refcount_add(&db->db_holds, tag);
          ^

Also just a heads up there will be some simple conflicts to resolve against master due to 25458cb which was merged.

I'd like to get this merged so I'll be spending some time looking in to the remaining testing failures.

behlendorf · 2016-07-27T18:47:49Z

Original comments from @angstymeat which were posted to #4863. Thanks for testing this out and providing us some feedback.

I'm running this patch against the latest July 25 commits (g25458cb) and I'm seeing a weird thing on all three machines: my load average never drops below 1, and I can't find what is running to cause it.

This is almost certainly by the use of cv_timedwait_hires() in the new dbuf_evict_thread(). @dpquigl you'll want to update the code for Linux to use cv_timedwait_sig_hires() instead so this doesn't count against the load average.

Two systems are running the 4.6.4-201.fc23.x86_64 kernel, and the third is running 4.6.4-301.fc24.x86_64. One is an Intel Xeon, one an AMD Opteron, and one an Intel i3. Memory in them ranged from 2GB to 32GB.

I'm also seeing radically different cache-hit and ARC-utilization statistics. The systems appear to be caching almost nothing but metadata, and cache hit-rates have dropped considerably (my graphs are showing rates at between 10% to 25% of what I had been seeing without this patch).

All three systems are running pools with LZ4 compression, and overall compression ratios range from 1.09x to 1.83x.

I don't know if the cache values are normal or not now that the ARC is storing compressed data, but the load average definitely seems wrong to me.

I can provide more information about the systems upon request.

dpquigl · 2016-07-29T00:10:46Z

@behlendorf Made the change you requested and pushed it as a separate commit for now. You can feel free to merge the commit into the first one during merge.

dpquigl · 2016-07-29T12:59:28Z

@behlendorf The builtin builds seem to fail because of a bug in SPL.

behlendorf · 2016-07-29T17:39:03Z

@dpquigl thanks. The built in failures should be resolved fairly soon, they're due to changes in the latest kernel.org kernel.

Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS Issue: https://www.illumos.org/issues/6950

behlendorf · 2016-08-22T23:38:29Z

Closing, replaced by #5009.

mmatuska · 2020-08-19T08:22:45Z

MRU/MFU pressure fix:
#10548
#10618

behlendorf mentioned this pull request Jun 20, 2016

ZFS Encryption #4329

Closed

tcaputi mentioned this pull request Jul 11, 2016

Raw Send Feature (Encrypted + compressed) #4840

Closed

behlendorf mentioned this pull request Jul 18, 2016

OpenZFS 6950 - ARC should cache compressed data (testing-only) #4863

Closed

sempervictus mentioned this pull request Jul 27, 2016

ABD: linear/scatter dual typed buffer for ARC (ver 2) #3441

Closed

dpquigl closed this Jul 29, 2016

dpquigl reopened this Jul 29, 2016

behlendorf mentioned this pull request Aug 16, 2016

OpenZFS 6950 - ARC should cache compressed data (Testing-only, rebased on master) #4975

Closed

behlendorf closed this Aug 22, 2016

dpquigl deleted the openzfs-compressedarc branch September 13, 2016 20:35

OpenZFS 6950 - ARC should cache compressed data #4768

OpenZFS 6950 - ARC should cache compressed data #4768

Conversation

dpquigl commented Jun 17, 2016

dpquigl commented Jun 17, 2016

dpquigl commented Jun 17, 2016

don-brady commented Jun 17, 2016

don-brady commented Jun 17, 2016

dweeezil commented Jun 17, 2016

dweeezil commented Jun 17, 2016

behlendorf commented Jun 17, 2016 • edited

dpquigl commented Jun 18, 2016

behlendorf commented Jun 20, 2016

dpquigl commented Jun 20, 2016

don-brady commented Jun 20, 2016

dpquigl commented Jun 21, 2016

behlendorf commented Jun 21, 2016

dpquigl commented Jun 21, 2016

dpquigl commented Jun 22, 2016

behlendorf commented Jun 22, 2016

don-brady commented Jun 22, 2016

dpquigl commented Jun 23, 2016

tcaputi commented Jun 24, 2016

behlendorf commented Jun 24, 2016 • edited

dpquigl commented Jun 28, 2016

behlendorf commented Jun 28, 2016

dpquigl commented Jun 30, 2016

dpquigl commented Jul 6, 2016

dpquigl commented Jul 11, 2016

behlendorf commented Jul 15, 2016 • edited

dpquigl commented Jul 16, 2016

behlendorf commented Jul 18, 2016

behlendorf commented Jul 26, 2016

behlendorf commented Jul 27, 2016

dpquigl commented Jul 29, 2016

dpquigl commented Jul 29, 2016

behlendorf commented Jul 29, 2016

behlendorf commented Aug 22, 2016

mmatuska commented Aug 19, 2020 • edited

behlendorf commented Jun 17, 2016 •

edited

behlendorf commented Jun 24, 2016 •

edited

behlendorf commented Jul 15, 2016 •

edited

mmatuska commented Aug 19, 2020 •

edited