Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 6950 - ARC should cache compressed data #4768

Closed
wants to merge 1 commit into from
Closed

OpenZFS 6950 - ARC should cache compressed data #4768

wants to merge 1 commit into from

Conversation

dpquigl
Copy link
Contributor

@dpquigl dpquigl commented Jun 17, 2016

Authored by: George Wilson george.wilson@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
Reviewed by: Dan Kimmel dan.kimmel@delphix.com
Reviewed by: Matt Ahrens mahrens@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Ported by: David Quigley david.quigley@intel.com

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS Issue: https://www.illumos.org/issues/6950

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 17, 2016

Please note that for the purpose of merging we split out the regression test suite from the compressed arc patch. They are two different features and we will have a pull request coming shortly for the test suite.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 17, 2016

The failure for the built-in build is because we don't have support for a DTRACE2_PROBE of the form (dmu_dbuf_impl_t, multilist_sublist_t). This needs to be added but I'm not sure how its done. Would appreciate help with this.

@don-brady
Copy link
Contributor

@ behlendorf - Looks like ztest failed on Ubuntu bot. Is there an easy way to get the core file(s)?

trap: SIGTERM: bad trap
+ TEST_ZTEST_TIMEOUT=900
+ TEST_ZTEST_DIR=/mnt
+ TEST_ZTEST_CORE_DIR=/mnt/zloop
+ TEST_ZTEST_OPTIONS=
+ sudo -E dmesg -c
+ sudo -E zfs.sh
+ CHILD=13461
+ wait 13461
+ sudo -E zloop.sh -t 900 -f /mnt -c /mnt/zloop
core dump directory (/mnt/zloop) does not exist, creating it.
06/17 02:30:56 /sbin/ztest -VVVVV -m 2 -r 0 -R 1 -v 2 -a 9 -T 90 -P 29 -s 128m -f /mnt 

@don-brady
Copy link
Contributor

@ behlendorf - any advice/tips on adding a new DTRACE probe definition? Like in include/sys/trace_dbuf.h
Not sure if these need to be hand rolled or if there is a tool? Thanks

@dweeezil
Copy link
Contributor

@don-brady You need to add a new template to include/sys/trace_dbuf.h by hand. The existing one takes an additional zio_t parameter. You need to make one that takes a multilist_sublist_t. The existing one doesn't do anything with the zio so I suppose the new one wouldn't need to do anything interesting with the multilist sublist (although it could).

@dweeezil
Copy link
Contributor

Start by adding something like this to the bottom of include/sys/trace_dbuf.h:

DECLARE_EVENT_CLASS(zfs_dbuf_evict_one_class,
        TP_PROTO(dmu_buf_impl_t *db, multilist_sublist_t *mls),
        TP_ARGS(db, mls),
        TP_STRUCT__entry(DBUF_TP_STRUCT_ENTRY),
        TP_fast_assign(DBUF_TP_FAST_ASSIGN),
        TP_printk(DBUF_TP_PRINTK_FMT, DBUF_TP_PRINTK_ARGS)
);

#define DEFINE_DBUF_EVICT_ONE_EVENT(name) \
DEFINE_EVENT(zfs_dbuf_evict_one_class, name, \
        TP_PROTO(dmu_buf_impl_t *db, multilist_sublist_t mls), \
        TP_ARGS(db, mls))
DEFINE_DBUF_EVENT(zfs_dbuf__evict__one);

(completely untested, sorry) which should get things moving in the right direction.

@behlendorf
Copy link
Contributor

behlendorf commented Jun 17, 2016

@don-brady right now there's no way to get the crash dumps. But feel free to extend the script to better support this kind of thing. That said, it looks like something much worse happened during the testing.

Normally when ztest fails the worst thing which can happen is that test case fails and the user space process cores. However in this case the entire test node went unresponsive and all subsequent tests failed. Normally, that's a clue to check the console log from either this test case or an earlier one to see if you can find an earlier problem.

In this case since none of the automated tested passed I checked the console logs for CentOS 7 and found the following. That would explain all of the test wreckage.

[ 2015.078402] VERIFY(!list_is_empty(list)) failed
[ 2015.080684] PANIC at list.h:126:list_remove()
[ 2015.082965] Showing stack for process 28245
[ 2015.084986] CPU: 1 PID: 28245 Comm: zpool Tainted: P           OE  ------------   3.10.0-327.18.2.el7.x86_64 #1
[ 2015.089681] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016
[ 2015.092693]  ffffffffa06b37f7 000000007dc791d7 ffff8800d18c38e0 ffffffff81635a0c
[ 2015.096825]  ffff8800d18c38f0 ffffffffa0473d84 ffff8800d18c3a78 ffffffffa0473e4f
[ 2015.100655]  00000000000002c2 ffff880000000028 ffff8800d18c3a88 ffff8800d18c3a28
[ 2015.104539] Call Trace:
[ 2015.105779]  [<ffffffff81635a0c>] dump_stack+0x19/0x1b
[ 2015.108401]  [<ffffffffa0473d84>] spl_dumpstack+0x44/0x50 [spl]
[ 2015.111203]  [<ffffffffa0473e4f>] spl_panic+0xbf/0xf0 [spl]
[ 2015.114009]  [<ffffffff81647d71>] ? ftrace_call+0x5/0x2f
[ 2015.116693]  [<ffffffff81647d71>] ? ftrace_call+0x5/0x2f
[ 2015.119227]  [<ffffffffa05e05a1>] multilist_sublist_remove+0x91/0xa0 [zfs]
[ 2015.122515]  [<ffffffffa05e0643>] multilist_remove+0x93/0x150 [zfs]
[ 2015.125472]  [<ffffffffa058a710>] __dbuf_hold_impl+0x330/0x970 [zfs]
[ 2015.128755]  [<ffffffffa058adcb>] dbuf_hold_impl+0x7b/0xa0 [zfs]
[ 2015.132852]  [<ffffffffa058ae25>] dbuf_hold_level+0x35/0x60 [zfs]
[ 2015.136858]  [<ffffffffa058ae66>] dbuf_hold+0x16/0x20 [zfs]
[ 2015.140681]  [<ffffffffa05af5ed>] dnode_hold_impl+0xdd/0x820 [zfs]
[ 2015.144916]  [<ffffffffa059bfff>] dmu_object_claim+0x9f/0xf0 [zfs]
[ 2015.149041]  [<ffffffffa0624e9e>] zap_create_claim_norm+0x2e/0x60 [zfs]
[ 2015.153501]  [<ffffffffa0624e75>] ? zap_create_claim_norm+0x5/0x60 [zfs]
[ 2015.157744]  [<ffffffffa0624ef0>] zap_create_claim+0x20/0x30 [zfs]
[ 2015.161936]  [<ffffffffa05cc557>] dsl_pool_create+0x97/0x4d0 [zfs]
[ 2015.166267]  [<ffffffffa05f4e0f>] spa_create+0x59f/0xaf0 [zfs]
[ 2015.170134]  [<ffffffffa063acf2>] zfs_ioc_pool_create+0x152/0x270 [zfs]
[ 2015.174360]  [<ffffffffa063ba97>] zfsdev_ioctl+0x537/0x660 [zfs]
[ 2015.178273]  [<ffffffffa063b560>] ? pool_status_check+0xf0/0xf0 [zfs]
[ 2015.182335]  [<ffffffff811f2175>] do_vfs_ioctl+0x2e5/0x4c0
[ 2015.185965]  [<ffffffff811f23f1>] SyS_ioctl+0xa1/0xc0
[ 2015.189398]  [<ffffffff816461c9>] system_call_fastpath+0x16/0x1b

Regarding the dtrace probes, @dweeezil summed it up nicely.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 18, 2016

How do I build the tree with dtrace macros enabled? I'm trying to add the new dtrace2 macro but I'm not sure how to test it.

@behlendorf
Copy link
Contributor

@dpquigl since the dtrace macro's depend on ftrace which is GPL the only way to enable them is in a local build. Change the Licence: to in the top level META build to GPL and re-run autogen+configure+make.

It also looks like the updated patch is hitting an ASSERT.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 20, 2016

Yea I hadn't tracked down the assert yet. I was trying to get the last of the builds to pass successfully before I started working on that. Ill redo the build with the changes you suggested and try to get the dtrace/ftrace error ironed out first.

@don-brady
Copy link
Contributor

@dpquigl @behlendorf I looked at the assert last week. The new dbuf cache is built with the multilist primitives (also used by L2 ARC). The state of multilist_link_active doesn't seem to always match reality. In addition the dbuf_cache_size refcount seems off and will assert if the list remove assert is suppressed. I'll keep digging since I have a reproducible case now.

@behlendorf behlendorf mentioned this pull request Jun 20, 2016
@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 21, 2016

@behlendorf Is there something in that pull request that you think will fix our issue with multilist_link_active? If so is it a single commit I can extract or do we need to wait on the entire patch?

@behlendorf
Copy link
Contributor

@dpquigl I just referenced this PR from the ZFS encryption PR so the developers working on encryption are aware of this work. I haven't looked in to the ASSERT at all, but it looks like you've sorted out all the build issues which is great.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 21, 2016

@behlendorf Yea I added the dbuf ftrace support and fixed the arc ftrace support because of struct changes. Don is working on tracking down the assert and I've built an Ubuntu vm and have ZFS built and I'll be tracking down the ztest failure on ubuntu.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 22, 2016

@behlendorf Is there any way to get some assistance in trying to track down some of these testing failures? I built an ubuntu vm and ran the commands that supposedly failed and they ran through and passed on my local vm.

@behlendorf
Copy link
Contributor

There are still a couple existing ztest failures which occur once and a while. I wouldn't worry about the ztest failure until the assert is addressed. Particulatly if you can't reproduce the failure locally.

@don-brady
Copy link
Contributor

@behlendorf @dpquigl I found the issue with the ASSERT. I'll do some more testing but the root cause seems to be a missing multilist_link_init() in the dbuf kmem cache constructor.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 23, 2016

@behlendorf The patch has been rebased to the latest master. I don't know why the tests are failing on the test machines but they seem to clear our tests on Centos and when running the failed commands on ubuntu they pass for me. I'm not sure how you want to handle this now.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 24, 2016

@dpquigl It seems that most of the tests are failing because (1) the packaging is failing to include or properly place ztest and (2) There is a panic occuring in the zconfig test.

Also the debian 8 slave disconnected at some point (that happens occasionally).

@behlendorf
Copy link
Contributor

behlendorf commented Jun 24, 2016

Right, what @tcaputi said! Also be aware that d0de2e8 which adds some test coverage might have introduced a new ztest failure, I'm looking in to it today and will see if it needs to be reverted.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 28, 2016

@behlendorf So the last two runs that went through came up pretty good with several of the machines coming up all green. I'm hoping this run with the tunables does the same and we can try to get the last of the problems figured out and get this merged. If we can do that then I think it would be better to rebase the zfs resume send patches on top of compressed arc instead of the other way around.

@behlendorf
Copy link
Contributor

@dpquigl it's great to see this PR coming together. But we're going to need to run down any observed failures before this can be merged in order to keep the master branch is good shape for other developers. Ignoring some spurious buildbot infrastructure problems and a long standing ztest failure all the tests have been passing on all the builders. I've resubmitted the failures above in order to get clean test runs to determine if any new issues were introduced.

I'd also like to merge this PR after the resumable send changes since that's the same order they were merged in OpenZFS. This afternoon I resolved the last remaining issue with #4742 so as long as it passes the automated testing I expect to be able to merge it tomorrow. That will help bring a few key areas back in sync with the OpenZFS code.

After it's merged it would be great if you could rebase this against master. Then I'll be happy to review it and time permitting help run down any remaining outstanding issues.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jun 30, 2016

Rebased on top of latest master.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jul 6, 2016

So now that the zfs send patches have been merged and this has been rebased can we work on looking into these test failures?

@dpquigl
Copy link
Contributor Author

dpquigl commented Jul 11, 2016

@behlendorf Is there any way we can get a concerted effort to get this looked at and merged? I have a bunch of other work that is depending on this and I'd like to not have to keep carrying these patches forward.

@behlendorf
Copy link
Contributor

behlendorf commented Jul 15, 2016

@dpquigl the finalized version of this feature was just merged to the OpenZFS repository so we should be in a good position to get this merged. If you can refresh this against master, making sure to include any last minute fixes made to the OpenZFS version we can get this merged.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jul 16, 2016

@behlendorf The version currently up there contains all of George's fixes. According to the PR here there are no conflicts with master that would prevent you from merging the commit. We had a F2F meeting this week and the Compressed ARC code we have completed passed our internal test machines with 100% test success. I'm not sure why the testers here are failing but we're seeing good results from the code internally. If you need the patch rebased I can try to do it this weekend but it will most likely have to wait until Monday.

@behlendorf
Copy link
Contributor

@dpquigl that's great news regarding your internal testing. Many of the test failures appear to have been due to problems in the master branch which have been addressed since this PR was last refreshed. For testing purposes I've opened #4863 with this patch rebased against the latest master code. If things go well there this should be ready to merge.

@behlendorf
Copy link
Contributor

Minor issue when building without debugging enabled

/home/behlendo/src/git/zfs/module/zfs/arc.c: In function ‘arc_share_buf’:
/home/behlendo/src/git/zfs/module/zfs/arc.c:2266:15: warning: unused variable ‘state’ [-Wunused-variable]
  arc_state_t *state = hdr->b_l1hdr.b_state;
               ^
/home/behlendo/src/git/zfs/module/zfs/arc.c: In function ‘arc_unshare_buf’:
/home/behlendo/src/git/zfs/module/zfs/arc.c:2295:15: warning: unused variable ‘state’ [-Wunused-variable]
  arc_state_t *state = hdr->b_l1hdr.b_state;
               ^
/home/behlendo/src/git/zfs/module/zfs/dbuf.c: In function ‘dbuf_add_ref’:
/home/behlendo/src/git/zfs/module/zfs/dbuf.c:2801:10: warning: unused variable ‘holds’ [-Wunused-variable]
  int64_t holds = refcount_add(&db->db_holds, tag);
          ^

Also just a heads up there will be some simple conflicts to resolve against master due to 25458cb which was merged.

I'd like to get this merged so I'll be spending some time looking in to the remaining testing failures.

@behlendorf
Copy link
Contributor

Original comments from @angstymeat which were posted to #4863. Thanks for testing this out and providing us some feedback.

I'm running this patch against the latest July 25 commits (g25458cb) and I'm seeing a weird thing on all three machines: my load average never drops below 1, and I can't find what is running to cause it.

This is almost certainly by the use of cv_timedwait_hires() in the new dbuf_evict_thread(). @dpquigl you'll want to update the code for Linux to use cv_timedwait_sig_hires() instead so this doesn't count against the load average.

Two systems are running the 4.6.4-201.fc23.x86_64 kernel, and the third is running 4.6.4-301.fc24.x86_64. One is an Intel Xeon, one an AMD Opteron, and one an Intel i3. Memory in them ranged from 2GB to 32GB.

I'm also seeing radically different cache-hit and ARC-utilization statistics. The systems appear to be caching almost nothing but metadata, and cache hit-rates have dropped considerably (my graphs are showing rates at between 10% to 25% of what I had been seeing without this patch).

All three systems are running pools with LZ4 compression, and overall compression ratios range from 1.09x to 1.83x.

I don't know if the cache values are normal or not now that the ARC is storing compressed data, but the load average definitely seems wrong to me.

I can provide more information about the systems upon request.

@dpquigl
Copy link
Contributor Author

dpquigl commented Jul 29, 2016

@behlendorf Made the change you requested and pushed it as a separate commit for now. You can feel free to merge the commit into the first one during merge.

@dpquigl dpquigl closed this Jul 29, 2016
@dpquigl dpquigl reopened this Jul 29, 2016
@dpquigl
Copy link
Contributor Author

dpquigl commented Jul 29, 2016

@behlendorf The builtin builds seem to fail because of a bug in SPL.

@behlendorf
Copy link
Contributor

@dpquigl thanks. The built in failures should be resolved fairly soon, they're due to changes in the latest kernel.org kernel.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS Issue: https://www.illumos.org/issues/6950
@behlendorf
Copy link
Contributor

Closing, replaced by #5009.

@behlendorf behlendorf closed this Aug 22, 2016
@dpquigl dpquigl deleted the openzfs-compressedarc branch September 13, 2016 20:35
@mmatuska
Copy link
Contributor

mmatuska commented Aug 19, 2020

MRU/MFU pressure fix:
#10548
#10618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants