Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption beyond zfs_arc_max with dedup enabled #2083

Closed
adm-sim opened this issue Jan 27, 2014 · 6 comments
Closed

Memory consumption beyond zfs_arc_max with dedup enabled #2083

adm-sim opened this issue Jan 27, 2014 · 6 comments
Labels
Component: Memory Management kernel memory management

Comments

@adm-sim
Copy link

adm-sim commented Jan 27, 2014

I've been experimenting with ZFS 0.6.2.1 on a machine with Ubuntu 12.10, 32GB RAM (non-ECC, production system will have ECC) and a 2x2TB Linux-managed RAID1 (will be moved to RAIDZ1 for production). I just created the tank on the 2TB soft-RAID1 device, enabled compression and dedup and stored a few 100GB of data. I got a dedup ratio of about 3.5x but there was no free memory left at all, the system became unusable. Restarting the system, everything seemed fine, then I wrote a few GB of data, same thing.

AFAIK

DDT-sha256-zap-duplicate: 615271 entries, size 463 on disk, 149 in core
DDT-sha256-zap-unique: 846070 entries, size 494 on disk, 159 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     826K   83.5G   51.7G   52.9G     826K   83.5G   51.7G   52.9G
     2     363K   34.6G   17.8G   18.5G     869K   81.9G   41.3G   43.0G
     4     138K   14.1G   8.89G   9.11G     654K   66.4G   41.0G   42.1G
     8    49.0K   3.94G   2.25G   2.34G     580K   44.3G   25.3G   26.4G
    16    37.2K   3.96G   3.06G   3.10G     865K   90.1G   69.9G   70.8G
    32    9.81K    854M    471M    488M     464K   40.5G   21.9G   22.7G
    64    1.84K    160M   80.8M   85.1M     148K   11.8G   5.99G   6.33G
   128    1.13K   60.4M   24.7M   27.7M     218K   11.2G   4.70G   5.26G
   256      545   52.9M   30.9M   32.1M     169K   15.5G   9.00G   9.36G
   512      120   7.17M   4.19M   4.51M    84.5K   5.09G   2.96G   3.18G
    1K      368   40.0M   19.0M   19.7M     480K   52.2G   24.8G   25.7G
    2K       16    401K     23K     76K    46.4K   1.31G   73.5M    226M
    4K        8      5K      4K     32K    39.9K   24.6M   20.0M    160M
 Total    1.39M    141G   84.3G   86.6G    5.32M    504G    299G    308G

means that the table should only take about 90MB of memory, so I don't get what's happening. I have the same setup on an identical server without dedup and that one seems to work fine.

I then set zfs_arc_max to 12GB and it happened again. After restarting the server and mounting the volume:

             total       used       free     shared    buffers     cached
Mem:         32138        457      31680          0         19         66
-/+ buffers/cache:        372      31766
Swap:         7812          0       7812

and arcstats

4 1 0x01 84 4032 7898070146 560489175172
name                            type data
hits                            4    1059
misses                          4    185
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    971
demand_metadata_misses          4    49
prefetch_data_hits              4    0
prefetch_data_misses            4    7
prefetch_metadata_hits          4    88
prefetch_metadata_misses        4    129
mru_hits                        4    476
mru_ghost_hits                  4    0
mfu_hits                        4    495
mfu_ghost_hits                  4    0
deleted                         4    9
recycle_miss                    4    0
mutex_miss                      4    0
evict_skip                      4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    0
evict_l2_ineligible             4    2048
hash_elements                   4    176
hash_elements_max               4    176
hash_collisions                 4    0
hash_chains                     4    0
hash_chain_max                  4    0
p                               4    6442450944
c                               4    12884901888
c_min                           4    1610612736
c_max                           4    12884901888
size                            4    1704536
hdr_size                        4    101424
data_size                       4    1448960
other_size                      4    154152
anon_size                       4    16384
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1231872
mru_evict_data                  4    206336
mru_evict_metadata              4    849408
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    200704
mfu_evict_data                  4    0
mfu_evict_metadata              4    4096
mfu_ghost_size                  4    16384
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    16384
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1498200
arc_meta_limit                  4    3221225472
arc_meta_max                    4    1449144

Played around a bit until ARC hit vfs_arc_max (12GB):

4 1 0x01 84 4032 7898070146 1406380500230
name                            type data
hits                            4    7338384
misses                          4    117090
demand_data_hits                4    4841648
demand_data_misses              4    10072
demand_metadata_hits            4    2423640
demand_metadata_misses          4    35334
prefetch_data_hits              4    37879
prefetch_data_misses            4    65420
prefetch_metadata_hits          4    35217
prefetch_metadata_misses        4    6264
mru_hits                        4    2672085
mru_ghost_hits                  4    301
mfu_hits                        4    4615778
mfu_ghost_hits                  4    1183
deleted                         4    9
recycle_miss                    4    1022
mutex_miss                      4    17
evict_skip                      4    2
evict_l2_cached                 4    0
evict_l2_eligible               4    1977338368
evict_l2_ineligible             4    751589376
hash_elements                   4    166822
hash_elements_max               4    166828
hash_collisions                 4    59458
hash_chains                     4    21504
hash_chain_max                  4    4
p                               4    55022931
c                               4    12652319216
c_min                           4    1610612736
c_max                           4    12884901888
size                            4    12327222416
hdr_size                        4    55933440
data_size                       4    12149027328
other_size                      4    122261648
anon_size                       4    1056256
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    6481734656
mru_evict_data                  4    6220393984
mru_evict_metadata              4    188646912
mru_ghost_size                  4    1902724096
mru_ghost_evict_data            4    1871710720
mru_ghost_evict_metadata        4    31013376
mfu_size                        4    5666236416
mfu_evict_data                  4    5643978240
mfu_evict_metadata              4    16081408
mfu_ghost_size                  4    708022272
mfu_ghost_evict_data            4    680676352
mfu_ghost_evict_metadata        4    27345920
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    1947
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    462466704
arc_meta_limit                  4    3221225472
arc_meta_max                    4    465357280

and free -m showed what was to be expected. But, playing around some more led to the system becoming unreasonably slow (minutes to copy 1GB) and

             total       used       free     shared    buffers     cached
Mem:         32138      31923        215          0          6      15442
-/+ buffers/cache:      16473      15665
Swap:         7812          0       7812

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1    308 3774708  27204 9464052    0    0   386   271   72  348  1  2 83 15

Unmounting the ZFS volume and unloading the kernel module frees up all the memory...So to me, it really looks like some sort of memory leak: zfs_arc_max is set, and arcstats says this limit is observed (see below), but ZFS somehow continues to eat up memory.

Is this behavior expected? Are there more limits I can/should set?

@DeHackEd
Copy link
Contributor

So in your kstat/zfs/arcstats you have:

c_max                           4    12884901888
size                            4    12327222416

.. and from your free -m output ZFS couldn't be using more than about 16 GB of memory. So ZFS hasn't actually exceeded the ARC setting at this point and isn't going crazy either.

For slow speeds do remember that dedup is hell on memory usage and thrashing is to be expected on deduplicated data.

It's important to note that right now there's a lot of overhead and fragmentation in the memory handling subsystem. ZFS thinks it's staying under the limit but could exceed it. Flushing ZFS hard by exporting the pool and/or unloading the module will clean it out thoroughly. ZFS is tested pretty extensively and we don't think there are any actual leaks in it.

There is a fair amount of overhead in the DDT allocations specifically which will be improved dramatically in 0.6.3. You can get the effects immediately by using the Git version (warning: kernel and userspace tools versions must match or you'll get all kinds of weird errors), or if you want to try patching ZFS yourself see #1893 or ecf3d9b. Add .patch to the end of either of these URLs to get a patch file for the data shown.

@adm-sim
Copy link
Author

adm-sim commented Jan 27, 2014

ZFS not exceeding the ARC setting, but the memory still being used, is exactly what was worrying me. Thanks for the info about fragmentation!
I might take a look at 0.6.3 or the patches. How production-safe should those be considered, or, how far off do you think 0.6.3 is? There is no due date on the road map yet.

@DeHackEd
Copy link
Contributor

0.6.3 will be released when all the issues are addressed. You can't put a date on stuff like that.

ZFS is kept in as close to a "good to release" state as possible. Every commit goes through a test suite at LLNL. I'm running the Git version on the workstation I'm typing this message on right now (give or take a couple of weeks worth of commits).

@AndCycle
Copy link

just test dedup for my backup dataset on last week's git version,
8GB for ARC metadata and 30GB for L2ARC,

I don't experience any memory issue, but hit by heavy random i/o brought by dedup,

rsync from backup dataset to dedup test pool got only 7mb/s after it reach 100GB mark,
dedup ration is only 1.07 at this point,
zpool iostat shows most activity are reading on dedup test pool,
soon I give up because it won't be done in my lifetime.

even I didn't get caught by memory issue,
I think the dedup pool eventually going into unusable status due to the crazy random i/o on rotating hard drive.

http://christopher-technicalmusings.blogspot.tw/2011/07/zfs-dedup-performance-real-world.html

@sempervictus
Copy link
Contributor

Interesting issue - i'm seeing the same behavior in deduplication after ~150G even with SSD ZIL/ARC2 for the target disk and 16G of ARC. I dont think the IO is really random though, just seeking for blocks which may be identical to the one being written. Oracle and IBM have documentation on this, but basically, if remember this correctly, the math is (number of blocks) * 320B = memory needed to dedup all the blocks. Given these numbers, it actually gets impractical pretty quick, and we have memory demons running around under the SPL floorboards to make this a real party.

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

6 participants
@behlendorf @sempervictus @DeHackEd @AndCycle @adm-sim and others