root zfs freezes system #154

Closed
Rudd-O opened this Issue Mar 10, 2011 · 82 comments

Comments

Projects
None yet
4 participants
Contributor

Rudd-O commented Mar 10, 2011

My system freezes when rsyncing large volumes of data.

The first rsync finished just fine, and copied over 80 GB of data from my closet.

However, a second rsync pass (which is nothing more than stats()) on the same files (my /home directory), will eventually -- into about a minute or so of reading massive amounts of files -- grind the system to a halt. the top view freezes completely while kswapd is at the top of the process list, and ps ax hangs in the middle of the process listing. Obviously I cannot provide a screenshot of that.

What workarounds can I providde to tell ZFS not to use so much memory? Even if it is slower, I need to see if this iis a memory problem.

Swap rests on another partition of the same SSD. It is not swapping with a file on the zfs volume.

Contributor

Rudd-O commented Mar 10, 2011

It's memory, definitely. Free memory drops rapidly as the rsync goes, then the machine locks up. echo 3 > /proc/sys/vm/drop_caches will free the memory that got eaten, but it will take several seconds to complete.

Contributor

Rudd-O commented Mar 10, 2011

curiously the cached memory went down only about a hundred M, but the free memory went up like eight hundred M. So I don't know exactly what is being freed that is not tallied up in the cached memory counter.

Contributor

Rudd-O commented Mar 10, 2011

Also if I stop the rsync as the machine is about to hang, arc_reclaim kicks in with 10000000% percent of cpu (exaggeration of mine).

it seems the problem is that arc_reclaim is simply kicking in too late, and by that time the machine is already effectively hung. remember none of that memory is swappable.

Owner

behlendorf commented Mar 10, 2011

This is my number 1 issue to get fixed. I've just started to look at it now that things are pretty stable. Unfortunately, because of the way ZFS has to manage the ARC, cached data isn't reported under 'cached' instead you'll see it under 'vmalloc_used'. That won't change soon (but I have long term plans), but in the short term we should be able to do something about the thrashing.

Contributor

Rudd-O commented Mar 10, 2011

how do i reduce the use of the arc in the meantime?

Contributor

Rudd-O commented Mar 10, 2011

what takes a long time is the sync previous to the drop_caches. during that time, the free memory increases.

Contributor

Rudd-O commented Mar 10, 2011

sorry, i take my last comment back. dropping the caches is what takes long.

Owner

behlendorf commented Mar 10, 2011

You can cap the ARC size by setting the 'zfs_arc_max' module option. However, after the module is loaded you won't be able to change this value through /sys/module/zfs/. Arguably that would be a nice thing to support.

Contributor

Rudd-O commented Mar 11, 2011

What units is the zfs_arc_max knob? How do I set it? Only through modprobe? So I'd have to add it to the dracut module then. I think I will put the dracut thingie on version control and push it to my repo.

Contributor

Rudd-O commented Mar 11, 2011

my first question is, let's say I say zfs_arc_max=1024, what is that? kilobytes?

Owner

behlendorf commented Mar 11, 2011

Right now the arc cache is sized to a maximum 3/4 of all system memory. The zfs_arc_max module option actually takes a value in bytes because that's what the same hook on Solaris expects. We could make this as fancy as we need.

Contributor

fajarnugraha commented Mar 14, 2011

The unit for zfs_arc_max is bytes, and it will only matter if it's > 64M (otherwise the code ignores the setting). I set it to 65M (68157440) right now for reliable operation, and a (very rough) look shows memory usage is at most 512MB higher compared to without zfs.

Contributor

Rudd-O commented Mar 14, 2011

~/Projects/Mine/zfs@karen.dragonfear α:
cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_max=268435456 zfs_arc_min=0

~/Projects/Mine/zfs@karen.dragonfear α:
echo $(( 268435456 / 1024 / 1024 ))
256

Naaah, 64 MB is just TOO SLOW for a root file system to use. I tried that. So I went to 256 MB, and for the most part my machine only stalls two to three times a day (as opposed to right after boot with the unbounded setting). That's progress.

Now, if syscall response could be made faster... I feel like reading files that are already in cache is excruciatingly slow in ZFS compared to ext4, and it really affects application performance:

~/Projects/Mine/zfs@karen.dragonfear α:
sudo /home/rudd-o/Projects/Mine/fs-benchmarks/testers/tightloops read /boot/grub/grub.conf 
[sudo] password for rudd-o: 
Beginning to read() /boot/grub/grub.conf...
Stopping...
Speed: 3930443 read calls in 5 seconds, 786088/s

~/Projects/Mine/zfs@karen.dragonfear α:
sudo /home/rudd-o/Projects/Mine/fs-benchmarks/testers/tightloops read /etc/localtime
Beginning to read() /etc/localtime...
Stopping...
Speed: 464700 read calls in 5 seconds, 92940/s

An order of magnitude slower, NOT GOOD!

Contributor

fajarnugraha commented Mar 14, 2011

IIRC you shouldn't be able to use 64M unless you hack the code (64MB + 1 byte, maybe, but not 64MB) :D

Anyway, related to cached files, I assume you've seen this: behlendorf/zfs@450dc14

Contributor

Rudd-O commented Mar 14, 2011

FWIW, I tried 128. not 64. The code should recognize 64 and add a +1 to it, or change to accept 64 since many people with small systems will try that. I remember hacking zfs-fuse deeply to have it accept 16, since I wanted to run it on a 128 MB box. IT WORKED.

Contributor

Rudd-O commented Mar 14, 2011

That commit improves performance for read()'s, not for access() es, which constitute the VAST majority of file-related syscalls applications execute. Try an strace on kmail when it starts and you will quickly see what I mean.

What I don't undersatnd is, access() / stat() are supposed to be cached in the Linux kernel dentry cache. Does ZFS somehow bypass that and provides its own cache for dentries? That would be the only way I could understand access() stat() being slower on ZFS.

Owner

behlendorf commented Mar 14, 2011

As you said the default ARC behavior is to put a minimum bound of 64MiB on ARC, setting it to one byte larger than this should work. Changing this check to a '>=' instead of a '>' seems reasonable to me. I really have no idea how badly it will behave with less than 64MiB but feel free to try it!

    if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
            arc_c_max = zfs_arc_max;
    if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
            arc_c_min = zfs_arc_min;

Reading access() performance your right an order of magnitude is NO GOOD! This can and will be improved but in the interests of getting something working sooner rather than latter I haven't optimized this code.

While the zfs code does use the Linux dentry and inode caches, the values stored in the inode are not 100% authoritative. Currently these values still need to be pulled from the znode. Finishing the unification of the znode and inode is listed as "Inode/Znode Refactoring" in the list of development items and I expect it will help this issue. There are also a couple of memory allocations which if eliminated I'm sure would improve things. Finally, I suspect using oprofile to profile to getattr would reveal some other ways to improve things.

All of these changes move us a little further away using the unmodified Solaris code but I think that's a price we have to pay in the Posix layer if we want good performance. Anyway, I'm happy to have help with this work. :) In the meanwhile I want to fix the other memory issues which I feel I have a pretty good handle on now and with luck can get fixed this week.

Owner

behlendorf commented Mar 14, 2011

I'm having a horrible time reproducing this VM thrashing issue which has been reported. Nothing I've tried today has been able to recreate the issue. I've seen kswapd pop up briefly (fraction of a second) to a large CPU percentage but it quickly frees the required memory.

I want to get this fixed, but I need a good test case. Can someone who is seeing the problem determine exactly what is needed to recreate the issue. Also if you do manage to recreate the issue (and have a somewhat responsive system), running the following will give me a lot of what I need to get it fixed.

echo t >/proc/sysrq-trigger
echo m >/proc/sysrq-trigger
cat /proc/spl/kstat/zfs/arcstats
cat /proc/meminfo
Contributor

Rudd-O commented Mar 15, 2011

It's simple to reproduce:

  1. run zfs as root file system
  2. use your system normally, load a lot of applications
  3. then start a disk-intensive operation such as a yum upgrade
  4. eventually it'll choke.

I will try to upgrade to a newer kernel today. Running 2.6.37-pre7

Owner

behlendorf commented Mar 15, 2011

Okay, then I'll work on reviewing and pulling your your Dracut changes tomorrow. Then I can setup zfs as a root filesystem and see if I can't hit the issue. I know others have hit the problem with it as a non-root filesystem but I wasn't able too. That's what I get for using test systems however, there really isn't anything quite like real usage.

Contributor

Rudd-O commented Mar 15, 2011

Certainly!

devsk commented Mar 15, 2011

Same thing as reported in issue 149.

One way to reproduce this easily would be to boot your Linux system with smaller amount of RAM than it actually has. Try booting with 512MB using mem=512M as kernel parameter.

You don't need ZFS as rootfs to trigger this. A simple 'find' or 'du' on a fairly large FS with small amount of RAM will trigger this.

I have run into this issue (eventual hard lockup!) so many times that I gave up on rootfs.

Its just a matter of amount of RAM. e.g. given 48GB of RAM, you will probably not run into this issue....:-)

Owner

behlendorf commented Mar 15, 2011

I can try less RAM, I've been booting my box with 2 GiB to try and tickle the issue but thus far no luck. I'll try less.

Owner

behlendorf commented Mar 15, 2011

Ahh, found it! Or rather I refound the issue. Ricardo was working on this issue with me, I thought it would only effect a Lustre/ZFS server but it looks like that's not the case. KQ has identified the same bug and opened the following issue with the upstream kernel. That' progress. :)

https://bugzilla.kernel.org/show_bug.cgi?id=30702

http://marc.info/?l=linux-mm&m=128942194520631&w=4

Ricardo had worked out a patch but it has yet to be merged in to the upstream kernel. Joshi from KQ has also attached a proposed fix. Rudd-O, Devsk since you two aren't squeamish about rebuilding your kernel, if you like you can apply the proposed fix to your kernel and rebuild it. It should resolve the deadlock.

https://bugzilla.kernel.org/attachment.cgi?id=50802

Contributor

Rudd-O commented Mar 15, 2011

I am going to test the patch soon.

devsk commented Mar 16, 2011

I think that patch has nothing to do with memory issue that we are facing with ZFS. I have seen a hard lock (which could be the deadlock mentioned above) as well but mostly thrashing. We do need to concentrate on thrashing aspect.

Owner

behlendorf commented Mar 16, 2011

Well my new home NAS hardware just arrived so once I get it installed (tonight maybe) hopefully I'll start seeing these real usage issues too. It's an Intel atom with only 4GiB of memory so I'm well motivated to make ZFS work well within those limits. :)

It's hard to say for sure what the problem is without stack traces from the problem. If anyone hits it again please run echo t >/proc/sysrq-trigger; dmesg. That way we'll get stacks from all the processes on the system and we should be able to see what it's trashing on

devsk commented Mar 16, 2011

One issue we may run into while doing echo t to sysrq is that there are so many tasks, dmesg buffer may not be enough to hold them all.

And of course, the system doesn't give you much time to do diagnostics when this happens.

Contributor

Rudd-O commented Mar 16, 2011

Kernel 2.6. 38 here, no preempt, no thrashing anymore. So far. Still haven't gotten to test the patch. I will keep you informed.

devsk commented Mar 16, 2011

I have been running 2.6.38-rc's for a while and I still see thrashing. Unless something changed in last 1 week (between rc8 and release), I don't thrashing is fixed.

Owner

behlendorf commented Mar 16, 2011

I doubt it, I highly doubt this is related to the exact kernel version. It could be related to CONFIG_PREEMPT_VOLUNTARY, I could absolutely see that causing issues like this. Unfortunately, I only made CONFIG_PREEMPT fatal at configure time and missed CONFIG_PREEMPT_VOUNTARY, I'll have to fix that. It seems more kernels these days have it enabled by default so I'll have to find time to make the code preempt-safe as well.

Contributor

Rudd-O commented Mar 16, 2011

It's the voluntary thing! :) i am quite sure.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

behlendorf reply@reply.github.com wrote:

I doubt it, I highly doubt this is related to the exact kernel version. It could be related to CONFIG_PREEMPT_VOLUNTARY, I could absolutely see that causing issues like this. Unfortunately, I only made CONFIG_PREEMPT fatal at configure time and missed CONFIG_PREEMPT_VOUNTARY, I'll have to fix that. It seems more kernels these days have it enabled by default so I'll have to find time to make the code preempt-safe as well. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879127

devsk commented Mar 16, 2011

I have been running PREEMPT_NONE since the day we started on this mission...:-)

devsk commented Mar 16, 2011

Ok, I am running 2.6.38 with PREEMPT_NONE and the system has 4GB of RAM. I have configured the ARC low now (min max both 368MB, and I can see the numbers matching that upon boot in /proc/spl/kstat/zfs/arcstats). But the end result is same: a simple 'find / -xdev' consumes all 4GB of RAM, makes processes swap and then hard locks the system.

Contributor

Rudd-O commented Mar 16, 2011

I will try exactly that as soon as I get home from the bar.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

devsk reply@reply.github.com wrote:

Ok, I am running 2.6.38 with PREEMPT_NONE and the system has 4GB of RAM. I have configured the ARC low now (min max both 368MB, and I can see the numbers matching that upon boot in /proc/spl/kstat/zfs/arcstats). But the end result is same: a simple 'find / -xdev' consumes all 4GB of RAM, makes processes swap and then hard locks the system. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879357

devsk commented Mar 16, 2011

Make sure to time it...:-D For me, its order of magnitude slower than ext4 (4-5 seconds vs 100-141 seconds;its an SSD). The time varies depending on whether the ZFS has already settled or not i.e. if the login process is still going on, some processes may do some IO.

Contributor

Rudd-O commented Mar 16, 2011

Oh Oh know, I endured rsync /usr at 5mb per second. Reminds me of pio drives. Hehehehehee.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

devsk reply@reply.github.com wrote:

Make sure to time it...:-D For me, its order of magnitude slower than ext4 (4-5 seconds vs 100-141 seconds;its an SSD). The time varies depending on whether the ZFS has already settled or not i.e. if the login process is still going on, some processes may do some IO. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879380

Contributor

Rudd-O commented Mar 16, 2011

hung after today's rsync nightlie. so yes, the bug persists. i will try the patch in question soon, see if that changes the situation.

Contributor

Rudd-O commented Mar 20, 2011

I am testing the patch. Running find on a large directory no longer hangs the machine in a matter of seconds, but I still see kswapd and arc_reclaim peak in CPU from time to time. However, this time, free memory watermark hovers around 20M ~ 50M, indicating that at least the memory freeing algorithm is working and I no longer have to manually dropcaches to recover control of my machine.

This is good news. The patch is doing something.

devsk commented Mar 20, 2011

Which patch are u talking about? Sorry, lost the line of thought here.

Contributor

Rudd-O commented Mar 20, 2011

Kernel patch quoted above.

Also the system is incredibly slow under load. Patch did. No help with tjat.

Owner

behlendorf commented Mar 20, 2011

There are a couple issues going on here. The kernel patch fixes a specific deadlock related to vmalloc(). That's well understood and an upstream kernel bug because the ZFS port make heavy use of vmalloc() we tickle it fairly often.

The second issue is the kswapd issue which has been reported. I believe this is caused by using GFP_NOFS a bit to aggressively in the ZFS code. This flag prevents the kernel from reclaiming memory as part of other memory allocations while going about its usual business. Relaxing the use of GFP_NOFS has introduced a few new deadlock which still need to be run down. The GFP_NOFS changes are currently in a spl/zfs branch of the same name.

Getting a handle on both of these issues I'm fairly sure should significantly improve things.

Owner

behlendorf commented Mar 20, 2011

The GFP_NOFS branch now contains a workaround (needs testing) for the deadlock caused by the upstream kernel bug. Getting it fixed in the kernel is still the best course of action, but this hopefully will avoid the need for everything to run a patched kernel. That's a deal breaker for most people in my experience.

Additional deadlocks were also introduced on this branch due to GFP_NOFS being used less. This should be good for kswapd but bad until we can resolve them. I have a good idea to do this bug no patches yet. If anyones feeling brave I've love to hear how to current branch works for you. It's worked work for my light testing so far.

Contributor

Rudd-O commented Mar 21, 2011

git pull from my master to your master. i have merged your dracut branch changes.

Contributor

Rudd-O commented Mar 23, 2011

System hard frozen again during rsync. Using the latest code in spl and zfs. Arcreclaim activity seen when memory goes rowing hundred fifty egabytes

Contributor

Rudd-O commented Mar 23, 2011

Kernel oops when menor gets below 5MB. Cannot read the message as everything is.... wait... Lemme take a picture okay?

Contributor

Rudd-O commented Mar 23, 2011

http://imgur.com/7By6Z

Gotta love android phones.

Contributor

Rudd-O commented Mar 23, 2011

Gonna try again with prjmarycachs mtdata

Contributor

Rudd-O commented Mar 23, 2011

Remember this is all with zfs max arc limit of half a gigabyte! Still running out of memroy??? Wtf

Contributor

Rudd-O commented Mar 23, 2011

Looks like even with prjmarycachs metadatait is going to die.......

yes. It died. Out of memory again.

Contributor

Rudd-O commented Mar 23, 2011

This time the machine died but reboot worked. Probably because of me not waiting to take and upload a picture.

Contributor

Rudd-O commented Mar 23, 2011

With prjmarycache metadata, the system is exceedingly slow.

Contributor

Rudd-O commented Mar 23, 2011

Trying prjmarycache none. I would really like to rely exclusively on the dentry and pagecache exclusively however smart arc is supposed to be.

Contributor

Rudd-O commented Mar 23, 2011

Gotta wonder, why zfs max arc has no effect on zfs arc size? (At least with primarycache all or metadata...)

Contributor

Rudd-O commented Mar 23, 2011

Looks like it is going to die anyway, even with primarycache off...

oh yes, it died.

Contributor

Rudd-O commented Mar 23, 2011

Trying with secondarycache none....

devsk commented Mar 23, 2011

You are running into the same very issues I ran into.

https://github.com/behlendorf/zfs/issues/149/#issue/149/comment/854776

Contributor

Rudd-O commented Mar 23, 2011

Died again. And with arc disabled, it is excruciatingly slow too...

Owner

behlendorf commented Mar 23, 2011

OK, OK, OK. I can take a hint. :) With the kswapd thrashing behind us (right?), I'll take a look at keeping the memory usage within the specified limits.

Contributor

fajarnugraha commented Mar 23, 2011

the reason I use 65m for max arc is even with that, zfs mem usage goes around 256-384m. so if you have 512m arc size i wouldn't be surprised if mem usage goes around 2G.

Contributor

Rudd-O commented Mar 23, 2011

brian: I don't see any kswapd thrashing when free memory is available, but when it starts getting tight (~20M free), it surely appears again. But that is to be expected, as kswapd is scrambling to find pages to swap out?

fajarnugraha: the problem with 65 max arc is that, well, eh, the system is excruciatingly slow.

I'd rather have a choice to just use the linux dentry and page caches, and forget about ARC. I understand that presently ZFS VFS code will still be run even in cases where the dentry cache contains a dentry that userspace is trying to access, or the ARC contains block data that userspace is trying to read, so yeah, that is probably the reason we have this huge overhead even in cached data. That is not the case in other file systems -- as soon as something is cached in either the page or the dentry caches, none of the filesystem code needs to be invoked. Bummer for ZFS here.

Contributor

Rudd-O commented Mar 23, 2011

Also I would like to point out that this is not the case for ZFS-FUSE, for when data has already been cached, the ZFS code is never invoked again. I wrote the patch to make this possible, because without that, ZFS-FUSE was much much slower than even ZFS in kernel.

Owner

behlendorf commented Mar 23, 2011

I agree with pretty much everything you just said. :) Unfortunately, ZFS is a very different beast then every other Linux filesystem. I would love to integrate it more closely with the Linux dentry/inode/page caches to get exactly the performance improvement you suggest. I'm happy to to have a detailed development discussion on exactly how this should be done. I would suggest the place for it would be on the zfs-devel mailing list. But my first priority is getting the existing implementation stable.

Contributor

fajarnugraha commented Mar 23, 2011

Rudd-O: I know that running with arc=65M is slow :) My point is currently there's a lot more to zfs usage instead of arc, so I wouldn't be surprised if 512M arc means much higher memory usage. Having a specified limit for all zfs usage will be good, but as Brian mentioned we don't have one right now.

Since your patch to zfs-fuse was good performance-wise, can you easily port this to in kernel zfs?

Contributor

Rudd-O commented Mar 23, 2011

No, I cannot. The patch I wrote merely told FUSE to start caching stuff in the dentry / pagecache. ZFS kernel has a different road to travel -- one that involves integrating znodes with inodes to enable proper dentry caching, and other types of work to enable reliance on pagecache alone.

Contributor

Rudd-O commented Mar 23, 2011

Gawd damn. http://pastebin.com/yZy2TVY4 Kernel BUGs galore, and it's always when checking that BAT0 file. ALWAYS. They have been happenin since I patched my kernel. I will have to revert that patch and work with the vanilla kernel.

Owner

behlendorf commented Mar 23, 2011

Not a lot to go on there. The kernel patch shouldn't be needed anymore with the source from master so it will be interesting to see if you still see this with the vanilla kernel.

Contributor

Rudd-O commented Mar 23, 2011

With your latest code AND without the kernel patch, I see no freezes... yet.
I haven't done the evil rsync that kills machines (TM). I do feel the machine
stuttering when memory gets low, and I also do see the memory getting freed in
like 300MB bunches when the memory watermark hits about ~15MB.

El Wednesday, March 23, 2011, behlendorf escribió:

Not a lot to go on there. The kernel patch shouldn't be needed anymore
with the source from master so it will be interesting to see if you still
see this with the vanilla kernel.

Owner

behlendorf commented Mar 23, 2011

You could try increasing /proc/sys/vm/min_free_kbytes. This is the threshold used the the kernel for how much memory it wants to keep free. Bumping it up a little bit for now might help with the stuttering but leaving some more headroom.

Contributor

Rudd-O commented Mar 23, 2011

interesting. i will try that if i continue seeing stuttering.

El Wednesday, March 23, 2011, behlendorf escribió:

You could try increasing /proc/sys/vm/min_free_kbytes. This is the
threshold used the the kernel for how much memory it wants to keep free.
Bumping it up a little bit for now might help with the stuttering but
leaving some more headroom.

devsk commented Mar 23, 2011

stuttering comes from swapping code path. min_free_kbytes won't help with that. In fact, larger the min_free_kbytes, earlier the kernel will try to swap.

Unless we fix the loop and have zfs free memory inline instead of asynchronously in a separate thread (which may have scheduling trouble with kswapd hogging the CPU trying to find free pages), kernel will continue to swap. There is a lot of potentially conflicting machinery at work here.

Owner

behlendorf commented Mar 23, 2011

Devsk, your right. Thanks for reminding me. Before I got distracted with the GFP_NOFS badness I started to work on a patch to do direct on the ARC. Right now all reclaim is done by the arc_reclaim thread and it basically just checks one a second and shrinks the ARC if memory looks low. That of course isn't fast enough for a dynamic environment like a desktop so swapping kicks in. Adding the direct reclaim path (via a shrinker) should improve things... at least that's the current theory.

Contributor

Rudd-O commented Mar 23, 2011

YES! THAT IS EXACTLY THE PROBLEM. This is why you see kswapd and arc_reclaim
contending for 100% CPU on both cores when this happens (both are desperately
trying to free memory at all costs, both enter a very contended race, neither
succeeds, the kernel oopses and says it cannot allocate memory). I am sure
that when ARC reclaim is done inline as needed instead of on a separate
thread, this problem will be a thing of the past.

So when can we have that juicy bit? :-D

El Wednesday, March 23, 2011, behlendorf escribió:

Devsk, your right. Thanks for reminding me. Before I got distracted with
the GFP_NOFS badness I started to work on a patch to do direct on the ARC.
Right now all reclaim is done by the arc_reclaim thread and it basically
just checks one a second and shrinks the ARC if memory looks low. That of
course isn't fast enough for a dynamic environment like a desktop so
swapping kicks in. Adding the direct reclaim path (via a shrinker) should
improve things... at least that's the current theory.

Contributor

fajarnugraha commented Mar 25, 2011

I tested a build from master on RHEL with 2.6.32 kernel, the dracut change cause an error during "make rpm". The cause is simple: RHEL5 doesnot recognize %{_datarootdir} macro in zfs.spec. I had to change it manually to %{_datadir}, then the build process completed successfully.

devsk commented Mar 26, 2011

So when can we have that juicy bit? :-D

Not anytime soon I guess...:-D

Surprise me pleasant Brian...;-)

Owner

behlendorf commented Mar 27, 2011

I have a branch now which implements much of this but it still requires some tuning. Give me a few more days to chew on it and I'll make it public for other to test and see if it improves their workloads. I've been using your rsync/find test work load on a 2GiB atom system, once it works there I'll verify it works well on some 16-core 128GiB memory systems.

In the process I have also gotten a good handle on your memory usage issue. There's no leak, but find does cause some pretty nasty fragmentation. I'll write up my understanding perhaps in a bug or on the mailing list next week. There may be a few easy things which can be done to improve things. There are also some much harder things which will have to wait. :)

Owner

behlendorf commented Mar 30, 2011

These changes are the result of my recent work to getting a handle of vmalloc() and memory usage. They are still a work in progress but available for testing. Once I'm happy with everything I'll make a detailed post to zfs-devel explaining the memory usage on Linux as it stands today. If you want to test these changes you must update both the spl and zfs source to the shrinker branch.

https://github.com/behlendorf/spl/tree/shrinker

https://github.com/behlendorf/zfs/tree/shrinker

Here's a high level summary of these changes:

  • Reduce spl slab fragmentation by halving the slab size. Excessive fragmentation for certain workloads, find, caused lots of wasted memory. Decreasing the slab size helps but there is still considerable overhead which will be difficult to address.
  • More useful slab statistics including the slab size (size) and how much of it is allocated to objects (alloc). This makes it easy to see which slabs are badly fragments and how much memory it is costing, /proc/spl/kmem/slab. Additionally, there are now some slab usage summaries in /proc/spl/kmem/slab*.
  • Honor the arc_meta_limit. Previously this limit was not enforced which could result in meta data consuming your entire ARC cache which hurts performance. This limit is now enforced and can be set with the zfs_arc_meta_limit module option. It defaults to 1/4 of the ARC cache.
  • Show the arc_meta_used and associated arc stats. Previously these values were not visible in /proc/spl/kstat/zfs/arcstats. Additionally there is a memory_direct_count and a memory_indirect_count which show how often you can hit the direct and indirect reclaim paths.
  • Added direct and indirect memory reclaim paths for the ARC. This should improve behavior issues under low memory conditions and prevent OOM events and arc_reclaim/kswapd thrashing.
  • Several bug fixes for issues exposed by testing in a low memory environment. Details in the commit logs.

There is one major and one minor outstanding issue I know of which are preventing these changes from being merged in to master.

  • Because the ARC now properly honors the arc_meta_limit there is additional pressure on the dcache. This additional pressure now regularly causes a long standing bug to be hit more regularly on low memory systems (2 GiB). This needs to be fixed before this change can be merged.

kernel BUG at fs/inode.c:1333! [ in iput() ]
Putting away a reference on already cleared inode

  • There also remains the smaller issue of the ARC cache being dropped to arc_min when memory pressure is encountered. This only impacts performance but needs to be explained and fixed, the ARC should reach a steady-state.

devsk commented Mar 30, 2011

Wow! That's a lot of work for a week...:-)

Earliest I can test is the weekend though...:( Swamped with work of my own.

Owner

behlendorf commented Mar 30, 2011

I try and keep busy. No rush to get this tested it's going to take some some time to run down the iput() issue mentioned above. I just wanted to make what I'm thinking about public for comment.

Owner

behlendorf commented Apr 22, 2011

This was fixed in what will be 0.6.0-rc4. Closing issue, see comments in issue #149 starting here.

https://github.com/behlendorf/zfs/issues/149#issuecomment-1042925

@behlendorf behlendorf closed this Apr 22, 2011

Contributor

Rudd-O commented May 8, 2012

kswapd is back to trashing again (in my observation, any time that c_max is set via kernel module option to 18 or more GB of RAM in a swapless system with 48 GB memory).

both kswapd0 and kswapd1 will spin 100% CPU, pegging two cores of the eight-core machine.

This is bad, I have had to resort to limiting the ARC to around 15 GB and I am testing again with these params:

hash_elements_max 4 1150514
hash_chain_max 4 9
c 4 15372519301
c_min 4 3843129825
c_max 4 15372519301
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 17659
arc_meta_used 4 7688103744
arc_meta_limit 4 7686259650
arc_meta_max 4 7712895136

Owner

behlendorf commented May 10, 2012

Is this with the latest spl+zfs master source? Sevealr recent VM changes were merged in there which I expected to make this sort of this much less likely. If in fact they have had an adverse impact I'd like to get it resolved right away. In particular are you running with the following commits.

SPL
zfsonlinux/spl@f90096c Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS
zfsonlinux/spl@a9a7a01 Add SPLAT test to exercise slab direct reclaim
zfsonlinux/spl@b78d4b9 Ensure a minimum of one slab is reclaimed
zfsonlinux/spl@06089b9 Ensure direct reclaim forward progress
zfsonlinux/spl@c0e0fc1 Ignore slab cache age and delay in direct reclaim
zfsonlinux/spl@cef7605 Throttle number of freed slabs based on nr_to_scan

ZFS
518b487 Update ARC memory limits to account for SLUB internal fragmentation
302f753 Integrate ARC more tightly with Linux

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 1, 2015

Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment