Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
root zfs freezes system #154
Comments
|
It's memory, definitely. Free memory drops rapidly as the rsync goes, then the machine locks up. echo 3 > /proc/sys/vm/drop_caches will free the memory that got eaten, but it will take several seconds to complete. |
|
curiously the cached memory went down only about a hundred M, but the free memory went up like eight hundred M. So I don't know exactly what is being freed that is not tallied up in the cached memory counter. |
|
Also if I stop the rsync as the machine is about to hang, arc_reclaim kicks in with 10000000% percent of cpu (exaggeration of mine). it seems the problem is that arc_reclaim is simply kicking in too late, and by that time the machine is already effectively hung. remember none of that memory is swappable. |
|
This is my number 1 issue to get fixed. I've just started to look at it now that things are pretty stable. Unfortunately, because of the way ZFS has to manage the ARC, cached data isn't reported under 'cached' instead you'll see it under 'vmalloc_used'. That won't change soon (but I have long term plans), but in the short term we should be able to do something about the thrashing. |
|
how do i reduce the use of the arc in the meantime? |
|
what takes a long time is the sync previous to the drop_caches. during that time, the free memory increases. |
|
sorry, i take my last comment back. dropping the caches is what takes long. |
|
You can cap the ARC size by setting the 'zfs_arc_max' module option. However, after the module is loaded you won't be able to change this value through /sys/module/zfs/. Arguably that would be a nice thing to support. |
|
What units is the zfs_arc_max knob? How do I set it? Only through modprobe? So I'd have to add it to the dracut module then. I think I will put the dracut thingie on version control and push it to my repo. |
|
my first question is, let's say I say zfs_arc_max=1024, what is that? kilobytes? |
|
Right now the arc cache is sized to a maximum 3/4 of all system memory. The zfs_arc_max module option actually takes a value in bytes because that's what the same hook on Solaris expects. We could make this as fancy as we need. |
|
The unit for zfs_arc_max is bytes, and it will only matter if it's > 64M (otherwise the code ignores the setting). I set it to 65M (68157440) right now for reliable operation, and a (very rough) look shows memory usage is at most 512MB higher compared to without zfs. |
Naaah, 64 MB is just TOO SLOW for a root file system to use. I tried that. So I went to 256 MB, and for the most part my machine only stalls two to three times a day (as opposed to right after boot with the unbounded setting). That's progress. Now, if syscall response could be made faster... I feel like reading files that are already in cache is excruciatingly slow in ZFS compared to ext4, and it really affects application performance:
An order of magnitude slower, NOT GOOD! |
|
IIRC you shouldn't be able to use 64M unless you hack the code (64MB + 1 byte, maybe, but not 64MB) :D Anyway, related to cached files, I assume you've seen this: behlendorf/zfs@450dc14 |
|
FWIW, I tried 128. not 64. The code should recognize 64 and add a +1 to it, or change to accept 64 since many people with small systems will try that. I remember hacking zfs-fuse deeply to have it accept 16, since I wanted to run it on a 128 MB box. IT WORKED. |
|
That commit improves performance for read()'s, not for access() es, which constitute the VAST majority of file-related syscalls applications execute. Try an strace on kmail when it starts and you will quickly see what I mean. What I don't undersatnd is, access() / stat() are supposed to be cached in the Linux kernel dentry cache. Does ZFS somehow bypass that and provides its own cache for dentries? That would be the only way I could understand access() stat() being slower on ZFS. |
|
As you said the default ARC behavior is to put a minimum bound of 64MiB on ARC, setting it to one byte larger than this should work. Changing this check to a '>=' instead of a '>' seems reasonable to me. I really have no idea how badly it will behave with less than 64MiB but feel free to try it!
Reading access() performance your right an order of magnitude is NO GOOD! This can and will be improved but in the interests of getting something working sooner rather than latter I haven't optimized this code. While the zfs code does use the Linux dentry and inode caches, the values stored in the inode are not 100% authoritative. Currently these values still need to be pulled from the znode. Finishing the unification of the znode and inode is listed as "Inode/Znode Refactoring" in the list of development items and I expect it will help this issue. There are also a couple of memory allocations which if eliminated I'm sure would improve things. Finally, I suspect using oprofile to profile to getattr would reveal some other ways to improve things. All of these changes move us a little further away using the unmodified Solaris code but I think that's a price we have to pay in the Posix layer if we want good performance. Anyway, I'm happy to have help with this work. :) In the meanwhile I want to fix the other memory issues which I feel I have a pretty good handle on now and with luck can get fixed this week. |
|
I'm having a horrible time reproducing this VM thrashing issue which has been reported. Nothing I've tried today has been able to recreate the issue. I've seen kswapd pop up briefly (fraction of a second) to a large CPU percentage but it quickly frees the required memory. I want to get this fixed, but I need a good test case. Can someone who is seeing the problem determine exactly what is needed to recreate the issue. Also if you do manage to recreate the issue (and have a somewhat responsive system), running the following will give me a lot of what I need to get it fixed. echo t >/proc/sysrq-trigger echo m >/proc/sysrq-trigger cat /proc/spl/kstat/zfs/arcstats cat /proc/meminfo |
|
It's simple to reproduce:
I will try to upgrade to a newer kernel today. Running 2.6.37-pre7 |
|
Okay, then I'll work on reviewing and pulling your your Dracut changes tomorrow. Then I can setup zfs as a root filesystem and see if I can't hit the issue. I know others have hit the problem with it as a non-root filesystem but I wasn't able too. That's what I get for using test systems however, there really isn't anything quite like real usage. |
|
Certainly! |
devsk
commented
Mar 15, 2011
|
Same thing as reported in issue 149. One way to reproduce this easily would be to boot your Linux system with smaller amount of RAM than it actually has. Try booting with 512MB using mem=512M as kernel parameter. You don't need ZFS as rootfs to trigger this. A simple 'find' or 'du' on a fairly large FS with small amount of RAM will trigger this. I have run into this issue (eventual hard lockup!) so many times that I gave up on rootfs. Its just a matter of amount of RAM. e.g. given 48GB of RAM, you will probably not run into this issue....:-) |
|
I can try less RAM, I've been booting my box with 2 GiB to try and tickle the issue but thus far no luck. I'll try less. |
|
Ahh, found it! Or rather I refound the issue. Ricardo was working on this issue with me, I thought it would only effect a Lustre/ZFS server but it looks like that's not the case. KQ has identified the same bug and opened the following issue with the upstream kernel. That' progress. :) https://bugzilla.kernel.org/show_bug.cgi?id=30702 http://marc.info/?l=linux-mm&m=128942194520631&w=4 Ricardo had worked out a patch but it has yet to be merged in to the upstream kernel. Joshi from KQ has also attached a proposed fix. Rudd-O, Devsk since you two aren't squeamish about rebuilding your kernel, if you like you can apply the proposed fix to your kernel and rebuild it. It should resolve the deadlock. |
|
I am going to test the patch soon. |
devsk
commented
Mar 16, 2011
|
I think that patch has nothing to do with memory issue that we are facing with ZFS. I have seen a hard lock (which could be the deadlock mentioned above) as well but mostly thrashing. We do need to concentrate on thrashing aspect. |
|
Well my new home NAS hardware just arrived so once I get it installed (tonight maybe) hopefully I'll start seeing these real usage issues too. It's an Intel atom with only 4GiB of memory so I'm well motivated to make ZFS work well within those limits. :) It's hard to say for sure what the problem is without stack traces from the problem. If anyone hits it again please run echo t >/proc/sysrq-trigger; dmesg. That way we'll get stacks from all the processes on the system and we should be able to see what it's trashing on |
devsk
commented
Mar 16, 2011
|
One issue we may run into while doing echo t to sysrq is that there are so many tasks, dmesg buffer may not be enough to hold them all. And of course, the system doesn't give you much time to do diagnostics when this happens. |
|
Kernel 2.6. 38 here, no preempt, no thrashing anymore. So far. Still haven't gotten to test the patch. I will keep you informed. |
devsk
commented
Mar 16, 2011
|
I have been running 2.6.38-rc's for a while and I still see thrashing. Unless something changed in last 1 week (between rc8 and release), I don't thrashing is fixed. |
|
I doubt it, I highly doubt this is related to the exact kernel version. It could be related to CONFIG_PREEMPT_VOLUNTARY, I could absolutely see that causing issues like this. Unfortunately, I only made CONFIG_PREEMPT fatal at configure time and missed CONFIG_PREEMPT_VOUNTARY, I'll have to fix that. It seems more kernels these days have it enabled by default so I'll have to find time to make the code preempt-safe as well. |
It's the voluntary thing! :) i am quite sure.Sent from my Android phone with K-9 Mail. Please excuse my brevity. behlendorf reply@reply.github.com wrote: I doubt it, I highly doubt this is related to the exact kernel version. It could be related to CONFIG_PREEMPT_VOLUNTARY, I could absolutely see that causing issues like this. Unfortunately, I only made CONFIG_PREEMPT fatal at configure time and missed CONFIG_PREEMPT_VOUNTARY, I'll have to fix that. It seems more kernels these days have it enabled by default so I'll have to find time to make the code preempt-safe as well. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879127 |
devsk
commented
Mar 16, 2011
|
I have been running PREEMPT_NONE since the day we started on this mission...:-) |
devsk
commented
Mar 16, 2011
|
Ok, I am running 2.6.38 with PREEMPT_NONE and the system has 4GB of RAM. I have configured the ARC low now (min max both 368MB, and I can see the numbers matching that upon boot in /proc/spl/kstat/zfs/arcstats). But the end result is same: a simple 'find / -xdev' consumes all 4GB of RAM, makes processes swap and then hard locks the system. |
I will try exactly that as soon as I get home from the bar.Sent from my Android phone with K-9 Mail. Please excuse my brevity. devsk reply@reply.github.com wrote: Ok, I am running 2.6.38 with PREEMPT_NONE and the system has 4GB of RAM. I have configured the ARC low now (min max both 368MB, and I can see the numbers matching that upon boot in /proc/spl/kstat/zfs/arcstats). But the end result is same: a simple 'find / -xdev' consumes all 4GB of RAM, makes processes swap and then hard locks the system. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879357 |
devsk
commented
Mar 16, 2011
|
Make sure to time it...:-D For me, its order of magnitude slower than ext4 (4-5 seconds vs 100-141 seconds;its an SSD). The time varies depending on whether the ZFS has already settled or not i.e. if the login process is still going on, some processes may do some IO. |
Oh Oh know, I endured rsync /usr at 5mb per second. Reminds me of pio drives. Hehehehehee.Sent from my Android phone with K-9 Mail. Please excuse my brevity. devsk reply@reply.github.com wrote: Make sure to time it...:-D For me, its order of magnitude slower than ext4 (4-5 seconds vs 100-141 seconds;its an SSD). The time varies depending on whether the ZFS has already settled or not i.e. if the login process is still going on, some processes may do some IO. -- Reply to this email directly or view it on GitHub: https://github.com/behlendorf/zfs/issues/154#comment_879380 |
|
hung after today's rsync nightlie. so yes, the bug persists. i will try the patch in question soon, see if that changes the situation. |
|
I am testing the patch. Running find on a large directory no longer hangs the machine in a matter of seconds, but I still see kswapd and arc_reclaim peak in CPU from time to time. However, this time, free memory watermark hovers around 20M ~ 50M, indicating that at least the memory freeing algorithm is working and I no longer have to manually dropcaches to recover control of my machine. This is good news. The patch is doing something. |
devsk
commented
Mar 20, 2011
|
Which patch are u talking about? Sorry, lost the line of thought here. |
|
Kernel patch quoted above. Also the system is incredibly slow under load. Patch did. No help with tjat. |
|
There are a couple issues going on here. The kernel patch fixes a specific deadlock related to vmalloc(). That's well understood and an upstream kernel bug because the ZFS port make heavy use of vmalloc() we tickle it fairly often. The second issue is the kswapd issue which has been reported. I believe this is caused by using GFP_NOFS a bit to aggressively in the ZFS code. This flag prevents the kernel from reclaiming memory as part of other memory allocations while going about its usual business. Relaxing the use of GFP_NOFS has introduced a few new deadlock which still need to be run down. The GFP_NOFS changes are currently in a spl/zfs branch of the same name. Getting a handle on both of these issues I'm fairly sure should significantly improve things. |
|
The GFP_NOFS branch now contains a workaround (needs testing) for the deadlock caused by the upstream kernel bug. Getting it fixed in the kernel is still the best course of action, but this hopefully will avoid the need for everything to run a patched kernel. That's a deal breaker for most people in my experience. Additional deadlocks were also introduced on this branch due to GFP_NOFS being used less. This should be good for kswapd but bad until we can resolve them. I have a good idea to do this bug no patches yet. If anyones feeling brave I've love to hear how to current branch works for you. It's worked work for my light testing so far. |
|
git pull from my master to your master. i have merged your dracut branch changes. |
|
System hard frozen again during rsync. Using the latest code in spl and zfs. Arcreclaim activity seen when memory goes rowing hundred fifty egabytes |
|
Kernel oops when menor gets below 5MB. Cannot read the message as everything is.... wait... Lemme take a picture okay? |
|
Gotta love android phones. |
|
Gonna try again with prjmarycachs mtdata |
|
Remember this is all with zfs max arc limit of half a gigabyte! Still running out of memroy??? Wtf |
|
Looks like even with prjmarycachs metadatait is going to die....... yes. It died. Out of memory again. |
|
This time the machine died but reboot worked. Probably because of me not waiting to take and upload a picture. |
|
With prjmarycache metadata, the system is exceedingly slow. |
|
Trying prjmarycache none. I would really like to rely exclusively on the dentry and pagecache exclusively however smart arc is supposed to be. |
|
Gotta wonder, why zfs max arc has no effect on zfs arc size? (At least with primarycache all or metadata...) |
|
Looks like it is going to die anyway, even with primarycache off... oh yes, it died. |
|
Trying with secondarycache none.... |
devsk
commented
Mar 23, 2011
|
You are running into the same very issues I ran into. https://github.com/behlendorf/zfs/issues/149/#issue/149/comment/854776 |
|
Died again. And with arc disabled, it is excruciatingly slow too... |
|
OK, OK, OK. I can take a hint. :) With the kswapd thrashing behind us (right?), I'll take a look at keeping the memory usage within the specified limits. |
|
the reason I use 65m for max arc is even with that, zfs mem usage goes around 256-384m. so if you have 512m arc size i wouldn't be surprised if mem usage goes around 2G. |
|
brian: I don't see any kswapd thrashing when free memory is available, but when it starts getting tight (~20M free), it surely appears again. But that is to be expected, as kswapd is scrambling to find pages to swap out? fajarnugraha: the problem with 65 max arc is that, well, eh, the system is excruciatingly slow. I'd rather have a choice to just use the linux dentry and page caches, and forget about ARC. I understand that presently ZFS VFS code will still be run even in cases where the dentry cache contains a dentry that userspace is trying to access, or the ARC contains block data that userspace is trying to read, so yeah, that is probably the reason we have this huge overhead even in cached data. That is not the case in other file systems -- as soon as something is cached in either the page or the dentry caches, none of the filesystem code needs to be invoked. Bummer for ZFS here. |
|
Also I would like to point out that this is not the case for ZFS-FUSE, for when data has already been cached, the ZFS code is never invoked again. I wrote the patch to make this possible, because without that, ZFS-FUSE was much much slower than even ZFS in kernel. |
|
I agree with pretty much everything you just said. :) Unfortunately, ZFS is a very different beast then every other Linux filesystem. I would love to integrate it more closely with the Linux dentry/inode/page caches to get exactly the performance improvement you suggest. I'm happy to to have a detailed development discussion on exactly how this should be done. I would suggest the place for it would be on the zfs-devel mailing list. But my first priority is getting the existing implementation stable. |
|
Rudd-O: I know that running with arc=65M is slow :) My point is currently there's a lot more to zfs usage instead of arc, so I wouldn't be surprised if 512M arc means much higher memory usage. Having a specified limit for all zfs usage will be good, but as Brian mentioned we don't have one right now. Since your patch to zfs-fuse was good performance-wise, can you easily port this to in kernel zfs? |
|
No, I cannot. The patch I wrote merely told FUSE to start caching stuff in the dentry / pagecache. ZFS kernel has a different road to travel -- one that involves integrating znodes with inodes to enable proper dentry caching, and other types of work to enable reliance on pagecache alone. |
|
Gawd damn. http://pastebin.com/yZy2TVY4 Kernel BUGs galore, and it's always when checking that BAT0 file. ALWAYS. They have been happenin since I patched my kernel. I will have to revert that patch and work with the vanilla kernel. |
|
Not a lot to go on there. The kernel patch shouldn't be needed anymore with the source from master so it will be interesting to see if you still see this with the vanilla kernel. |
|
With your latest code AND without the kernel patch, I see no freezes... yet. El Wednesday, March 23, 2011, behlendorf escribió:
|
|
You could try increasing /proc/sys/vm/min_free_kbytes. This is the threshold used the the kernel for how much memory it wants to keep free. Bumping it up a little bit for now might help with the stuttering but leaving some more headroom. |
|
interesting. i will try that if i continue seeing stuttering. El Wednesday, March 23, 2011, behlendorf escribió:
|
devsk
commented
Mar 23, 2011
|
stuttering comes from swapping code path. min_free_kbytes won't help with that. In fact, larger the min_free_kbytes, earlier the kernel will try to swap. Unless we fix the loop and have zfs free memory inline instead of asynchronously in a separate thread (which may have scheduling trouble with kswapd hogging the CPU trying to find free pages), kernel will continue to swap. There is a lot of potentially conflicting machinery at work here. |
|
Devsk, your right. Thanks for reminding me. Before I got distracted with the GFP_NOFS badness I started to work on a patch to do direct on the ARC. Right now all reclaim is done by the arc_reclaim thread and it basically just checks one a second and shrinks the ARC if memory looks low. That of course isn't fast enough for a dynamic environment like a desktop so swapping kicks in. Adding the direct reclaim path (via a shrinker) should improve things... at least that's the current theory. |
|
YES! THAT IS EXACTLY THE PROBLEM. This is why you see kswapd and arc_reclaim So when can we have that juicy bit? :-D El Wednesday, March 23, 2011, behlendorf escribió:
|
|
I tested a build from master on RHEL with 2.6.32 kernel, the dracut change cause an error during "make rpm". The cause is simple: RHEL5 doesnot recognize %{_datarootdir} macro in zfs.spec. I had to change it manually to %{_datadir}, then the build process completed successfully. |
devsk
commented
Mar 26, 2011
Not anytime soon I guess...:-D Surprise me pleasant Brian...;-) |
|
I have a branch now which implements much of this but it still requires some tuning. Give me a few more days to chew on it and I'll make it public for other to test and see if it improves their workloads. I've been using your rsync/find test work load on a 2GiB atom system, once it works there I'll verify it works well on some 16-core 128GiB memory systems. In the process I have also gotten a good handle on your memory usage issue. There's no leak, but find does cause some pretty nasty fragmentation. I'll write up my understanding perhaps in a bug or on the mailing list next week. There may be a few easy things which can be done to improve things. There are also some much harder things which will have to wait. :) |
|
These changes are the result of my recent work to getting a handle of vmalloc() and memory usage. They are still a work in progress but available for testing. Once I'm happy with everything I'll make a detailed post to zfs-devel explaining the memory usage on Linux as it stands today. If you want to test these changes you must update both the spl and zfs source to the shrinker branch. https://github.com/behlendorf/spl/tree/shrinker https://github.com/behlendorf/zfs/tree/shrinker Here's a high level summary of these changes:
There is one major and one minor outstanding issue I know of which are preventing these changes from being merged in to master.
kernel BUG at fs/inode.c:1333! [ in iput() ]
|
devsk
commented
Mar 30, 2011
|
Wow! That's a lot of work for a week...:-) Earliest I can test is the weekend though...:( Swamped with work of my own. |
|
I try and keep busy. No rush to get this tested it's going to take some some time to run down the iput() issue mentioned above. I just wanted to make what I'm thinking about public for comment. |
|
This was fixed in what will be 0.6.0-rc4. Closing issue, see comments in issue #149 starting here. https://github.com/behlendorf/zfs/issues/149#issuecomment-1042925 |
behlendorf
closed this
Apr 22, 2011
|
kswapd is back to trashing again (in my observation, any time that c_max is set via kernel module option to 18 or more GB of RAM in a swapless system with 48 GB memory). both kswapd0 and kswapd1 will spin 100% CPU, pegging two cores of the eight-core machine. This is bad, I have had to resort to limiting the ARC to around 15 GB and I am testing again with these params: hash_elements_max 4 1150514 |
|
Is this with the latest spl+zfs master source? Sevealr recent VM changes were merged in there which I expected to make this sort of this much less likely. If in fact they have had an adverse impact I'd like to get it resolved right away. In particular are you running with the following commits. SPL ZFS |
Rudd-O commentedMar 10, 2011
My system freezes when rsyncing large volumes of data.
The first rsync finished just fine, and copied over 80 GB of data from my closet.
However, a second rsync pass (which is nothing more than stats()) on the same files (my /home directory), will eventually -- into about a minute or so of reading massive amounts of files -- grind the system to a halt. the top view freezes completely while kswapd is at the top of the process list, and ps ax hangs in the middle of the process listing. Obviously I cannot provide a screenshot of that.
What workarounds can I providde to tell ZFS not to use so much memory? Even if it is slower, I need to see if this iis a memory problem.
Swap rests on another partition of the same SSD. It is not swapping with a file on the zfs volume.