New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added CONFIG_COMPACTION=y to defconfigs #349

Closed
wants to merge 1 commit into
base: rpi-3.6.y
from

Conversation

Projects
None yet
3 participants
@Ferroin
Copy link
Contributor

Ferroin commented Aug 5, 2013

This adds a slight overhead to memory allocation (maybe 2us without
overclocking), but improves preformance with workloads that use a lot of
big blocks of memory. It might also (theoretically) improve behavior
with respect to issues like #153, I've had such problems less frequently on
kernels with this set.

Added CONFIG_COMPACTION=y to defconfigs
This adds a slight overhead to memory allocation (maybe 2us without
overclocking), but improves preformance with workloads that use a lot of
big blocks of memory.  It might also (theoretically) improve behavior
with respect to issues like #153.
@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 5, 2013

@P33M any thoughts?

@P33M

This comment has been minimized.

Copy link
Contributor

P33M commented Aug 5, 2013

That was one option that was on my to-do list to test. We need to get some testing input from people actually seeing this error though - if we are doing an atomic allocation then the delay can be a bit big... uvcvideo in particular is a bad offender for this because it will request a rather large coherent range for its URBs.

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 5, 2013

Well, as I said, i personally see the error less with this enabled. With this enabled I get errors like #153 maybe once a week, as opposed to roughly once a day without it enabled.

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 15, 2013

Just noticed that CONFIG_COMPACTION is enabled in 3.9.y kernel tree (and so branch=next firmware), so I guess it doesn't have any widespread bad effect.

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 16, 2013

Just finished some intensive profiling, with this enabled malloc() is ~5us slower in most cases, except when memory is almost full and the request will fit in memory, in which case it is marginally faster because the kernel can compact memory to make a large enough space for the request, as opposed to either killing something or paging to disk.

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 16, 2013

@Ferroin
So what's your view? Is the 5us a problem?
How long does memory allocation normally take? How many allocations/sec are you seeing?

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 16, 2013

Outside of realtime use cases, 5us is probably not a problem; even then it might not be depending on how the program is written (ie, if it allocates all it's memory on startup, instead of allocating on the fly, this will have less impact).

As for the 'normal' time that allocation takes, it can vary hugely depending on a large number of factors. In general, any dynamic allocation (eg malloc(), free(), calloc(), kmalloc() etc.) is going to be slow (on the order of at least 100us). The 5us figure is based on more than ten thousand samples taken over a few days during a wide variety of system states. The overall biggest increase was almost 2ms when it was the only userspace process running, but this is major outlier, most of the samples were in the range of about +13us to -6us difference with this option on. To do the actual testing, i used profiling on the following generic program:

for (int i; i < 600; i++) {
    sleep_1_second()
    mem = malloc(random_multiple_of_4096)
    sleep_1_second()
    free(mem)
}

As for the number of allocations per second, this is largely dependent on what programs you are using. In general, stuff that has very well defined memory usage, such as most of the GNU coreutils (with the big exception of dd) and a large majority of emulation software, will show fewer allocations per second. Simulations, games, and media software tend to have much higher allocation rates, but usually only in large bursts.

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 16, 2013

Obviously an increase of 5us for a function that is called 100,000 times per second would be a big problem.
If it's called, less than 1000 times per second, then it would be hard to measure the impact.

Does this just apply to user malloc, or kmallocs? Can the usb or network kernel drivers cause many allocation per second?

@P33M

This comment has been minimized.

Copy link
Contributor

P33M commented Aug 16, 2013

kmalloc is my biggest concern. dwc_otg in particular is very kmalloc-happy, which it mainly does with interrupts disabled.

Other kernel drivers may disable interrupts to do kmallocs (though this is rarer - they are usually sensibly written).

I did profile the URB enqueue function to determine its critical section time - there are two potential kmalloc calls within this, and the whole section usually takes 4uS. If we add 5uS to each - then it's a big deal.

As established previously, the USB subsystem is very sensitive to something disabling interrupts for more than a microframe's worth of time. I have a feeling that some high-bandwidth audio/video USB driver will do something silly with allocation, and with the added delay may cause problems.

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 16, 2013

I'll rebuild my kernel with the appropriate options to check if it has a big effect on kmalloc and get back to you on it.

Even aside from this, if dwc_otg uses kmalloc a lot with interrupts off, that probably explains a lot of the USB multimedia problems, and possibly also the input latency problems some of my friends have seen.

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 17, 2013

Just finished re building the kernel, I'll do some intensive testing both with and without CONFIG_COMPACTION=y.

Peraonally, I haven't had any problems up to this point using a kernel with CONFIG_COMPACTION set, but then my only USB traffic is infrequent NFS and SSH network connections.

@P33M

This comment has been minimized.

Copy link
Contributor

P33M commented Aug 30, 2013

We had a real-world use case unrelated to USB that benefited from this, as Rob and I found yesterday.

Extracting an xz-compressed tar on the SD card and then writing to the same SD card sped up by about a factor of 2.

As CONFIG_COMPACTION seems to squash the majority of allocation failures (and pesky kevent 2 spam), with a small static penalty for small allocations that require a compaction event, I recommend including it. I define "small" as "not a significant fraction of a USB microframe", which is the critical timing parameter for holding off interrupts.

I did some basic regression testing with a USB webcam - the poster child for testing interrupt holdoffs - with this enabled it seems that guvcview complains less about running out of space for its buffers, which is good. The webcam was still useable.

I would also recommend changing the memory allocator to SLUB for an added performance benefit. I think Arch has been using this as the default in their kernels for some time now.

@Ferroin

This comment has been minimized.

Copy link
Contributor Author

Ferroin commented Aug 30, 2013

I've also found NFS to be faster with this enabled (makes it more likely that a buffer allocation will be contiguous).

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 30, 2013

@P33M
Just to confirm. Remove SLAB and add SLUB?
SLUB_DEBUG? SLUB_STATS?

@P33M

This comment has been minimized.

Copy link
Contributor

P33M commented Aug 30, 2013

Remove CONFIG_SLAB=y
Add CONFIG_SLUB=y
Add CONFIG_SLUB_DEBUG=y - adds to compiled code size, but defaults to all features being turned off therefore has a very minimal impact on speed.

See http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg12389.html

Kernel docs say to leave SLUB_STATS turned off for production use (it profiles the allocator).

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 30, 2013

Thanks. That's what I'm building with. If I don't spot any obvious problems I'll push out an update with these options this weekend.

@popcornmix

This comment has been minimized.

Copy link
Collaborator

popcornmix commented Aug 30, 2013

CONFIG_COMPACTION and CONFIG_SLUB options have been committed.
@Ferroin thanks for the suggestion.

@Ferroin Ferroin closed this Aug 30, 2013

andyduller pushed a commit to afterthoughtsoftware/linux that referenced this pull request Oct 30, 2013

davet321 pushed a commit to davet321/rpi-linux that referenced this pull request Nov 18, 2013

davet321 pushed a commit to davet321/rpi-linux that referenced this pull request Nov 18, 2013

Add config options
Increase to CONFIG_MMC_BLOCK_MINORS=32
and enable CONFIG_JUMP_LABEL
See: raspberrypi#348

Move to SLUB memory allocator.
See: raspberrypi#349

anholt pushed a commit to anholt/linux that referenced this pull request Aug 20, 2016

writeback: Write dirty times for WB_SYNC_ALL writeback
Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ raspberrypi#349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment