New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to 250hz and voluntary preemption #1216

Closed
robingroppe opened this Issue Dec 6, 2015 · 60 comments

Comments

Projects
None yet
9 participants
@robingroppe

robingroppe commented Dec 6, 2015

Please change the kernel to 250hz and voluntary preemption.

In everyday use my Pi with the modified kernel can take more work while still being responsive to all other running tasks. For example when i run the MumbleRubyPluginbot, with the stock kernel, which needs to be fed with data every 20ms or so and start an apt upgrade of some packets it quickly starts to lag.
With the modified kernel everything is fine.
You said you would need some evidence. I ran UnixBench on both kernels.
I cant tell much about starting graphical apps, because my pi is running headless but i guess you guys are having crosscompilers set up and can easily build a modified kernel.
By the way these two things i am mentioning are also used in almost every stockkernel in most distributions (Debian, Ubuntu...). And in my opinion there is a reason for that.

Unixbench Stock Kernel: https://robingroppe.de/media/rpi2/orig.txt
Unixbench Modified Kernel: https://robingroppe.de/media/rpi2/mod.txt

@P33M

This comment has been minimized.

Show comment
Hide comment
@P33M

P33M Dec 7, 2015

Contributor

Preemption is enabled by default.

https://github.com/raspberrypi/linux/blob/rpi-4.1.y/arch/arm/configs/bcmrpi_defconfig#L42

We don't set CONFIG_HZ in the defconfig so the default of 100Hz is used - this in theory gives a 10ms timeslice. Can you increase the priority of the critical process to realtime with the same results?

Contributor

P33M commented Dec 7, 2015

Preemption is enabled by default.

https://github.com/raspberrypi/linux/blob/rpi-4.1.y/arch/arm/configs/bcmrpi_defconfig#L42

We don't set CONFIG_HZ in the defconfig so the default of 100Hz is used - this in theory gives a 10ms timeslice. Can you increase the priority of the critical process to realtime with the same results?

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Yes, the Low Latency Preemption is enabled while the rest of the Kernel is not so realtime.
Check the difference between PREEMPT_VOLUNTARY and PREEMPT.
For example Ubuntu always uses PREEMPT_VOLUNTARY for Desktops and Servers.
Only the Low-Latency Kernel uses PREEMPT combined with 1000hz and a few other tweaks.
Full Preemption causes Kernel overhead and reduces throughput.

robingroppe commented Dec 7, 2015

Yes, the Low Latency Preemption is enabled while the rest of the Kernel is not so realtime.
Check the difference between PREEMPT_VOLUNTARY and PREEMPT.
For example Ubuntu always uses PREEMPT_VOLUNTARY for Desktops and Servers.
Only the Low-Latency Kernel uses PREEMPT combined with 1000hz and a few other tweaks.
Full Preemption causes Kernel overhead and reduces throughput.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Here is my modified kernel in case you want to run a few tests.
http://robingroppe.de/media/rpi2/kernel7.xz

I followed this guide:
https://www.raspberrypi.org/documentation/linux/kernel/building.md

So all you have to do is jump into the linux directory and...

$ sudo cp arch/arm/boot/dts/.dtb /boot/
$ sudo cp arch/arm/boot/dts/overlays/
.dtb* /boot/overlays/
$ sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/
$ sudo scripts/mkknlimg arch/arm/boot/zImage /boot/kernel7.img

robingroppe commented Dec 7, 2015

Here is my modified kernel in case you want to run a few tests.
http://robingroppe.de/media/rpi2/kernel7.xz

I followed this guide:
https://www.raspberrypi.org/documentation/linux/kernel/building.md

So all you have to do is jump into the linux directory and...

$ sudo cp arch/arm/boot/dts/.dtb /boot/
$ sudo cp arch/arm/boot/dts/overlays/
.dtb* /boot/overlays/
$ sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/
$ sudo scripts/mkknlimg arch/arm/boot/zImage /boot/kernel7.img

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Just found a discussion about it on the Arch Board:
http://archlinuxarm.org/forum/viewtopic.php?f=23&t=7907

robingroppe commented Dec 7, 2015

Just found a discussion about it on the Arch Board:
http://archlinuxarm.org/forum/viewtopic.php?f=23&t=7907

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 7, 2015

Collaborator

@robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.

Collaborator

popcornmix commented Dec 7, 2015

@robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

In this situation changing to voluntary will actually improve throughput.
But i can do that. Do you have any tests you want me to rely on?
Am 07.12.2015 14:03 schrieb "popcornmix" notifications@github.com:

@robingroppe https://github.com/robingroppe Can you test with
CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on
the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we
need to be very sure of exactly what the benefits are before enabling.


Reply to this email directly or view it on GitHub
#1216 (comment).

robingroppe commented Dec 7, 2015

In this situation changing to voluntary will actually improve throughput.
But i can do that. Do you have any tests you want me to rely on?
Am 07.12.2015 14:03 schrieb "popcornmix" notifications@github.com:

@robingroppe https://github.com/robingroppe Can you test with
CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on
the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we
need to be very sure of exactly what the benefits are before enabling.


Reply to this email directly or view it on GitHub
#1216 (comment).

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 7, 2015

Collaborator

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html

Collaborator

popcornmix commented Dec 7, 2015

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Okay. Voluntary causes the throughput to drop slightly but full preemption
like you are using right now causes even more drop of throughput and
additionally kernel overhead.
Am 07.12.2015 14:35 schrieb "popcornmix" notifications@github.com:

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum
latency of rescheduling, providing faster application reactions, at the
cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html


Reply to this email directly or view it on GitHub
#1216 (comment).

robingroppe commented Dec 7, 2015

Okay. Voluntary causes the throughput to drop slightly but full preemption
like you are using right now causes even more drop of throughput and
additionally kernel overhead.
Am 07.12.2015 14:35 schrieb "popcornmix" notifications@github.com:

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum
latency of rescheduling, providing faster application reactions, at the
cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html


Reply to this email directly or view it on GitHub
#1216 (comment).

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

@popcornmix That's in comparison to PREEMPT_NONE, PREEMPT_VOLUNTARY came before PREEMPT_FULL (which is what the RPi is using). PREEMPT_NONE is the highest throughput option, but provides noticeable latency for many things that require interactive usage (which is why almost nobody who isn't doing exclusively HPC workloads uses it anymore), PREEMPT_FULL provides the lowest latency, but at the cost of significant throughput (and potential stability issues). PREEMPT_VOLUNTARY was originally a precursor to PREEMPT_FULL, but is now kept as a compromise between PREEMPT_FULL and PREEMPT_NONE.

Contributor

Ferroin commented Dec 7, 2015

@popcornmix That's in comparison to PREEMPT_NONE, PREEMPT_VOLUNTARY came before PREEMPT_FULL (which is what the RPi is using). PREEMPT_NONE is the highest throughput option, but provides noticeable latency for many things that require interactive usage (which is why almost nobody who isn't doing exclusively HPC workloads uses it anymore), PREEMPT_FULL provides the lowest latency, but at the cost of significant throughput (and potential stability issues). PREEMPT_VOLUNTARY was originally a precursor to PREEMPT_FULL, but is now kept as a compromise between PREEMPT_FULL and PREEMPT_NONE.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

I am compiling two more kernels. One with only 250hz and Full Preempt and one with 100hz and Voluntary Preempt. This will take a while. Has someone tested the modified kernel already?

robingroppe commented Dec 7, 2015

I am compiling two more kernels. One with only 250hz and Full Preempt and one with 100hz and Voluntary Preempt. This will take a while. Has someone tested the modified kernel already?

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 7, 2015

Collaborator

@Ferroin Okay, so CONFIG_PREEMPT_VOLUNTARY will make latency worse compared to current. Possible an issue for some of the hardware drivers (LIRC/I2C/I2S/SPI etc).

Collaborator

popcornmix commented Dec 7, 2015

@Ferroin Okay, so CONFIG_PREEMPT_VOLUNTARY will make latency worse compared to current. Possible an issue for some of the hardware drivers (LIRC/I2C/I2S/SPI etc).

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

Unless the drivers are in user-space, this should actually make things better for them. The latency impact is entirely in user-space, and technically, reducing preemption should make timing critical stuff work more reliably (assuming they have critical sections properly wrapped in preempt_off() calls).

Contributor

Ferroin commented Dec 7, 2015

Unless the drivers are in user-space, this should actually make things better for them. The latency impact is entirely in user-space, and technically, reducing preemption should make timing critical stuff work more reliably (assuming they have critical sections properly wrapped in preempt_off() calls).

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

But the kernel timer of 250hz or 4ms will reduce the latency even more than you will loose by switching to the other Preemption model.

robingroppe commented Dec 7, 2015

But the kernel timer of 250hz or 4ms will reduce the latency even more than you will loose by switching to the other Preemption model.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

That's pretty dependent on how thee drivers are handling preemption, as well as what you are doing in general on the system. I think for most use cases, if it doesn't fully offset the latency increase from PREEMPT_VOLUNTARY, it should come very close. Some people trying to do odd timing specific things with their hardware from userspace (the DHT11 temperature sensor immediately comes to mind) may have issues, but if they really need low latency, they should probably be building their own kernel with HZ=1000 anyway.

It's worth keeping in mind that the higher timer frequency will also increase power consumption (although I doubt that it will be more than a few micro-amps difference between HZ=100 and HZ=250), so it may be worth testing that as well.

Contributor

Ferroin commented Dec 7, 2015

That's pretty dependent on how thee drivers are handling preemption, as well as what you are doing in general on the system. I think for most use cases, if it doesn't fully offset the latency increase from PREEMPT_VOLUNTARY, it should come very close. Some people trying to do odd timing specific things with their hardware from userspace (the DHT11 temperature sensor immediately comes to mind) may have issues, but if they really need low latency, they should probably be building their own kernel with HZ=1000 anyway.

It's worth keeping in mind that the higher timer frequency will also increase power consumption (although I doubt that it will be more than a few micro-amps difference between HZ=100 and HZ=250), so it may be worth testing that as well.

@clivem

This comment has been minimized.

Show comment
Hide comment
@clivem

clivem Dec 7, 2015

Contributor

I was going to stay out of this as I cannot provide any hard evidence. I don't know whether years of experience count? LOL

IMHO, the best balance, for general usage, between latency and throughput is achieved with VOLUNTARY_PREEMPT and 250HZ for headless, and VOLUNTARY_PREEMPT and 1000HZ for desktop/GUI. But I wouldn't change the current RPI default configs. If people want to change the defaults for their specific use cases, they can compile their own kernels......

Contributor

clivem commented Dec 7, 2015

I was going to stay out of this as I cannot provide any hard evidence. I don't know whether years of experience count? LOL

IMHO, the best balance, for general usage, between latency and throughput is achieved with VOLUNTARY_PREEMPT and 250HZ for headless, and VOLUNTARY_PREEMPT and 1000HZ for desktop/GUI. But I wouldn't change the current RPI default configs. If people want to change the defaults for their specific use cases, they can compile their own kernels......

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

I am not asking to change values for some specific usecase.
I want a usable general purpose kernel right out of the box.
Look what the big distros are using...
There is a reason why the do it that way.

robingroppe commented Dec 7, 2015

I am not asking to change values for some specific usecase.
I want a usable general purpose kernel right out of the box.
Look what the big distros are using...
There is a reason why the do it that way.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

Might as well add what I use on various systems as well.

In general, I use one of four different configurations, depending on what the system is for:

  1. HZ=100 PREEMPT_NONE (I use this for stuff that is solely for number crunching or other similar processor bound things, it provides the best overall throughput, but latency is horrible. The only systems I run this way are usually dedicated BOINC clients, and on occasion VM's for testing particular things).
  2. HZ=250 PREEMPT_VOLUNTARY (I use this for most systems when I don't have some particular reason to use anything else. When I use a custom kernel on the Pi, I usually use this configuration).
  3. HZ=300 PREEMPT_FULL (I use this on systems that I need to do multimedia work on, but don't need true real-time performance. The particular reasoning being that 300 is exactly divisible by both PAL and NTSC frame rates, so it's a bit better for live video editing).
  4. HZ=1000 PREEMPT_FULL (I use this only on stuff that needs absolute minimal latency, usually when I'm doing something that requires real-time guarantees. It's horribly energy inefficient (about 20-30W greater power consumption compared to HZ=250 PREEMPT_VOLUNTARY on an AMD FX-8320), and really trashes throughput for computations (most benchmarks are noticeably lower with this than HZ=250 and PREEMPT_VOLUNTARY)).

Overall, the biggest impact from both of these options is how many mandatory context-switches they cause. Context switches are expensive, even on really well designed hardware, and are a large part of what hurts throughput in number crunching stuff. A higher frequency on the global timer interrupt (higher HZ value) increases the number of required context switches in direct proportion to it's value (each time it fires, you get at a minimum two context switches, one from the running task to the scheduler, and one from the scheduler to the new task it selected to run). It's harder to quantify what impact the PREEMPT options have, but the more preemption points are available, the more likely a context switch will happen.

Contributor

Ferroin commented Dec 7, 2015

Might as well add what I use on various systems as well.

In general, I use one of four different configurations, depending on what the system is for:

  1. HZ=100 PREEMPT_NONE (I use this for stuff that is solely for number crunching or other similar processor bound things, it provides the best overall throughput, but latency is horrible. The only systems I run this way are usually dedicated BOINC clients, and on occasion VM's for testing particular things).
  2. HZ=250 PREEMPT_VOLUNTARY (I use this for most systems when I don't have some particular reason to use anything else. When I use a custom kernel on the Pi, I usually use this configuration).
  3. HZ=300 PREEMPT_FULL (I use this on systems that I need to do multimedia work on, but don't need true real-time performance. The particular reasoning being that 300 is exactly divisible by both PAL and NTSC frame rates, so it's a bit better for live video editing).
  4. HZ=1000 PREEMPT_FULL (I use this only on stuff that needs absolute minimal latency, usually when I'm doing something that requires real-time guarantees. It's horribly energy inefficient (about 20-30W greater power consumption compared to HZ=250 PREEMPT_VOLUNTARY on an AMD FX-8320), and really trashes throughput for computations (most benchmarks are noticeably lower with this than HZ=250 and PREEMPT_VOLUNTARY)).

Overall, the biggest impact from both of these options is how many mandatory context-switches they cause. Context switches are expensive, even on really well designed hardware, and are a large part of what hurts throughput in number crunching stuff. A higher frequency on the global timer interrupt (higher HZ value) increases the number of required context switches in direct proportion to it's value (each time it fires, you get at a minimum two context switches, one from the running task to the scheduler, and one from the scheduler to the new task it selected to run). It's harder to quantify what impact the PREEMPT options have, but the more preemption points are available, the more likely a context switch will happen.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

@robingroppe 'All the big distros are doing it this way' is not a valid argument in general, and for embedded systems in particular.

Take for example Ubuntu's choice of what kernel to ship with their releases. Almost always, it's not a version that is tagged upstream for long-term-support.

On top of the numerous poor decisions that get made by big distros, you need to remember other than OpenWRT, Angstrom, and their friends, big distros are targeted at desktop systems or server systems, which have very different requirements from embedded systems.

Arguably, HZ=100's only advantages over HZ=250 are throughput (which is insignificant when you have horrible latency) and energy efficiency (which shouldn't be a primary consideration when using something with a 5-10W nominal power draw).

As far as PREEMPT_VOLUNTARY, that has a lot more potential to impact existing user code, but is much less significant than switching to PREEMPT_NONE.

Contributor

Ferroin commented Dec 7, 2015

@robingroppe 'All the big distros are doing it this way' is not a valid argument in general, and for embedded systems in particular.

Take for example Ubuntu's choice of what kernel to ship with their releases. Almost always, it's not a version that is tagged upstream for long-term-support.

On top of the numerous poor decisions that get made by big distros, you need to remember other than OpenWRT, Angstrom, and their friends, big distros are targeted at desktop systems or server systems, which have very different requirements from embedded systems.

Arguably, HZ=100's only advantages over HZ=250 are throughput (which is insignificant when you have horrible latency) and energy efficiency (which shouldn't be a primary consideration when using something with a 5-10W nominal power draw).

As far as PREEMPT_VOLUNTARY, that has a lot more potential to impact existing user code, but is much less significant than switching to PREEMPT_NONE.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Isnt the Pi meant to be a Desktop or a Server?

robingroppe commented Dec 7, 2015

Isnt the Pi meant to be a Desktop or a Server?

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 7, 2015

Collaborator

My Ubuntu install does have CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_250.
OpenELEC on Pi has CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_300.
OSMC on Pi has CONFIG_PREEMPT and CONFIG_HZ_100
Might be worth checking what Archlinux on Pi uses.

Do we have anyone who objects to these settings?
Comments @pelwell @P33M @notro ?

Collaborator

popcornmix commented Dec 7, 2015

My Ubuntu install does have CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_250.
OpenELEC on Pi has CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_300.
OSMC on Pi has CONFIG_PREEMPT and CONFIG_HZ_100
Might be worth checking what Archlinux on Pi uses.

Do we have anyone who objects to these settings?
Comments @pelwell @P33M @notro ?

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

@robingroppe The original intent was to be an educational tool, for teaching basic programming skills, as well as basic electrical design skills. It's obviously evolved far beyond that (because at the time of release, it was the absolute cheapest SBC available that was actually usable beyond IoT type applications), but that doesn't mean that it's not an embedded system by nature.

Contributor

Ferroin commented Dec 7, 2015

@robingroppe The original intent was to be an educational tool, for teaching basic programming skills, as well as basic electrical design skills. It's obviously evolved far beyond that (because at the time of release, it was the absolute cheapest SBC available that was actually usable beyond IoT type applications), but that doesn't mean that it's not an embedded system by nature.

@clivem

This comment has been minimized.

Show comment
Hide comment
@clivem

clivem Dec 7, 2015

Contributor

@popcornmix IIRC, Fedora defaults to CONFIG_PREEMPT_VOLUNTARY and 200HZ for ARM kernel builds. (Just another useless piece of data.....)

Contributor

clivem commented Dec 7, 2015

@popcornmix IIRC, Fedora defaults to CONFIG_PREEMPT_VOLUNTARY and 200HZ for ARM kernel builds. (Just another useless piece of data.....)

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 7, 2015

Contributor

@popcornmix I'd say that given the particularly heavy usage of the Pi for media center type things, HZ=300 is probably slightly preferred to HZ=250. The way that people seem to be using it, we should almost certainly be prioritizing minimal latency over maximal throughput, so I'd say it's still a tossup whether we really want PREEMPT_VOLUNTARY over PREEMPT_FULL.

Contributor

Ferroin commented Dec 7, 2015

@popcornmix I'd say that given the particularly heavy usage of the Pi for media center type things, HZ=300 is probably slightly preferred to HZ=250. The way that people seem to be using it, we should almost certainly be prioritizing minimal latency over maximal throughput, so I'd say it's still a tossup whether we really want PREEMPT_VOLUNTARY over PREEMPT_FULL.

@pelwell

This comment has been minimized.

Show comment
Hide comment
@pelwell

pelwell Dec 7, 2015

Contributor

I'm curious to see whether the difference in flat-out number crunching performance between HZ=100 and HZ=250/300 is measurable on an otherwise idle system. But in general, given that OpenELEC seems OK with VOLUNTARY/300 I think we could give it a try.

Contributor

pelwell commented Dec 7, 2015

I'm curious to see whether the difference in flat-out number crunching performance between HZ=100 and HZ=250/300 is measurable on an otherwise idle system. But in general, given that OpenELEC seems OK with VOLUNTARY/300 I think we could give it a try.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

I can say that my Pi runs "time echo "scale=5000; 4*a(1)" | bc -l" in 2m07s on the stock kernel and 2m10s on a 1000hz rt kernel.

robingroppe commented Dec 7, 2015

I can say that my Pi runs "time echo "scale=5000; 4*a(1)" | bc -l" in 2m07s on the stock kernel and 2m10s on a 1000hz rt kernel.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

So 250 or 300hz should be nothing to worry about.

robingroppe commented Dec 7, 2015

So 250 or 300hz should be nothing to worry about.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 7, 2015

Collaborator

127 seconds versus 130s is 2.3% which I wouldn't say was nothing to worry about.
We have spent a lot of time and effort for optimisations smaller than that.

Of course this is 1000Hz and whatever changes "rt" implies, so the actual difference is likely smaller.
But in general we wouldn't accept a 1% performance loss without a very compelling reason.

Would be good to do the same test with just CONFIG_HZ_300 and with just CONFIG_PREEMPT_VOLUNTARY and report the changes.

Collaborator

popcornmix commented Dec 7, 2015

127 seconds versus 130s is 2.3% which I wouldn't say was nothing to worry about.
We have spent a lot of time and effort for optimisations smaller than that.

Of course this is 1000Hz and whatever changes "rt" implies, so the actual difference is likely smaller.
But in general we wouldn't accept a 1% performance loss without a very compelling reason.

Would be good to do the same test with just CONFIG_HZ_300 and with just CONFIG_PREEMPT_VOLUNTARY and report the changes.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Just checked ArchARM uses 200hz and voluntary Preemption. I will check the differnce with a 250hz kernel.

robingroppe commented Dec 7, 2015

Just checked ArchARM uses 200hz and voluntary Preemption. I will check the differnce with a 250hz kernel.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 7, 2015

Stock:

  1. Attempt
    real 2m22.995s
    user 2m8.580s
    sys 0m0.050s
  2. Attempt
    real 2m8.763s
    user 2m8.760s
    sys 0m0.000s
  3. Attempt
    real 2m8.713s
    user 2m8.690s
    sys 0m0.000s

250hz-Voluntary:

1.Attempt
real 2m9.216s
user 2m9.184s
sys 0m0.012s

  1. Attempt
    real 2m9.370s
    user 2m9.360s
    sys 0m0.008s
  2. Attempt
    real 2m9.252s
    user 2m9.244s
    sys 0m0.008s

But dont forget that this is just one specific benchmark. The UnixBench results told another story.

robingroppe commented Dec 7, 2015

Stock:

  1. Attempt
    real 2m22.995s
    user 2m8.580s
    sys 0m0.050s
  2. Attempt
    real 2m8.763s
    user 2m8.760s
    sys 0m0.000s
  3. Attempt
    real 2m8.713s
    user 2m8.690s
    sys 0m0.000s

250hz-Voluntary:

1.Attempt
real 2m9.216s
user 2m9.184s
sys 0m0.012s

  1. Attempt
    real 2m9.370s
    user 2m9.360s
    sys 0m0.008s
  2. Attempt
    real 2m9.252s
    user 2m9.244s
    sys 0m0.008s

But dont forget that this is just one specific benchmark. The UnixBench results told another story.

@pelwell

This comment has been minimized.

Show comment
Hide comment
@pelwell

pelwell Dec 7, 2015

Contributor

Which was...?

Contributor

pelwell commented Dec 7, 2015

Which was...?

@JamesH65

This comment has been minimized.

Show comment
Hide comment
@JamesH65

JamesH65 Dec 8, 2015

Contributor

Sorry to butt in, been following thread with interest.

The standard kernel needs to cater to the educational needs, so needs to
have a responsive desktop. Headless devices is not a major use case for
education.

What is noticeable from the posts in this thread - no actual figures.
Someone need to determine a set of tests to run that are representative for
the major use case, and try them at the different settings proposed. That
way the Foundation can make a valid assessment of the right figures to use.

On 8 December 2015 at 12:00, Robin Groppe notifications@github.com wrote:

I also think the standard kernel should fit for Case 2 and 3.
Most People i know are using it as a headless server for network
applications or as a small desktop.
The headless server thing will mostly also benfit from these changes.
So what do you think?


Reply to this email directly or view it on GitHub
#1216 (comment).

Contributor

JamesH65 commented Dec 8, 2015

Sorry to butt in, been following thread with interest.

The standard kernel needs to cater to the educational needs, so needs to
have a responsive desktop. Headless devices is not a major use case for
education.

What is noticeable from the posts in this thread - no actual figures.
Someone need to determine a set of tests to run that are representative for
the major use case, and try them at the different settings proposed. That
way the Foundation can make a valid assessment of the right figures to use.

On 8 December 2015 at 12:00, Robin Groppe notifications@github.com wrote:

I also think the standard kernel should fit for Case 2 and 3.
Most People i know are using it as a headless server for network
applications or as a small desktop.
The headless server thing will mostly also benfit from these changes.
So what do you think?


Reply to this email directly or view it on GitHub
#1216 (comment).

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 8, 2015

I dont know much about scientific benchmarking. If you do, please give it a go. A download link to the modified kernel has been posted.
I can say that even on the terminal the system feels snappier.

robingroppe commented Dec 8, 2015

I dont know much about scientific benchmarking. If you do, please give it a go. A download link to the modified kernel has been posted.
I can say that even on the terminal the system feels snappier.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 8, 2015

Contributor

We can't really scientifically benchmark desktop responsiveness, it's way too subjective to properly measure. The closest we can get is probably latencytop, which will need it's own config option turned on in the kernel to be usable (which will impact the results some). unixbench may be worthwhile as a way to determine throughput, but that's still not great.

In general, I'd say that the particular things to look at are:

  1. How long it takes to get to a login/desktop from the time power is applied (this is hard to measure properly, best option is probably to use time-stamps in the logs).
  2. What numbers do we get from something like bonnie++ (the big limiting factor for most usage on the Pi is usually storage, so this has a significant impact on latency).
  3. How long does it take to render a reference web page (the reference page doesn't have to be really complex, something like acid3 should be fine), there's a frontend for webkit that renders directly to a PNG image, that might be useful for testing this.
  4. How does it impact things that are memory bound as opposed to CPU or I/O bound? (A good test for this would be timing a long effect chain in SoX processing some reference audio, I'll look at throwing together a script to test this).

Ideally, we should get as many samples as reasonably possible, as these are things that may be impacted.

Contributor

Ferroin commented Dec 8, 2015

We can't really scientifically benchmark desktop responsiveness, it's way too subjective to properly measure. The closest we can get is probably latencytop, which will need it's own config option turned on in the kernel to be usable (which will impact the results some). unixbench may be worthwhile as a way to determine throughput, but that's still not great.

In general, I'd say that the particular things to look at are:

  1. How long it takes to get to a login/desktop from the time power is applied (this is hard to measure properly, best option is probably to use time-stamps in the logs).
  2. What numbers do we get from something like bonnie++ (the big limiting factor for most usage on the Pi is usually storage, so this has a significant impact on latency).
  3. How long does it take to render a reference web page (the reference page doesn't have to be really complex, something like acid3 should be fine), there's a frontend for webkit that renders directly to a PNG image, that might be useful for testing this.
  4. How does it impact things that are memory bound as opposed to CPU or I/O bound? (A good test for this would be timing a long effect chain in SoX processing some reference audio, I'll look at throwing together a script to test this).

Ideally, we should get as many samples as reasonably possible, as these are things that may be impacted.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 8, 2015

Collaborator

nbench is very simple for testing integer/floating point and memory operations.
A more advance gui benchmark would be something like cairo-traces.
There seems to be a cutdown version more suitable for embedded platforms here:
https://github.com/ssvb/trimmed-cairo-traces

Collaborator

popcornmix commented Dec 8, 2015

nbench is very simple for testing integer/floating point and memory operations.
A more advance gui benchmark would be something like cairo-traces.
There seems to be a cutdown version more suitable for embedded platforms here:
https://github.com/ssvb/trimmed-cairo-traces

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 8, 2015

Contributor

Benchmarks are by nature synthetic workloads, thus no one benchmark is going to give us a full picture of the system performance.

Based on this, the wider variety of benchmarks we test, the better. I do think that we should at a minimum do something to test disk performance, as that is usually one of the biggest bottlenecks on most SBC type systems. I think that cairo-traces is definitely worth testing with. I still think that benchmarking HTML rendering is worth doing as well, most people I personally know who use the Pi as a desktop primarily use it for web browsing, and rendering HTML is complex enough that it should show performance variance pretty well.

Contributor

Ferroin commented Dec 8, 2015

Benchmarks are by nature synthetic workloads, thus no one benchmark is going to give us a full picture of the system performance.

Based on this, the wider variety of benchmarks we test, the better. I do think that we should at a minimum do something to test disk performance, as that is usually one of the biggest bottlenecks on most SBC type systems. I think that cairo-traces is definitely worth testing with. I still think that benchmarking HTML rendering is worth doing as well, most people I personally know who use the Pi as a desktop primarily use it for web browsing, and rendering HTML is complex enough that it should show performance variance pretty well.

@mr-berndt

This comment has been minimized.

Show comment
Hide comment
@mr-berndt

mr-berndt Dec 9, 2015

Thanks to this discussion I was able to solve a major issue I was having with an embedded system here: When building my kernels I always used the very common recommendation for audio-workstations (mostly X86 with some power behind) which is 1000 Hz and PREEMPT.

Running squeezelite with sox-resampling, jack and brutefir with 96 kHz on the Pi (quite some tasks) lead to short crackling when the playlist changed from 96 kHz to 44.1 kHz.

Now I changed my kernel to PREEMPT_VOLUNTARY and 250 Hz and the issue is gone.

So again no benchmark but a hint.. ;)
And thx from my side for bringing this up!

mr-berndt commented Dec 9, 2015

Thanks to this discussion I was able to solve a major issue I was having with an embedded system here: When building my kernels I always used the very common recommendation for audio-workstations (mostly X86 with some power behind) which is 1000 Hz and PREEMPT.

Running squeezelite with sox-resampling, jack and brutefir with 96 kHz on the Pi (quite some tasks) lead to short crackling when the playlist changed from 96 kHz to 44.1 kHz.

Now I changed my kernel to PREEMPT_VOLUNTARY and 250 Hz and the issue is gone.

So again no benchmark but a hint.. ;)
And thx from my side for bringing this up!

@pelwell

This comment has been minimized.

Show comment
Hide comment
@pelwell

pelwell Dec 9, 2015

Contributor

Yours is an easier decision, since as you have seen it reduces the overhead. Going from 100Hz to 250Hz adds some overhead, but possibly not enough to worry about - we won't know without some proper benchmarking.

Contributor

pelwell commented Dec 9, 2015

Yours is an easier decision, since as you have seen it reduces the overhead. Going from 100Hz to 250Hz adds some overhead, but possibly not enough to worry about - we won't know without some proper benchmarking.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 9, 2015

Contributor

This may be worth pointing out:

On the Pi, running with no overclocking and HZ=100, we get 7 million cycles between each timer interrupt. With HZ=250, it's approximately 2.8 million. Ignoring the scheduler overhead (because it's not deterministic, and I don't know what the minimum overhead for it is on ARM), that's a 40% reduction in processor time per time slice. This also ignores two specific things however:

  1. For a compute heavy workload with one task per cpu, the only hit to performance is the added scheduler overhead, as there are still 700 million cycles in one second, no matter what the scheduling interrupt frequency is.
  2. For a realistic desktop workload (mostly idle, but with lots of processes and threads), the more frequent scheduling directly improves responsiveness, which improves the net productivity of the user themselves (you obviously get more work done the less you are sitting around waiting for your computer to respond).
Contributor

Ferroin commented Dec 9, 2015

This may be worth pointing out:

On the Pi, running with no overclocking and HZ=100, we get 7 million cycles between each timer interrupt. With HZ=250, it's approximately 2.8 million. Ignoring the scheduler overhead (because it's not deterministic, and I don't know what the minimum overhead for it is on ARM), that's a 40% reduction in processor time per time slice. This also ignores two specific things however:

  1. For a compute heavy workload with one task per cpu, the only hit to performance is the added scheduler overhead, as there are still 700 million cycles in one second, no matter what the scheduling interrupt frequency is.
  2. For a realistic desktop workload (mostly idle, but with lots of processes and threads), the more frequent scheduling directly improves responsiveness, which improves the net productivity of the user themselves (you obviously get more work done the less you are sitting around waiting for your computer to respond).
@pelwell

This comment has been minimized.

Show comment
Hide comment
@pelwell

pelwell Dec 9, 2015

Contributor

HZ=1000, we get 7 million cycles

Surely that's a typo? 7GHz would be an aggressive overclock.

Contributor

pelwell commented Dec 9, 2015

HZ=1000, we get 7 million cycles

Surely that's a typo? 7GHz would be an aggressive overclock.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 9, 2015

Contributor

@pelwell yes, I meant at HZ=100

Contributor

Ferroin commented Dec 9, 2015

@pelwell yes, I meant at HZ=100

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 9, 2015

Now think about this. Every 10ms the kernel looks for something to do.
What if a process is finished after lets say 1.5 million cycles?
The scheduler does not know that until the full 10ms are over.
So all it does for the rest of the slice is sit there and wait to fire up the scheduler again.
This causes wait loops for the rest of the processes which may could have been done already if the scheduler have been aware of this situation.
I would not call this overhead like the arch guys did but it is surely inefficiency.
But thats the situation on a desktop or server. A lot of processes, not one which is crunching numbers all day.

robingroppe commented Dec 9, 2015

Now think about this. Every 10ms the kernel looks for something to do.
What if a process is finished after lets say 1.5 million cycles?
The scheduler does not know that until the full 10ms are over.
So all it does for the rest of the slice is sit there and wait to fire up the scheduler again.
This causes wait loops for the rest of the processes which may could have been done already if the scheduler have been aware of this situation.
I would not call this overhead like the arch guys did but it is surely inefficiency.
But thats the situation on a desktop or server. A lot of processes, not one which is crunching numbers all day.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 9, 2015

Contributor

Actually, assuming that the process does something that causes it to go into one of the sleep states (S or D in top and similar tools), then the reschedule should happen almost immediately. The issue is when you have more tasks that can run than you have CPU cores to run them on. The lower the scheduler interrupt frequency, the more a task can do before it gets forcibly preempted by some other runnable task (It's actually more complicated than this, because the Linux scheduler uses the nice value of a task not to prioritize it's position in the queue of runnable tasks, but to adjust it's scheduling time slice (higher nice values get longer time slices, and thus get to run more than lower nice values). This nicely solves many of the issues present in traditional priority queue schedulers found in older UNIX systems like SVR4, but makes it much harder to truly determine the impact of adjusting the scheduler interrupt frequency).

This is why increasing the frequency reduces raw computational throughput, but improves latency. Unlike most x86 desktops, the Pi has a slow enough processor that many things done on a desktop (for example, rendering this webpage, or fetching an e-mail) take more than one scheduling period to do, which means that they will tend to block other tasks from running for longer periods when the scheduling period is longer, thus hurting latency and responsiveness.

Contributor

Ferroin commented Dec 9, 2015

Actually, assuming that the process does something that causes it to go into one of the sleep states (S or D in top and similar tools), then the reschedule should happen almost immediately. The issue is when you have more tasks that can run than you have CPU cores to run them on. The lower the scheduler interrupt frequency, the more a task can do before it gets forcibly preempted by some other runnable task (It's actually more complicated than this, because the Linux scheduler uses the nice value of a task not to prioritize it's position in the queue of runnable tasks, but to adjust it's scheduling time slice (higher nice values get longer time slices, and thus get to run more than lower nice values). This nicely solves many of the issues present in traditional priority queue schedulers found in older UNIX systems like SVR4, but makes it much harder to truly determine the impact of adjusting the scheduler interrupt frequency).

This is why increasing the frequency reduces raw computational throughput, but improves latency. Unlike most x86 desktops, the Pi has a slow enough processor that many things done on a desktop (for example, rendering this webpage, or fetching an e-mail) take more than one scheduling period to do, which means that they will tend to block other tasks from running for longer periods when the scheduling period is longer, thus hurting latency and responsiveness.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 9, 2015

Have you compared the raw performance of a 100hz kernel vs. a 1000hz kernel?

robingroppe commented Dec 9, 2015

Have you compared the raw performance of a 100hz kernel vs. a 1000hz kernel?

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 9, 2015

Context switches and scheduler interrupts are pretty cheap in relation of how much smoother the system feels.
I have stopped the bc thing ~2m950s with a 1000hz kernel.

robingroppe commented Dec 9, 2015

Context switches and scheduler interrupts are pretty cheap in relation of how much smoother the system feels.
I have stopped the bc thing ~2m950s with a 1000hz kernel.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 9, 2015

Contributor

That depends on what you mean by 'raw performance'. In a pure computational sense, HZ=1000 will give around 10x less computational performance than HZ=100, because you will have approximately 10 the scheduling overhead. As far as latency, I have no actual numbers, but I see a noticeable difference (probably subjective, but I did jump through hoops to do proper double-blind testing), but that doesn't nessicarily mean anything, and 1000Hz is overkill for almost anything but gaming.

Contributor

Ferroin commented Dec 9, 2015

That depends on what you mean by 'raw performance'. In a pure computational sense, HZ=1000 will give around 10x less computational performance than HZ=100, because you will have approximately 10 the scheduling overhead. As far as latency, I have no actual numbers, but I see a noticeable difference (probably subjective, but I did jump through hoops to do proper double-blind testing), but that doesn't nessicarily mean anything, and 1000Hz is overkill for almost anything but gaming.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 9, 2015

How do you think the performance will drop about factor 10? How long does a interrupt take?

robingroppe commented Dec 9, 2015

How do you think the performance will drop about factor 10? How long does a interrupt take?

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Dec 9, 2015

Contributor

Apparently I'm horrible typing numbers today...

I'm not entirely sure what I was intending to state there, but a 10x performance drop was definitely not it. The bit about the scheduling overhead still stands though, if you're running the scheduler 10 times as often, you have 10 times the overhead.

In response to the statement before my comment:
Context switches are always expensive compared to almost anything else(you're saving a significant percentage of the processor state to memory, and copying in some other saved state, and completely trashing the processor cache), that's part of why fork() is so slow on almost every system in existence, and why anybody doing real HPC work locks individual tasks to their own CPU core, and then avoids syscalls at all costs. I don't remember the efficiency of the scheduler on Linux, but I'm pretty sure it's not O(1) and scales in some way with the number of tasks. And on top of that, the more frequently interrupts are firing, the higher your power consumption. I feel that the trade-off is probably worth it to use HZ=250, but it's almost certainly not worth it when using HZ=1000.

Contributor

Ferroin commented Dec 9, 2015

Apparently I'm horrible typing numbers today...

I'm not entirely sure what I was intending to state there, but a 10x performance drop was definitely not it. The bit about the scheduling overhead still stands though, if you're running the scheduler 10 times as often, you have 10 times the overhead.

In response to the statement before my comment:
Context switches are always expensive compared to almost anything else(you're saving a significant percentage of the processor state to memory, and copying in some other saved state, and completely trashing the processor cache), that's part of why fork() is so slow on almost every system in existence, and why anybody doing real HPC work locks individual tasks to their own CPU core, and then avoids syscalls at all costs. I don't remember the efficiency of the scheduler on Linux, but I'm pretty sure it's not O(1) and scales in some way with the number of tasks. And on top of that, the more frequently interrupts are firing, the higher your power consumption. I feel that the trade-off is probably worth it to use HZ=250, but it's almost certainly not worth it when using HZ=1000.

@robingroppe

This comment has been minimized.

Show comment
Hide comment
@robingroppe

robingroppe Dec 9, 2015

I have mentioned the 1000hz to get to the other extreme. Even there it was not even a loss of 1%.
Maybe there are Workloads where this really hurts. But as we are talking about 250hz and i have tested it a bit, I have to say that i dont see a downside. I have seen a loss of 0.048% in crunchin numbers with bc but a massive gain on Unixbench. Especially in Multicore Workloads. The Arch Guys saw a massive gain on Linpack too.

robingroppe commented Dec 9, 2015

I have mentioned the 1000hz to get to the other extreme. Even there it was not even a loss of 1%.
Maybe there are Workloads where this really hurts. But as we are talking about 250hz and i have tested it a bit, I have to say that i dont see a downside. I have seen a loss of 0.048% in crunchin numbers with bc but a massive gain on Unixbench. Especially in Multicore Workloads. The Arch Guys saw a massive gain on Linpack too.

popcornmix added a commit that referenced this issue Dec 15, 2015

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Dec 15, 2015

Collaborator

We're happy enough with switching to CONFIG_PREEMPT_VOLUNTARY to match Ubuntu and OpenELEC. This is in latest rpi-update kernel.

We'll see if we get any positive or negative reports from this, and possibly increase the HZ value in a subsequent update.

Collaborator

popcornmix commented Dec 15, 2015

We're happy enough with switching to CONFIG_PREEMPT_VOLUNTARY to match Ubuntu and OpenELEC. This is in latest rpi-update kernel.

We'll see if we get any positive or negative reports from this, and possibly increase the HZ value in a subsequent update.

@mk01

This comment has been minimized.

Show comment
Hide comment
@mk01

mk01 Dec 24, 2015

perhaps what needs to be understood:

  • voluntary <> preemptive - more preemption points doesn't mean preemption will always happen. only can happen
  • periodicity of ticks - periodic ticks are (imho) many years obsolete. sure, there can be specific requirements, but honestly - will be hard to find - in actual kernels cost to handle a tick is close to non measurable times. so it alone has barely any effect. actually if 100 if 1000, doesn't make sense as long as this can be handled dynamically. and - it can (seeing as low as 5-10 ticks /1s is not uncommon, so why to generate still those 100?)
  • what I agree with is 3-5% gain on throughput with voluntary case, but again - 95% of times our embedded devices run with ONE active task anyhow -> there will be no difference.
  • nobody checked other CONFIG params - specially HIGH_RES_TIMERS. until this is used, all being said / set / expected is being invalidated. depending on platforms timers implementation (and which one is used for HR timers) - it is responsible for 2000-3000 wakeups per second - so where is the discussion if 250 or 300

mk01

btw: kernel is not the same as 20years ago - this was the time when all those distros set its .config params. and never looked back.

mk01 commented Dec 24, 2015

perhaps what needs to be understood:

  • voluntary <> preemptive - more preemption points doesn't mean preemption will always happen. only can happen
  • periodicity of ticks - periodic ticks are (imho) many years obsolete. sure, there can be specific requirements, but honestly - will be hard to find - in actual kernels cost to handle a tick is close to non measurable times. so it alone has barely any effect. actually if 100 if 1000, doesn't make sense as long as this can be handled dynamically. and - it can (seeing as low as 5-10 ticks /1s is not uncommon, so why to generate still those 100?)
  • what I agree with is 3-5% gain on throughput with voluntary case, but again - 95% of times our embedded devices run with ONE active task anyhow -> there will be no difference.
  • nobody checked other CONFIG params - specially HIGH_RES_TIMERS. until this is used, all being said / set / expected is being invalidated. depending on platforms timers implementation (and which one is used for HR timers) - it is responsible for 2000-3000 wakeups per second - so where is the discussion if 250 or 300

mk01

btw: kernel is not the same as 20years ago - this was the time when all those distros set its .config params. and never looked back.

@Ferroin

This comment has been minimized.

Show comment
Hide comment
@Ferroin

Ferroin Jan 4, 2016

Contributor

@mk01 The point about preemption is perfectly valid, but doesn't have much bearing on the fact that more preemption points means lower latency on stuff that's latency sensitive. As far as tickless systems (which is what you appear to be referring to in the second and third points), that is all well and good except at least one CPU has to have something running to provide timekeeping, and as a result of this, a generation 1 Pi can't be run tickless at all (because it has only one CPU). On top of that, the timer frequency still has an impact when the system isn't sitting idle, because when the tick is running, that's the average frequency it runs at (it's only the average because of how linux's scheduler works, but that is beyond the scope of this discussion). The bit about hrtimers is also worth considering, but there isn't as much variance in that as you would think, and that only causes wakeups when something is directly using it.

As far as the comment about distros not changing kernel configuration, that's blatantly wrong. Aside form the fact that the most widely used distros didn't exist 20 years ago, almost none of them just set config options the first time and never change them. It doesn't happen often in most distros, but it does happen. Usually it's as new features become stable (BTRFS and F2FS are both included as modules in all major distros that ship precompiled kernels, the didn't even exist as config options 5 years ago, let alone 20). Less frequently, distros change config options for performance reasons (this is why Ubuntu ships a standard kernel, a virtualization targeted kernel, a server targeted kernel, and a low-latency kernel (which has HZ=1000 and PREEMPT_FULL, and why they switched from the CFQ I/O scheduler to the deadline I/O scheduler), or for security reasons (I know of at least a few distros that recently disabled vm86 support in their default configs, most jumped on disabling 16-bit segment support, a lot of them quickly turn off any legacy syscall when an option to do so appears, etc).

Contributor

Ferroin commented Jan 4, 2016

@mk01 The point about preemption is perfectly valid, but doesn't have much bearing on the fact that more preemption points means lower latency on stuff that's latency sensitive. As far as tickless systems (which is what you appear to be referring to in the second and third points), that is all well and good except at least one CPU has to have something running to provide timekeeping, and as a result of this, a generation 1 Pi can't be run tickless at all (because it has only one CPU). On top of that, the timer frequency still has an impact when the system isn't sitting idle, because when the tick is running, that's the average frequency it runs at (it's only the average because of how linux's scheduler works, but that is beyond the scope of this discussion). The bit about hrtimers is also worth considering, but there isn't as much variance in that as you would think, and that only causes wakeups when something is directly using it.

As far as the comment about distros not changing kernel configuration, that's blatantly wrong. Aside form the fact that the most widely used distros didn't exist 20 years ago, almost none of them just set config options the first time and never change them. It doesn't happen often in most distros, but it does happen. Usually it's as new features become stable (BTRFS and F2FS are both included as modules in all major distros that ship precompiled kernels, the didn't even exist as config options 5 years ago, let alone 20). Less frequently, distros change config options for performance reasons (this is why Ubuntu ships a standard kernel, a virtualization targeted kernel, a server targeted kernel, and a low-latency kernel (which has HZ=1000 and PREEMPT_FULL, and why they switched from the CFQ I/O scheduler to the deadline I/O scheduler), or for security reasons (I know of at least a few distros that recently disabled vm86 support in their default configs, most jumped on disabling 16-bit segment support, a lot of them quickly turn off any legacy syscall when an option to do so appears, etc).

popcornmix added a commit that referenced this issue Jan 4, 2016

ffeldbauer pushed a commit to ffeldbauer/PandaRPiKernel that referenced this issue Jan 28, 2016

popcornmix added a commit that referenced this issue Feb 20, 2016

@P33M

This comment has been minimized.

Show comment
Hide comment
@P33M

P33M May 17, 2017

Contributor

We have settled on CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ=100 as the default. We have a microsecond-resolution timestamp source for precise userspace timing.

Contributor

P33M commented May 17, 2017

We have settled on CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ=100 as the default. We have a microsecond-resolution timestamp source for precise userspace timing.

@P33M P33M closed this May 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment