performance tuning - some improvements #96

interduo · 2022-09-13T14:42:14Z

@rchac could You check on some medium/big setups impact on:
*CPU usage with this performance tuning (look at si and sy in top)
*total router performance
?

Change Queue NIC Adapter Size to maximum (can check max using ethtool --show-ring netdev0)

ethtool netdev0 -G rx 8192 rx 8192
ethtool netdev1 -G rx 8192 rx 8192

I expect here: to have more throughput and less cpu ussage because cpu in one single interrupt more work could be done.

~~2. Changing **congestion control** `sysctl -w net.ipv4.tcp_congestion_control=htcp`~~

This is sender side parameter and in LibreQoS we L2 bridging.

Making bigger backlog:

ip link set netdev0 qlen 10000
ip link set netdev1 qlen 10000
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_max_backlog=300000

I expect here: to have more throughput and less cpu ussage because cpu in one interrupt should get more work done and do it more nicely because of that it could could empty backlog in not standard sequence.

Turn off adaptive coalescence (and set those recommended values)

ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122
ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122

Here I expect more throughtput and constant latency (defaults are: adaptive on, tx-usecs: 8
tx-frames: 128) so the latency could jitter during bigger load.

I think that: Ubuntu doesn't set CPU scalling_governor as performance by default so do that by:
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
what do You got in Your OS'es by default?

I expect to have constant highest avaiable Mhz rate for all cores (Turbo mode always on). This affects whole system performance.

Dont allow CPU to enter deep sleep states
echo 0 > /sys/module/intel_idle/parameters/max_cstate
(if You got Permission denied then edit /etc/default/grub and add intel_pstate=disable to GRUB_CMDLINE_LINUX_DEFAULT and run command update-grub2)

Turning back from deepsleep costs a little work for CPU so we dont allow CPU to sleep deeply.

Turning off TCP Selective Acknowledgments (SACK).
echo sysctl -w net.ipv4.tcp_sack=0 > /etc/sysctl.d/libreqos.conf
Then run cmd: sysctl --system
Change TCP memory usage from defaults:
put below contents in the end of /etc/sysctl.d/libreqos.conf file

net.core.rmem_max=16777216 
net.core.wmem_max=16777216 
net.core.rmem_default=16777216 
net.core.wmem_default=16777216 
net.core.optmem_max=16777216 
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216

and then then run cmd: sysctl --system

More throughput, more memory needed for traffic.

use host CPU type in Hypervisor
(eg. If You are using VM on Proxmox set the CPU type from common KVM processor to host)

This is important if You use LibreQoS in VM on proxmox - the difference is ~4-6% of less cpu usage.
(the instruction set of host CPU is much more wider than KVM common CPU)

Turn off mitigations:
Add this option bellow to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off
run cmd update-grub2

Mitigations also affects CPU performance - turning it off gives You also few percent of less cpu usage.

The text was updated successfully, but these errors were encountered:

interduo · 2022-09-13T16:01:12Z

Maybe @dtaht is one of the competent person that could help to estimate and recommend those settings for better tuning and no impact on bufferbloat also? (I am fan of his bufferbloat project!)

and other person on this project like:
@richb-hanover @Belyivulk @tohojo @mjstekel @syadnom @mwangi-kiboi @axonxorz @thebracket @marsalans @micron10 could also be interesting for testing this tuning settings?

Belyivulk · 2022-09-13T21:39:13Z

Could you elaborate please on what impact you expect each change should make and why (particularly around the NIC driver settings and linux changes).

tohojo · 2022-09-18T16:47:02Z

Most of these seem relatively pointless, and a few can be actively harmful. Specifically:

Change Queue NIC Adapter Size to maximum

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

Making bigger backlog:

The specific recommended settings will most likely not make any difference as the qlen is not actually used by any of the newer qdiscs. If they did do anything they would actively harm things, though, (bigger queues are not better!) so I would not recommend setting this.

Turn off adaptive coalescence (and set those recommended values)

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

I think that: Ubuntu doesn't set CPU scalling_governor as performance by default so do that by:

and

Dont allow CPU to enter deep sleep states

These can achieve better performance, but it's a trade-off against power usage. Not sure changing the default is a good idea.

Turning off TCP Selective Acknowledgments (SACK).

This is actively a bad idea! Doesn't actually achieve anything for forwarded traffic, though, so in that sense it's relatively harmless, but in general SACKs are a good thing!

Change TCP memory usage from defaults:

Doesn't affect forwarded traffic either, so relatively pointless.

use host CPU type in Hypervisor

This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

Turn off [spectre] mitigations:

This will indeed improve performance. However, the spectre mitigations are there for a reason, so they should be turned off only if the box doesn't run any other applications. Not sure I'd consider this safe as a default setting...

dtaht · 2022-09-18T17:46:41Z

I pretty much agree with all of toke's comments. Sacks are really important. Also very few of these parameters are needed for a "router". However I do have a couple suggestions ISP server side in general:

In a cake and fq_codel dominated network, enabling ecn can help. Although the IETF in it's infinate wisdom has partially deprecated RFC3168 style ECN, it still reduces packet loss and retransmits when used. Both the client and server need to set
sudo sysctl -w net.ipv4.tcp_ecn=1

for this to happen, with cake/fq-codel/fq-pie in the middle, it will mark instead of drop.

Somewhat related to that is that monitoring tools need to measure both drop and marks going forward.

Secondly we generally are seeing a HUGE benefit from setting TCP_NOTSENT_LOWAT to a fairly low value, especially in containers and and kubernetes. 32K I'm seeing used for short RTTs and low bandwidths. The default is infinite....

dtaht · 2022-09-18T17:50:57Z

In general, I have found that power states are just hell on real-time performance, and cause all kinds of anomalies, so if you can afford the bill, and are properly cooled, go with the performance governor.

dtaht · 2022-09-18T18:51:05Z

I also tend to think that the largest rx queues possible - especially in virtualized environments at >10gigE is probably a better idea than @tohojo thinks. But would prefer to measure.

interduo · 2022-09-19T09:06:20Z

@tohojo could You alaborate on:

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

All the manuals/tuning tips I found for 40G network NICs contanis this information as recommended in shortcut - set the maximum TX/RX Decriptors. ethtool -G ethN rx 8192 tx 8192. The purpose for so low default value is that there are some motherboard included nics whitch have less number of tx/rx descriptors supported.

for example: http://patchwork.ozlabs.org/project/netdev/patch/20140514141748.20309.83121.stgit@dragon/

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

What is more important for You:

constant latency - turn off adaptive coalescence and set constant usec value,
throughput - turn on adaptive coalescence (or turn it off or set high values to it parameters rx-usecs 62 and tx-usecs 122)

There is a very good paper for that here: https://01.org/linux-interrupt-moderation

use host CPU type in Hypervisor
This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

This is for proxmox KVM tuning - if You use LibreQoS as VM change the procesor type from "Common KVM procesor" to "host" and performance goes >5%.

tohojo · 2022-09-19T09:37:48Z

@tohojo could You alaborate on:

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

All the manuals/tuning tips I found for 40G network NICs contanis this information as recommended in shortcut - set the maximum TX/RX Decriptors. ethtool -G ethN rx 8192 tx 8192. The purpose for so low default value is that there are some motherboard included nics whitch have less number of tx/rx descriptors supported.

for example: http://patchwork.ozlabs.org/project/netdev/patch/20140514141748.20309.83121.stgit@dragon/

If you look at the discussion from that link, the conclusion was that increasing TX ring sizes was not actually a good idea. The TX ring is just a dumb FIFO on the NIC, you want that to be as small as possible (but it shouldn't run empty). Mostly, BQL will take care of this just fine on its own.

For the RX ring, there can be some cases where increasing the ring size makes sense. Specifically, on some VM setups, the VM scheduling delay can be long enough that the NIC RX ring fills up before the kernel gets a chance to empty it out. In this case, increasing the RX ring size can indeed help. That's what I meant with "in specific cases".

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

What is more important for You:
* constant latency - turn off adaptive coalescence and set constant usec value,
* throughput - turn on adaptive coalescence (or turn it off or set high values to it parameters `rx-usecs 62 and tx-usecs 122`)
There is a very good paper for that here: https://01.org/linux-interrupt-moderation

That document lists different trade-offs. You (arbitrarily) picked the "max throughput" setting, which may or may not be what you want for a specific deployment. My point is that this is something you can tune depending on how you run things, not something you set as a default that works well for all cases.

use host CPU type in Hypervisor
This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

This is for proxmox KVM tuning - if You use LibreQoS as VM change the procesor type from "Common KVM procesor" to "host" and performance goes >5%.

Right. Seems like a bug in Proxmox that it doesn't do this by default?

interduo · 2022-09-19T09:43:37Z

Right. Seems like a bug in Proxmox that it doesn't do this by default?

No - because of live-migration function between hosts with different CPUs - that's why default setting is virtualized CPU called "Common KVM CPU".

rchac · 2022-10-20T19:07:16Z

Note to people coming across this, probably avoid this one

ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122
ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122

#126 (comment)

interduo · 2022-10-20T19:18:31Z

In my case this helped to get more performance form our set. In mentioned issue I asked about the results. I am curious why are different from my.

rchac · 2022-10-20T21:02:27Z

Question for @dtaht and @tohojo

In the Wiki and setup guide - I have been advising operators to set up a bash script that runs on each boot to disable some NIC offloading features. The script contains:

#!/bin/sh
ethtool --offload eth1 gso off tso off lro off sg off gro off
ethtool --offload eth2 gso off tso off lro off sg off gro off

where eth1 asd eth2 correspond to the shaping interfaces.

These are the only NIC configuration changes officially recommended in the Wiki, and are what most LibreQoS operators are working with currently. I recommended disabling these offloads based on the understanding that some of them break XDP. I couldn't find official XDP documentation on exactly which of these offloads need to be disabled. The ones listed above are pieced together based on various guides and threads about XDP found online.

I want to make sure these disabled offloading features are ones we actually need to be disabling, and see if there are others I failed to consider. Any recommendations here?

dtaht · 2022-11-13T17:07:32Z

The rightest answers for this need to be collected into the doc...

rchac · 2023-02-05T01:55:42Z

Closed as stale. But if still needed please reopen.

interduo · 2023-02-05T10:06:12Z

I will reopen, add some things to wiki then close.

interduo mentioned this issue Oct 7, 2022

Intel XL710 vs Mellanox ConnectX-6 Lx #126

Closed

rchac mentioned this issue Oct 28, 2022

Realtime kernels and spectre mitigations #141

Open

dtaht added this to the v1.3 milestone Nov 13, 2022

dtaht modified the milestones: v1.3, v1.4 Dec 7, 2022

rchac closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023

interduo reopened this Feb 5, 2023

rchac closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance tuning - some improvements #96

performance tuning - some improvements #96

interduo commented Sep 13, 2022 •

edited

Loading

interduo commented Sep 13, 2022 •

edited

Loading

Belyivulk commented Sep 13, 2022

tohojo commented Sep 18, 2022

dtaht commented Sep 18, 2022

dtaht commented Sep 18, 2022

dtaht commented Sep 18, 2022

interduo commented Sep 19, 2022 •

edited

Loading

tohojo commented Sep 19, 2022

interduo commented Sep 19, 2022 •

edited

Loading

rchac commented Oct 20, 2022 •

edited

Loading

interduo commented Oct 20, 2022

rchac commented Oct 20, 2022

dtaht commented Nov 13, 2022

rchac commented Feb 5, 2023

interduo commented Feb 5, 2023

performance tuning - some improvements #96

performance tuning - some improvements #96

Comments

interduo commented Sep 13, 2022 • edited Loading

interduo commented Sep 13, 2022 • edited Loading

Belyivulk commented Sep 13, 2022

tohojo commented Sep 18, 2022

dtaht commented Sep 18, 2022

dtaht commented Sep 18, 2022

dtaht commented Sep 18, 2022

interduo commented Sep 19, 2022 • edited Loading

tohojo commented Sep 19, 2022

interduo commented Sep 19, 2022 • edited Loading

rchac commented Oct 20, 2022 • edited Loading

interduo commented Oct 20, 2022

rchac commented Oct 20, 2022

dtaht commented Nov 13, 2022

rchac commented Feb 5, 2023

interduo commented Feb 5, 2023

interduo commented Sep 13, 2022 •

edited

Loading

interduo commented Sep 13, 2022 •

edited

Loading

interduo commented Sep 19, 2022 •

edited

Loading

interduo commented Sep 19, 2022 •

edited

Loading

rchac commented Oct 20, 2022 •

edited

Loading