Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance tuning - some improvements #96

Closed
interduo opened this issue Sep 13, 2022 · 15 comments
Closed

performance tuning - some improvements #96

interduo opened this issue Sep 13, 2022 · 15 comments
Milestone

Comments

@interduo
Copy link
Collaborator

interduo commented Sep 13, 2022

@rchac could You check on some medium/big setups impact on:
*CPU usage with this performance tuning (look at si and sy in top)
*total router performance
?

  1. Change Queue NIC Adapter Size to maximum (can check max using ethtool --show-ring netdev0)
ethtool netdev0 -G rx 8192 rx 8192
ethtool netdev1 -G rx 8192 rx 8192

I expect here: to have more throughput and less cpu ussage because cpu in one single interrupt more work could be done.

2. Changing **congestion control** `sysctl -w net.ipv4.tcp_congestion_control=htcp`

This is sender side parameter and in LibreQoS we L2 bridging.

  1. Making bigger backlog:
ip link set netdev0 qlen 10000
ip link set netdev1 qlen 10000
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_max_backlog=300000

I expect here: to have more throughput and less cpu ussage because cpu in one interrupt should get more work done and do it more nicely because of that it could could empty backlog in not standard sequence.

  1. Turn off adaptive coalescence (and set those recommended values)
ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122
ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122

Here I expect more throughtput and constant latency (defaults are: adaptive on, tx-usecs: 8
tx-frames: 128) so the latency could jitter during bigger load.

  1. I think that: Ubuntu doesn't set CPU scalling_governor as performance by default so do that by:
    echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    what do You got in Your OS'es by default?

I expect to have constant highest avaiable Mhz rate for all cores (Turbo mode always on). This affects whole system performance.

  1. Dont allow CPU to enter deep sleep states
    echo 0 > /sys/module/intel_idle/parameters/max_cstate
    (if You got Permission denied then edit /etc/default/grub and add intel_pstate=disable to GRUB_CMDLINE_LINUX_DEFAULT and run command update-grub2)

Turning back from deepsleep costs a little work for CPU so we dont allow CPU to sleep deeply.

  1. Turning off TCP Selective Acknowledgments (SACK).
    echo sysctl -w net.ipv4.tcp_sack=0 > /etc/sysctl.d/libreqos.conf
    Then run cmd: sysctl --system

  2. Change TCP memory usage from defaults:
    put below contents in the end of /etc/sysctl.d/libreqos.conf file

net.core.rmem_max=16777216 
net.core.wmem_max=16777216 
net.core.rmem_default=16777216 
net.core.wmem_default=16777216 
net.core.optmem_max=16777216 
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216

and then then run cmd: sysctl --system

More throughput, more memory needed for traffic.

  1. use host CPU type in Hypervisor
    (eg. If You are using VM on Proxmox set the CPU type from common KVM processor to host)

This is important if You use LibreQoS in VM on proxmox - the difference is ~4-6% of less cpu usage.
(the instruction set of host CPU is much more wider than KVM common CPU)

  1. Turn off mitigations:
    Add this option bellow to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
    noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off
    run cmd update-grub2

Mitigations also affects CPU performance - turning it off gives You also few percent of less cpu usage.

@interduo
Copy link
Collaborator Author

interduo commented Sep 13, 2022

Maybe @dtaht is one of the competent person that could help to estimate and recommend those settings for better tuning and no impact on bufferbloat also? (I am fan of his bufferbloat project!)

and other person on this project like:
@richb-hanover @Belyivulk @tohojo @mjstekel @syadnom @mwangi-kiboi @axonxorz @thebracket @marsalans @micron10 could also be interesting for testing this tuning settings?

@Belyivulk
Copy link

Could you elaborate please on what impact you expect each change should make and why (particularly around the NIC driver settings and linux changes).

@tohojo
Copy link

tohojo commented Sep 18, 2022

Most of these seem relatively pointless, and a few can be actively harmful. Specifically:

  1. Change Queue NIC Adapter Size to maximum

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

  1. Making bigger backlog:

The specific recommended settings will most likely not make any difference as the qlen is not actually used by any of the newer qdiscs. If they did do anything they would actively harm things, though, (bigger queues are not better!) so I would not recommend setting this.

  1. Turn off adaptive coalescence (and set those recommended values)

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

  1. I think that: Ubuntu doesn't set CPU scalling_governor as performance by default so do that by:

and

  1. Dont allow CPU to enter deep sleep states

These can achieve better performance, but it's a trade-off against power usage. Not sure changing the default is a good idea.

  1. Turning off TCP Selective Acknowledgments (SACK).

This is actively a bad idea! Doesn't actually achieve anything for forwarded traffic, though, so in that sense it's relatively harmless, but in general SACKs are a good thing!

  1. Change TCP memory usage from defaults:

Doesn't affect forwarded traffic either, so relatively pointless.

  1. use host CPU type in Hypervisor

This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

  1. Turn off [spectre] mitigations:

This will indeed improve performance. However, the spectre mitigations are there for a reason, so they should be turned off only if the box doesn't run any other applications. Not sure I'd consider this safe as a default setting...

@dtaht
Copy link
Collaborator

dtaht commented Sep 18, 2022

I pretty much agree with all of toke's comments. Sacks are really important. Also very few of these parameters are needed for a "router". However I do have a couple suggestions ISP server side in general:

In a cake and fq_codel dominated network, enabling ecn can help. Although the IETF in it's infinate wisdom has partially deprecated RFC3168 style ECN, it still reduces packet loss and retransmits when used. Both the client and server need to set
sudo sysctl -w net.ipv4.tcp_ecn=1

for this to happen, with cake/fq-codel/fq-pie in the middle, it will mark instead of drop.

Somewhat related to that is that monitoring tools need to measure both drop and marks going forward.

Secondly we generally are seeing a HUGE benefit from setting TCP_NOTSENT_LOWAT to a fairly low value, especially in containers and and kubernetes. 32K I'm seeing used for short RTTs and low bandwidths. The default is infinite....

@dtaht
Copy link
Collaborator

dtaht commented Sep 18, 2022

In general, I have found that power states are just hell on real-time performance, and cause all kinds of anomalies, so if you can afford the bill, and are properly cooled, go with the performance governor.

@dtaht
Copy link
Collaborator

dtaht commented Sep 18, 2022

I also tend to think that the largest rx queues possible - especially in virtualized environments at >10gigE is probably a better idea than @tohojo thinks. But would prefer to measure.

@interduo
Copy link
Collaborator Author

interduo commented Sep 19, 2022

@tohojo could You alaborate on:

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

All the manuals/tuning tips I found for 40G network NICs contanis this information as recommended in shortcut - set the maximum TX/RX Decriptors. ethtool -G ethN rx 8192 tx 8192. The purpose for so low default value is that there are some motherboard included nics whitch have less number of tx/rx descriptors supported.

for example: http://patchwork.ozlabs.org/project/netdev/patch/20140514141748.20309.83121.stgit@dragon/

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

What is more important for You:

  • constant latency - turn off adaptive coalescence and set constant usec value,
  • throughput - turn on adaptive coalescence (or turn it off or set high values to it parameters rx-usecs 62 and tx-usecs 122)

There is a very good paper for that here: https://01.org/linux-interrupt-moderation

use host CPU type in Hypervisor
This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

This is for proxmox KVM tuning - if You use LibreQoS as VM change the procesor type from "Common KVM procesor" to "host" and performance goes >5%.

@tohojo
Copy link

tohojo commented Sep 19, 2022

@tohojo could You alaborate on:

This can help with specific performance problems on certain setups, but will mostly make no difference. Would not recommend as a default.

All the manuals/tuning tips I found for 40G network NICs contanis this information as recommended in shortcut - set the maximum TX/RX Decriptors. ethtool -G ethN rx 8192 tx 8192. The purpose for so low default value is that there are some motherboard included nics whitch have less number of tx/rx descriptors supported.

for example: http://patchwork.ozlabs.org/project/netdev/patch/20140514141748.20309.83121.stgit@dragon/

If you look at the discussion from that link, the conclusion was that increasing TX ring sizes was not actually a good idea. The TX ring is just a dumb FIFO on the NIC, you want that to be as small as possible (but it shouldn't run empty). Mostly, BQL will take care of this just fine on its own.

For the RX ring, there can be some cases where increasing the ring size makes sense. Specifically, on some VM setups, the VM scheduling delay can be long enough that the NIC RX ring fills up before the kernel gets a chance to empty it out. In this case, increasing the RX ring size can indeed help. That's what I meant with "in specific cases".

Not sure about this one; I think it's driver-specific how well the adaptive coalescing works. There may be an argument for consistent latency, but I'm not sure it's a good idea to set those values as defaults without backing that by specific benchmarking.

What is more important for You:

* constant latency - turn off adaptive coalescence and set constant usec value,
* throughput - turn on adaptive coalescence (or turn it off or set high values to it parameters `rx-usecs 62 and tx-usecs 122`)

There is a very good paper for that here: https://01.org/linux-interrupt-moderation

That document lists different trade-offs. You (arbitrarily) picked the "max throughput" setting, which may or may not be what you want for a specific deployment. My point is that this is something you can tune depending on how you run things, not something you set as a default that works well for all cases.

use host CPU type in Hypervisor
This is probably the only thing on this list that is a benefit with no downsides. However, it'll obviously depend on which host type CPU is actually running, so I'm not sure it's something that can be done as a default?

This is for proxmox KVM tuning - if You use LibreQoS as VM change the procesor type from "Common KVM procesor" to "host" and performance goes >5%.

Right. Seems like a bug in Proxmox that it doesn't do this by default?

@interduo
Copy link
Collaborator Author

interduo commented Sep 19, 2022

Right. Seems like a bug in Proxmox that it doesn't do this by default?

No - because of live-migration function between hosts with different CPUs - that's why default setting is virtualized CPU called "Common KVM CPU".

@rchac
Copy link
Member

rchac commented Oct 20, 2022

Note to people coming across this, probably avoid this one

ethtool -C netdev0 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122
ethtool -C netdev1 adaptive-rx off adaptive-tx off rx-usecs 62 tx-usecs 122

#126 (comment)

@interduo
Copy link
Collaborator Author

In my case this helped to get more performance form our set. In mentioned issue I asked about the results. I am curious why are different from my.

@rchac
Copy link
Member

rchac commented Oct 20, 2022

Question for @dtaht and @tohojo

In the Wiki and setup guide - I have been advising operators to set up a bash script that runs on each boot to disable some NIC offloading features. The script contains:

#!/bin/sh
ethtool --offload eth1 gso off tso off lro off sg off gro off
ethtool --offload eth2 gso off tso off lro off sg off gro off

where eth1 asd eth2 correspond to the shaping interfaces.

These are the only NIC configuration changes officially recommended in the Wiki, and are what most LibreQoS operators are working with currently. I recommended disabling these offloads based on the understanding that some of them break XDP. I couldn't find official XDP documentation on exactly which of these offloads need to be disabled. The ones listed above are pieced together based on various guides and threads about XDP found online.

I want to make sure these disabled offloading features are ones we actually need to be disabling, and see if there are others I failed to consider. Any recommendations here?

@dtaht
Copy link
Collaborator

dtaht commented Nov 13, 2022

The rightest answers for this need to be collected into the doc...

@dtaht dtaht added this to the v1.3 milestone Nov 13, 2022
@dtaht dtaht modified the milestones: v1.3, v1.4 Dec 7, 2022
@rchac
Copy link
Member

rchac commented Feb 5, 2023

Closed as stale. But if still needed please reopen.

@rchac rchac closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023
@interduo interduo reopened this Feb 5, 2023
@interduo
Copy link
Collaborator Author

interduo commented Feb 5, 2023

I will reopen, add some things to wiki then close.

@rchac rchac closed this as completed Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants