Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cake musings #121

Closed
dtaht opened this issue Sep 26, 2022 · 37 comments
Closed

cake musings #121

dtaht opened this issue Sep 26, 2022 · 37 comments
Assignees
Milestone

Comments

@dtaht
Copy link
Collaborator

dtaht commented Sep 26, 2022

could also read from standard in. (one day).

#119

@interduo
Copy link
Collaborator

interduo commented Sep 26, 2022

@dtaht what problem could be solved with getting data from stdin instead of file?

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 27, 2022

My understanding of the C code presently was that it was one shell cmd per mod. a file, stdin, whatever, as basically it just has to loop through this:

https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/xdp_iphash_to_cpu_cmdline.c#L286

@interduo
Copy link
Collaborator

So this issue should be creates in xdp-cpumap_tc repository right?

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 27, 2022

We are in a situation with multiple disparate tools written in 5 different languages (C,BPF,Rust,Python,shell), flying in loose formation, with each "core" developer skilled in only one or two of those languages. While it certainly makes sense to continue to improve each tool, I find myself longing for somewhat tighter integration, and also to profile where the bottlenecks are and pain points in production. And also, given all the changes that have landed since july, I at least am wanting to freeze things, have a bigger, better test suite, get more users' feedback, etc, etc. Cake was designed (2014) on a 600Mhz single core mips, and at the time, the htb locking problem seemed impossible to overcome - and yet here we are, pushing 10+Gbps! What I think I know about the real bottlenecks and network behavior is completely obsolete, and my bedtime reading has been "Understanding software dynamics" - Richard Site's new book. 10Gps is probably enough for a LOT of smaller ISPs., and I'm reminded of the story of boeing, who designed an airplane that could fly farther, faster, was more beautiful, and when they presented an early 787 prototype to the airlines, the airlines didn't care - they wanted something mostly that could load and unload faster, which they'd not considered at all.

@interduo
Copy link
Collaborator

Yeaaa I started with 36,6k modem in ~1999. For me the speed of 40gbit is awesome now.

@interduo
Copy link
Collaborator

We are in a situation with multiple disparate tools written in 5 different languages (C,BPF,Rust,Python,shell), flying in loose formation, with each "core" developer skilled in only one or two of those languages.

Could You describe a problem You have now with this and recommdation for resolving?

Well ... I think the architecture is a bit clean right now. Every part do it's own role. The main pros of this is that You could replace XDP to DPDK or VPP in future or XDP to NFT without complications and totally rewriting LQoS code. If there comes a bottleneck.

And also, given all the changes that have landed since july, I at least am wanting to freeze things, have a bigger, better test suite, get more users' feedback, etc, etc.

v1.2 was released few hours ago - this was the point whitch got some primary features and was much deeply tested by us.

could also read from standard in. (one day).

I didn't catch "between rows" that You would like to make a bulk/batch insert here. I thought that You want only get data from STDIN.

I catch the same thing so I did an issue in the project responsible for that part of machine xdp-project/xdp-cpumap-tc#8

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

+1. I am very happy to hear you've been testing extensively! What I was trying to describe was having some more automated tests and environments, where (selfishly, I guess), I could take apart and trace where the bottlenecks were nowadays. There are now a few features to cake I'd like to explore adding, as well. Ideally a flent suite transiting a couple containers.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

And yes, being able to one day target dpdk/vpp is on my mind, as well as being able to output rules compatible with more cpe, such as openwrt and mikrotik, presently.

@interduo
Copy link
Collaborator

@dtaht does cake have any repo on github?

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

our current out of tree version is here: https://github.com/dtaht/sch_cake/

There are a couple other branches floating about, including one with SCE support, and another with L4S.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

@tohojo One example of something I'd been meaning to get on was to add DROP_REASON support to the linux qdiscs in general (congestive, overflow), as this is providing a useful tracepoint in many other subsystems today. Example: https://lwn.net/Articles/895346/ - another big idea is hooking the ack-filter code selectively to be able to do pping analysis within the kernel. The SCE debate was lost, and I don't know whether or not L4S will ever get off the ground - too many theoretical problems, IMHO. Not sure, besides "speed" what else could be added to cake.

There's another case I've been thinking about in the "emulated AP" case, where we have a bunch of cake instances for customers, and one for the AP, in that, carrying over the existing codel state from the customer qdiscs to the AP, would make a more intelligent drop decision should the AP also get backlogged. I think that the skb->cb area is untouched transiting cake -> htb -> cake, but I'd have to check, and that's technically a layer violation.

Lastly it makes sense to utilize the skb's hardware timestamp generated as early as possible (e.g on the xdp rx ring), rather than timestamp once it hits the qdisc. It's always made sense to do that, but requires hardware that can, and we've not tried to do that generally, since most don't.

There are a few other cake ideas over here: https://github.com/dtaht/sch_cake/issues

@interduo
Copy link
Collaborator

interduo commented Sep 28, 2022

@dtaht wait for Trie support in xdp-cpumap-tc meged in master and do next round of testing. This is serious bottleneck now. (and a only regression from our earlier QoS scripts)

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

Better explanation of DROP_REASON: https://lwn.net/Articles/885729/

@dtaht dtaht changed the title xdp setup code cake musings Sep 28, 2022
@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

The other thing that we don't track well is packet rescheduling, the fair queuing bit, which in my opinion is far, far, far more relevant that drops. 10,000,000,000 packets rescheduled would be a better indicator of what that's doing.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

@interduo are you saying the trie is slower?

@interduo
Copy link
Collaborator

Implementing Trie into xdp-cpumap-tc will booost LibreQoS machine by N-performance factor (the N factor depends on how many IPs You have and how many cores You have).

In Trie You have got a maximum 5 probes (Ipv4). If not using Trie average of probes could be calculated in some way as: count of all IPs in ShapedDevices.csv/2/cpucore count.

Just look at my top now:
IMG_20220928_230722.jpg

In few days I will paste new one.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 28, 2022

wow. That's still quite a lot of cpu. Somewhere in some thread (forgive me for forgetting which)? you'd posted more details about your network?

I note that before I got involved here, I was mostly looking at some DPU box, with a FPGA, to do much of this work. The nvidia one was about 2k, and claimed to do 100Gbit.

@tohojo
Copy link

tohojo commented Sep 29, 2022 via email

@tohojo
Copy link

tohojo commented Sep 29, 2022 via email

@interduo
Copy link
Collaborator

interduo commented Sep 29, 2022

Why? This will really depend on your network; if you're creating rules on a subnet basis, then yeah, the trie lookup will allow you to collapse lots of rules into one (which may or may not improve forwarding performance depending on the number of rules and the size of the hash table).

@tohojo If I understand it correctly the average probes count will be
NOW: (count of circuits/cpucores/2) - eg 5000/20/2 = 125 probes,
TRIE: 4 probes maximum,

However, if your rules are for small subnets (e.g., single IPs), the trie-based lookup is not likely to improve anything; not sure if it'll make it worse either...

I remember the time when we implemented trie support in our earlier QoS scripts (the cpu ussage droped down ~20%).
How the cpu usage will drop here - the time will show.

https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.adv-filter.hashing.html

@tohojo
Copy link

tohojo commented Sep 29, 2022 via email

@interduo
Copy link
Collaborator

One probe means single test in rule set (in firewall,tc filtering, iptables marking - depending on what do You use for QoS filtering). There are two examples in link whitch I pasted earlier (in kernel stack/not XDP).

This will really depend on your network;

None of big ISPs got one big network all over the network especialy when using L2 clear Ethernet and not PPPoE.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 29, 2022

another big idea is hooking the ack-filter code selectively to be able to do pping analysis within the kernel.
Why? What's the use case, and why not just use something like epping? Putting random monitoring stuff into a qdisc does not sound like a good fit...

epping is not scaling past a gbit. Being able to monitor rtt for a select group of qdiscs, would be nice.

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 29, 2022

Lastly it makes sense to utilize the skb's hardware timestamp generated as early as possible (e.g on the xdp rx ring), rather than timestamp once it hits the qdisc. It's always made sense to do that, but requires hardware that can, and we've not tried to do that generally, since most don't.
While in principle you're right that an earlier timestamp is "better", I doubt it'll make any material difference on the timescales Codel reacts on (by default, anyway). When you're reacting to queuing in the several-millisecond range I don't think the handful of microseconds of processing the stack does before the packet hits qdisc_enqueue() is going to make much difference.

Talking past each other a bit -

  1. libreqos is kind of an appliance, where making hardware specific optimizations would help.

  2. I was thinking aloud, (and badly) about tying the ap-emulation to the customer qdisc in one qdisc - ubercake

  3. and I was also trying to save "time" as hardware can get a timestamp in 0ns, getns call is pretty expensive in comparison.

I can be convinced otherwise by benchmarks, of course :)

So do I. The thing that I really really really really want is some good ole-fashioned TCAM on the bloody ethernet card. But knowing what the bottlenecks and pain points really are to get past 40Gbit would be nicer still.

@interduo
Copy link
Collaborator

interduo commented Sep 29, 2022

@tohojo I know that there are some situations where Trie algorithm will be slower - but those are edge cases in big networks.

...if You don't operate big network > 200 hosts You probably don't need LibreQoS and it would be sufficient for You to do all the routing/firewall/qos thinks in one-big-black-box (mikrotik or some linux) and also it would be better for You to use only network kernel stack (without touching XDP/DPDK/VPP technology) because of less problem to solve.

@thebracket
Copy link
Collaborator

Trie vs. Hash search complexity.

I'd expect Trie to be slightly slower in the specific 1:1 match case. A hashmap lookup has a fixed cost: hashing the IP address (not sure what scheme it uses), check if the resulting index is occupied, and either return a pointer or a null (no match). A trie starts at the top of the tree and picks the most-specific child on each lookup; there's no hash operation, but there are some compares - which very quickly zero in on a result.
Where it gets interesting is your CPU memory cache. If either structure fits in cache, then access is effectively free - so you're really just paying for the hash function. So it may well be the case that the two differ in performance based on the size of the stored data (and Trie lets you store less, if you have subnet matches)

Trie also brings to the table things you can't reasonably do with a hashmap - subnet matching (which is nice for IPv4 and essential for IPv6). So the potential performance penalty is paid for with functionality - I hope!

Uber Cake

I love that name. I want to eat an uber cake. Anyway, I've been pondering this - and I can see some potential advantages to moving some of the logic into cpumap (the XDP part). The XDP program already knows which IP addresses you care about, and could give output pre-matched to queues (for easy parsing of results). It also already parses much of the IP header - which is redundant work if pping is doing the same thing. I could see a split "kernel space gathers data, userspace mangles it" setup being useful. The stack limits for XDP programs are pretty brutal, so we can't cram too much in there.

However, my big concern is that if cpumap becomes busier - that's adding latency to the overall setup, which is exactly the opposite of our general goal. Running pping in userspace and passively sniffing passing traffic loads the heck out of a core, but doesn't add latency to the fast packet path.

@rchac
Copy link
Member

rchac commented Sep 30, 2022

A note regarding PPing resource use - I've been looking at using XDP PPing instead. It may already be possible to load XDP PPing after loading LibreQoS with xdp-loader. When I tried, I was unable to compile XDP PPing. If someone else can pull it off please let me know. It would really optimize tcp latency tracking.

@thebracket
Copy link
Collaborator

@rchac I just tried building the latest pping on Ubuntu server, and ran into all manner of issues.

I could get it to work with:

sudo apt install pkg-config build-essential llvm libelf-dev libc6-dev-i386 m4 libpcap-dev linux-tools-common
wget https://launchpad.net/ubuntu/+archive/primary/+files/libmnl-dev_1.0.4-3build2_amd64.deb
sudo dpkg -i libmnl-dev_1.0.4-3build2_amd64.deb
git clone https://github.com/xdp-project/bpf-examples.git
cd bpf-examples
./configure
make clean
make pping
sudo ethtool --offload eth0 gso off tso off lro off sg off gro off
cd pping
sudo ./pping --interface eth0 --tcp

That gave me some errors at the beginning:

Starting ePPing in standard mode tracking TCP on eth0
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: Kernel error message: Exclusivity flag on, cannot modify

And then some usable data:

13:17:47.211659897 4.682621 ms 0.019659 ms TCP 172.17.27.200:41962+91.189.91.38:80
13:17:47.229314620 64.847197 ms 41.322040 ms TCP 91.189.91.38:80+172.17.27.200:41962
13:17:47.309967264 2.809065 ms 0.019659 ms TCP 172.17.27.200:41962+91.189.91.38:80
13:17:47.353418561 58.645372 ms 41.322040 ms TCP 91.189.91.38:80+172.17.27.200:41962

(I fired off an apt update to generate traffic)

@thebracket
Copy link
Collaborator

A very quick and dirty test of the BPF pping is interesting (I haven't got it working with cpumap yet). Test environment: 3 hyper-V VMs (I'm running a Win11 box here; 12th gen i7, 32gb RAM). Two VMs are absolutely minimal Ubuntu servers, setup to talk to one another at 100.64.1.1 and 100.64.1.2. The third has a bridge setup, offloading disabled.

Without pping running, a simple single-stream iperf between the two gets 19.1 Gbits/sec. This leaves the intermediate VM about 10% utilized.

With pping watching the whole thing (sudo ./pping --interface eth1 --tcp), I still get 19.1 Gbits/sec - but I reached about 75% utilization (single CPU).

I'm not at all sure the rate limit/sample limit is working. Changing sample rate to 100ms increased CPU load and reduced throughput. I'd have expected the exact opposite?

@rchac
Copy link
Member

rchac commented Oct 1, 2022

Wow that is surprising. Maybe we could pin it to the last CPU core or something. I see a bit about the concurrency issues here. I would guess the efficiency is still much better than non-BPF pping though since regular pping seems to choke past 1Gbps.

Do we know what the default rate/limit sample limit is in ms? That is odd it would do that.

@dtaht
Copy link
Collaborator Author

dtaht commented Oct 1, 2022

This law ( https://en.wikipedia.org/wiki/Gustafson%27s_law ) and it's predecessor dominate in how I think about much in CS and in life's processes itself. Also, this new book, is very good: https://www.amazon.com/Understanding-Software-Addison-Wesley-Professional-Computing/dp/0137589735/

@tohojo
Copy link

tohojo commented Oct 1, 2022

Ping @simosund

@thebracket
Copy link
Collaborator

Honestly, eating 75% of one (virtual) CPU for 19 gbit/s is pretty darned good.

Looking at the source code (I may be wrong; I haven't deep-dived yet), rate limiting seems to be applied after parsing the packet, finding the current flow's state (or marking as a new one) - the part that's being rate limited is the storage of events in the shared buffer (with the rate limit being applied to the timestamp of the last event for that flow). It looks like it defaults to a rate limit of 100ms and an "rtt-rate" of 0 - I used 1. So I think I need to rethink how I approach those parameters.

I was kind-of expecting a more general "only sample n% of packets" type of setup (I'm pretty sure that's what Preseem does; upside: it's fast. Downside: sometimes you don't have any data for mostly idle customers).

Looking at it from an integration point-of-view, I haven't tried to use xdp-load or similar to get both it and xdp-cpumap going at once. It should be possible, I'd recommend pping go first because it pretty much always just passes the packet - so you don't have to worry about it colliding with cpumap.

The beginnings of an idea I'll try and test a bit in the coming weeks:

  • If pping were integrated into the classifier (not sure you can chain those, so I may need to do some coding), after cpumap has done its magic then classifier execution is parallelized.
  • Since the packet is coming into the classifier on multiple CPUs, we'd need to use multi-CPU friendly maps. There are some per-cpu map options available for this (I'm 99% sure the current code would race-condition itself to an untimely death; I may be wrong)
  • Since the classifier already parses a fair chunk of the packet, some of the work is already done. I'm not sure how "heavy" the additional parsing and map work will be.
  • I'm pretty sure that adding a counter to each CPU would hurt performance, so the existing rate limiting may well work. Will investigate some kind of limit, maybe as primitive as "kernel time mod X" as a tunable to keep performance under control.

Anyway, that's musing more than concrete ideas at this point.

@simosund
Copy link

simosund commented Oct 3, 2022

Hi, as the developer of eBPF PPing (or ePPing as we like to refer to it internally), I just wanted to clarify a few things about it. First of, I think it's very cool that you are interested in using ePPing and if you run into issues or have ideas for improvements please don't hesitate reaching out to me or raise an issue at bpf-examples. That said, I'm working on ePPing as part of my PhD project, so please realize that it is a bit experimental/rough around the edges, even if @tohojo does an awesome job of reviewing everything that is merged into master. One example of its unpolished state is that documentation is pretty much non-existent at this point, so sorry if ex. the different command-line arguments are unclear.

A note regarding PPing resource use - I've been looking at using XDP PPing instead. It may already be possible to load XDP PPing after loading LibreQoS with xdp-loader. When I tried, I was unable to compile XDP PPing. If someone else can pull it off please let me know. It would really optimize tcp latency tracking.

@rchac if you have time please raise an issue at bpf-examples with whatever errors you got. I think the configure script is supposed to try and hint at any missing dependencies, and ePPing should generally have the same dependencies as other examples in the repository, so chances are any improvements that make it easier to compile ePPing would also benefit other examples.

That gave me some errors at the beginning:

Starting ePPing in standard mode tracking TCP on eth0
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: Kernel error message: Exclusivity flag on, cannot modify

The first two (about xdp_metadata) should be harmless. If I understand it correctly they're due to libxdp adding some ELF sections that libbpf doesn't understand, but @tohojo is the expert on that.

The last one should also be relatively harmless. Whenever ePPing starts it will try to attach a clsact qdisc to the interface to be able to attach tc BPF programs to it. If a clsact qdisc already exists on the interface libbpf will throw out that error, and ePPing will simply reuse the existing clsact instead. Note that by default ePPing will NOT tear down this clsact on shutdown to avoid risking pulling the rug from other programs that may use the clsact. So you will often end up seeing this message if you've run ePPing on the same interface once previously. If you use the -for --force argument ePPing will try to tear down the clsact, but only if it was the process that created the clsact.

With pping watching the whole thing (sudo ./pping --interface eth1 --tcp), I still get 19.1 Gbits/sec - but I reached about 75% utilization (single CPU).

This is roughly in line with the performance I've observed from my own tests.

I'm not at all sure the rate limit/sample limit is working. Changing sample rate to 100ms increased CPU load and reduced throughput. I'd have expected the exact opposite?

This seems strange, the default sampling rate is already 100ms, so specifying 100ms sampling rate should not make any difference at all. Please note that ePPing expects the sampling rate in ms, so setting sampling rate to ex. 100ms (the default) would be -r 100.

Looking at the source code (I may be wrong; I haven't deep-dived yet), rate limiting seems to be applied after parsing the packet, finding the current flow's state (or marking as a new one) - the part that's being rate limited is the storage of events in the shared buffer (with the rate limit being applied to the timestamp of the last event for that flow). It looks like it defaults to a rate limit of 100ms and an "rtt-rate" of 0 - I used 1. So I think I need to rethink how I approach those parameters.

You are correct. The rate limit is simply rate-limiting the potential number of RTT samples ePPing computes per flow. So with the default of 100ms you would at most get 10 RTTs every second per flow (and with r -0, i.e. no rate limiting, you would still "only" get up to 1000 RTTs/s due to TCP timestamps not being updated more frequently, so for a single flow I would not expect this to have much impact on the performance). The -Ror --rtt-rate simply adapts this rate limit to flow's RTT instead, so with --rtt-rate 1 a flow with ex. 50ms RTT would give you up to ~20 RTT samples per second for that flow. The idea behind the rate limiting was mainly to prevent ePPing from spitting out an absurd amount of RTT samples when there are multiple concurrent flows (which adds quite a bit of overhead), but as the limit is applied per-flow this still won't work that well with a large amount of concurrent flows (ex. -r 100 with 10,000 flows may still net you upwards of 100,000 RTTs/s).

Right now I'm working on aggregating RTTs instead of reporting individual RTTs to make ePPing work better with larger numbers of flows. I'm very open for suggestions on what the most useful way to aggregate the RTTs would be. Right now I'm leaning towards using some form of histogram and periodically report the RTT distribution per ex. flow/IP/IP-subrange.

I was kind-of expecting a more general "only sample n% of packets" type of setup (I'm pretty sure that's what Preseem does; upside: it's fast. Downside: sometimes you don't have any data for mostly idle customers).

I don't think it would be much work to add this if it's something you want. As you say it's a bit more general approach and should also help even for a single flow. However, please note that unless ePPing (or Kathleen's original pping for that matter) sees every packet in a flow they may mismatch packets and calculate RTTs that are off with up to +/- 1 ms (or however often the sender updates its TCP timestamps), so just sampling a random subset of packets may cause the RTTs to become less accurate.

I should note with delayed ACKs ePPing will include the delay in the RTT, and I've recently discovered that there are some edge cases with delayed ACK where ePPing may slightly overestimate the RTT in addition to the delay. So if you need highly accurate RTTs (reliable accuracy below 1ms), then ePPing may not provide that (and neither will Kathleen's pping). These problems are inherent to using TCP timestamps to calculate RTTs, so I have been thinking of switching to sequence and ACK numbers instead, but those have their own challenges and it's not the highest priority right now. But using sequence and ACK numbers still require seeing all packets in a flow if you want to be sure that you don't get inflated RTTs due to retransmissions, and may still include the delay from delayed ACKs (depending on if the ACK was triggered by receiving enough segments or from timing out).

Looking at it from an integration point-of-view, I haven't tried to use xdp-load or similar to get both it and xdp-cpumap going at once. It should be possible, I'd recommend pping go first because it pretty much always just passes the packet - so you don't have to worry about it colliding with cpumap.

It should certainly be possible to use ePPing together with other XDP/tc programs, although admittedly I haven't tried it myself. ePPing should ALWAYS pass on the packet to the kernel (or whatever XDP/tc programs comes next).

@tohojo
Copy link

tohojo commented Oct 11, 2022 via email

@dtaht
Copy link
Collaborator Author

dtaht commented Oct 11, 2022

It is probable I misspoke and tied the pping result at 11gbit to a ebpf pping at 1gbit.

As for the concept:

You have a shared db of the sending and receiving qdisc(s) updated by ebpf. You report on the rtt by pulling it out of the ebpf table. Goal was to only monitor what you wanted to monitor in problematic queue trees. Not possible?

@dtaht dtaht added this to the v1.4 milestone Nov 13, 2022
@dtaht dtaht self-assigned this Jan 12, 2023
@dtaht
Copy link
Collaborator Author

dtaht commented Jan 16, 2023

In reading this long thread, I do not see any "new work" to be done out of it, aside from validating the incredibly high speed xdp bridge and ebpf code we now have to be correct. If there is something on this enormous thread that still needed to be done, please let me know? Otherwise, everyone, please give git head a shot.

@dtaht dtaht closed this as completed Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants