add usernetes post

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
vsoch · Jun 12, 2024 · 9a77127 · 9a77127
1 parent e98dec0
commit 9a77127
Showing 1 changed file with 61 additions and 0 deletions.
diff --git a/_posts/2024/2024-06-11-usernetes.md b/_posts/2024/2024-06-11-usernetes.md
@@ -0,0 +1,61 @@
+---
+title: "Performance of User-Space Kubernetes"
+date: 2024-06-11 10:00:00
+---
+
+This is a story that goes along with our upcoming paper about running <a href="https://arxiv.org/abs/2406.06995" target="_blank">User Space Kubernetes</a> alongside an HPC resource manager, Flux Framework. In summary, early work we did for <a href="https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/" target="_blank">FOSDEM earlier this year</a> demonstrated that we could run Flux alongside user-space Kubernetes, but the network took a hit. There is much more background to that - I had been working on getting a setup (any setup) working for almost 6 months, an effort that led to a push for <a href="https://twitter.com/_AkihiroSuda_/status/1699208132604698735" target="_blank">Usernetes Gen 2</a>, and a much easier deployment experience that allowed me to get the first version working in <a href="https://github.com/converged-computing/flux-lima/tree/main/usernetes" target="_blank">Lima</a>. 
+
+But there were some weakness in the setup. Specifically, <a href="https://arxiv.org/html/2402.00365v1" target="_blank">slirp4netns</a> (my friend I now call "slirpy") meant the packets went through a TAP device that would slow down our applications (MPI benchmarks and HPC application) by about 50%, which you can see in the FOSDEM talk. It wasn't a terrible result - it meant the setup could be use for services that didn't require low latency, and the setup would need additional work. At that time I didn't have a setup beyond the private "Star Trek" cluster that I presented, and I hadn't been able to get it working on any cloud. Of course I'm stubborn as heck, so that would change. This small post is a story about my adventure working on some of the challenges here, because I think others can learn from it, and it was definitely a journey of debugging (and pure stubbornness to keep trying). A lot of the detail and more professional presentation is in <a href="https://arxiv.org/abs/2406.06995" target="_blank">our paper on ArXiv</a>, but I'll share some more thoughts here and lessons learned here.
+
+## A Journey of Usernetes on AWS
+
+### Moving Earth with Terraform
+
+The setup (using Terraform) had many gotches, specifically related to ensuring that we deployed with 2 or more availability zones, but only gave one to the managed node group to use, since the usernetes nodes needed the internal IP addresses to connect. They could not see one another between different availability zones, but the deployment wouldn't work if you only asked for one. The next issue was the Flux broker needing preditable hostnames, and AWS not having any reliable way (aside from a robust setup with Route 53 which was too much for a bring up and throw away cluster) to create them. The headless Service in Kubernetes solves this for us (we can predict them) but this would be running Flux directly on the VM, for which the hostnames are garbled strings and numbers. So instead, we have to query the API to derive them on startup using a selector with name and instance type, wait for the expected number of instances (from the Terraform config), and write the Flux broker file dynamically. That also handles variables for the cluster from the <a href="https://github.com/converged-computing/flux-usernetes/blob/4790b59b81e7350094f58a1e14243eb3a904b015/aws/tf/main.tf" target="_blank">terraform main file</a> such as the network adapter, number of instances, and selectors. Next, the security group needs ingress and egress to itself, and the health check needs to be set to a huge value so your entire cluster isn't brought down by one node deemed unhealthy right in the middle (this happened to me in prototyping, and since I hadn't automated everything yet I'd fall to my hands and knees in anguish, just kidding). The elastic fiber adapter (EFA) and placement group are fairly important for good networking, but in practice I've seen quite variable performance regardless. In other words, you can easily get a bad node, and more work is needed to understand this. 
+
+### Memory Limits
+
+The next problem is about memory, and perhaps this was the most challenging thing to figure out. it took me upwards of a month to get our applications running, because MPI continually threw up on me with an error about memory. It didn't make any sense - the hpc family of instances on AWS were configured not to set any limits on memory, and we had lots of it. I had tried every MPI variant and install under the sun. It wasn't until I talked with one of our brilliant Flux developers he mentioned that I might test what Flux sees - running a command to see the memory limit as a flux job. And what we saw? It was limited! It turns out that starting Flux as a service (as this systemd setup did) was setting a limit. Why? Because <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Process%20Properties" target="_blank">systemd is a jerk</a>. We never saw this in the Flux Operator because we don't use systemd - we just start the brokers directly. The fix was to add "LimitMEMLOCK=infinity"  to the flux service file, and then our HPC application ran! It was an amazing moment, because it meant we were a tiny bit closer to using this setup. I pinged a lot of AWS folks for help, and although we figured it out separately, I want to say a huge thank you for looking into possible issues for this bug! I hope we helped uncover this insight for future folks.
+
+### Elastic Fiber Adapter
+
+With the above, we had Flux running on bare metal, and performantly with the elastic fiber adapter. But now - how to spin up a user space Kubernetes cluster that is running inside of Flux, and in that cluster, bring up the Flux Operator (another Flux cluster) that runs the same application? I first needed an automated way to set the entire thing up - and ssh'ing into many nodes wasn't something I wanted to do for more than 2-3 nodes. I was able to bring up the control plane on the lead broker node, then use flux exec and flux archive to send over the join command to the workers and issue a command to run a script to start and connect. After installing the Flux Operator we then needed to expose the EFA driver to the container. I needed to customize the daemonset installer to work in my Usernetes cluster, and I was able to do that by removing a lot of the selectors that expected AWS instances for nodes. And then - everything worked! Almost...
+
+### It's that slirpy guy, or is it?
+
+We at first had really confusing results. The OSU Latency benchmark was almost the same as on the "bare metal" VM, but every other collective call was worse in performance. I made the assumption that we were still dealing with the TAP device. But then, late one night I stayed up reading code for <a href="https://gitlab.freedesktop.org/slirp/libslirp/-/blob/master/src/slirp.c#L781" target="_blank">libslirp</a> and slirp4netns and documentation for <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-basics" target="_blank">EFA</a>, and I didn't see how it was logical. I hadn't traced anything, but everything that I read pointed to the idea that this setup should be bypassed entirely with EFA. I then started to wonder if maybe my assumption was wrong, could it be something else?
+
+The red herring was bringing up a cluster, and running LAMMPS again, and looking carefully at the output. I noticed that the CPU utilization was 60 something percent, and I had assumed this was because of the network - it was making the CPU wait and not fully utilized. But was it? I decided to run on just one node. And then I knew it wasn't the network - the CPU utilization didn't change! 🤯️ Specifically, when you run on one node, there is no inter-node communication. If the network was causing the issue, we would have seen that number jump, and it didn't.
+
+So I started to read again about how cgroups (and CPU usage) for Kubernetes. Then I did what any good developer would do, brought up an interactive container (the same that would run LAMMPS) in Usernetes, and started to snoop around. What was limiting the CPU? Looking at the "/proc/cpuinfo" I saw all the nodes. Flux resource list did too. But I kept looking, and specifically in the cgroup, which is a bunch of dumb text files located at "/sys/fs/cgroup." Here is where I saw something interesting. I found a file called "cpu.max" that looked like this:
+
+```console
+# cat /sys/fs/cgroup/cpu.max
+1000000 100000
+```
+
+That seemed suspicious, because there was a value set, period. So I naively removed the resource requests and limits from the pod, shelled into the container, and looked again. And whoa!
+
+```console
+# cat /sys/fs/cgroup/cpu.max
+max 100000
+```
+
+It said "max" this time. And guess what? I then ran LAMMPS, on both 1 and 2 nodes (that's all I had) and the CPU utilization...
+
+```console
+Performance: 0.249 ns/day, 96.476 hours/ns, 28.792 timesteps/s, 70.023 katom-step/s
+99.2% CPU use with 16 MPI tasks x 1 OpenMP threads
+```
+
+...was close to 100% (it's usually 99 point something, and actually even 99.2 is lower than what I normally see). Reading more about this setting, this is a way to set the number or percentage of periods that the cgroup is allowed to use the CPU. So if you set resource limits on two pods, they might be using the same physical resources, but just some slice of time of them. This was the <a href="https://gist.github.com/rsms/5f9899fb4a7dcce0ca2479a27af55130#file-cgroup2-cpu-limit-sh-L32-L36" target="_blank">comment</a> that I referenced that night. I had no idea - I had assumed that setting the resource request and limit was just a strategy to map the pod 1:1 to the node, and it didn't actually cap any resources. I was wrong! 
+
+Specifically, I was wrong that the slirp4netns is an issue at all. My intuition was correct that it wasn't being used for a large chunk of networking for our runs, and what was actually happening is that the limits/requests we set to enforce the 1:1 mapping of pods to nodes was limiting the CPU. I had incorrectly thought that "Burstable" QOS (quality of service) meant that you could go over those limits and use the full node resources (and only Guaranteed would set actual limits), but I was wrong. This meant that anything that required the CPU do do processing (a collective call, or LAMMPS, for example) was going to suffer. But anything that is a ping/pong and uses less CPU (like osu latency) is less likely. This is why our OSU Latency results were so good the first time, and I think likely we (by chance) got a pair of good nodes (which isn't always the case). Here is <a href="https://github.com/kubernetes/kubernetes/blob/9c5643f8fcda0ad8b08ee04774abd0cc70dcd43f/pkg/kubelet/cm/cgroup_manager_linux.go#L597-L625" target="_blank">the function where</a> that cgroup.max is set in Kubernetes. There are equivalents for memory, etc.
+
+I then redid the experiments, and twice actually, because the second time I forgot the placement group. Each run was 10-12 hours - and yes, I have a pretty good ability to focus on detail oriented work. But this was a really fun adventure, because there were many challenges and bugs and things to learn along the way, and albeit it isn't perfect (we have more work to do) I am really proud of the work that we've done. It's these kind of satisfying challenges and working with people that support and accept me that keeps me around. My team is the best, and I am inspired to work alongside them and do great work while we have fun. 🥰️ If you want to learn more, you can read our early paper on <a href="https://arxiv.org/abs/2406.06995" target="_blank">arXiv</a> that we will be submitting somewhere soon.
+
+### Why not an issue before?
+
+The reason I think this hasn't caught our attention before is about scale. We have only 16 physical cores per node for the <a href="https://instances.vantage.sh/aws/ec2/hpc7g.4xlarge" target="_blank">hpc7g.4xlarge</a> instance to maximize memory per CPU, so setting a resource limit (for something below that) is going to be a larger chunk of the total than, say, setting it to 94/96 cores, as we did with previous experiments for Kubecon and previous publications. This means that likely we were taking some of a performance hit before, but we didn't realize it. With affinity and devices that require one per node, we shouldn't need to use resource limits anymore (and indeed we do not).
+
+The lesson: question your own assumptions, and never give up. And don't forget to have fun. 🥑️