This document is for OSDI'2024 artifact evaluation.
In the paper, we make the claim that
We use Anvil to verify three Kubernetes controllers for managing ZooKeeper, RabbitMQ, and FluentBit, which can readily be deployed in Kubernetes platforms and are comparable in terms of features and performance to widely used unverified controllers.
The goal is to reproduce the key results to support the claim. Specifically, the key results are (1) verification results in Figure 1 in Section 7 and (2) performance results in Figure 3 in Section 7.
The entire artifact evaluation process can take about 9 hours (mostly machine time).
Following kick-the-tires instructions, you will (1) verify one controller using the container image we prepared, and (2) run a small subset of the workloads used for evaluating the controller's performance.
The instructions in this section can be run on any machine with docker installed.
Step 1: download the container image (~3.56GB)
docker pull ghcr.io/vmware-research/verifiable-controllers/anvil-ae:latest
Note: If you cannot download the image, try to authenticate to the container registry and pull again.
Step 2: run the container
docker run -it ghcr.io/vmware-research/verifiable-controllers/anvil-ae:latest bash
Now you should be in the path /anvil
inside the container.
Step 3: verify the ZooKeeper controller in the container
VERUS_DIR=/verus ./verify-controller-only.sh zookeeper
It takes about 4 minutes to finish on a MacBook Pro 2019.
Step 4: check the verification result
cat zookeeper.json | grep "errors"
The result should be "errors": 0,
, meaning that all the proofs are verified. If you do not see the expected output, please let us know.
The instructions in this section require some environment setup to run the controller workloads. We highly suggest you use the CloudLab machine with the profiles we prepared, as all the environment setup work will be a matter of running one script. If you are a first timer of CloudLab, you can read the CloudLab doc for an overview of how artifact evaluation is generally conducted on CloudLab. If you do not already have a CloudLab account, please apply for one following this link, and ask the OSDI AEC chair to add you to the OSDI AEC project.
To make the evaluation process smooth, we have prepared CloudLab profiles for setting up the environment for three hardware types: Clemson c6420, Wisconsin c220g5 and Wisconsin c220g2. Please note these machines may not be available all the time. You can submit a resource reservation to guarantee the availability of the machine.
Please let us know if you have trouble accessing CloudLab, we can help set up the experiment and give you access.
Step 1: setup environment
Click one of the following three links: profile anvil-ae-c6420
, profile anvil-ae-c220g5
, or profile anvil-ae-c220g2
, and keep hitting next
to create the experiment. You should see that CloudLab starts to provision the machine, which typically takes ~5 minutes. Please patiently wait for "Status" to become Ready
. After the machine is ready, log into the machine using ssh
(with the key provided to CloudLab) or using the shell
provided by the CloudLab Web UI, and then run
bash /local/repository/scripts/cloudlab_startup_run_by_creator.sh
This command will take ~5 minutes to finish and it will set up the environment for you and install Acto at the workdir/acto
directory under your $HOME
directory.
Please note that by default you only have access to the machine for 16 hours. Although it is definitely enough to finish the kick-the-tires instructions, we suggest you extend your access (just click the green Extend
button) to ensure you can finish the full evaluation with the same machine smoothly.
No available machine?
You can try to reserve one machine of Clemson c6420, Wisconsin c220g5 or Wisconsin c220g2. If you still cannot get a machine, please contact us on HotCRP.
The script fails in the middle?
Sometimes the script might fail due to transient network issues. The script is supposed to be idempotent and you can just rerun it. If it persistently fails, please contact us on HotCRP.
Still want to set up local environment instead of using CloudLab?
- A Linux system with Docker support
- Python 3.10 or newer
- Install
pip3
by runningsudo apt install python3-pip
- Install Golang
- Clone the repo recursively by running
git clone --recursive --branch anvil-dev https://github.com/xlab-uiuc/acto.git
- Install Python dependencies by running
pip3 install -r requirements-dev.txt
in the project - Install
Kind
by runninggo install sigs.k8s.io/kind@v0.20.0
- Install
Kubectl
by runningcurl -LO https://dl.k8s.io/release/v1.22.9/bin/linux/amd64/kubectl
andsudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
Run all the instructions below inside the cloned acto
repo.
If you encounter any problem, please contact us on HotCRP.
Step 2: run the workload
We suggest you use tmux
when running on remote machines as the command can take hours.
cd ~/workdir/acto/
bash anvil-ae-one-controller.sh 0.05
It takes ~1.5 hours to finish on CloudLab c6420. Note that if you chose to manually set up the environment, you need to replace ~/workdir/acto/
with the path to the cloned acto
repo on your machine instead.
Step 3: check the performance result
cat anvil-table-3.txt
You should see a table like this:
| Controller | Verified (Anvil) Mean | Verified (Anvil) Max | Reference (unverified) Mean | Reference (unverified) Max |
|--------------|-------------------------|------------------------|-------------------------------|------------------------------|
| Zookeeper | 149.953 | 159.953 | 141.854 | 160.174 |
Note that the absolute numbers depends on the platform. If you do not see the expected table, please let us know.
Following full evaluation instructions, you will reproduce the verification results in Table 1 and the performance results in Table 3. These are the key results that support the claim in the paper. The absolute number of the time-related results heavily depend on the platform, but we will highlight the key pattern you should be able to observe.
Following the instructions, you will reproduce the key results that (1) the Anvil framework and the three controllers are verified, and (2) the code size the time to verify are consistent with Table 1 in the paper.
You will reuse the container from Kick-the-tires Instructions. Run the container with bash as you did before. Then in the path /anvil
inside the container, run
VERUS_DIR=/verus ./reproduce-verification-result.sh
This command runs Verus to verify the Anvil framework and the three controllers and collects statistics including code sizes and time to verify. It takes about 12 minutes to finish on a MacBook Pro 2019. After it's done, you should see
Presenting verification results from Verus. You should see 0 errors for Anvil and the three controllers, which means everything is verified.
+ cat anvil.json
+ grep errors
"errors": 0,
+ cat fluent.json
+ grep errors
"errors": 0,
+ grep errors
+ cat rabbitmq.json
"errors": 0,
+ cat zookeeper.json
+ grep errors
"errors": 0,
The most important results are the "errors": 0,
, meaning that the Anvil framework and the three controllers are verified.
To see the generated Table 1, run
cat tools/t1.txt
and you should see a generated table like this:
| | Trusted (LoC) | Exec (LoC) | Proof (LoC) | Time to Verify (seconds) |
|---------------------------|-----------------|--------------|---------------|----------------------------|
| ZooKeeper controller | | | | |
| |---Liveness | 94 | -- | 7245 | 297.872 |
| |---Conformance | 5 | -- | 172 | 5.891 |
| |---Controller model | -- | -- | 935 | -- |
| |---Controller impl | -- | 1134 | -- | -- |
| |---Trusted wrapper | 515 | -- | -- | -- |
| |---Trusted ZooKeeper API | 318 | -- | -- | -- |
| |---Trusted entry point | 19 | -- | -- | -- |
| |---Total | 951 | 1134 | 8352 | 303.763 (172.959) |
| RabbitMQ controller | | | | |
| |---Liveness | 144 | -- | 5211 | 163.285 |
| |---Safety | 22 | -- | 358 | 14.561 |
| |---Conformance | 5 | -- | 290 | 10.48 |
| |---Controller model | -- | -- | 1369 | -- |
| |---Controller impl | -- | 1598 | -- | -- |
| |---Trusted wrapper | 359 | -- | -- | -- |
| |---Trusted entry point | 19 | -- | -- | -- |
| |---Total | 549 | 1598 | 7228 | 188.326 (209.879) |
| Fluent controller | | | | |
| |---Liveness | 115 | -- | 7079 | 216.977 |
| |---Conformance | 10 | -- | 201 | 6.741 |
| |---Controller model | -- | -- | 1115 | -- |
| |---Controller impl | -- | 1208 | -- | -- |
| |---Trusted wrapper | 681 | -- | -- | -- |
| |---Trusted entry point | 24 | -- | -- | -- |
| |---Total | 830 | 1208 | 8395 | 223.718 (168.815) |
| Total(all) | 2330 | 3940 | 23975 | 715.807 (551.653) |
When comparing this generated table to the original Table 1 in the paper, please note that:
- The numbers in the "Time to verify" column heavily depend on the platform. The numbers we show above are different from those in the paper because the platform configuration and the solver version have changed since the submission. You might find the absolute numbers generated on your platform are different from the numbers shown above, which is expected. Regardless of the platform, you should still be able to observe that most of the time (> 70%) is spent on the "Liveness" row.
- The numbers in the "Trusted", "Exec" and "Proof" should be deterministic. You might notice some minor difference when comparing them to the numbers reported in the paper. This is because we have slightly updated the controllers' implementations and proofs since the submission.
Following the instructions, you will reproduce the key results that the verified controllers achieve comparable performance to the unverified reference controllers as shown in Table 3.
You will reuse the CloudLab machine as in the Kick-the-tires Instructions.
We suggest you use tmux
when running on remote machines as the command will take hours.
In the path ~/workdir/acto/
inside your CloudLab machine, run
bash anvil-ae-sampled.sh 0.05
This command runs 5% of the workloads for the three controllers and their unverified references. It takes ~7 hours to finish on CloudLab c6420. After it's done, to see the generated Table 3, run
cat anvil-table-3.txt
and you should see a generated table like this:
| Controller | Verified (Anvil) Mean | Verified (Anvil) Max | Reference (unverified) Mean | Reference (unverified) Max |
|--------------|-------------------------|------------------------|-------------------------------|------------------------------|
| Zookeeper | 149.953 | 159.953 | 141.854 | 160.174 |
| RabbitMQ | 201.167 | 356.158 | 202.159 | 356.013 |
| FluentBit | 32.087 | 33.049 | 29.634 | 33.26 |
The numbers are the execution time (in milliseconds) it takes for the verified/reference controller to do reconciliation. The absolute numbers depend on the platform. You might observe that the execution times are shorter compared to the numbers reported in the paper. This is because the machine configuration and Acto (the tool we use to run workloads) have changed since the submission. Regardless of the platform, you should still be able to observe that the verified controllers are NOT significantly slower than their unverified references. The execution time of each verified controller should be within 2.5X of the execution time of the corresponding reference controller, in terms of both mean and max time. In fact, in most cases their differences are negligible (as shown above).
I want to run all the workloads?
Note that this will take 100+ hours to finish. If you really want to do that, run
bash anvil-ae-sampled.sh 1