Benchmark

Spegel performance is measured using the Benchmark tool to give an idea of the expected performance that Spegel can deliver. The tool enables a generic method of measuring image pull performance using different deployment conditions in Kubernetes.

Method

The benchmarks were run on AKS v1.29 with 50 Standard_D4ds_v5 nodes. The environment was setup using the provided Terraform configuration. Spegel v0.0.22 is installed in the cluster using the default configuration.

The measurements are done using the generated benchmark images. These images are provided as a v1 and v2 to simulate a rolling upgrade.

10 MB 1 layer
10 MB 4 layers
100 MB 1 layer
100 MB 4 layers
1 GB 1 layer
1 GB 4 layers

The measurements for 10 MB 1 layer is run with the following command. The same measurement is done for each of the image size and layer combinations.

benchmark measure --result-dir $RESULT_DIR --kubeconfig $KUBECONFIG --namespace spegel-benchmark --images ghcr.io/spegel-org/benchmark:v1-10MB-1 ghcr.io/spegel-org/benchmark:v2-10MB-1

Afterwards all of the results are analyzed to generate corresponding graphs.

benchmark analyze --path $RESULT

Results

The results are compared to the baseline results which have been measured with the same setup but without Spegel running in the cluster.

Image	Baseline	Spegel
10 MB 1 layer
10 MB 4 layer
100 MB 1 layer
100 MB 4 layer
1 GB 1 layer
1 GB 4 layer

Analysis

The image pull duration for v1 and v2 versions differ in shape. This is due to how Kubernetes rolls outs pods for new daemonsets compared to when one is updated. For the v1 images pods are created in batches with no check that the previous batch has started successfully. Images and it's layers are not advertised until the whole image has been pulled. When nodes pull the same image in parallel they will both fetch the image from the original registry. As the new batch of pods are created before the previous batch has pulled the image it means that they will also have to pull the image from the source registry. Batched pod creations is a known weakness of Spegel which does not have a solution currently. Any performance increase seen in the graphs are most likely coincidental due to the time at which the benchmarks were run. On the other hand it can be observed that performance at times can be worse with Spegel. The performance reduction originates when waiting for the router to timeout when the image does not exist in the cluster. Currently the timeout default is 5 seconds. A lower value would remove these issues at the risk of timeout occurring even when an image does exist within the cluster. The default value was set to a high value initially, but will be revised in a future release to bring the performance to be on par with the baseline.

The v2 images however see a greater performance improvement as each pod will wait for the other to complete pulling the image. This means that the first pod will have to pull the image from the source registry. After that the second pod should be able to pull the image from the node which the first pod is deployed to. The table below shows the average pull duration for the baseline and Spegel benchmark. The improvement percentage is calculated with the equation (baseline - spegel)/baseline * 100. We see the largest performance increase will small images. Images with more layers do seem to be slower compared to their single layer counterparts. One explanation for this is that multiple layers require multiple requests, and in Spegel case multiple layer discovery calls.

Image	Baseline (avg)	Spegel (avg)	Improvement
10 MB 1 layer	1220 ms	181 ms	85.16%
10 MB 4 layers	1409 ms	407 ms	71.11%
100 MB 1 layer	1725 ms	559 ms	67.59%
100 MB 4 layers	1573 ms	526 ms	66.56%
1 GB 1 layer	8429 ms	6942 ms	17.64%
1 GB 4 layers	7310 ms	5478 ms	18.32%

While better the performance improvements get less as the images get larger. The best explanation is that the disk bandwidth is getting saturated. Spegel serves image layers from disk and relies on the OS to copy from disk to the TCP socket. Ignoring the overhead in discovering layers the next bottle neck is the network and disk bandwidth available. For the benchmark Standard_D4ds_v5 VMs with ephemeral disk were used. These VMs have a non-guaranteed throughput of 250 MB/s. With some rough calculations we can see that 1024 MB / 4.392 s = 233 MB/s which is approaching the max performance of the disk.

Conclusion

The benchmarks has shown that Spegel works best during updates of existing deployments and daemonsets rather than when new ones are created. Future work needs to be done in allowing advertisement of image pulls in progress to increase performance. It also shows that disk performance eventually matters when pulling large images. For this reason it could be beneficial to deploy larger VMs with higher disk throughput rather than more smaller VMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCHMARK.md

BENCHMARK.md

Benchmark

Method

Results

Analysis

Conclusion

Files

BENCHMARK.md

Latest commit

History

BENCHMARK.md

File metadata and controls

Benchmark

Method

Results

Analysis

Conclusion