Skip to content

Latest commit

 

History

History
62 lines (43 loc) · 5.9 KB

BENCHMARK.md

File metadata and controls

62 lines (43 loc) · 5.9 KB

Benchmark

Spegel performance is measured using the Benchmark tool to give an idea of the expected performance that Spegel can deliver. The tool enables a generic method of measuring image pull performance using different deployment conditions in Kubernetes.

Method

The benchmarks were run on AKS v1.29 with 50 Standard_D4ds_v5 nodes. The environment was setup using the provided Terraform configuration. Spegel v0.0.22 is installed in the cluster using the default configuration.

The measurements are done using the generated benchmark images. These images are provided as a v1 and v2 to simulate a rolling upgrade.

  • 10 MB 1 layer
  • 10 MB 4 layers
  • 100 MB 1 layer
  • 100 MB 4 layers
  • 1 GB 1 layer
  • 1 GB 4 layers

The measurements for 10 MB 1 layer is run with the following command. The same measurement is done for each of the image size and layer combinations.

benchmark measure --result-dir $RESULT_DIR --kubeconfig $KUBECONFIG --namespace spegel-benchmark --images ghcr.io/spegel-org/benchmark:v1-10MB-1 ghcr.io/spegel-org/benchmark:v2-10MB-1

Afterwards all of the results are analyzed to generate corresponding graphs.

benchmark analyze --path $RESULT

Results

The results are compared to the baseline results which have been measured with the same setup but without Spegel running in the cluster.

Image Baseline Spegel
10 MB 1 layer
10 MB 4 layer
100 MB 1 layer
100 MB 4 layer
1 GB 1 layer
1 GB 4 layer

Analysis

The image pull duration for v1 and v2 versions differ in shape. This is due to how Kubernetes rolls outs pods for new daemonsets compared to when one is updated. For the v1 images pods are created in batches with no check that the previous batch has started successfully. Images and it's layers are not advertised until the whole image has been pulled. When nodes pull the same image in parallel they will both fetch the image from the original registry. As the new batch of pods are created before the previous batch has pulled the image it means that they will also have to pull the image from the source registry. Batched pod creations is a known weakness of Spegel which does not have a solution currently. Any performance increase seen in the graphs are most likely coincidental due to the time at which the benchmarks were run. On the other hand it can be observed that performance at times can be worse with Spegel. The performance reduction originates when waiting for the router to timeout when the image does not exist in the cluster. Currently the timeout default is 5 seconds. A lower value would remove these issues at the risk of timeout occurring even when an image does exist within the cluster. The default value was set to a high value initially, but will be revised in a future release to bring the performance to be on par with the baseline.

The v2 images however see a greater performance improvement as each pod will wait for the other to complete pulling the image. This means that the first pod will have to pull the image from the source registry. After that the second pod should be able to pull the image from the node which the first pod is deployed to. The table below shows the average pull duration for the baseline and Spegel benchmark. The improvement percentage is calculated with the equation (baseline - spegel)/baseline * 100. We see the largest performance increase will small images. Images with more layers do seem to be slower compared to their single layer counterparts. One explanation for this is that multiple layers require multiple requests, and in Spegel case multiple layer discovery calls.

Image Baseline (avg) Spegel (avg) Improvement
10 MB 1 layer 1220 ms 181 ms 85.16%
10 MB 4 layers 1409 ms 407 ms 71.11%
100 MB 1 layer 1725 ms 559 ms 67.59%
100 MB 4 layers 1573 ms 526 ms 66.56%
1 GB 1 layer 8429 ms 6942 ms 17.64%
1 GB 4 layers 7310 ms 5478 ms 18.32%

While better the performance improvements get less as the images get larger. The best explanation is that the disk bandwidth is getting saturated. Spegel serves image layers from disk and relies on the OS to copy from disk to the TCP socket. Ignoring the overhead in discovering layers the next bottle neck is the network and disk bandwidth available. For the benchmark Standard_D4ds_v5 VMs with ephemeral disk were used. These VMs have a non-guaranteed throughput of 250 MB/s. With some rough calculations we can see that 1024 MB / 4.392 s = 233 MB/s which is approaching the max performance of the disk.

Conclusion

The benchmarks has shown that Spegel works best during updates of existing deployments and daemonsets rather than when new ones are created. Future work needs to be done in allowing advertisement of image pulls in progress to increase performance. It also shows that disk performance eventually matters when pulling large images. For this reason it could be beneficial to deploy larger VMs with higher disk throughput rather than more smaller VMs.