Test-driving ARM cloud servers for development
ARM is coming to the cloud. Looking at some marketing materials, it seems the primary target is so-called "scale-out workloads", like lightweight web servers, that need to stay up but aren't exactly CPU-intensive. It's slower than x86, but it's also cheaper than x86! That seems to be the slogan.
My use case is the opposite. I am interested in using ARM cloud servers as a development environment, to troubleshoot porting issues to the ARM architecture. This often involves building large amounts of code, so CPU performance is important. On the other hand, compilation parallelizes well, so with enough cores (or vCPUs, as they say on the cloud) single-core performance is less important.
In this post, I will compare ARM cloud server offerings from Scaleway, AWS, and Packet. To benchmark them as a development environment, I chose building LLVM 9.0.0 in Release configuration as a workload. If you know of other ARM cloud server offerings I should try, please let me know.
LLVM was configured as follows:
$ cmake -G Ninja -D CMAKE_BUILD_TYPE=Release ..
To provide the context for numbers below, here is the result for my local build server at home, 4 cores of Intel Core i7-6700K with 16 GB of RAM.
$ time ninja -j4 real 23m9.424s user 88m45.267s sys 3m51.048s
4.2% spent in kernel, with 100.0% utilization. Took 92.6 minutes in single-core basis.
Local build server runs Debian GNU/Linux unstable, amd64. I chose Ubuntu 18.04 arm64 port for cloud servers. All systems were idle except for the build (and systemd and sshd). Everything else was the default unless specified otherwise. Note that cloud pricing is ever-shifting: pricing information below is as of the publication of this article.
Scaleway's ARM64 instances come in seven different kinds, from
ARM64-128GB. I chose
ARM64-4GB for a comparison. It comes with 6 vCPUs of Cavium ThunderX and 4 GB of RAM. Price was 0.012 EUR per hour.
I found that building with 6 jobs in parallel results in out-of-memory error. It seems about 1 GB of RAM per job is required, so I used 4 jobs instead.
$ time ninja -j4 real 256m33.810s user 992m21.012s sys 32m41.956s
3.2% spent in kernel, with 99.9% utilization. Took 1025.0 minutes in single-core basis.
AWS's A1 instances come in five different kinds, from
a1.4xlarge. I chose
a1.large for a comparison. It comes with 2 vCPUs of AWS Graviton (using ARM Neoverse cores) and 4 GB of RAM. Price was 0.0642 USD per hour in Tokyo region, which is the closest to me. (Seoul region is closer, but A1 instances are not available in Seoul.)
$ time ninja -j2 real 170m18.482s user 326m17.396s sys 14m3.670s
4.1% spent in kernel, with 99.9% utilization. Took 340.4 minutes in single-core basis.
c1.large.arm has 96 cores of Cavium ThunderX (the same chip used by Scaleway) and 128 GB of RAM. Price was 0.5 USD per hour.
c2.large.arm has 32 cores of Ampere eMAG and 128 GB of RAM. Price was 1 USD per hour.
$ time ninja -j96 real 12m35.765s user 883m14.923s sys 42m58.907s
4.6% spent in kernel, with 76.6% utilization. Took 926.2 minutes in single-core basis. LLVM build is parallel, but not 96 jobs parallel. As expected, the result is comparable to Scaleway.
$ time ninja -j32 real 13m38.964s user 404m19.502s sys 15m52.135s
3.8% spent in kernel, with 96.2% utilization. Took 420.2 minutes in single-core basis. LLVM build is 32 jobs parallel!
I used 1 EUR = 1.1 USD in the table below. Price is per core per hour in USD. Build time is per core in minutes.
|Scaleway ARM64||Cavium ThunderX||0.0033||1025.0|
|AWS A1||AWS Graviton||0.0321||340.4|
|Packet c1.large.arm||Cavium ThunderX||0.0052||926.2|
|Packet c2.large.arm||Ampere eMAG||0.0313||420.2|
Here it is again in relative numbers standardized at Scaleway = 1. Performance is reciprocal of build time.
Local x86 is >10x faster than Scaleway and >3x faster than AWS. While AWS is 3x faster than Scaleway, it's also ~10x expensive. Packet c1.large.arm is comparable to Scaleway and c2.large.arm is comparable to AWS, but they can't be rented in slices.