tf benchmarks

Tobias Kind edited this page Nov 29, 2018 · 73 revisions

Tensorflow (TF) benchmarks will tell if the Terminator comes sooner or later. Read the news, artificial intelligence is the biggest threat to humanity since cockroaches invaded the earth (together with those darn squirrels).

Based on the ML principle "don't believe benchmarks that you did not falsify yourself" I will try to benchmark my own systems. Unfortunately the initial WINDOWS snubbing by tensorflow limit(ed) my benchmarks to VMs (update: TF now works under Windows). Didn't Bill Gates recently save Google, they could be a bit more graceful, or was it Apple? And no GPUs in virtual machines. And did I mention the missing AMD and OpenCL support already? Anyway, here are some collections of benchmarks from different sources.

Benchmarks can be geared towards accuracy, standard deviations/errors, memory consumption, performance and parallel scaling. It becomes clear very soon that using tensorflow with CPUs only maybe cheap, but not very fast. See for example below the cifar10 example with dual GPU setups. TensorFlow may not be the fastest and most accurate algorithm, but may scale well with the use of additional GPUs.


cifar10

This collection of tensorflow cifar10 performance covers GPU and CPU performances from November 2015, sources are added below. Some explanation the cifar10 program tries to distinguish camels from pigeons, cats from dogs etc. Example/batch - the higher the better and sec/batch the lower the better. The list can not be sorted, thanks to the ... markdown language.

Num Platform examples/sec sec/batch Price Perf/$
-1 2x GTX 1080 TI 12486.5 0.010 price: $1600 7.80
0 GTX 1080 1780.0 0.072 price: $814 2.19
1 GTX 1070 1733.1 0.074 price: $449 3.85
2 2x Geforce TitanX 796.7 0.161 price: $2060 0.38
3 1x Geforce TitanX 550.1 0.233 price: $1030 0.53
4 i7-3770K & GTX 970 641.4 0.200 price: $630 CPU+GPU 1.01
5 Xeon E5-2670 + 1 GPU 325.2 0.394 Amazon g2.2xlarge 0.12
6 Xeon E5-2670 + 4 GPUs 337.9 0.379 Amazon g2.8xlarge 0.06
7 Tesla K40 350.0 0.250 price: $3000 0.16
8 Tesla K20 NA 0.350 price: $2000 NA
9 Core i7-2600K 4.2 GHz 230.8 0.555 price: $330 0.69
10 4x NVIDIA Tesla K20M NA 0.100 price: $10996 NA

One interesting conclusion from the table above is, that with the Tesla K40 you get a premium tag but the performance does not really hold up to the three fold higher price. The desktop computer with a Core i7-2600K0 is the cheapest but also the slowest with CPU only, around 54-times slower!

The fastest computer contains 2x GTX 1080 TI Founders Edition with each 11 GB RAM, but also uses a fast i7-6900K CPU and intel server SSD. The data needs to be force-fed from the CPU into the GPU, so SSD + fast CPU + a modern GPU with lots of GPU RAM help. However the cifar10 example currently does not scale very well. The same computer with just one GTX 1080 TI Founders Edition (11 GB) runs actually double the performance.

Now the table above is just for academic exercises, system costs are at least double the price, and of course energy costs will eat away personal budgets very quickly. A Titan X under full power consumes around 250 Watts meaning 4 cards under full power will eat 1 kW. Meaning if the computer runs full load for 20h a day it will consume 20kWh (kilo Watt hours). Energy costs will be 20 kWh × $0.05f/kWh = $1.00 in the US per day. For 360 days = 360 bucks. Thats just the GPUs and not the whole system.

Source -1:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source

Source 0:
Zotac GTX 1080 AMP Extreme, 2560 CUDA cores, 1771 MHz core clock, 10000 MHz mem clock. i7 930 3.8 GHz boost clock. step 100000, loss = 0.72 (1780.0 examples/sec; 0.072 sec/batch); time: 2h 5m. Source

Source 1:
Asus GTX 1070 Strix, 1920 CUDA cores, 1860 MHz core clock, 8000 MHz mem clock. i7 6700k 4.2 GHz boost clock. Source

Source 2 and 3:
Geforce TitanX, Core Clock 1127 MHz ; 3072 CUDA Cores; no CPU info. Source

Source 4:
GeForce GTX 970 (4 Gbyte RAM and 1.253 GHz clock rate, 7Ghz memory clock rate) with 1664 CUDA Cores
CPU: i7-3770K CPU @ 3.50GHz CPU has 8 threads and 4 cores
Source

Source 5 & 6:
On a g2.2xlarge: step 100, loss = 4.50 (325.2 examples/sec; 0.394 sec/batch)
1x NVIDIA GRID K520 GPUs, each with 1,536 CUDA and 8 cores Intel Xeon E5-2670

On a g2.8xlarge: step 100, loss = 4.49 (337.9 examples/sec; 0.379 sec/batch)
4x NVIDIA GRID K520 GPUs, each with 1,536 CUDA cores and 32 threads of Intel Xeon E5-2670

doesn't seem like it is able to use the 4 GPU cards unfortunately :(
Source and EC2 defs

Source 7 & 8:
On a single Tesla K40, cifar10_train.py processes a single batch of 128 images in 0.25-0.35 sec (i.e. 350 - 600 images /sec). The model reaches ~86% accuracy after 100K steps in 8 hours of training time. (source)

With batch_size 128.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
1 Tesla K40m | 0.25-0.35 | ~86% at 100K steps (4 hours)
Source

Source 9:
Core i7-2600K 4.2 GHz (2011) CPU only running Oracle Virtual Box with Ubuntu 13 (EOL)

Source 10:
Four Tesla K20m each with 2496 CUDA Cores
source


alexnet

This is the alexnet interference network benchmark. The standard TF benchmark conditions are batch size = 128 measured across 100 steps. The benchmark works on CPUs as well as GPUs. It contains a forward and forward-backward pass. The forward pass is highly efficient (80-90% core utilization) on CPUs, the forward-backward pass less efficient (30-50%). The benchmark measures outcomes in ms/batch (milliseconds per batch; lower is better).

The benchmark is invoked by calling:

$python tensorflow/models/image/alexnet/alexnet_benchmark.py
Num Platform fwd ms/bat fwd-bw ms/bat Price $
0 GeForce GTX 1080 Ti 25 76 800
1 Titan X 70 244 1000
2 Tesla K40c 145 480 3000
3 GeForce TITAN X 91 301 1000
4 Geforce GTX TITAN X 100 328 1000
5 i7-2600K 4.2 Ghz 2456 9981 330
6 i7-6900K 3.2 Ghz 932 2864 900

Total run time for the GeForce GTX 1080 Ti is 16 seconds(!) and for the Core I7-6900K (8 core) CPU is 7 minutes. That is a 26-fold speed advantage for the GPU.

The Sandy-Bridge Core i7-2600K CPU has only 4 cores and 8 threads running at 4.2 Ghz with a maximum DDR3 memory bandwidth of 21 GB/sec. In comparison the Geforce Titan X has around 3072 CUDA cores running at 1.1 Ghz and a maximum memory bandwidth of 336.5 GB/sec. So while the CPU GHz advantage for the CPU is 4:1 and the price advantage is 3:1, the core count disadvantage is 1:768(!) and the memory bandwidth disadvantage for the CPU is 1:16. The global performance disadvantage for the CPU is 1:40, basically the single Geforce Titan X GPU is 40-times faster than the CPU.

The performance per dollar column above is probably not really useful, because with a 38-fold performance per Dollar advantage you will be also 40-fold slower. Plus this benchmark in this millisecond range and has to be considered a synthetic micro-benchmark. The most expensive CUDA bottleneck is the CPU-GPU bottleneck, basically efficiently moving GBytes of data from the CPU to the GPU for processing. We can see that from the cifar10 benchmark where the CUDA speed advantage shrinks to zero compared to a similar priced Xeon CPU.

For those compute centers that are not next to a nuclear power plant the performance/Watt may play a role, so here more more efficient designs such as the Jetson TK1 or good old FPGAs maybe interesting. But then again, you will be 10-fold slower and pay a 10-fold premium for saving a bunch of electrons (and the earth).

It might be interesting to note that the tensorflow based alexnet benchmark is still 10-fold slower than the torch and neon deep learning kits.

  • Source 0: Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source

  • Source 1 & 2: Titan X and Tesla K40c from tensorflow's alexnet_benchmark.py

  • Source 3: 2 x GeForce GTX TITAN X; major: 5 minor: 2 memoryClockRate (GHz) 1.2155; Total memory: 12.00GiB; 3072 CUDA cores Source

  • Source 4: GeForce GTX TITAN X; major: 5 minor: 2; memoryClockRate (GHz) 1.076; Total memory: 12.00GiB; 3072 CUDA cores Source

  • Source 5: i7-2600K 4.2 Ghz (4 cores, 8 threads) Source: own

  • Source 6: Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source


mnist

This is the mnist LeNet-5-like convolutional MNIST model example. It achieves an test error of 0.8% (lower is better) and a validation error of 0.9% (lower is better). The TensorFlow tutorials can be downloaded from here TF models. Loading the data to the GPU takes around 1 minute. Training for 10 epochs with a batch size of 64 on a GeForce GTX 1080 Ti takes additional 54 seconds. Only one GPU is utilized by this example. When running TF mnist on a 8-core CPU only, the runtime increases to 7 minutes, so the GPU example is roughly 4 times faster (even though batch time is 10-fold faster). TF CPU only was not compiled for SSE4.1, SSE4.2, AVX, AVX2 or FMA.

The benchmark is invoked by calling:

$python tutorials/image/mnist/convolutional.py 
Num Platform epoch time run time (min) Price $
1 GeForce GTX 1080 Ti 5.7 ms 1:54 min $800
2 Core i7-6900 K 52.4 ms 7:36 min $900

Source 1:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source

Source 2:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source


word2vec

This is the implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. The tensorflow GPU implementation is missing, other implementations in theano or keras may use GPU acceleration. Hence the current benchmark only runs on CPU. For 15 epochs in the example the accuracy is 36.5%.

Num Platform run time (min) Price $
1 Core i7-6900K (8 core) 23:17 min $900
2 Core i7-2600K (4 core) 36:04 min $300

LINKS:

ConvNet Benchmarks - benchmarking convolutional neural network implementations such as Narvana or caffe

TF performance@Quroa - Why is the (first release) tensorflow performance poor?

Pick a DL algorithm - deep learning frameworks surveyed by VentureBeat

Samsung Veles - Benchmarks for Samsung Veles deep learning

Comparing DeepLearning Kits - Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

AI gains - Looking back at performance gains in deep learning systems by Soumith Chintala from FaceBook AI

BIDMach - Benchmarks from a CPU/GPU machine learning package

LambdaLabs - Benchmarks from RTX 2080 Ti, RTX 2080, GTX 1080 Ti, Titan V, Tesla V100 (Oct 2018) via LambdaLabs [XLS]

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.