Skip to content
Qirong Ho edited this page Jul 7, 2017 · 2 revisions

Performance

AlexNet on ILSVRC 2012 (Distributed GPU Training)

  • Objective: We train the AlexNet using ILSVRC 2012 Dataset.

  • Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.

  • Setting: See the net prototxt and solver. The training script and PS settings are provided here.

Throughput

The following figure shows PMLS-Caffe's speedup of throughput when training AlexNet using different settings of staleness values and number of nodes. When using 1 node, the performance of the original Caffe is reported. The throughput is evaluated with cuDNN R2 and CUDA 6.5.

Convergence

On our cluster, when training AlexNet with 8 nodes, PMLS-Caffe takes only 1 day to converge (compared to 5 - 7 days on the single machine Caffe), and achieves 56.5% top-1 accuracy on the validation set.

The following figures show how the validation error decreases along with training time and iterations. When using 1 node, the performance of the original Caffe is reported.

GoogLeNet on ILSVRC 2012 (Distributed GPU Training)

  • Objective: We train the GoogLeNet using ILSVRC 2012 Dataset.

  • Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.

  • Setting: See the net prototxt and solver. The training script and PS settings are provided here.

Throughput

The following figure shows PMLS-Caffe's speedup of throughput when training GoogLeNet using different settings of staleness values and number of nodes, compared to single machine Caffe. The throughput is evaluated with cuDNN R2 and CUDA 6.5.

Convergence

When training GoogLeNet with 8 nodes, PMLS-Caffe takes less than 48 hours to achieve 50% top-1 accuracy, and less than 75 hours to achieves 57% top-1 accuracy, and finally achieve 67.1% top-1 accuracy, enjoys about 4 times speedup compared to single machine Caffe, which usually takes 15- 20 days to converge, as shown in the following figures.

ImageNet 22K (Distributed GPU Training)

  • Objective and dataset: We train a CNN using all available images in ImageNet, including 14,197,087 labeled images from 21,841 categories. We randomly split the whole set into two parts, and use the first 7.1 million of images for training and remained for test. The whole data size is about 3.2Tb with 1.6Tb of training and 1.6Tb as test.

  • Environment: We train the CNN with fully data-parallelism on a GPU cluster with 8 nodes, of which every node is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.

  • Settings: The network and solver configurations will be released soon.

Convergence

The following table compares our result to those of previous work on ImageNet 22K, in terms of experimental settings, machine resources, training time used, and train/test accuracy. It's worth mentioning that the prediction performance primarily depends on what kind of CNN structure you choose, thus could be substantially improved if choosing a different or improved model.

Framework Data (train/test) # machines/cores Time Train accuracy Test accuracy
PMLS-Caffe 7.1M / 7.1M 8 / 8 GPUs 3 days 41% 23.7%
Adam 7.1M / 7.1M 62 machines / ? 10 days N/A 29.8%
Le et al., w/ pretrain 7.1M+10M unlabeled images / 7.1M 1000 / 16000 cores 3 days N/A 15.8%
MxNet 14.2M / No test 1 / 4 GPUs 8.5 days 37.19% N/A