Merge branch 'master' into lr-warmup-gammainit

uoguelph-mlrg · Aug 2, 2017 · f8e9c9a · f8e9c9a
2 parents 80580c0 + 645d183
commit f8e9c9a
Show file tree

Hide file tree

Showing 6 changed files with 25 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -105,16 +105,20 @@ More examples can be found [here](https://github.com/uoguelph-mlrg/Theano-MPI/tr
 
 ## Example Performance
 
-###BSP tested on up to eight Tesla K80 GPUs
-Time per 5120 images in seconds: [allow_gc = True]
+### BSP tested on up to eight Tesla K80 GPUs 
+
+Training (+communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on [copper](https://www.sharcnet.ca/my/systems/show/108)]
 
 | Model | 1GPU  | 2GPU  | 4GPU  | 8GPU  |
 | :---: | :---: | :---: | :---: | :---: |
 | AlexNet-128b | 20.50 | 10.35+0.78 | 5.13+0.54 | 2.63+0.61 |
 | GoogLeNet-32b | 63.89 | 31.40+1.00 | 15.51+0.71 | 7.69+0.80 |
 | VGG16-16b | 358.29 | 176.08+13.90 | 90.44+9.28 | 55.12+12.59 |
-| VGG16-32b | 343.37 | 169.12+7.14 | 86.97++4.80 | 43.29+5.41 |
+| VGG16-32b | 343.37 | 169.12+7.14 | 86.97+4.80 | 43.29+5.41 |
 | ResNet50-64b | 163.15 | 80.09+0.81 | 40.25+0.56 | 20.12+0.57 |
+
+More details on the benchmark can be found in this [notebook](https://github.com/uoguelph-mlrg/Theano-MPI/blob/master/examples/speedup-n_workers.ipynb).
+
 <img src=https://github.com/uoguelph-mlrg/Theano-MPI/raw/master/show/val_a.png width=500/>
 <img src=https://github.com/uoguelph-mlrg/Theano-MPI/raw/master/show/val_g.png width=500/>
 
@@ -123,7 +127,7 @@ Time per 5120 images in seconds: [allow_gc = True]
 
 * You may want to use those helper functions in `/theanompi/lib/opt.py` to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence.
 
-* To get the best running speed performance, the memory cache may need to be cleaned before running.
+* Binding cores according to your NUMA topology may give better performance. Try the `-bind` option with the launcher (needs [hwloc](https://www.open-mpi.org/projects/hwloc/) depedency).
 
 * Binding cores according to your NUMA topology may give better performance. Try the `-bind` option with the launcher (needs [hwloc](https://www.open-mpi.org/projects/hwloc/) depedency).
 

diff --git a/examples/speedup-n_workers.ipynb b/examples/speedup-n_workers.ipynb
diff --git a/theanompi/models/alex_net.py b/theanompi/models/alex_net.py
@@ -520,7 +520,6 @@ def val_iter(self, count,recorder):
 
             else:
 
-
                 img_mean = self.data.rawdata[4]
                 img_std = self.data.rawdata[5]
                 import hickle as hkl

diff --git a/theanompi/models/googlenet.py b/theanompi/models/googlenet.py
@@ -866,7 +866,6 @@ def val_iter(self, count,recorder):
 
             else:
 
-
                 img_mean = self.data.rawdata[4]
                 img_std = self.data.rawdata[5]
                 import hickle as hkl

diff --git a/theanompi/models/lasagne_model_zoo/resnet50.py b/theanompi/models/lasagne_model_zoo/resnet50.py
@@ -597,8 +597,7 @@ def val_iter(self, count,recorder):
 
 
             else:
-
-
+
                 img_mean = self.data.rawdata[4]
                 img_std = self.data.rawdata[5]
 

diff --git a/theanompi/models/lasagne_model_zoo/vgg16.py b/theanompi/models/lasagne_model_zoo/vgg16.py
@@ -392,7 +392,7 @@ def train_iter(self, count,recorder):
 
 
             else:
-            
+
                 img_mean = self.data.rawdata[4]
                 img_std = self.data.rawdata[5]
                 import hickle as hkl
@@ -492,8 +492,7 @@ def val_iter(self, count,recorder):
 
 
             else:
-
-
+
                 img_mean = self.data.rawdata[4]
                 img_std = self.data.rawdata[5]
                 import hickle as hkl