Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/feature/20171211-file-format-con…
Browse files Browse the repository at this point in the history
…verter' into 1-onnx-nnabla
  • Loading branch information
Masato Hori committed Feb 15, 2018
2 parents 2ba045a + b74bba4 commit c428dc1
Show file tree
Hide file tree
Showing 5 changed files with 236 additions and 324 deletions.
18 changes: 9 additions & 9 deletions doc/python/tutorial/multi_device_training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ multiple devices. It is normally used for gradients exchange in data
parallel distributed training. Basically, there are two types of
distributed trainings in Neural Network literature: Data Parallel and
Model Parallel. Here we only focus on the former, Data Parallel
Training. Data Parallel Distributed Training are based on the very
simple equation in the optimization for a neural network called
Training. Data Parallel Distributed Training is based on the very
simple equation used for the optimization of a neural network called
(Mini-Batch) Stochastic Gradient Descent.

In the oprimization process, the objective one tries to minimize is
In the optimization process, the objective one tries to minimize is

.. math::
Expand Down Expand Up @@ -44,7 +44,7 @@ data points.
+ \frac{1}{B} \sum_{i=B \times (N-1) + 1}^{B \times N} \nabla_{\mathbf{w}} \ell (\mathbf{w}, \mathbf{x}_i)
\right)
In data parallel distributed training, the follwoing steps are peformed
In data parallel distributed training, the following steps are peformed
according to the above equation,

1. each term, summation of derivatives (gradients) divided by batch size
Expand All @@ -70,7 +70,7 @@ Cluster on Ipython Clusters tab.
Launch client
-------------

This codes are **only** needed for this turoial on **Jupyter Notebook**.
This code is **only** needed for this tutorial via **Jupyter Notebook**.

.. code:: python
Expand Down Expand Up @@ -174,7 +174,7 @@ Create data points and a very simple neural network
pred = PF.affine(h, n_class, w_init=w_init)
loss = F.mean(F.softmax_cross_entropy(pred, y))
**Important notice** here is that ``w_init`` is passed to parametric
**Important to notice** here is that ``w_init`` is passed to parametric
functions to let the network on each GPU start from the same values of
trainable parameters in the optimization process.

Expand Down Expand Up @@ -320,10 +320,10 @@ Update weights,
%%px
solver.update()
That's all for the usage of ``C.MultiProcessDataParalellCommunicator`` in the
sense of Data Parallel Distributed Training.
This concludes the usage of ``C.MultiProcessDataParalellCommunicator`` for
Data Parallel Distributed Training.

Now you got the picture of using ``C.MultiProcessDataParalellCommunicator``, go to
Now you should have an understanding of how to use ``C.MultiProcessDataParalellCommunicator``, go to
the cifar10 example,

1. **multi\_device\_multi\_process\_classification.sh**
Expand Down

0 comments on commit c428dc1

Please sign in to comment.