Skip to content

Commit

Permalink
Merge pull request #93 from kkruups/patch-1
Browse files Browse the repository at this point in the history
Update of multi_device_training.rst
  • Loading branch information
KazukiYoshiyama-sony committed Jan 19, 2018
2 parents 37b58b6 + fc9f1a2 commit 1551d89
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions doc/python/tutorial/multi_device_training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ multiple devices. It is normally used for gradients exchange in data
parallel distributed training. Basically, there are two types of
distributed trainings in Neural Network literature: Data Parallel and
Model Parallel. Here we only focus on the former, Data Parallel
Training. Data Parallel Distributed Training are based on the very
simple equation in the optimization for a neural network called
Training. Data Parallel Distributed Training is based on the very
simple equation used for the optimization of a neural network called
(Mini-Batch) Stochastic Gradient Descent.

In the oprimization process, the objective one tries to minimize is
In the optimization process, the objective one tries to minimize is

.. math::
Expand Down Expand Up @@ -44,7 +44,7 @@ data points.
+ \frac{1}{B} \sum_{i=B \times (N-1) + 1}^{B \times N} \nabla_{\mathbf{w}} \ell (\mathbf{w}, \mathbf{x}_i)
\right)
In data parallel distributed training, the follwoing steps are peformed
In data parallel distributed training, the following steps are peformed
according to the above equation,

1. each term, summation of derivatives (gradients) divided by batch size
Expand All @@ -70,7 +70,7 @@ Cluster on Ipython Clusters tab.
Launch client
-------------

This codes are **only** needed for this turoial on **Jupyter Notebook**.
This code is **only** needed for this tutorial via **Jupyter Notebook**.

.. code:: python
Expand Down Expand Up @@ -174,7 +174,7 @@ Create data points and a very simple neural network
pred = PF.affine(h, n_class, w_init=w_init)
loss = F.mean(F.softmax_cross_entropy(pred, y))
**Important notice** here is that ``w_init`` is passed to parametric
**Important to notice** here is that ``w_init`` is passed to parametric
functions to let the network on each GPU start from the same values of
trainable parameters in the optimization process.

Expand Down Expand Up @@ -320,10 +320,10 @@ Update weights,
%%px
solver.update()
That's all for the usage of ``C.MultiProcessDataParalellCommunicator`` in the
sense of Data Parallel Distributed Training.
This concludes the usage of ``C.MultiProcessDataParalellCommunicator`` for
Data Parallel Distributed Training.

Now you got the picture of using ``C.MultiProcessDataParalellCommunicator``, go to
Now you should have an understanding of how to use ``C.MultiProcessDataParalellCommunicator``, go to
the cifar10 example,

1. **multi\_device\_multi\_process\_classification.sh**
Expand Down

0 comments on commit 1551d89

Please sign in to comment.