Add AllReduce distributed strategy design #373

QiJune · 2020-10-27T09:39:25Z

Here is for beter review

codecov · 2020-10-28T04:10:54Z

Codecov Report

Merging #373 into develop will decrease coverage by 0.06%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop     #373      +/-   ##
===========================================
- Coverage    87.75%   87.69%   -0.07%     
===========================================
  Files           33       33              
  Lines         1503     1503              
===========================================
- Hits          1319     1318       -1     
- Misses         121      122       +1     
  Partials        63       63

Impacted Files	Coverage Δ
vision/imageloader/imageloader.go	`90.41% <0.00%> (-0.69%)`	⬇️

Yancey1989 · 2020-10-28T08:25:23Z

doc/allreduce.md

+
+Single-process multi-GPU is not the recommended mode,
+becase of its overhead of scatter/gather and GIL contention in every forward pass.
+So, let's focus on DistributedDataParallel.


GoTorch does not have GIL, does single-process multi-GPU mode fits GoTorch?

The answer is no. There are two reasons:

The overhead of scatter/gather is also nonnegligible. We once use scatter/parallel do/gather to support multi-GPU AllReduce in Paddle with C++. From the experience, the speedup ratio is not very good.

We could only use scatter/gather in multi-GPU of one node. It could not be scaled to multi-node multi-GPU.

QiJune added 2 commits October 27, 2020 17:15

init

ffabacb

add gotorch part

9949a54

QiJune changed the title ~~[WIP] Add AllReduce distributed strategy design~~ Add AllReduce distributed strategy design Oct 28, 2020

add RecordIODataLoader part

59f653b

QiJune force-pushed the allreduce_design branch from 9dba66f to 59f653b Compare October 28, 2020 04:05

fix ci

cd6a379

polish doc

89bee36

QiJune requested review from shendiaomo, typhoonzero, wangkuiyi and Yancey1989 October 28, 2020 06:31

Yancey1989 reviewed Oct 28, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AllReduce distributed strategy design #373

Add AllReduce distributed strategy design #373

QiJune commented Oct 27, 2020 •

edited

codecov bot commented Oct 28, 2020 •

edited

Yancey1989 Oct 28, 2020

QiJune Oct 29, 2020 •

edited

Add AllReduce distributed strategy design #373

Are you sure you want to change the base?

Add AllReduce distributed strategy design #373

Conversation

QiJune commented Oct 27, 2020 • edited

codecov bot commented Oct 28, 2020 • edited

Codecov Report

Yancey1989 Oct 28, 2020

Choose a reason for hiding this comment

QiJune Oct 29, 2020 • edited

Choose a reason for hiding this comment

QiJune commented Oct 27, 2020 •

edited

codecov bot commented Oct 28, 2020 •

edited

QiJune Oct 29, 2020 •

edited