Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regrading Gpipe #48

Open
Raviteja1996 opened this issue Mar 19, 2019 · 7 comments
Open

Regrading Gpipe #48

Raviteja1996 opened this issue Mar 19, 2019 · 7 comments

Comments

@Raviteja1996
Copy link

Hi, I want to test how Gpipe works, when i searched in the web I found about lingvo repository. Can i know how to run it. I mean i didn't find any documentation so I was a little confused.

@drpngx
Copy link
Contributor

drpngx commented Mar 19, 2019

@bignamehyp any further comments?

@bignamehyp
Copy link
Member

We will update a better instruction to run GPipe in the near future.

An example to run GPipe is provided at the comments here:
https://github.com/tensorflow/lingvo/blob/master/lingvo/tasks/lm/params/one_billion_wds.py#L180.

Once you modified OneBWdsGPipeTransformer hparams, here can start the trainer on 8 GPUs:

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds. OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=8

The general instruction to install/run Lingvo model is provided at
https://github.com/tensorflow/lingvo/blob/master/README.md

@WonderAndMaps
Copy link

Will there be tutorials for image classification, e.g. AmoebaNet? Thanks

@Raviteja1996
Copy link
Author

Raviteja1996 commented Apr 2, 2019

Hi, in the above you mentioned about changing OneBWdsGPipeTransformer hparams and then try to run on 8 GPU's and gave the command to run. I did not understand what are those parameters, can I get help which parameters fit for my system. I am using machine consisting of 4 GPU. What ever parameters I change I am facing segmentation fault core dumped. I am also attaching my system info(GPU).

command : bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4

segmentation fault.txt

system info:
GPU:
sys_info.txt

@Raviteja1996
Copy link
Author

Hi, any update about above post.

@bignamehyp
Copy link
Member

Is it still an open issue?

@feiwang3311
Copy link

feiwang3311 commented Jul 9, 2019

It will be great if more guidance (tutorials) can be offered for running GPipe on the image classification models, such as the AmoebaNet models evaluated in the GPipe arXiv paper and blog :)

Fei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants