Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Raviteja1996 · 2019-04-22T05:02:12Z

Hi, I am trying to run the above mentioned model in the docker. I was facing the error when I ran the following command,
**command : ** bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
I have a 4 GPU system so I am using split_size=4. When I asked how to try out Gpipe in issue #48 I was given this command and also was asked to modify OneBWdsGPipeTransformer hparams, I haven't done the changes for hparams is the following error because of that? If I need to change something can you help in what hparams I need to change. I am also posting the error logo below:

**Error log : **

err.txt

jonathanasdf · 2019-04-22T20:35:01Z

Please try

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --controller_gpus=4 --worker_gpus=4 --worker_split_size=4

(Having to specify controller_gpus is a bug that we will fix)

jonathanasdf · 2019-04-22T20:38:49Z

There also seems to be a failing assertion right now with that model, we will look into that too.

Raviteja1996 · 2019-04-23T04:17:56Z

Hi I tried the command you gave me in the above comment. I think it progressed and some where it met with Aborted (core dumped). I am attaching the error log:
**Error log : **
error.txt

jonathanasdf · 2019-04-23T04:22:40Z

Yes, there is some error with the model configuration right now. We are sorry about the problem and will update this issue when it is resolved.

fangelyuan · 2019-05-30T11:24:42Z

@jonathanasdf
hello， when i run /lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4 --worker_gpus=4 --worker_split_size=4
I have a problem，can you tell me how to resolve it.

I0530 07:26:44.508102 140140756334336 trainer.py:305] Load from checkpoint /tmp/mnist/log/train/ckpt-00000000.
I0530 07:26:44.509429 140140756334336 saver.py:1276] Restoring parameters from /tmp/mnist/log/train/ckpt-00000000
I0530 07:26:45.732462 140140747941632 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 455, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
raise type(e)(node_def, op, message)

bignamehyp · 2019-05-31T00:05:50Z

The VOCAB_SIZE was incorrectly set. We will fix it asap.

bignamehyp · 2019-06-01T08:31:17Z

This issue should have been fixed. Please close it if there is no further issue.

msharmavikram mentioned this issue May 30, 2019

Assertion Error: GPipe - Transformer Example #88

Closed

bignamehyp closed this as completed Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Raviteja1996 commented Apr 22, 2019

jonathanasdf commented Apr 22, 2019

jonathanasdf commented Apr 22, 2019

Raviteja1996 commented Apr 23, 2019

jonathanasdf commented Apr 23, 2019

fangelyuan commented May 30, 2019

bignamehyp commented May 31, 2019

bignamehyp commented Jun 1, 2019

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Comments

Raviteja1996 commented Apr 22, 2019

jonathanasdf commented Apr 22, 2019

jonathanasdf commented Apr 22, 2019

Raviteja1996 commented Apr 23, 2019

jonathanasdf commented Apr 23, 2019

fangelyuan commented May 30, 2019

bignamehyp commented May 31, 2019

bignamehyp commented Jun 1, 2019