Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Closed
Raviteja1996 opened this issue Apr 22, 2019 · 7 comments
Closed

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer #68

Raviteja1996 opened this issue Apr 22, 2019 · 7 comments

Comments

@Raviteja1996
Copy link

Hi, I am trying to run the above mentioned model in the docker. I was facing the error when I ran the following command,
**command : ** bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
I have a 4 GPU system so I am using split_size=4. When I asked how to try out Gpipe in issue #48 I was given this command and also was asked to modify OneBWdsGPipeTransformer hparams, I haven't done the changes for hparams is the following error because of that? If I need to change something can you help in what hparams I need to change. I am also posting the error logo below:

**Error log : **

err.txt

@jonathanasdf
Copy link
Contributor

Please try

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --controller_gpus=4 --worker_gpus=4 --worker_split_size=4

(Having to specify controller_gpus is a bug that we will fix)

@jonathanasdf
Copy link
Contributor

There also seems to be a failing assertion right now with that model, we will look into that too.

@Raviteja1996
Copy link
Author

Hi I tried the command you gave me in the above comment. I think it progressed and some where it met with Aborted (core dumped). I am attaching the error log:
**Error log : **
error.txt

@jonathanasdf
Copy link
Contributor

Yes, there is some error with the model configuration right now. We are sorry about the problem and will update this issue when it is resolved.

@fangelyuan
Copy link

@jonathanasdf
hello, when i run /lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4 --worker_gpus=4 --worker_split_size=4
I have a problem,can you tell me how to resolve it.

I0530 07:26:44.508102 140140756334336 trainer.py:305] Load from checkpoint /tmp/mnist/log/train/ckpt-00000000.
I0530 07:26:44.509429 140140756334336 saver.py:1276] Restoring parameters from /tmp/mnist/log/train/ckpt-00000000
I0530 07:26:45.732462 140140747941632 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 455, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
raise type(e)(node_def, op, message)

@bignamehyp
Copy link
Member

The VOCAB_SIZE was incorrectly set. We will fix it asap.

@bignamehyp
Copy link
Member

This issue should have been fixed. Please close it if there is no further issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants