You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to run the Librispeech training example (only the 100hrs data, not the 360 or 500). I have an issue where Lingvo successfully begins training and gets to one or two epochs. Then, training stalls out with no error warning printed to screen, and the log exits and goes to receive input from the command line (in the docker container).
Is this suggestive of OOM? When I try to run nvidia-smi after this error occurs, the command stalls out and no information about the GPU appears.
I am running the command bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \ --model=asr.librispeech.Librispeech960Grapheme --mode=sync \ --logdir=/tmp/lingvo/log --saver_max_to_keep=2 \ --run_locally=gpu 2>&1 |& tee run.log on one Tesla K80 on Google Cloud. Any ideas of how to debug this? Is there a setting that I could tweak to fix this?
In librispeech.py, I have halved p.bucket_batch_limit ([48, 24, 24, 24, 24, 24, 24, 24]`) and modified
def Train(self):
p = self._CommonInputParams(is_eval=False)
p.file_datasource.file_pattern = 'train/train.tfrecords-*'
p.num_samples = 28539
return p
to agree with the reduced number of samples in the Librispeech100 data, as opposed to 960.
The text was updated successfully, but these errors were encountered:
It's also possible that it's a CPU OOM due to the input pipeline. In addition to making the bucket_batch_limit even smaller, you can also try setting file_buffer_size=1 (default 10000) and file_parallelism=1 (default 16).
I am attempting to run the Librispeech training example (only the 100hrs data, not the 360 or 500). I have an issue where Lingvo successfully begins training and gets to one or two epochs. Then, training stalls out with no error warning printed to screen, and the log exits and goes to receive input from the command line (in the docker container).
Here is the output I get from the training: https://github.com/argideritzalpea/lingvo/blob/master/run.log
Is this suggestive of OOM? When I try to run
nvidia-smi
after this error occurs, the command stalls out and no information about the GPU appears.I am running the command
bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \ --model=asr.librispeech.Librispeech960Grapheme --mode=sync \ --logdir=/tmp/lingvo/log --saver_max_to_keep=2 \ --run_locally=gpu 2>&1 |& tee run.log
on one Tesla K80 on Google Cloud. Any ideas of how to debug this? Is there a setting that I could tweak to fix this?In librispeech.py, I have halved p.bucket_batch_limit
(
[48, 24, 24, 24, 24, 24, 24, 24]`) and modifiedto agree with the reduced number of samples in the Librispeech100 data, as opposed to 960.
The text was updated successfully, but these errors were encountered: