Exits training #206

argideritzalpea · 2020-04-23T02:28:10Z

I am attempting to run the Librispeech training example (only the 100hrs data, not the 360 or 500). I have an issue where Lingvo successfully begins training and gets to one or two epochs. Then, training stalls out with no error warning printed to screen, and the log exits and goes to receive input from the command line (in the docker container).

Here is the output I get from the training: https://github.com/argideritzalpea/lingvo/blob/master/run.log

Is this suggestive of OOM? When I try to run nvidia-smi after this error occurs, the command stalls out and no information about the GPU appears.

I am running the command bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \ --model=asr.librispeech.Librispeech960Grapheme --mode=sync \ --logdir=/tmp/lingvo/log --saver_max_to_keep=2 \ --run_locally=gpu 2>&1 |& tee run.log on one Tesla K80 on Google Cloud. Any ideas of how to debug this? Is there a setting that I could tweak to fix this?
In librispeech.py, I have halved p.bucket_batch_limit ([48, 24, 24, 24, 24, 24, 24, 24]`) and modified

def Train(self):
    p = self._CommonInputParams(is_eval=False)
    p.file_datasource.file_pattern = 'train/train.tfrecords-*'
    p.num_samples = 28539
    return p

to agree with the reduced number of samples in the Librispeech100 data, as opposed to 960.

The text was updated successfully, but these errors were encountered:

jonathanasdf · 2020-04-23T04:21:34Z

I think that's likely some kind of OOM, but I have no idea why it would just quit without printing any kind of error.

If it runs fine with --run_locally=cpu then I guess it's a GPU oom.

You can try setting report_tensor_allocations_upon_oom=True in

lingvo/lingvo/trainer.py

Line 492 in 6138c97

(_, eval_metrics, per_example_tensors) = sess.run([

Example: tensorflow/tensorflow#17076

It's also possible that it's a CPU OOM due to the input pipeline. In addition to making the bucket_batch_limit even smaller, you can also try setting file_buffer_size=1 (default 10000) and file_parallelism=1 (default 16).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exits training #206

Exits training #206

argideritzalpea commented Apr 23, 2020 •

edited

jonathanasdf commented Apr 23, 2020

Exits training #206

Exits training #206

Comments

argideritzalpea commented Apr 23, 2020 • edited

jonathanasdf commented Apr 23, 2020

argideritzalpea commented Apr 23, 2020 •

edited