Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exits training #206

Open
argideritzalpea opened this issue Apr 23, 2020 · 1 comment
Open

Exits training #206

argideritzalpea opened this issue Apr 23, 2020 · 1 comment

Comments

@argideritzalpea
Copy link

argideritzalpea commented Apr 23, 2020

I am attempting to run the Librispeech training example (only the 100hrs data, not the 360 or 500). I have an issue where Lingvo successfully begins training and gets to one or two epochs. Then, training stalls out with no error warning printed to screen, and the log exits and goes to receive input from the command line (in the docker container).

Here is the output I get from the training: https://github.com/argideritzalpea/lingvo/blob/master/run.log

Is this suggestive of OOM? When I try to run nvidia-smi after this error occurs, the command stalls out and no information about the GPU appears.

I am running the command bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \ --model=asr.librispeech.Librispeech960Grapheme --mode=sync \ --logdir=/tmp/lingvo/log --saver_max_to_keep=2 \ --run_locally=gpu 2>&1 |& tee run.log on one Tesla K80 on Google Cloud. Any ideas of how to debug this? Is there a setting that I could tweak to fix this?
In librispeech.py, I have halved p.bucket_batch_limit ([48, 24, 24, 24, 24, 24, 24, 24]`) and modified

def Train(self):
    p = self._CommonInputParams(is_eval=False)
    p.file_datasource.file_pattern = 'train/train.tfrecords-*'
    p.num_samples = 28539
    return p

to agree with the reduced number of samples in the Librispeech100 data, as opposed to 960.

@jonathanasdf
Copy link
Contributor

I think that's likely some kind of OOM, but I have no idea why it would just quit without printing any kind of error.

If it runs fine with --run_locally=cpu then I guess it's a GPU oom.

You can try setting report_tensor_allocations_upon_oom=True in

(_, eval_metrics, per_example_tensors) = sess.run([

Example: tensorflow/tensorflow#17076

It's also possible that it's a CPU OOM due to the input pipeline. In addition to making the bucket_batch_limit even smaller, you can also try setting file_buffer_size=1 (default 10000) and file_parallelism=1 (default 16).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants