Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow training due to TensorBoard verbosity #26

Closed
Flandan opened this issue May 9, 2017 · 1 comment
Closed

Slow training due to TensorBoard verbosity #26

Flandan opened this issue May 9, 2017 · 1 comment

Comments

@Flandan
Copy link

Flandan commented May 9, 2017

Hey @Sentdex, watching latest youtube video got me wondering why your two titans were training at the same rate. I thought it might be a cpu bottleneck somewhere in the code so I had a look through and the only obvious thing I could see was the TensorBoard output.

After a quick test, with short training sessions changing the verbosity from 2 to 0, I got around 20x speedup.
The line in alexnet.py:

model = tflearn.DNN(network, checkpoint_path='model_alexnet',
                    max_checkpoints=1, tensorboard_verbose=2, tensorboard_dir='log')

To:

model = tflearn.DNN(network, checkpoint_path='model_alexnet',
                    max_checkpoints=1, tensorboard_verbose=0, tensorboard_dir='log')

I can now train a reasonable model in under an hour. My guess is that all the data processing/saving for TensorBoard is happening on the cpu with extra slowdown writing it to the drive. Therefore it cant keep up. So if you don't need the extra verbosity its a good idea to turn it down.

If anyone can confirm this I would appreciate it, I was a bit shocked and cynical when I saw the epochs flying by!

Edit: It also results in a log file that is a few Megabyte vs a few hundred and TensorBoard no longer struggles to load it. Hopefully it will stop those crashes too.

@Sentdex
Copy link
Owner

Sentdex commented May 9, 2017

This is a great point. I like the extra data from tensorboard, but this will indeed make a huge change for our larger model that takes days/weeks to train.. I'll go ahead and change this in the official code. Good to leave tensorboard verbose when first validating a model, but a bad idea for the long haul.

@Sentdex Sentdex closed this as completed May 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants