You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @Sentdex, watching latest youtube video got me wondering why your two titans were training at the same rate. I thought it might be a cpu bottleneck somewhere in the code so I had a look through and the only obvious thing I could see was the TensorBoard output.
After a quick test, with short training sessions changing the verbosity from 2 to 0, I got around 20x speedup.
The line in alexnet.py:
I can now train a reasonable model in under an hour. My guess is that all the data processing/saving for TensorBoard is happening on the cpu with extra slowdown writing it to the drive. Therefore it cant keep up. So if you don't need the extra verbosity its a good idea to turn it down.
If anyone can confirm this I would appreciate it, I was a bit shocked and cynical when I saw the epochs flying by!
Edit: It also results in a log file that is a few Megabyte vs a few hundred and TensorBoard no longer struggles to load it. Hopefully it will stop those crashes too.
The text was updated successfully, but these errors were encountered:
This is a great point. I like the extra data from tensorboard, but this will indeed make a huge change for our larger model that takes days/weeks to train.. I'll go ahead and change this in the official code. Good to leave tensorboard verbose when first validating a model, but a bad idea for the long haul.
Hey @Sentdex, watching latest youtube video got me wondering why your two titans were training at the same rate. I thought it might be a cpu bottleneck somewhere in the code so I had a look through and the only obvious thing I could see was the TensorBoard output.
After a quick test, with short training sessions changing the verbosity from 2 to 0, I got around 20x speedup.
The line in alexnet.py:
To:
I can now train a reasonable model in under an hour. My guess is that all the data processing/saving for TensorBoard is happening on the cpu with extra slowdown writing it to the drive. Therefore it cant keep up. So if you don't need the extra verbosity its a good idea to turn it down.
If anyone can confirm this I would appreciate it, I was a bit shocked and cynical when I saw the epochs flying by!
Edit: It also results in a log file that is a few Megabyte vs a few hundred and TensorBoard no longer struggles to load it. Hopefully it will stop those crashes too.
The text was updated successfully, but these errors were encountered: