-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorboard won't deploy #59
Comments
So, I just tested GKE cluster creation and tensorboard deployed without a problem... I know you've observed the hanging during image pulling before too, @willgraf. Maybe it's an intermittent issue? I looked here In general, is there a best practice regarding whether or not to pull Until I hear back from you, @willgraf, I'm just going to ignore this as long as it agrees to stay hidden. |
You can set the timeout in the helmfile.
|
Also, when calling |
The tensorboard instance seems to run well, however, it cannot get the data from the bucket. This may be due to some conflict with NodeJS/Express server, since they use the same routing. In the console of tensorboard I see several errors:
and
and
maybe these errors are causing tensorboard to stop loading data (or are evidence of tensorboard fetching the data but being unable to render it). This Stack Overflow post makes me think it may have to do with our express server and our tensorboard being on the same load balancer? I can't explain how that would cause failure, but the top answer describes a similar set up to our own. UPDATE: This issue is due to an ingress problem. by not including the trailing "/" in the URL this ingress issue is unrelated to the current issue of tensorboard not deploying. |
Resolved by #70 |
The tensorboard code that we merged into
master
this past week isn't fully functional. I observed, when dpeloying the currentmaster
branch on GKE, that helmfile deployment would ultimately fail because thetensorboard
deployment held up things long enough for helmfile to timeout. Upon closer inspection, it looks like the tensorboard container gets stuck pulling the image for up to 20 minutes... If there's no problem pulling this image (tensorflow/tensorflow:latest
) in other settings, then I suspect this is some sort of cluster resource issue, perhaps insufficient disk space on some node.The text was updated successfully, but these errors were encountered: