Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorboard won't deploy #59

Closed
dylanbannon opened this issue Nov 11, 2018 · 6 comments
Closed

tensorboard won't deploy #59

dylanbannon opened this issue Nov 11, 2018 · 6 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@dylanbannon
Copy link
Contributor

The tensorboard code that we merged into master this past week isn't fully functional. I observed, when dpeloying the current master branch on GKE, that helmfile deployment would ultimately fail because the tensorboard deployment held up things long enough for helmfile to timeout. Upon closer inspection, it looks like the tensorboard container gets stuck pulling the image for up to 20 minutes... If there's no problem pulling this image (tensorflow/tensorflow:latest) in other settings, then I suspect this is some sort of cluster resource issue, perhaps insufficient disk space on some node.

@dylanbannon dylanbannon added the bug Something isn't working label Nov 11, 2018
@dylanbannon
Copy link
Contributor Author

So, I just tested GKE cluster creation and tensorboard deployed without a problem... I know you've observed the hanging during image pulling before too, @willgraf. Maybe it's an intermittent issue?

I looked here
https://hub.docker.com/r/tensorflow/tensorflow/tags/
and noticed that the latest image for Tensorflow is only 480MB. I remember it being 3GB, though, right?

In general, is there a best practice regarding whether or not to pull latest versions of Docker images?

Until I hear back from you, @willgraf, I'm just going to ignore this as long as it agrees to stay hidden.

@dylanbannon dylanbannon added the question Further information is requested label Nov 11, 2018
@osterman
Copy link
Contributor

You can set the timeout in the helmfile.

helmDefaults:
  timeout: 1200

@osterman
Copy link
Contributor

Also, when calling helm you can pass --timeout=1200 (which is what the helmfile default does)

@willgraf
Copy link
Contributor

willgraf commented Nov 16, 2018

The tensorboard instance seems to run well, however, it cannot get the data from the bucket. This may be due to some conflict with NodeJS/Express server, since they use the same routing.

In the console of tensorboard I see several errors:

Failed to decode downloaded font: http://35.230.25.91/font-roboto/oMMgfZMQthOryQo9n22dcuvvDin1pK8aKteLpeZ5c0A.woff2

and

Uncaught SyntaxError: Unexpected token < in JSON at position 0
    at JSON.parse (<anonymous>)
    at XMLHttpRequest.req.onload (tensorboard:39466)

and

OTS parsing error: invalid version tag

maybe these errors are causing tensorboard to stop loading data (or are evidence of tensorboard fetching the data but being unable to render it).

This Stack Overflow post makes me think it may have to do with our express server and our tensorboard being on the same load balancer? I can't explain how that would cause failure, but the top answer describes a similar set up to our own.


UPDATE: This issue is due to an ingress problem. by not including the trailing "/" in the URL /tensorboard/, the express engine attempts to process the page as well as the tensorboard server. This is not quite understood yet, but there has been a specific issue for this, #68.

this ingress issue is unrelated to the current issue of tensorboard not deploying.

@willgraf
Copy link
Contributor

willgraf commented Nov 17, 2018

I think this issue will be resolved by #70 as it tackled many of the tensorboard issues.


UPDATE: I believe this issue has been resolved by #70. I'll wait for a few days but if there is not any further activity, this issue will be closed.

@willgraf
Copy link
Contributor

Resolved by #70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants