Skip to content

Conversation

@dlibenzi
Copy link
Collaborator

@dlibenzi dlibenzi commented Sep 5, 2019

… to be setup.

Use a single init for TPU mesh setup.

… to be setup.

Use a single init for TPU mesh setup.
@dlibenzi dlibenzi requested a review from jysohn23 September 5, 2019 18:59
Copy link
Collaborator

@jysohn23 jysohn23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this works on a pod slice?

@dlibenzi
Copy link
Collaborator Author

dlibenzi commented Sep 5, 2019

Have we tested this works on a pod slice?

I did on Borg.

@dlibenzi dlibenzi merged commit cd8743c into master Sep 6, 2019
@jysohn23
Copy link
Collaborator

jysohn23 commented Sep 6, 2019

Have we tested this works on a pod slice?

I did on Borg.

Bad news, I get the error message:

2019-09-06 04:18:22.859156: E tensorflow/compiler/xla/xla_client/xla_util.cc:72] The TPU system has not been initialized.

Full client-side logs: https://gist.github.com/jysohn23/d6631b3cde065500ebf9a4d462492ef4.

Have you tried testing with a slice that hadn't already been initialized previously? With a previously initialized (N^2 inits) TPU slice, trying this single init way had worked for me previously which was misleading. Would you mind reproing the failure on a fresh slice? (not that it matters, but I tried on v3-32 and v3-512).

@dlibenzi
Copy link
Collaborator Author

dlibenzi commented Sep 6, 2019 via email

@jysohn23
Copy link
Collaborator

jysohn23 commented Sep 6, 2019

Yes, freshly handed out by Borg. I pushed the CL this afternoon, are you sure it has been picked up by nightly?

Yes, I triggered an on-demand nightly after your PR was merged and verified pip freeze and we don't see the serial inits one by one on each worker.

@dlibenzi
Copy link
Collaborator Author

dlibenzi commented Sep 6, 2019 via email

@jysohn23
Copy link
Collaborator

jysohn23 commented Sep 6, 2019

PR? You meant CL? There need to be the TF mesh CL in the TFRC servers on the TPU VMs.

Oh, yes you are right, thanks for reminding me; I forgot about the TFRC update of course..! I'll try this tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants