-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Evaluator
in TF_CONFIG
#30121
Comments
I'm facing the same issue. I failed to run distributed evaluation when setting |
I guess it is clearer to have evaluator job in the cluster spec if you want to run side-car evaluation. Does that work for Estimator + distribution strategy? |
@yuefengz can you give an example of how they should setup their TF_CONFIG? |
Yes, it is better to give an example. Thanks. |
Another example would look like:
|
Thanks for your reply. |
@yuefengz |
I'm facing exactly the same issue. Any hints/answers? |
@wangsiyu @liyi193328 @sahiltyagi4 I guess all of you have explicitly offered the if self._task_type == TaskType.MASTER:
device_filters = ['/job:ps', '/job:master']
elif self._task_type == TaskType.CHIEF:
device_filters = ['/job:ps', '/job:chief']
elif self._task_type == TaskType.WORKER:
device_filters = ['/job:ps', '/job:worker/task:%d' % self._task_id]
elif self._task_type == TaskType.PS:
device_filters = ['/job:ps', '/job:worker', '/job:chief', '/job:master'] Thus, the solution to prevent unnecessary sync between all intances is easy. Just provide a correct device filter to your declared |
@shishaochen Got it, Thanks a lot, Nice to you |
hi~ do you solve the problem? I encountered the same problem with you,but I dont understand how to set tf.ConfigProto. Can you give me a example?? Thanks |
Agreed with @lebinlebin, the solution is unclear. Can someone explictly provide an example of:
|
@yuefengz please take a look at TF_CONFIG env issue in kubeflow/tf-operator |
HI, @yuefengz , any update about this issue? kubeflow/training-operator#1139 |
hi, I am using the elder version of tf-operator, and I have the same issue. Has it been fixed yet? |
Hey, bro,did you have a solution there? I met the same issue with version 1.14. And there is another confusing issue is that my chief node has no variable file just meta-graph files. The evaluator also told me "Estimator not trained yet". Thx |
Hi There, We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.we will get you the right help.Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
Describe the current behavior
The
evaluator
inTF_CONFIG
makes me confused. InRunConfig
document, I found theevaluator
should not be incluster
. For example,This means the
evaluator
is not part of training cluster. So it is not going to be in thecluster_spec
.However, in
DistributionStrategy
there is a check to findevaluator
node incluster_spec
. For example, intensorflow/python/distribute/multi_worker_util.py
, there is a function named_validate_cluster_spec
.That means if the
task_type
isevaluator
, it should be incluster_spec
. This is inconsistent with what is stated in the document above. So what shouldTF_CONFIG
be like when I want to useDistributionStrategy
in training and single mode in evaluation? I try many times but failed because of this check. So it is better to give an example about how to constructTF_CONFIG
. Thanks very much.@yuefengz @anj-s
The text was updated successfully, but these errors were encountered: