Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Evaluator in TF_CONFIG #30121

Closed
wangsiyu opened this issue Jun 25, 2019 · 20 comments
Closed

About Evaluator in TF_CONFIG #30121

wangsiyu opened this issue Jun 25, 2019 · 20 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug

Comments

@wangsiyu
Copy link
Contributor

wangsiyu commented Jun 25, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): master
  • Python version: 2.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

Describe the current behavior
The evaluator in TF_CONFIG makes me confused. In RunConfig document, I found the evaluator should not be in cluster. For example,

cluster = {'chief': ['host0:2222'],
             'ps': ['host1:2222', 'host2:2222'],
             'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
      {'cluster': cluster,
       'task': {'type': 'evaluator', 'index': 0}})

This means the evaluator is not part of training cluster. So it is not going to be in the cluster_spec.
However, in DistributionStrategy there is a check to find evaluator node in cluster_spec. For example, in tensorflow/python/distribute/multi_worker_util.py, there is a function named _validate_cluster_spec.

  if task_type not in ("chief", "worker", "evaluator", "ps"):
    raise ValueError(
        "Unrecognized task_type: %r, valid task types are: \"chief\", "
        "\"worker\", \"evaluator\" and \"ps\"." % task_type)

  if task_type and task_type not in cluster_spec:
    raise ValueError("`task_type` %r not found in cluster_spec." % task_type)

That means if the task_type is evaluator, it should be in cluster_spec. This is inconsistent with what is stated in the document above. So what should TF_CONFIG be like when I want to use DistributionStrategy in training and single mode in evaluation? I try many times but failed because of this check. So it is better to give an example about how to construct TF_CONFIG. Thanks very much.
@yuefengz @anj-s

@gadagashwini-zz gadagashwini-zz self-assigned this Jun 26, 2019
@gadagashwini-zz gadagashwini-zz added comp:dist-strat Distribution Strategy related issues type:bug Bug labels Jun 26, 2019
@gadagashwini-zz gadagashwini-zz removed their assignment Jul 5, 2019
@hfzhang31
Copy link

I'm facing the same issue. I failed to run distributed evaluation when setting cluster_spec like the document does. Now I'm using train_and_evaluate for my model. The train_and_evaluate document mentions that when using 'train_and_evaluate', the evaluator should not use cluster_spec but use remote_cluster instead. It does eliminate ValueError. However the model doesn't work as expected. I am curious whether RunConfig document made mistakes?

@yuefengz
Copy link
Contributor

yuefengz commented Aug 6, 2019

I guess it is clearer to have evaluator job in the cluster spec if you want to run side-car evaluation. Does that work for Estimator + distribution strategy?

@guptapriya
Copy link
Contributor

@yuefengz can you give an example of how they should setup their TF_CONFIG?

@wangsiyu
Copy link
Contributor Author

wangsiyu commented Aug 7, 2019

Yes, it is better to give an example. Thanks.

@yuefengz
Copy link
Contributor

yuefengz commented Sep 6, 2019

Another example would look like:

cluster = {'chief': ['host0:2222'],
                 'evaluator': ['host6:2222'],
                 'ps': ['host1:2222', 'host2:2222'],
                 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

@wangsiyu
Copy link
Contributor Author

Another example would look like:

cluster = {'chief': ['host0:2222'],
                 'evaluator': ['host6:2222'],
                 'ps': ['host1:2222', 'host2:2222'],
                 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

Thanks for your reply.

@liyi193328
Copy link

liyi193328 commented Oct 10, 2019

@yuefengz
if evaluator in cluster, chief worker will wait evaluator start session like (CreateSession still waiting for response from worker: /job:evaluator/replica:0/task:0)
But evaluator wait chief's checkpoint file, can't start like(Waiting 100.000000 secs before starting eval, Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.).
May I miss something ? how to solve it? Thanks.

@sahiltyagi4
Copy link

@yuefengz
if evaluator in cluster, chief worker will wait evaluator start session like (CreateSession still waiting for response from worker: /job:evaluator/replica:0/task:0)
But evaluator wait chief's checkpoint file, can't start like(Waiting 100.000000 secs before starting eval, Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.).
May I miss something ? how to solve it? Thanks.

I'm facing exactly the same issue. Any hints/answers?

@shishaochen
Copy link
Contributor

shishaochen commented Nov 20, 2019

@wangsiyu @liyi193328 @sahiltyagi4 I guess all of you have explicitly offered the session_config parameter when constructing a tf.estimator.RunConfig.
According to source of run_config.py#L589, a tf.estimator.Estimator will automatically add device filters to the created tf.ConfigProto in distributed training when the field session_config is unset.

if self._task_type == TaskType.MASTER:
  device_filters = ['/job:ps', '/job:master']
elif self._task_type == TaskType.CHIEF:
  device_filters = ['/job:ps', '/job:chief']
elif self._task_type == TaskType.WORKER:
  device_filters = ['/job:ps', '/job:worker/task:%d' % self._task_id]
elif self._task_type == TaskType.PS:
  device_filters = ['/job:ps', '/job:worker', '/job:chief', '/job:master']

Thus, the solution to prevent unnecessary sync between all intances is easy. Just provide a correct device filter to your declared tf.ConfigProto.

@liyi193328
Copy link

@shishaochen Got it, Thanks a lot, Nice to you

@lebinlebin
Copy link

hi~ do you solve the problem? I encountered the same problem with you,but I dont understand how to set tf.ConfigProto. Can you give me a example?? Thanks

@mckinziebrandon
Copy link

Agreed with @lebinlebin, the solution is unclear. Can someone explictly provide an example of:

  1. The TF_CONFIG set on the worker nodes.
  2. The TF_CONFIG set on the evaluate node.
  3. The full tf.estimator.RunConfig used.

@meibenjin
Copy link

meibenjin commented Mar 23, 2020

@yuefengz please take a look at TF_CONFIG env issue in kubeflow/tf-operator

@meibenjin
Copy link

@yuefengz please take a look at TF_CONFIG env issue in kubeflow/tf-operator

HI, @yuefengz , any update about this issue? kubeflow/training-operator#1139

@Mesilenceki
Copy link

hi, I am using the elder version of tf-operator, and I have the same issue. Has it been fixed yet?

@joelxiangnanchen
Copy link

@shishaochen Got it, Thanks a lot, Nice to you

Hey, bro,did you have a solution there? I met the same issue with version 1.14. And there is another confusing issue is that my chief node has no variable file just meta-graph files. The evaluator also told me "Estimator not trained yet". Thx

@sushreebarsa
Copy link
Contributor

Hi There,

We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.we will get you the right help.Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jul 1, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 8, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug
Projects
None yet
Development

No branches or pull requests