About `Evaluator` in TF_CONFIG #30121

wangsiyu · 2019-06-25T14:00:52Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): master
Python version: 2.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

Describe the current behavior
The evaluator in TF_CONFIG makes me confused. In RunConfig document, I found the evaluator should not be in cluster. For example,

cluster = {'chief': ['host0:2222'],
             'ps': ['host1:2222', 'host2:2222'],
             'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
      {'cluster': cluster,
       'task': {'type': 'evaluator', 'index': 0}})

This means the evaluator is not part of training cluster. So it is not going to be in the cluster_spec.
However, in DistributionStrategy there is a check to find evaluator node in cluster_spec. For example, in tensorflow/python/distribute/multi_worker_util.py, there is a function named _validate_cluster_spec.

  if task_type not in ("chief", "worker", "evaluator", "ps"):
    raise ValueError(
        "Unrecognized task_type: %r, valid task types are: \"chief\", "
        "\"worker\", \"evaluator\" and \"ps\"." % task_type)

  if task_type and task_type not in cluster_spec:
    raise ValueError("`task_type` %r not found in cluster_spec." % task_type)

That means if the task_type is evaluator, it should be in cluster_spec. This is inconsistent with what is stated in the document above. So what should TF_CONFIG be like when I want to use DistributionStrategy in training and single mode in evaluation? I try many times but failed because of this check. So it is better to give an example about how to construct TF_CONFIG. Thanks very much.
@yuefengz @anj-s

The text was updated successfully, but these errors were encountered:

hfzhang31 · 2019-07-24T08:52:43Z

I'm facing the same issue. I failed to run distributed evaluation when setting cluster_spec like the document does. Now I'm using train_and_evaluate for my model. The train_and_evaluate document mentions that when using 'train_and_evaluate', the evaluator should not use cluster_spec but use remote_cluster instead. It does eliminate ValueError. However the model doesn't work as expected. I am curious whether RunConfig document made mistakes?

yuefengz · 2019-08-06T23:23:22Z

I guess it is clearer to have evaluator job in the cluster spec if you want to run side-car evaluation. Does that work for Estimator + distribution strategy?

guptapriya · 2019-08-07T00:55:35Z

@yuefengz can you give an example of how they should setup their TF_CONFIG?

wangsiyu · 2019-08-07T06:08:32Z

Yes, it is better to give an example. Thanks.

yuefengz · 2019-09-06T01:00:47Z

Another example would look like:

cluster = {'chief': ['host0:2222'],
                 'evaluator': ['host6:2222'],
                 'ps': ['host1:2222', 'host2:2222'],
                 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

wangsiyu · 2019-09-17T04:17:32Z

Another example would look like:

cluster = {'chief': ['host0:2222'],
                 'evaluator': ['host6:2222'],
                 'ps': ['host1:2222', 'host2:2222'],
                 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

Thanks for your reply.

liyi193328 · 2019-10-10T15:56:18Z

@yuefengz
if evaluator in cluster, chief worker will wait evaluator start session like (CreateSession still waiting for response from worker: /job:evaluator/replica:0/task:0)
But evaluator wait chief's checkpoint file, can't start like(Waiting 100.000000 secs before starting eval, Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.).
May I miss something ? how to solve it? Thanks.

sahiltyagi4 · 2019-11-10T10:12:19Z

@yuefengz
if evaluator in cluster, chief worker will wait evaluator start session like (CreateSession still waiting for response from worker: /job:evaluator/replica:0/task:0)
But evaluator wait chief's checkpoint file, can't start like(Waiting 100.000000 secs before starting eval, Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.).
May I miss something ? how to solve it? Thanks.

I'm facing exactly the same issue. Any hints/answers?

shishaochen · 2019-11-20T01:16:49Z

@wangsiyu @liyi193328 @sahiltyagi4 I guess all of you have explicitly offered the session_config parameter when constructing a tf.estimator.RunConfig.
According to source of run_config.py#L589, a tf.estimator.Estimator will automatically add device filters to the created tf.ConfigProto in distributed training when the field session_config is unset.

if self._task_type == TaskType.MASTER:
  device_filters = ['/job:ps', '/job:master']
elif self._task_type == TaskType.CHIEF:
  device_filters = ['/job:ps', '/job:chief']
elif self._task_type == TaskType.WORKER:
  device_filters = ['/job:ps', '/job:worker/task:%d' % self._task_id]
elif self._task_type == TaskType.PS:
  device_filters = ['/job:ps', '/job:worker', '/job:chief', '/job:master']

Thus, the solution to prevent unnecessary sync between all intances is easy. Just provide a correct device filter to your declared tf.ConfigProto.

liyi193328 · 2019-11-25T12:00:27Z

@shishaochen Got it, Thanks a lot, Nice to you

lebinlebin · 2019-12-14T06:09:26Z

hi~ do you solve the problem? I encountered the same problem with you，but I dont understand how to set tf.ConfigProto. Can you give me a example？？ Thanks

mckinziebrandon · 2020-01-28T17:47:57Z

Agreed with @lebinlebin, the solution is unclear. Can someone explictly provide an example of:

The TF_CONFIG set on the worker nodes.
The TF_CONFIG set on the evaluate node.
The full tf.estimator.RunConfig used.

meibenjin · 2020-03-23T02:57:28Z

@yuefengz please take a look at TF_CONFIG env issue in kubeflow/tf-operator

meibenjin · 2020-04-01T06:22:35Z

@yuefengz please take a look at TF_CONFIG env issue in kubeflow/tf-operator

HI, @yuefengz , any update about this issue? kubeflow/training-operator#1139

Mesilenceki · 2020-07-16T10:43:53Z

hi, I am using the elder version of tf-operator, and I have the same issue. Has it been fixed yet?

joelxiangnanchen · 2020-11-16T06:02:23Z

@shishaochen Got it, Thanks a lot, Nice to you

Hey, bro，did you have a solution there？ I met the same issue with version 1.14. And there is another confusing issue is that my chief node has no variable file just meta-graph files. The evaluator also told me "Estimator not trained yet". Thx

sushreebarsa · 2021-07-01T17:28:45Z

Hi There,

We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.we will get you the right help.Thanks!

google-ml-butler · 2021-07-08T17:46:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-07-15T17:50:43Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2021-07-15T17:50:53Z

Are you satisfied with the resolution of your issue?
Yes
No

gadagashwini-zz self-assigned this Jun 26, 2019

gadagashwini-zz added comp:dist-strat Distribution Strategy related issues type:bug Bug labels Jun 26, 2019

martinwicke assigned isaprykin Jun 26, 2019

gadagashwini-zz removed their assignment Jul 5, 2019

guptapriya assigned yuefengz Jul 16, 2019

sahiltyagi4 unassigned isaprykin Nov 10, 2019

meibenjin mentioned this issue Mar 9, 2020

evaluator should be set in TF_CONFIG when using Estimator distribute strategy kubeflow/training-operator#1139

Closed

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jul 1, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 8, 2021

google-ml-butler bot closed this as completed Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About `Evaluator` in TF_CONFIG #30121

About `Evaluator` in TF_CONFIG #30121

wangsiyu commented Jun 25, 2019 •

edited

Loading

hfzhang31 commented Jul 24, 2019

yuefengz commented Aug 6, 2019

guptapriya commented Aug 7, 2019

wangsiyu commented Aug 7, 2019

yuefengz commented Sep 6, 2019

wangsiyu commented Sep 17, 2019

liyi193328 commented Oct 10, 2019 •

edited

Loading

sahiltyagi4 commented Nov 10, 2019

shishaochen commented Nov 20, 2019 •

edited

Loading

liyi193328 commented Nov 25, 2019

lebinlebin commented Dec 14, 2019

mckinziebrandon commented Jan 28, 2020

meibenjin commented Mar 23, 2020 •

edited

Loading

meibenjin commented Apr 1, 2020

Mesilenceki commented Jul 16, 2020

joelxiangnanchen commented Nov 16, 2020

sushreebarsa commented Jul 1, 2021

google-ml-butler bot commented Jul 8, 2021

google-ml-butler bot commented Jul 15, 2021

google-ml-butler bot commented Jul 15, 2021

About Evaluator in TF_CONFIG #30121

About Evaluator in TF_CONFIG #30121

Comments

wangsiyu commented Jun 25, 2019 • edited Loading

hfzhang31 commented Jul 24, 2019

yuefengz commented Aug 6, 2019

guptapriya commented Aug 7, 2019

wangsiyu commented Aug 7, 2019

yuefengz commented Sep 6, 2019

wangsiyu commented Sep 17, 2019

liyi193328 commented Oct 10, 2019 • edited Loading

sahiltyagi4 commented Nov 10, 2019

shishaochen commented Nov 20, 2019 • edited Loading

liyi193328 commented Nov 25, 2019

lebinlebin commented Dec 14, 2019

mckinziebrandon commented Jan 28, 2020

meibenjin commented Mar 23, 2020 • edited Loading

meibenjin commented Apr 1, 2020

Mesilenceki commented Jul 16, 2020

joelxiangnanchen commented Nov 16, 2020

sushreebarsa commented Jul 1, 2021

google-ml-butler bot commented Jul 8, 2021

google-ml-butler bot commented Jul 15, 2021

google-ml-butler bot commented Jul 15, 2021

About `Evaluator` in TF_CONFIG #30121

About `Evaluator` in TF_CONFIG #30121

wangsiyu commented Jun 25, 2019 •

edited

Loading

liyi193328 commented Oct 10, 2019 •

edited

Loading

shishaochen commented Nov 20, 2019 •

edited

Loading

meibenjin commented Mar 23, 2020 •

edited

Loading