Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in merge_accumulators when using keras metrics on dataflow #158

Open
zywind opened this issue Aug 20, 2022 · 3 comments
Open

Error in merge_accumulators when using keras metrics on dataflow #158

zywind opened this issue Aug 20, 2022 · 3 comments

Comments

@zywind
Copy link

zywind commented Aug 20, 2022

System information

  • Have I written custom code (as opposed to using a stock example script
    provided in TensorFlow Model Analysis)
    : Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): GCP Dataflow Apache Beam Python 3.7 SDK 2.39.0
  • TensorFlow Model Analysis installed from (source or binary): binary
  • TensorFlow Model Analysis version (use command below): 0.33
  • Python version: 3.7
  • Jupyter Notebook version: Jupyter lab 3.2.8
  • Exact command to reproduce:

I am using TFX's evaluator

eval_config = tfma.EvalConfig(
  model_specs=model_specs,
  metrics_specs=tfma.metrics.specs_from_metrics([
      tf.keras.metrics.AUC(curve='ROC', name='ROCAUC'),
      tf.keras.metrics.AUC(curve='PR', name='PRAUC'),
      tf.keras.metrics.Precision(),
      tf.keras.metrics.Recall(),
      tf.keras.metrics.BinaryAccuracy(),
    ]),
  slicing_specs=slicing_specs
)

evaluator = Evaluator(
  eval_config=eval_config,
  model=model,
  examples=transform_examples,
)

context.run(evaluator)

Describe the problem

Running the same evaluation using Beam's DirectRunner locally will not cause any error, but whenever I run it on dataflow and when dataflow spawns more than one worker, I get an error like so:

output.with_value(self.phased_combine_fn.apply(output.value)): File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 882, in merge_only return self.combine_fn.merge_accumulators(accumulators) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in merge_accumulators a in zip(self._combiners, zip(*accumulators_batch)) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in a in zip(self._combiners, zip(*accumulators_batch)) File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in merge_accumulators for metric_index in range(len(self._metrics[output_name])): TypeError: 'NoneType' object is not subscriptable

Based on the dataflow log, the failing steps were:

  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/Combine
  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/GroupByKey
  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PostCombineFn)/GroupByKey

I see that you have this commit, which appears to be addressing this problem, but it is immediately rolled back. I wonder if you have had similar issues and what would you recommend to fix the error.

@zywind
Copy link
Author

zywind commented Aug 20, 2022

I tried setting Dataflow's max_num_workers to 1 and the job succeeded. Looks like the problem is indeed in running dataflow with multiple workers.

@singhniraj08 singhniraj08 self-assigned this Aug 22, 2022
@singhniraj08
Copy link

Hi @zywind ,

As mentioned here, for distributed evaluation, we use tfma.ExtractEvaluateAndWriteResults. Please refer to this example notebook let me know if this resolves your issue.

Thank you.

@zywind
Copy link
Author

zywind commented Aug 22, 2022

Hi @singhniraj08,

I'm using the official TFX Evaluator, which internally uses tfma.ExtractEvaluateAndWriteResults as you can see here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants