Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evalator hangs while training #589

Open
jiqiujia opened this issue Aug 2, 2022 · 1 comment
Open

Evalator hangs while training #589

jiqiujia opened this issue Aug 2, 2022 · 1 comment

Comments

@jiqiujia
Copy link

jiqiujia commented Aug 2, 2022

Environment:

  • Python version 3.7
  • Spark version 2.4
  • TensorFlow version 2.5
  • TensorFlowOnSpark version 2.2.3
  • Cluster version hadoop

Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
image

image

@leewyang
Copy link
Contributor

leewyang commented Aug 2, 2022

I don't see anything obvious from your logs. Given that it looks like the evaluator process stalled/quit, I'd check for CPU and memory usage on that node (when it's running) to get more clues. You can also try to run the TF cluster on a smaller scale on a single node without Spark by just running the code in separate processes using TF_CONFIG, i.e. just using distributed TF by itself. And with local processes, you should be able to debug the evaluator node a bit easier to see why it may be stalling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants