-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job submitter stays in running status after successful submission #138
Comments
It seems the default job mode is changed. Previously it was "detahced". flink-on-k8s-operator/controllers/flinkcluster_converter.go Lines 629 to 635 in 16a1830
On the other hand, If a new job pod were spun, it is unexpected behavior. flink-on-k8s-operator/controllers/flinkcluster_converter.go Lines 719 to 726 in 16a1830
However when I tested to delete submitter pod forcibly, it was recreated. Therefore I guess the pod was deleted and recreated for some reason. For example, the submitter pod might have been deleted with oom kill and then recreated, but I don't know oom kill could be the cause of recreating pod of k8s job. Anyway it looks like the pod must be prevented to be recreated for any reason in |
Once you specify |
Thanks @elanv for pointing to the job mode. There are two issues I described, 1) job submitter stays in running status after job submit, 2) multiple jobs could be submitted at some scenarios (job restart or chart update). They seems related but I can't be sure. However, I don't see neither issues at v0.2.1, so I think this could be related to the webhooks bug being fixed? @regadas I guess we can close this, but maybe better to confirm the root cause. |
Hi @pjthepooh! yeah, these issues should be addressed now. This was due to a bug introduced in the admission webhooks when migrating to I'll close this issue for now. Please re-open if there are still issues. |
With v0.2.0, job submitter will stay
running
status instead ofcompleted
like previous version. This doesn't seem to cause any issue except when I hit an issue where a new pod could be spun up to submit a new job when a current job is recovering/restarting from an exception. Then more than one job will be in JM and only one of them can be in running state, while the others keep restarting. The operator only recognizes the original job.Job submitter hangs at this log message, and didn't signal to operator to indicate job completion.
The text was updated successfully, but these errors were encountered: