Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submitter stays in running status after successful submission #138

Closed
pjthepooh opened this issue Oct 19, 2021 · 4 comments
Closed

Job submitter stays in running status after successful submission #138

pjthepooh opened this issue Oct 19, 2021 · 4 comments

Comments

@pjthepooh
Copy link
Contributor

With v0.2.0, job submitter will stay running status instead of completed like previous version. This doesn't seem to cause any issue except when I hit an issue where a new pod could be spun up to submit a new job when a current job is recovering/restarting from an exception. Then more than one job will be in JM and only one of them can be in running state, while the others keep restarting. The operator only recognizes the original job.

Job submitter hangs at this log message, and didn't signal to operator to indicate job completion.

Job has been submitted with JobID <job_id>
@elanv
Copy link
Contributor

elanv commented Oct 20, 2021

It seems the default job mode is changed. Previously it was "detahced".
Therefore the submitter is not completed when the job mode is not specified.

if jobSpec.Mode != nil {
switch *jobSpec.Mode {
case v1beta1.JobModeBlocking:
case v1beta1.JobModeDetached:
jobArgs = append(jobArgs, "--detached")
}
}

On the other hand, If a new job pod were spun, it is unexpected behavior.
Even if submitter job fails k8s job does not spin a new pod because backoffLimit is set 0 like below.

// Disable the retry mechanism of k8s Job, all retries should be initiated
// by the operator based on the job restart policy. This is because Flink
// jobs are stateful, if a job fails after running for 10 hours, we probably
// don't want to start over from the beginning, instead we want to resume
// the job from the latest savepoint which means strictly speaking it is no
// longer the same job as the previous one because the `--fromSavepoint`
// parameter has changed.
var backoffLimit int32 = 0

However when I tested to delete submitter pod forcibly, it was recreated. Therefore I guess the pod was deleted and recreated for some reason. For example, the submitter pod might have been deleted with oom kill and then recreated, but I don't know oom kill could be the cause of recreating pod of k8s job. Anyway it looks like the pod must be prevented to be recreated for any reason in Blocking mode.

@elanv
Copy link
Contributor

elanv commented Oct 20, 2021

Once you specify spec.job.mode to Detached, I think that creating multiple jobs will be prevented, because the job is completed after job submission.

@pjthepooh
Copy link
Contributor Author

Thanks @elanv for pointing to the job mode. There are two issues I described, 1) job submitter stays in running status after job submit, 2) multiple jobs could be submitted at some scenarios (job restart or chart update). They seems related but I can't be sure.

However, I don't see neither issues at v0.2.1, so I think this could be related to the webhooks bug being fixed? @regadas I guess we can close this, but maybe better to confirm the root cause.

@regadas
Copy link
Contributor

regadas commented Oct 21, 2021

Hi @pjthepooh! yeah, these issues should be addressed now. This was due to a bug introduced in the admission webhooks when migrating to v1 #136. I recommend using the latest version v0.2.2.

I'll close this issue for now. Please re-open if there are still issues.

@regadas regadas closed this as completed Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants