Job submitter stays in running status after successful submission #138

pjthepooh · 2021-10-19T18:13:01Z

With v0.2.0, job submitter will stay running status instead of completed like previous version. This doesn't seem to cause any issue except when I hit an issue where a new pod could be spun up to submit a new job when a current job is recovering/restarting from an exception. Then more than one job will be in JM and only one of them can be in running state, while the others keep restarting. The operator only recognizes the original job.

Job submitter hangs at this log message, and didn't signal to operator to indicate job completion.

Job has been submitted with JobID <job_id>

The text was updated successfully, but these errors were encountered:

elanv · 2021-10-20T16:19:11Z

It seems the default job mode is changed. Previously it was "detahced".
Therefore the submitter is not completed when the job mode is not specified.

flink-on-k8s-operator/controllers/flinkcluster_converter.go

Lines 629 to 635 in 16a1830

    
           if jobSpec.Mode != nil { 
        
           	switch *jobSpec.Mode { 
        
           	case v1beta1.JobModeBlocking: 
        
           	case v1beta1.JobModeDetached: 
        
           		jobArgs = append(jobArgs, "--detached") 
        
           	} 
        
           }

On the other hand, If a new job pod were spun, it is unexpected behavior.
Even if submitter job fails k8s job does not spin a new pod because backoffLimit is set 0 like below.

flink-on-k8s-operator/controllers/flinkcluster_converter.go

Lines 719 to 726 in 16a1830

    
           // Disable the retry mechanism of k8s Job, all retries should be initiated 
        
           // by the operator based on the job restart policy. This is because Flink 
        
           // jobs are stateful, if a job fails after running for 10 hours, we probably 
        
           // don't want to start over from the beginning, instead we want to resume 
        
           // the job from the latest savepoint which means strictly speaking it is no 
        
           // longer the same job as the previous one because the `--fromSavepoint` 
        
           // parameter has changed. 
        
           var backoffLimit int32 = 0

However when I tested to delete submitter pod forcibly, it was recreated. Therefore I guess the pod was deleted and recreated for some reason. For example, the submitter pod might have been deleted with oom kill and then recreated, but I don't know oom kill could be the cause of recreating pod of k8s job. Anyway it looks like the pod must be prevented to be recreated for any reason in Blocking mode.

elanv · 2021-10-20T16:25:16Z

Once you specify spec.job.mode to Detached, I think that creating multiple jobs will be prevented, because the job is completed after job submission.

pjthepooh · 2021-10-20T16:58:14Z

Thanks @elanv for pointing to the job mode. There are two issues I described, 1) job submitter stays in running status after job submit, 2) multiple jobs could be submitted at some scenarios (job restart or chart update). They seems related but I can't be sure.

However, I don't see neither issues at v0.2.1, so I think this could be related to the webhooks bug being fixed? @regadas I guess we can close this, but maybe better to confirm the root cause.

regadas · 2021-10-21T08:35:38Z

Hi @pjthepooh! yeah, these issues should be addressed now. This was due to a bug introduced in the admission webhooks when migrating to v1 #136. I recommend using the latest version v0.2.2.

I'll close this issue for now. Please re-open if there are still issues.

regadas closed this as completed Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job submitter stays in running status after successful submission #138

Job submitter stays in running status after successful submission #138

pjthepooh commented Oct 19, 2021

elanv commented Oct 20, 2021

elanv commented Oct 20, 2021

pjthepooh commented Oct 20, 2021

regadas commented Oct 21, 2021 •

edited

Loading

Job submitter stays in running status after successful submission #138

Job submitter stays in running status after successful submission #138

Comments

pjthepooh commented Oct 19, 2021

elanv commented Oct 20, 2021

elanv commented Oct 20, 2021

pjthepooh commented Oct 20, 2021

regadas commented Oct 21, 2021 • edited Loading

regadas commented Oct 21, 2021 •

edited

Loading