Elaborate savepoint and update features #107

elanv · 2021-09-16T17:23:23Z

Purpose of this PR

Currently, savepoint and its related routines are scattered in several places. It make difficult to enhance this operator now. This PR organizes them so that savepoint-related routines can be improved and extended in the future. It also improves the update, cancel and recovery features that depend on savepoint routines.

resolve #84
fix #85
fix #95
fix #115

Changes

Make job deploy phase clearly with new job states.
Organize savepoint routines.
Fix some savepoint related issues.
Improve update stability.
Change the job stop process that is applied when updating and canceling a job.
Elaborate update/restart strategy more.

Details

Organize and fix savepoint routine
- Organize Savepoint handling and related routines in one place
- Auto savepoint
  - Delete lastSavepointTriggerTime and lastSavepointTriggerID: duplicated with status.savepoint
  - Change the first trigger to be based on status.job.startTime and delete SavepointTriggerReasonScheduledInitial
- Savepoint state
  - Add a routine to derive the state from HTTP code
  - Get rid of operator's own savepoint timeout error
Change job stop behavior when updating and cancelling a job
- From version 1.9, the stop API that supports exactly-once semantics was introduced, but for compatibility up to version 1.8, "cancel with savepoint" will be applied first. In the future, add the flinkVersion field and support "stop with savepoint" in 1.9 or higher.
- Apply "cancel with savepoint" API
Improve update process
- takeSavepointOnUpdate field
  - Rename takeSavepointOnUpgrade to takeSavepointOnUpdate
  - Improve savepoint skip routine when updating with takeSavepointOnUpdate
  - Fix Bump k8s.io/apimachinery from 0.22.8 to 0.24.1 #408
- Improve update stability
Elaborate update/restart strategy
Limit job state age from which job can be restarted when auto restarting from failure, updating stopped job and updating running job wiith takeSavepointOnUpdate false
- Add a field to limit maximum savepoint age to restore for job on restart.
  - MaxStateAgeToRestoreSeconds
Add new job deployment states
- Deploying, DeployFailed, Restarting

- fix handling failed auto savepoint - fix validations and tests related to changes - improve update routine - change the behavior of handling unexpected jobs - add a constraint for update: when takeSavepointOnUpdate is true, latest savepoint age should be less than maxStateAgeToRestore

elanv · 2021-09-16T17:31:15Z

Hello @regadas.
The work is almost done, but there are a lot of changes.
I am going to do testing and fixing a little more.

elanv · 2021-09-17T16:57:20Z

Currently this PR does not work with the job mode, "blocking".
In this PR, job is tracked by an ID obtained from job submitter.

It would be nice if we could discuss the issue #110 too.

hjwalt · 2021-09-24T12:33:24Z

@elanv @regadas I have some questions about the savepoint status:

Is it better to move savepoint status into JobStatus? It can be useful to support multiple job submission (if it makes sense to do so)
From my understanding, savepoint failure will also get updated into the savepoint status, how will the restart work with the failed savepoint if it is taken periodically? Since this PR also adds MaxStateAgeToRestoreSeconds, does it make sense to also add LastSuccessfulSavepoint savepoint status and use that for restarting the job if it still fulfils the age or the age is nil? This is relevant for execution.checkpointing.tolerable-failed-checkpoints as savepoint failures also counts to this configuration.

The operator can only automatically restart a failed job when there is a savepoint recorded in the job status whether it is automatically or manually taken; otherwise, the job will stay in failed state.

elanv · 2021-09-24T13:22:33Z

Hi @hjwalt.
Thanks for your review.

For the first one, I also hope the feature like multiple job management on single session cluster. Do you mean this feature too? I wrote a issue about the feature in gcp operator repo. However, I think it is better to introduce new CRD as I wrote in that issue than add this feature to FlinkCluster CRD.
(note: GoogleCloudPlatform/flink-on-k8s-operator#303)

For the second, only last successful savepoint is recorded in status.components.job, and the last savepoint status is recorded in status.savepoint. So it's working as you understand.

hjwalt · 2021-09-24T13:30:55Z

@elanv thank you for the explanation. Yes it is as you described, having multiple job management on one session cluster, and yes its not relevant for this PR.

hjwalt · 2021-09-25T01:16:57Z

@elanv more thoughts after reading this PR more:

The changes in CRD doesn't look backward compatible to me, which will break existing cluster when upgrading. I think this is better done with a new CRD definition version, and as you put in another issue, I think time can be invested better in writing a better CRD and reconciler. Fixing the current problems while maintaining backward compatibility seems pretty difficult to me.
If we are to refactor the CRD and reconciler, I think it would be good to look into using finite state machines (either self implemented or a library) with direct and observed transitions. This way, new states and transitions can be added in the future with minimal changes to existing states and transitions. We can start with your changes and work upwards with new CRD.

current state -> transition action (direct transition) -> intermediate state -> background work completed (observed transition) -> final state

The current confusion in recovery mechanism also affects me too, if you are open to coordinate the effort for a new CRD version, I can help to implement some of the features.

elanv · 2021-09-25T05:22:13Z

1. The changes in CRD doesn't look backward compatible to me, which will break existing cluster when upgrading. I think this is better done with a new CRD definition version, and as you put in another issue, I think time can be invested better in writing a better CRD and reconciler. Fixing the current problems while maintaining backward compatibility seems pretty difficult to me.

When the updated operator observes the existing FlinkCluster, it records the calculated state according to the changed CRD data structure and then reconciles based on it, so if you change the CRD carefully, you can smoothly switch to new CRD. And since the code of the gcp operator has not been well maintained, there are many critical bugs covered in this PR. If there are no critical problems, I think it is better to stabilize even if there are small breaking changes. #85 , #95 , #115 seems related to this PR.

Could you explain more about your problem?

2. If we are to refactor the CRD and reconciler, I think it would be good to look into using finite state machines (either self implemented or a library) with direct and observed transitions. This way, new states and transitions can be added in the future with minimal changes to existing states and transitions. We can start with your changes and work upwards with new CRD.

In my opinion, since k8s controller operates on the mechanism "observe --> calculate desired state --> reconcile", it is not appropriate to apply state machine. For example, depending on the status at the time of observation, sometimes the state may have to be skipped. As far as I know, k8s native controllers are not implemented as state machines for these reasons.

I attached a diagram indicating the state transition of the job to this PR, but it is not intended to implement the state machine. Those are only observed state by the operator for the reconciliation stage. Therefore, state may be skipped because it is recorded at the time of observation.

hjwalt · 2021-09-27T09:34:02Z

Could you explain more about your problem?

Just putting it out that I don't have a lot of experience testing k8s operators :) . It just seems to me that old savepoint information on existing clusters will be lost when upgrading the operator. Didn't test it so I could be wrong.

In my opinion, since k8s controller operates on the mechanism "observe --> calculate desired state --> reconcile", it is not appropriate to apply state machine. For example, depending on the status at the time of observation, sometimes the state may have to be skipped. As far as I know, k8s native controllers are not implemented as state machines for these reasons.

Yes, we can't attach it to the reconcile cycle, I'm more referring to use FSM in internal logic for job status. Its just a thought that FSM is better for maintenance sanity. k8s works on fairly simple status transition (like waiting, running, terminated, completed, error, crashloopbackoff, failed). The job status you proposed is much more complicated than that.

elanv · 2021-09-27T15:32:30Z

Just putting it out that I don't have a lot of experience testing k8s operators :) . It just seems to me that old savepoint information on existing clusters will be lost when upgrading the operator. Didn't test it so I could be wrong.

It seems good to leave the fields you pointed out in v1beta1.

Yes, we can't attach it to the reconcile cycle, I'm more referring to use FSM in internal logic for job status. Its just a thought that FMS is better for maintenance sanity. k8s works on fairly simple status transition (like waiting, running, terminated, completed, error, crashloopbackoff, failed). The job status you proposed is much more complicated than that.

Understand. The updater's job status routine is particularly complex, and it would be nice if it could be improved.

And to support the blocking mode, the job tracking routine needs to be reverted, but it seems that time is needed to verify it works well.

elanv · 2021-10-09T09:03:38Z

@hjwalt It does not seem to lose the existing savepoint location. The savepoint location is recorded as SavepointLocation and it remains the same. However, duplicates of some other savepoint related information have been removed or changed. I think you can check the CRD changes by looking at the diff of api/v1beta1/flinkcluster_types.go.

elanv · 2021-10-09T09:11:35Z

@regadas Finished almost. Could you review this PR? There is some changes in the functions for extracting log.

regadas · 2021-10-13T09:06:12Z

Awesome! Thanks @elanv I'll have a look at this later today!

regadas

This LGTM; most of the comments can actually be address on follow up PR's. @elanv Are you adding more changes to this PR or can I go ahead and merge?

regadas · 2021-10-14T14:59:23Z

api/v1beta1/flinkcluster_types.go

@@ -376,12 +377,20 @@ type JobSpec struct {
 	// Allow non-restored state, default: false.
 	AllowNonRestoredState *bool `json:"allowNonRestoredState,omitempty"`

-	// Should take savepoint before upgrading the job, default: false.
-	TakeSavepointOnUpgrade *bool `json:"takeSavepointOnUpgrade,omitempty"`


Could we remove this in a next release? Keeping it backward compact for now.

I guess we can break now; I don't give guarantees with v1beta1

regadas · 2021-10-14T15:41:10Z

controllers/flinkcluster_observer.go

+	var recordedJob = recorded.Components.Job
+	var extractLog = recordedJob != nil && recordedJob.State == v1beta1.JobStateDeploying
+	err = observer.observeSubmitter(extractLog, &submitter)


🎨 we can move the extractLog logic inside observeSubmitter to make the method intent clearer.

regadas · 2021-10-14T16:01:28Z

controllers/flinkcluster_reconciler.go

-	var jobSpec = reconciler.observed.cluster.Spec.Job
-	var jobStatus = reconciler.observed.cluster.Status.Components.Job
-	var savepointStatus = reconciler.observed.cluster.Status.Savepoint
+func (reconciler *ClusterReconciler) shouldTakeSavepoint() string {


It would be nice to return a proper reason type here

elanv · 2021-10-15T00:31:38Z

This LGTM; most of the comments can actually be address on follow up PR's. @elanv Are you adding more changes to this PR or can I go ahead and merge?

@regadas Thanks! There are no more commits to add. I'll make a new PR for the comments if it's okay.

elanv added 9 commits February 21, 2021 02:31

improve update and savepoint stability

6a6b010

Organize savepoint handling and related routines

2190304

Applied new job stop, update, restart routine

8564ece

fix types

308b909

unit test

58eec1b

Merge remote-tracking branch 'upstream/master' into fix2

57d0b7c

Merge branch 'fix2' into savepoint_and_update

8b1f0d9

Fix conflicts

2ad3293

This was referenced Sep 17, 2021

Job tracking routine changes for the blocking mode #110

Closed

autoSavepointSeconds Property doesn't have any effect #95

Closed

elanv added 2 commits October 9, 2021 01:45

Merge branch 'master' into savepoint_and_update

3ba8ede

fix merge

1fd857e

regadas reviewed Oct 14, 2021

View reviewed changes

regadas mentioned this pull request Oct 14, 2021

Nil pointer exception during update when jobsubmitter pod is lost #130

Closed

regadas approved these changes Oct 15, 2021

View reviewed changes

regadas merged commit 6019051 into spotify:master Oct 15, 2021

This was referenced Oct 20, 2021

Fix observeSubmitter #141

Merged

Add SavepointReason type #142

Merged

regadas mentioned this pull request Dec 8, 2021

FailureReasons not being set job failure #170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elaborate savepoint and update features #107

Elaborate savepoint and update features #107

elanv commented Sep 16, 2021 •

edited

Loading

elanv commented Sep 16, 2021 •

edited

Loading

elanv commented Sep 17, 2021 •

edited

Loading

hjwalt commented Sep 24, 2021

elanv commented Sep 24, 2021 •

edited

Loading

hjwalt commented Sep 24, 2021

hjwalt commented Sep 25, 2021 •

edited

Loading

elanv commented Sep 25, 2021 •

edited

Loading

hjwalt commented Sep 27, 2021 •

edited

Loading

elanv commented Sep 27, 2021

elanv commented Oct 9, 2021 •

edited

Loading

elanv commented Oct 9, 2021

regadas commented Oct 13, 2021

regadas left a comment

regadas Oct 14, 2021

regadas Oct 14, 2021

regadas Oct 14, 2021

regadas Oct 14, 2021

elanv commented Oct 15, 2021

Elaborate savepoint and update features #107

Elaborate savepoint and update features #107

Conversation

elanv commented Sep 16, 2021 • edited Loading

Purpose of this PR

Changes

Details

elanv commented Sep 16, 2021 • edited Loading

elanv commented Sep 17, 2021 • edited Loading

hjwalt commented Sep 24, 2021

elanv commented Sep 24, 2021 • edited Loading

hjwalt commented Sep 24, 2021

hjwalt commented Sep 25, 2021 • edited Loading

elanv commented Sep 25, 2021 • edited Loading

hjwalt commented Sep 27, 2021 • edited Loading

elanv commented Sep 27, 2021

elanv commented Oct 9, 2021 • edited Loading

elanv commented Oct 9, 2021

regadas commented Oct 13, 2021

regadas left a comment

Choose a reason for hiding this comment

regadas Oct 14, 2021

Choose a reason for hiding this comment

regadas Oct 14, 2021

Choose a reason for hiding this comment

regadas Oct 14, 2021

Choose a reason for hiding this comment

regadas Oct 14, 2021

Choose a reason for hiding this comment

elanv commented Oct 15, 2021

elanv commented Sep 16, 2021 •

edited

Loading

elanv commented Sep 16, 2021 •

edited

Loading

elanv commented Sep 17, 2021 •

edited

Loading

elanv commented Sep 24, 2021 •

edited

Loading

hjwalt commented Sep 25, 2021 •

edited

Loading

elanv commented Sep 25, 2021 •

edited

Loading

hjwalt commented Sep 27, 2021 •

edited

Loading

elanv commented Oct 9, 2021 •

edited

Loading