-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elaborate savepoint and update features #107
Conversation
- fix handling failed auto savepoint - fix validations and tests related to changes - improve update routine - change the behavior of handling unexpected jobs - add a constraint for update: when takeSavepointOnUpdate is true, latest savepoint age should be less than maxStateAgeToRestore
Hello @regadas. |
Currently this PR does not work with the job mode, "blocking". It would be nice if we could discuss the issue #110 too. |
@elanv @regadas I have some questions about the savepoint status:
|
Hi @hjwalt. For the first one, I also hope the feature like multiple job management on single session cluster. Do you mean this feature too? I wrote a issue about the feature in gcp operator repo. However, I think it is better to introduce new CRD as I wrote in that issue than add this feature to For the second, only last successful savepoint is recorded in |
@elanv thank you for the explanation. Yes it is as you described, having multiple job management on one session cluster, and yes its not relevant for this PR. |
@elanv more thoughts after reading this PR more:
The current confusion in recovery mechanism also affects me too, if you are open to coordinate the effort for a new CRD version, I can help to implement some of the features. |
When the updated operator observes the existing FlinkCluster, it records the calculated state according to the changed CRD data structure and then reconciles based on it, so if you change the CRD carefully, you can smoothly switch to new CRD. And since the code of the gcp operator has not been well maintained, there are many critical bugs covered in this PR. If there are no critical problems, I think it is better to stabilize even if there are small breaking changes. #85 , #95 , #115 seems related to this PR. Could you explain more about your problem?
In my opinion, since k8s controller operates on the mechanism "observe --> calculate desired state --> reconcile", it is not appropriate to apply state machine. For example, depending on the status at the time of observation, sometimes the state may have to be skipped. As far as I know, k8s native controllers are not implemented as state machines for these reasons. I attached a diagram indicating the state transition of the job to this PR, but it is not intended to implement the state machine. Those are only observed state by the operator for the reconciliation stage. Therefore, state may be skipped because it is recorded at the time of observation. |
Just putting it out that I don't have a lot of experience testing k8s operators :) . It just seems to me that old savepoint information on existing clusters will be lost when upgrading the operator. Didn't test it so I could be wrong.
Yes, we can't attach it to the reconcile cycle, I'm more referring to use FSM in internal logic for job status. Its just a thought that FSM is better for maintenance sanity. k8s works on fairly simple status transition (like waiting, running, terminated, completed, error, crashloopbackoff, failed). The job status you proposed is much more complicated than that. |
It seems good to leave the fields you pointed out in v1beta1.
Understand. The updater's job status routine is particularly complex, and it would be nice if it could be improved. And to support the blocking mode, the job tracking routine needs to be reverted, but it seems that time is needed to verify it works well. |
@hjwalt It does not seem to lose the existing savepoint location. The savepoint location is recorded as |
@regadas Finished almost. Could you review this PR? There is some changes in the functions for extracting log. |
Awesome! Thanks @elanv I'll have a look at this later today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM; most of the comments can actually be address on follow up PR's. @elanv Are you adding more changes to this PR or can I go ahead and merge?
@@ -376,12 +377,20 @@ type JobSpec struct { | |||
// Allow non-restored state, default: false. | |||
AllowNonRestoredState *bool `json:"allowNonRestoredState,omitempty"` | |||
|
|||
// Should take savepoint before upgrading the job, default: false. | |||
TakeSavepointOnUpgrade *bool `json:"takeSavepointOnUpgrade,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this in a next release? Keeping it backward compact for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can break now; I don't give guarantees with v1beta1
var recordedJob = recorded.Components.Job | ||
var extractLog = recordedJob != nil && recordedJob.State == v1beta1.JobStateDeploying | ||
err = observer.observeSubmitter(extractLog, &submitter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎨 we can move the extractLog
logic inside observeSubmitter
to make the method intent clearer.
var jobSpec = reconciler.observed.cluster.Spec.Job | ||
var jobStatus = reconciler.observed.cluster.Status.Components.Job | ||
var savepointStatus = reconciler.observed.cluster.Status.Savepoint | ||
func (reconciler *ClusterReconciler) shouldTakeSavepoint() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to return a proper reason type here
Purpose of this PR
Currently, savepoint and its related routines are scattered in several places. It make difficult to enhance this operator now. This PR organizes them so that savepoint-related routines can be improved and extended in the future. It also improves the update, cancel and recovery features that depend on savepoint routines.
resolve #84
fix #85
fix #95
fix #115
Changes
Details
Organize and fix savepoint routine
lastSavepointTriggerTime
andlastSavepointTriggerID
: duplicated withstatus.savepoint
status.job.startTime
and deleteSavepointTriggerReasonScheduledInitial
Change job stop behavior when updating and cancelling a job
Improve update process
takeSavepointOnUpdate
fieldtakeSavepointOnUpgrade
totakeSavepointOnUpdate
takeSavepointOnUpdate
Elaborate update/restart strategy
Limit job state age from which job can be restarted when auto restarting from failure, updating stopped job and updating running job wiith
takeSavepointOnUpdate
falseMaxStateAgeToRestoreSeconds
Add new job deployment states