Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a BSL controller to handle validation + update BSL status phase #2490

Closed
wants to merge 6 commits into from

Conversation

carlisia
Copy link
Contributor

@carlisia carlisia commented May 1, 2020

  • add a controller-runtime controller that continuously validates BSLs and marks them as available (valid) or unavailable (invalid), and logs accordingly.
  • remove BSL validation from the server startup

Signed-off-by: Carlisia carlisia@vmware.com

@carlisia carlisia changed the title Add BSL status phase based on validation and log according; remove validation from the server Add BSL status phase based on validation and log accordingly; remove validation from the server May 1, 2020
@carlisia carlisia self-assigned this May 1, 2020
Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole this looks good! The main question I have right now is around resync periods.

Carlisia added 4 commits May 4, 2020 08:54
Signed-off-by: Carlisia <carlisia@vmware.com>
Signed-off-by: Carlisia <carlisia@vmware.com>
Signed-off-by: Carlisia <carlisia@vmware.com>
Signed-off-by: Carlisia <carlisia@vmware.com>
@carlisia
Copy link
Contributor Author

carlisia commented May 4, 2020

I added:

  • StoreValidationPeriod
  • LastValidationTime

Much in line with what's in the BackupSynchController.

I also added this, for the state between creation and validation:

  • BackupStorageLocationPhaseUnverified

However, although I have verified this new phase was included in the CRD, it does not mark it as the default. Not sure if I'm doing this wrong:
https://github.com/vmware-tanzu/velero/pull/2490/files#diff-7c4f4164f55721d9493c9824d4735f60R101

After this PR is merged, I'm going to add these columns (StoreValidationPeriod and LastValidationTime) to the BSL output: #2421

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look at the body of the controller in a little more detail, but so far so good!

We also need to think about how other controllers/packages should incorporate this information. For example, AFAIK, even if a BSL is marked "Unavailable", the backup controller will still try to process backups targeting that location. Those backups will theoretically fail somewhere along the line, at least at the upload step, but ideally we'd check up-front if the BSL is valid or not, and fail validation if it's invalid.

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

newBackupStore: persistence.NewObjectBackupStore,
}

c.resyncFunc = c.run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in addition to the periodic validation via this resync func, we should also add OnAdd/OnUpdate event handlers to the backup storage location informer so that whenever a BSL is created or updated, it immediately triggers a validation. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.

Would we need both the handlers and the BSL validation though?

I'm thinking add the validation to the handlers (create/update), and keep the BSL looping and checking the CR field just so it does the proper periodic logging. Would this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

If we have the handler validating right on update then it will never be out-of-date. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to do the validation both when a BSL is added/updated, and periodically (via the resync func). That way, we immediately get an updated phase when the BSL is created or changed, but we are also regularly checking to ensure we can still connect to the object store.

Copy link
Member

@skriss skriss May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could still be out-of-date: the minio instance goes down, the S3 credentials are expired, the bucket is deleted, etc. None of those would involve a change to the BSL, but could still mean that it can't be connected to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point!

@carlisia
Copy link
Contributor Author

carlisia commented May 4, 2020

Ok recapping, I will:

  1. add handlers to do additional validation/update of a BSL status on BSL create/update
  2. keep the validation on the BSL controller so it also validates/updates (and logs) it there
  3. add a validation to backup creation to check if the corresponding BSL is valid

This brings up another nuance, which is - if we add a validation check in the backup controller to ensure the BSL is "Available" - should we rely on the CR field, or should we actually try to connect to the object store (e.g. via the backupStore.IsValid() method)? Relying on the CR field is easier/more efficient, but it could theoretically be out-of-date.

This question pertains to item # 3. My thinking below, taking into account @nrb's comment:

We may want to make this tuneable, as there can be monetary access costs associated with hitting services like S3. This change moves the validation from a one-time startup operation to a recurring one, so I don't want to surprise people with new bills :)

I think doing the validation on create/update + periodically could be good enough and checking for the BSL status only (as opposed to the store is valid) at backup creation is the way to go. Since there are potential costs associated with it, we should aim for less connections to the storage rather than more. Another factor is that each BSL can have its validation period decreased for more frequent checks: https://github.com/vmware-tanzu/velero/pull/2490/files#diff-0528e5cf4cf25ec346aa95f0bd1b2fd4R79 so the user has this level of control.

WDYT? We could go with this and add it to the backup creation later if we find out we need it.

Updated from the original

@skriss
Copy link
Member

skriss commented May 4, 2020

That all sounds reasonable to me! Just a couple clarifications/related comments:

  • for item # 1, those should go inside the same controller as item # 2 - so you'll have both event handlers on the informer, and a periodic resync function, in that one controller. The restic repository controller does something very similar: https://github.com/vmware-tanzu/velero/blob/master/pkg/controller/restic_repository_controller.go#L78-L87
  • there are probably other controllers beyond the backup controller where we want to check the phase of the BSL up-front. Probably need to just go through them one-by-one and see if it makes sense to add a check on that phase somewhere at the start of their processing

@carlisia
Copy link
Contributor Author

carlisia commented May 4, 2020

Oh ok, I was putting the handlers somewhere else, haha.

Great heads up all around. I will comb the controllers. Thanks!

Bucket string
Prefix string
BackupSyncPeriod time.Duration
StoreValidationPeriod time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would validationFrequency be a better name for this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yes indeed!

@@ -107,6 +109,10 @@ func (o *CreateOptions) Validate(c *cobra.Command, args []string, f client.Facto
return errors.New("--backup-sync-period must be non-negative")
}

if o.StoreValidationPeriod < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set this to default and continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. What if the user for some reason expects that value to take, and we give them another value. Could this be possible?

newPluginManager func(logrus.FieldLogger) clientmgmt.Manager,
logger logrus.FieldLogger,
) Interface {
if defaultStoreValidationPeriod <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if defaultStoreValidationPeriod == 0 doesn't that mean validation is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes so, this is confusing because of two flags with the same name. Mainly I set this up following the backup-sync-controller. We can remove this or change the condition here.

This right here is a flag on the server. It is set so that IF location.Spec.StoreValidationPeriod further below is not set, we'll use this value. And so, if there wasn't a value passed thru the server either, it is set to 1min.

We could change this to match the BSL flag, in other words, if <=0 then don't ever validate. But I don't think we want to never validate. And I know that's not what you are saying either. Since we do have to validate at some interval, we might as well make it configurable thru the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum: *we don't want to never validate a BSL >unless< the user specifically configures it that way.

for _, location := range locations {
locationName := location.Name
log = c.logger.WithField("backupLocation", locationName)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is location.Spec.StoreValidationPeriod == nil also a condition where we should skip validation?

Copy link
Contributor Author

@carlisia carlisia May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Only a specific value of zero is meant to skip validation. Any other value will set the frequency to that value (obvs), and nil means they didn't set it, which then the frequency will be set to the server value, either one set by the user, or the default. Pls be sure to see my comment up above.

Signed-off-by: Carlisia <carlisia@vmware.com>
@carlisia
Copy link
Contributor Author

carlisia commented May 5, 2020

At any rate.. relying on len(unavailableErrors) == len(locations) no longer works because due to the validation only occurring based on the validation frequency, the validation might be skipped for that reason so we don't know at the end of the iteration what that number truly is.

I'm going to add a method to return all BSLs that are in each phase, or maybe just the count for each, so this logging logic will be much cleaner and the rules clearer.

It might be something we'll want to add to the server status anyway, the listing of each existing BSL, grouped by status/phase.

Signed-off-by: Carlisia <carlisia@vmware.com>
@skriss
Copy link
Member

skriss commented May 8, 2020

@carlisia the kubebuilder convo reminded me that i do have a couple recommendations on how to structure this that (a) I think would make a potential migration to kubebuilder easier, and (b) would make testing easier, regardless of whether we move to kubebuilder or not. I'm not going to have time to write detailed notes until next week though, but here's a quick summary:

The main idea would be to keep the core "business logic" out of the controller package/struct, and extract it to its own package with a clearly defined interface. the controller should just handle responding to informer events, putting things on a work queue, taking things off the work queue, and then calling into the "business logic" package to actually handle the processing. This would be similar to (a) the server status request controller; and (b) the work I was doing in #2317. Having all the business logic in the controller package and structs IMHO is not the pattern we want to be following going fwd.

@carlisia
Copy link
Contributor Author

carlisia commented May 8, 2020

Yes, I started thinking this very thing when I was thinking about the webhooks and, regardless, you're right that the structure needs to have the logic not in the controller. Glad for the nudge.

@carlisia carlisia marked this pull request as draft May 27, 2020 23:56
@carlisia
Copy link
Contributor Author

Moved it to Draft because likely will be layering this work on top of #2561.

@carlisia carlisia changed the title Add BSL status phase based on validation and log accordingly; remove validation from the server Add a BSL controller to handle validation + update BSL status phase Jun 3, 2020
@carlisia
Copy link
Contributor Author

carlisia commented Jun 3, 2020

Closing this PR. I'll be redoing this in a new PR using the kubebuilder/controller-runtime framekwork as per #2597.

@carlisia carlisia closed this Jun 3, 2020
@carlisia carlisia mentioned this pull request Jun 3, 2020
21 tasks
@carlisia carlisia added the Enhancement/Dev Internal or Developer-focused Enhancement to Velero label Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/Dev Internal or Developer-focused Enhancement to Velero
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants