Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if PV/PVC Access Modes are Supported by the Volume Provider #139

Merged
merged 41 commits into from Dec 14, 2022

Conversation

rrpolanco
Copy link
Contributor

@rrpolanco rrpolanco commented Nov 21, 2022

Currently, when migrating data, we do not consider whether the source PV's access modes are supported by the destination's volume provider. Not having this check makes the migrator prone to errors during the migration which is not the optimal experience. E.g. if migration from a volume that supports access modes RWX(Minio) to a volume which only supports RWO (local volume). For cases such as this, we want to warn the users about such errors prior to the migration so they can fix them ahead of time and have a smoother migration experience.

Solution

pvmigrate client and library will support running volume access modes validation checks on the destination PV. The checks will verify if the access modes on any of the source PVCs are supported by the destination volume provider. This is accomplished by creating temporary headless pods which create and consume the destination storage class volume with the access modes of the source PVCs. Any failures will be shown to the user via the terminal.

How it Works

At a high level, the validation does the following:

  1. Retrieve all PVs for the source storage class
  2. For each of the PVs gathered, get their associated PVC reference
  3. For each source PVC:
    1. Create a temporary PVC using source PVC volume access modes
    2. Create a temporary POD to consume the temp PVC
    3. Wait for temp POD to be ready, if not ready then it's highly likely there was an error creating the volume
      1. Get the volume provisioning error by inspecting the events for the PVC
  4. Collect all the PVC errors and print them in a table format
NAMESPACE               PVC                             SOURCE                        REASON                           MESSAGE
---------               ---                             -----------                   --------------                   ----------
default                 rook-rwx-block-claim            openebs.io/local_openebs-localpv-provisioner-77c9bcfd96-v4dnj_311601fb-2832-4f3d-b464-e78d3895b046        ProvisioningFailed               failed to provision volume with StorageClass "openebs-localpv": Only support ReadWriteOnce access mode

Ref: Internal ticket tracker [sc-61144]

@rrpolanco
Copy link
Contributor Author

[sc-61144]

resources: []runtime.Object{
&storagev1.StorageClass{
ObjectMeta: metav1.ObjectMeta{
Name: "srcSc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try to add names here that clarifies WHEN.. THEN .. EXPECTED?
more info: https://dev.to/smyrman/test-with-expect-a-bdd-style-go-naming-pattern-5eh5

Example: when has x should not fail

Copy link
Contributor Author

@rrpolanco rrpolanco Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I'll give the TWE naming convention a try.


// checkVolumeAccessModeValid checks if the access modes of a pv are supported by the
// destination storage class.
func (p *PVMigrator) checkVolumeAccessModes(pvc corev1.PersistentVolumeClaim) (PVCError, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not here return error only?
That is little strange return PVCError and error. is not PVCError an error as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could return (error, error) however, I would have to deal with the type conversation later on. I rather use the concrete type directly, PVCErrror, which represents the error from the PVC event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return error, and have the cause of the PVCError be the parent error if one exists?

case <-waitInterval.C:
continue
case <-migrationTimeout.C:
return fmt.Errorf("migration pod %s failed to go into Running phase: timedout", createdPod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we a CR that represents the service? if so, could we use use conditional status?
So that, we can properly set the status of this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a CR. This project is both a CLI and library for doing PVC migrations.
What do you mean by

if so, could we use use conditional status?
So that, we can properly set the status of this one?

@@ -61,6 +85,9 @@ func Cli() {
flag.BoolVar(&options.SetDefaults, "set-defaults", false, "change default storage class from source to dest")
flag.BoolVar(&options.VerboseCopy, "verbose-copy", false, "show output from the rsync command used to copy data between PVCs")
flag.BoolVar(&options.SkipSourceValidation, "skip-source-validation", false, "migrate from PVCs using a particular StorageClass name, even if that StorageClass does not exist")
flag.BoolVar(&options.SkipPVAccessModeValidation, "skip-pv-access-mode-validation", false, "skip the volume access modes validation on the destination storage provider")
flag.IntVar(&options.PvcCopyTimeout, "timeout", 300, "length of time to wait (in seconds) when transferring data from the source to the destination storage volume")
flag.IntVar(&options.PodReadyTimeout, "pod-ready-timeout", 60, "length of time to wait (in seconds) for volume validation pod(s) to go into Ready phase")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: how we have been using this project?
Do we distribute it like a bin and call in the kURL?
Where and how it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used as a library in kURL's version of pvmigrate CLI: https://github.com/replicatedhq/kURL/blob/main/kurl_util/cmd/pvmigrate/main.go#L14

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also distributed as a binary publicly, and is installable with krew

go.mod Outdated
@@ -3,16 +3,20 @@ module github.com/replicatedhq/pvmigrate
go 1.19

require (
github.com/google/uuid v1.3.0
github.com/replicatedhq/kurl v0.0.0-20221116172021-8ac37b7b9867
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it begs to be a circular dep. can we move the method into this project?

Copy link
Contributor Author

@rrpolanco rrpolanco Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically importing github.com/replicatedhq/kurl/pkg/k8sutil. However, nothing in k8sutil package imports this (migrate) package, so no import cycle there. I'd rather no duplicate code here if not absolutely required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then lets move the code to pvmigrate. this project should absolutely not depend on kurl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emosbaugh To sever the kURL dependency I can copy those functions into this project or refactor so as to not depend on those functions per your original suggestion. Agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree
We should NOT be importing kurl here

go.mod Outdated
k8s.io/api v0.25.4
k8s.io/apimachinery v0.25.4
k8s.io/client-go v0.25.4
k8s.io/kubernetes v1.25.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.25.4

scaleAnnotation = baseAnnotation + "-scale"
kindAnnotation = baseAnnotation + "-kind"
sourceNsAnnotation = baseAnnotation + "-sourcens"
sourcePvcAnnotation = baseAnnotation + "-sourcepvc"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency

Suggested change
sourcePvcAnnotation = baseAnnotation + "-sourcepvc"
sourcePVCAnnotation = baseAnnotation + "-sourcepvc"

VerboseCopy bool
SkipSourceValidation bool
SkipPVAccessModeValidation bool
PvcCopyTimeout int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PvcCopyTimeout int
PVCCopyTimeout int

}
}

// start the migration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see any value in adding a --dry-run flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! Namely for standalone CLI users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's been requested before, actually: #104

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dry run of course could not create pods

var waitFor []string
propagation := metav1.DeletePropagationForeground
delopts := metav1.DeleteOptions{PropagationPolicy: &propagation}
if err := p.k8scli.CoreV1().PersistentVolumeClaims("default").Delete(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in the same namespace as the src pvc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp pod is created in default ns. See #139 (comment)

Containers: []corev1.Container{
{
Name: "busybox",
Image: "busybox",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work for airgap customers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This image would need to be loaded into Docker or containerd unless it's been loaded already by another image which builds atop of busybox.

Hmm, @ricardomaraschini node disk space validator would also run into the same issue if it was integrated in the kURL installer. The disk space validator relies on the rsync image to be available. It's probably a good idea for both to use the same container image. I need to think about how to best handle this. I'm currently thinking we should pick an image to use, archive it and then load it during install.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already require an image to be passed via the CLI to do the migration with
Use that image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag.StringVar(&options.RsyncImage, "rsync-image", "eeacms/rsync:2.3", "the image to use to copy PVCs - must have 'rsync' on the path")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved: We can use the kurl util image and specify it as part of the CLI as show here: https://github.com/replicatedhq/kURL/blob/aea79861716d66787f0b31670f1fc74a7ee16d1f/scripts/common/rook.sh#L153

propagation := metav1.DeletePropagationForeground
delopts := metav1.DeleteOptions{PropagationPolicy: &propagation}
if pod != nil {
if err := p.k8scli.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, delopts); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := p.k8scli.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, delopts); err != nil {
// Cleanup should use background context so as not to fail if context has already been canceled
if err := p.k8scli.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, delopts); err != nil {

}

// getPvcError returns the error event for why a PVC is in Pending status
func (p *PVMigrator) getPvcError(pvc *corev1.PersistentVolumeClaim) (PVCError, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (p *PVMigrator) getPvcError(pvc *corev1.PersistentVolumeClaim) (PVCError, error) {
func (p *PVMigrator) getPVCError(pvc *corev1.PersistentVolumeClaim) (PVCError, error) {

}
waitFor = append(waitFor, pvc.Name)

timeout := time.NewTicker(5 * time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout := time.NewTicker(5 * time.Minute)
timeout := time.After(5 * time.Minute)

@@ -79,6 +106,37 @@ func Cli() {

output := log.New(os.Stdout, "", 0) // this has no time prefix etc

// escape hatch - for DEV/TEST ONLY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is for dev/test only, that should be mentioned in the flag description - or the flag should be hidden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment. I added this in the case where there is a rare scenario where someone may a justifiable reason to ignore a PVC access volume validation.

if !options.SkipPVAccessModeValidation {
srcPVs, err := kurlutils.PVSByStorageClass(context.TODO(), clientset, options.SourceSCName)
if err != nil {
fmt.Printf("failed to get volumes using storage class %s: %s", options.SourceSCName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Printf("failed to get volumes using storage class %s: %s", options.SourceSCName, err)
output.Printf("failed to get volumes using storage class %s: %s", options.SourceSCName, err)

ethan commented the same elsewhere, but let's avoid fmt.printf when we have output available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.

@@ -134,6 +192,82 @@ func Migrate(ctx context.Context, w *log.Logger, clientset k8sclient.Interface,
return nil
}

// NewPVMigrator returns a PV migration context
func NewPVMigrator(cfg *rest.Config, log *log.Logger, srcSC, dstSC string, podReadyTimeout int) (*PVMigrator, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used by?
What is the use case?

Aren't you initializing the struct directly elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this whole function is unused and should not exist

I think there should be a single function to run the whole 'can I migrate from sc A to B without having broken access modes' check, but it should take all the args it needs directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truthfully what I'm getting from that is 'oh dear lord why did we start reimplementing this logic in kURL, that was the entire reason that the CLI function exists'

I still think the solution is to move the entire 'can I make proper dest PVCs' logic inside of 'Migrate'

@@ -79,6 +106,37 @@ func Cli() {

output := log.New(os.Stdout, "", 0) // this has no time prefix etc

// escape hatch - for DEV/TEST ONLY
if !options.SkipPVAccessModeValidation {
srcPVs, err := kurlutils.PVSByStorageClass(context.TODO(), clientset, options.SourceSCName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm relatively certain we have this logic in pvmigrate already - refactoring it into a function would be better than using kurlutils (even besides the 'let's not do circular deps' issue, having the same logic use two different implementations will cause issues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do. I can make use of that instead of relying the kurl k8s package.

Comment on lines 59 to 60
PvcCopyTimeout int
PodReadyTimeout int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PvcCopyTimeout int
PodReadyTimeout int
PvcCopyTimeout time.Duration
PodReadyTimeout time.Duration

Why are we passing around unitless integers when there is literally a type for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought on this was to make easy on clients...just give me a time in seconds (int) and the implementation will convert it to the type(time.Duration) it needs.

@@ -103,7 +161,7 @@ func Migrate(ctx context.Context, w *log.Logger, clientset k8sclient.Interface,
return fmt.Errorf("failed to scale down pods: %w", err)
}

err = copyAllPVCs(ctx, w, clientset, options.SourceSCName, options.DestSCName, options.RsyncImage, updatedMatchingPVCs, options.VerboseCopy, time.Second)
err = copyAllPVCs(ctx, w, clientset, options.SourceSCName, options.DestSCName, options.RsyncImage, updatedMatchingPVCs, options.VerboseCopy, time.Duration(options.PvcCopyTimeout)*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just.... no
I've requested a change to make the option actually a duration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will accept that change.

// ValidateVolumeAccessModes checks whether the provided persistent volumes support the access modes
// of the destination storage class.
// returns a map of pvc errors indexed by namespace
func (p *PVMigrator) ValidateVolumeAccessModes(pvs map[string]corev1.PersistentVolume) (map[string]map[string]PVCError, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function public?

If we're going to keep it public let's move it (and all supporting infa) to a new package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one benefit I see for keeping this public and separating the validation from the migration behaviour is that in the near future we could rollup all the different validation checks into one package or method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but even if we do that we probably don't want to encourage calling one individual validator.

Better to keep this private until we 100% have a need for it externally

Comment on lines 368 to 369
timeoutCtx, cancelCtx := context.WithTimeout(ctx, timeout)
defer cancelCtx()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this doesn't work great in a for loop.

Also, just more generally, this isn't an OK change IMO - the current logic is 'keep copying as long as it's not stuck', and since some copies may legitimately take hours I would rather not have to wait hours to find out that something was really stuck

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you're doubling up on timeouts here - you've added timeout logic to copyOnePVC but also to the context it's being passed. Choose one (but preferably choose none in this PR), if we need to do that it needs more discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fleshed out bit a more. Namely, figure out what constitutes as "stuck" and how to detect that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created internal tracking ticket for handling timeouts in this context.

@@ -229,7 +376,7 @@ func copyAllPVCs(ctx context.Context, w *log.Logger, clientset k8sclient.Interfa
return nil
}

func copyOnePVC(ctx context.Context, w *log.Logger, clientset k8sclient.Interface, ns string, sourcePvcName string, destPvcName string, rsyncImage string, verboseCopy bool, waitTime time.Duration, nodeName string) error {
func copyOnePVC(ctx context.Context, w *log.Logger, clientset k8sclient.Interface, ns string, sourcePvcName string, destPvcName string, rsyncImage string, verboseCopy bool, nodeName string, timeout time.Duration) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please back out all of the changes for this function. They do not appear related to the main thrust of the PR and require more discussion (as they prevent legitimately slow/large transfers from working, or prevent stalled transfers from being detected early, depending on configuration)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll back this out.

@@ -1146,3 +1296,249 @@ func resetReclaimPolicy(ctx context.Context, w *log.Logger, clientset k8sclient.

return nil
}

// buildPVConsumerPod creates a pod spec for consuming a pvc
func (p *PVMigrator) buildPVConsumerPod(pvcName string) *corev1.Pod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass in options instead of using an object?
It's not the patten the project has used thus far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laverya you want buildPVConsumerPod() to accept the name override as an argument instead of pulling it from the receiver?

Comment on lines 1307 to 1309
if len(podName) > 63 {
podName = podName[0:31] + podName[len(podName)-32:]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear this logic is present elsewhere in the code
could you factor them into a common function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

Comment on lines 1358 to 1360
if len(pvcName) > 63 {
pvcName = pvcName[0:31] + pvcName[len(pvcName)-32:]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another instance of this logic to factor out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address this.

Comment on lines 1403 to 1412
defer func() {
if err = p.deleteTmpPVC(tmpPVC); err != nil {
p.log.Printf("failed to delete tmp claim: %s", err)
}
}()
defer func() {
if err = p.deletePVConsumerPod(pvConsumerPod); err != nil {
p.log.Printf("failed to delete pv consumer pod %s: %s", pvConsumerPod.Name, err)
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these should be the same defer func, not two

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can consolidate these.

Comment on lines 110 to 130
if !options.SkipPVAccessModeValidation {
srcPVs, err := kurlutils.PVSByStorageClass(context.TODO(), clientset, options.SourceSCName)
if err != nil {
fmt.Printf("failed to get volumes using storage class %s: %s", options.SourceSCName, err)
os.Exit(1)
}

pvMigrator := PVMigrator{
ctx: context.TODO(),
log: output,
k8scli: clientset,
srcSc: options.SourceSCName,
dstSc: options.DestSCName,
deletePVTimeout: 5 * time.Minute,
podReadyTimeout: time.Duration(options.PodReadyTimeout) * time.Second,
}
unsupportedPVCs, err := pvMigrator.ValidateVolumeAccessModes(srcPVs)
if err != nil {
fmt.Printf("failed to validate volume access modes for destination storage class %s", options.DestSCName)
os.Exit(1)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would greatly prefer for this logic to be contained within 'Migrate' (after the 'do these storageclasses exist' check, for instance), not independent

If it must be independent, it should be callable as a single function that returns an error and should handle getting the list of PVCs itself.


// buildPVC creates a temporary PVC requesting for 1Mi of storage for a provided storage class name.
func (p *PVMigrator) buildTmpPVC(pvc corev1.PersistentVolumeClaim, sc string) *corev1.PersistentVolumeClaim {
pvcName := p.pvcNameOverride
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having pvcNameOverride, I'd generally recommend using the source PVC's name with a suffix and then trimmed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And name the pod based on the PVC being tested too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes things much clearer to any cluster operator that encounters this in progress

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name override is meant for testing since we need a deterministic name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should ALWAYS be deterministic, never random

The way to do that is to take something like my-app-pvc-0 and turn it into my-app-pvc-0-pvcmigrate-accessmodes

Copy link
Member

@laverya laverya Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is done elsewhere in the code (and in fact that's why we need the 'trim from middle' code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make this change.

Comment on lines 1376 to 1378
// for testing/mocking
// This fields gets set and updated by the kube api server
Status: corev1.PersistentVolumeClaimStatus{Phase: pvc.Status.Phase},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - why are you setting status here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're doing this so that you can set the claim status for something else to read in a test function, I'd recommend looking at the backgroundfunc stuff in some of the other test functions - that way you can wait for 1/10th of a second or whatever and ensure that the logic does not require status to be immediately perfect

cmd/main.go Outdated

if len(failures) != 0 {
preflight.PrintValidationFailures(output.Writer(), failures)
os.Exit(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any possibility for the purposes of kurl we run validate and then migrate separately? then we dont have to have this alternate exit code. if we cant, can we try to map this exit code to standard linux exit codes? also should we use a constant here?

Copy link
Contributor Author

@rrpolanco rrpolanco Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to only run validate() as part of the openebs migration prompt here and abort the install if the validate failed. Also, AFAIK there are no standard/conventional exit codes. We could model this after a linux command or come up with our own scheme. I lean towards the latter.

cmd/main.go Outdated
err = migrate.Migrate(context.TODO(), output, clientset, options)
if err != nil {
output.Printf("migration failed: %s", err)
os.Exit(3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 1

cmd/main.go Outdated
flag.BoolVar(&options.SetDefaults, "set-defaults", false, "change default storage class from source to dest")
flag.BoolVar(&options.VerboseCopy, "verbose-copy", false, "show output from the rsync command used to copy data between PVCs")
flag.BoolVar(&options.SkipSourceValidation, "skip-source-validation", false, "migrate from PVCs using a particular StorageClass name, even if that StorageClass does not exist")
flag.IntVar(&options.PvcCopyTimeout, "pvc-copy-timeout", 300, "length of time to wait (in seconds) when transferring data from the source to the destination storage volume")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
flag.IntVar(&options.PvcCopyTimeout, "pvc-copy-timeout", 300, "length of time to wait (in seconds) when transferring data from the source to the destination storage volume")
flag.IntVar(&options.PVCCopyTimeout, "pvc-copy-timeout", 300, "length of time to wait (in seconds) when transferring data from the source to the destination storage volume")


allpvs, err := cli.CoreV1().PersistentVolumes().List(ctx, metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("failed to get persistent volumes: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to get persistent volumes: %w", err)
return nil, fmt.Errorf("failed to list persistent volumes: %w", err)

pkg/k8sutil/storage.go Show resolved Hide resolved
// validate access modes for all PVCs using the d source storage class
pvcs, err := pvcsForStorageClass(ctx, w, clientset, options.SourceSCName, options.Namespace)
if err != nil {
return nil, fmt.Errorf("failed to get PVCs for storage %s: %s", options.SourceSCName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to get PVCs for storage %s: %s", options.SourceSCName, err)
return nil, fmt.Errorf("failed to get PVCs for storage %s: %w", options.SourceSCName, err)

}
pvcAccesModeFailures, err := validateVolumeAccessModes(ctx, w, clientset, options.DestSCName, time.Duration(options.PodReadyTimeout), pvcs)
if err != nil {
return nil, fmt.Errorf("failed validate PVC access modes: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed validate PVC access modes: %s", err)
return nil, fmt.Errorf("failed validate PVC access modes: %w", err)


pvcEvents, err := client.CoreV1().Events(pvc.Namespace).Search(scheme.Scheme, pvc)
if err != nil {
return nil, fmt.Errorf("failed to list events for PVC %s", pvc.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to list events for PVC %s", pvc.Name)
return nil, fmt.Errorf("failed to list events for PVC %s: %w", pvc.Name, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I've addressed this.

return srcPVCs, nil
}

func newK8sName(prefix, original string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to k8sutil package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. thanks.

pkg/preflight/validate.go Show resolved Hide resolved
@rrpolanco
Copy link
Contributor Author

Modified the output of the preflight validation to be generic.

When there is a validation failure when migrating from Rook to OpenEBS 👇

⚙  Waiting for OpenEBS CustomResourceDefinitions to be ready
✔ OpenEBS CustomResourceDefinitions are ready
Existing default storage class that is not OpenEBS LocalPV detected.
OpenEBS LocalPV will be installed as the non-default storage class.
storageclass.storage.k8s.io/openebs-localpv unchanged
awaiting openebs-localpv-provisioner deployment
Running Rook to OpenEBS Migration Checks ...

The following resources cannot be migrated:

NAMESPACE		RESOURCE			MESSAGE
---------		--------			-------
default			pvc/rook-rwx-claim		failed to provision volume with StorageClass "openebs-localpv": Only support ReadWriteOnce access mode
Cannot upgrade from Rook to OpenEBS due to previous errors.

@rrpolanco rrpolanco force-pushed the rrpolanco/sc-61144/check-pvc-accessmode branch from 7ae9c3d to 7c8d9e5 Compare December 5, 2022 21:04
}

if len(failures) != 0 {
preflight.PrintValidationFailures(os.Stdout, failures)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be stderr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to separate the output of the validation failures to go to stdout and program errors to stderr.

cmd/main.go Outdated
Comment on lines 26 to 42
// Cli uses CLI options to run Migrate
var options migrate.Options
var skipPreflightValidation bool
var preflightValidationOnly bool
flag.StringVar(&options.SourceSCName, "source-sc", "", "storage provider name to migrate from")
flag.StringVar(&options.DestSCName, "dest-sc", "", "storage provider name to migrate to")
flag.StringVar(&options.RsyncImage, "rsync-image", "eeacms/rsync:2.3", "the image to use to copy PVCs - must have 'rsync' on the path")
flag.StringVar(&options.Namespace, "namespace", "", "only migrate PVCs within this namespace")
flag.BoolVar(&options.SetDefaults, "set-defaults", false, "change default storage class from source to dest")
flag.BoolVar(&options.VerboseCopy, "verbose-copy", false, "show output from the rsync command used to copy data between PVCs")
flag.BoolVar(&options.SkipSourceValidation, "skip-source-validation", false, "migrate from PVCs using a particular StorageClass name, even if that StorageClass does not exist")
flag.IntVar(&options.PodReadyTimeout, "pod-ready-timeout", 60, "length of time to wait (in seconds) for volume validation pod(s) to go into Ready phase")
flag.BoolVar(&skipPreflightValidation, "skip-preflight-validation", false, "skip the volume access modes validation on the destination storage provider")
flag.BoolVar(&preflightValidationOnly, "preflight-validation-only", false, "skip the migration and run preflight validation only")

flag.Parse()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have we removed the CLI function?

I suppose that kurl isn't using it anymore, so it might not be useful...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to avoid import cycles from pvmigrate package. Moreover, handling CLI flags in the pvmigrate package is not the right place to do this.

Comment on lines +42 to +43
// Validate runs preflight check on storage volumes returning a list of failures
func Validate(ctx context.Context, w *log.Logger, clientset k8sclient.Interface, options migrate.Options) ([]ValidationFailure, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should validateStorageClasses - which validates that storageclasses are present in the cluster - be part of Validate instead of Migrate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation of storage classes will happen in both the validation and migration paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the output when the validation can't find the storage class(es):

$ ./kurl/bin/pvmigrate --source-sc "non-default" --dest-sc "longhorn" --rsync-image "replicated/ku
rl-util:alpha" --skip-free-space-check --preflight-validation-only
The following resources cannot be migrated:

NAMESPACE		RESOURCE		MESSAGE
---------		--------		-------
			sc/non-default		Resource not found
			sc/longhorn		Resource not found

fmt.Printf("%s\n", err.Error())
os.Exit(1)
}
PodReadyTimeout int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a time.Duration, not an int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// NewPrefixedName returns a name prefixed by prefix and with length that is no longer than 63
// chars
func NewPrefixedName(prefix, original string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind using this in Migrate? (specifically in newPvcName)

(This might require changing this to be NewSuffixedName for consistency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take this suggestion as a future TODO.

laverya
laverya previously approved these changes Dec 6, 2022
Copy link
Member

@laverya laverya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but there's one other thing I'd like to see -

PLEASE update the readme with the additional CLI flags and process steps.

@rrpolanco rrpolanco marked this pull request as ready for review December 7, 2022 19:37
@rrpolanco rrpolanco changed the title Check if PV Access Modes are Supported by the Volume Provider Check if PV/PVC Access Modes are Supported by the Volume Provider Dec 7, 2022
@@ -98,6 +109,34 @@ func validateVolumeAccessModes(ctx context.Context, l *log.Logger, client k8scli
return volAccessModeFailures, nil
}

// validateStorageClasses returns any failures encountered when discovering the source and destination
// storage classes
func validateStorageClasses(ctx context.Context, l *log.Logger, clientset k8sclient.Interface, sourceSCName, destSCName string) ([]ValidationFailure, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you copy the logic from the migrate version more precisely? This drops support for the 'skip-source-validation' flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

emosbaugh
emosbaugh previously approved these changes Dec 8, 2022
pkg/preflight/validate.go Outdated Show resolved Hide resolved
pkg/preflight/validate.go Show resolved Hide resolved
Comment on lines 322 to 328
if _, err := client.CoreV1().PersistentVolumes().Get(
ctx, pv.Name, metav1.GetOptions{},
); err != nil && !k8serrors.IsNotFound(err) {
l.Printf("failed to get pv for temp pvc %s: %s", pvc, err)
} else if err != nil && k8serrors.IsNotFound(err) {
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you return nil here?

Suggested change
if _, err := client.CoreV1().PersistentVolumes().Get(
ctx, pv.Name, metav1.GetOptions{},
); err != nil && !k8serrors.IsNotFound(err) {
l.Printf("failed to get pv for temp pvc %s: %s", pvc, err)
} else if err != nil && k8serrors.IsNotFound(err) {
break
}
if _, err := client.CoreV1().PersistentVolumes().Get(
ctx, pv.Name, metav1.GetOptions{},
); err != nil {
if k8serrors.IsNotFound(err) {
return nil
}
l.Printf("failed to get pv for temp pvc %s: %s", pvc, err)
}

case <-interval.C:
continue
case <-timeout:
return fmt.Errorf("failed to delete pvs: timeout")
}
}
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remove this return statement, Go requires a return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this felt like unreachable code in my head based on the other change requested in my review. but ill take your word for it

Copy link
Member

@emosbaugh emosbaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i dont feel that strongly about my suggestion ill approve

@rrpolanco rrpolanco merged commit 19f941d into main Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants