Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cant initiate failover of applicationSet based apps after hub recovery #861

Merged

Conversation

GowthamShanmugam
Copy link
Contributor

@GowthamShanmugam GowthamShanmugam commented May 23, 2023

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2209288

Fix explanation:

  1. Earlier, Failover/Relocate component is using placement ownerRef to match placement and its placement decision. It is failing in some cases. So Switching to label-based matching.

Other errors observed from QE setup once after fixing the above one:
2. Failover/Relocation readiness text in UI should not rely only on DRPC peer-ready conditions and available conditions. It is common for all errors, So if any error occurs the readiness should display "Not Ready".
3. While Failing/Relocating ramen is removing the cluster name from the placement rule, and adding it back once the failover/relocation operation is completed. In between if the cluster went down, and if a user wants to trigger Failover, Then UI is unable to find out the target cluster and deployment cluster.
The fix is, Find the deployment cluster name from DRPC, which is only possible at the time of failover and relocation operation is ongoing.

  • If the status is in the FailingOver state then the target cluster can be found using failoverCluster spec from DRPC.
  • If the status is in the Relocating state then the target cluster can be found using preferedCluster spec from DRPC.
  • If a target cluster is found then finding a deployment cluster is possible using the list of DR Clusters.

@GowthamShanmugam
Copy link
Contributor Author

/hold

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: GowthamShanmugam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@GowthamShanmugam GowthamShanmugam force-pushed the bz_2209288 branch 5 times, most recently from 3fb47ac to 3ce32ef Compare May 24, 2023 13:08
@GowthamShanmugam
Copy link
Contributor Author

/hold cancel

@GowthamShanmugam GowthamShanmugam changed the title Fix cant initiate failover of applicationSet based apps after hub recovery - WIP Fix cant initiate failover of applicationSet based apps after hub recovery May 24, 2023
Comment on lines 512 to 508
// While failing over failoverCluster cluster is the target cluster
// Find deployment cluster using dr cluster list
} else if (currStatus === DRPC_STATUS.FailingOver) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits

Suggested change
// While failing over failoverCluster cluster is the target cluster
// Find deployment cluster using dr cluster list
} else if (currStatus === DRPC_STATUS.FailingOver) {
} else if (currStatus === DRPC_STATUS.FailingOver) {
// While failing over failoverCluster cluster is the target cluster
// Find deployment cluster using dr cluster list

Comment on lines 506 to 509
const cluster = findCluster(
drClusters,
drpc?.spec?.preferredCluster,
false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing:

  1. "matchCluster" argument for the "findCluster" function is false by default, so no need to pass "false" again.
  2. second argument for the "findCluster" function is named as "deploymentClusterName", but in reality u want to figure out the deployment cluster, parameter that u are passing is target cluster naming convention is wrong/confusing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, the name is wrong.

Comment on lines 515 to 518
const cluster = findCluster(
drClusters,
drpc?.spec?.failoverCluster,
false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same 2 points as above...

@@ -488,3 +489,36 @@ export const filterPVCDataUsingAppsets = (

export const filterDRAlerts = (alert: Alert) =>
alert?.annotations?.alert_type === 'DisasterRecovery';

export const findDeploymentClusterName = (
plsDecision: ACMPlacementDecisionKind,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no issue with current approach, but if "drpc" is the best way to determine the deployment cluster, do we even need to have check for "plsDecision" ?? Can't we always check based upon "drpc" only ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • PlacementDecision is our source of truth to find out the deployment cluster.
  • Finding the deployment cluster name from DRPC won't work always, User may accidently change the "Prefered cluster / Failover cluster" on the DRPC spec without changing the spec "Action".
  • Also, if we find from DRPC, these are the following challenges:
    1. If the current DRPC status is Deployed, then the deployment cluster is a preferred cluster spec.
    2. If the current DRPC status is Initiating, then:
      If the Action is Failover, then the deploy deployment cluster is the opposite cluster from the failover cluster.
      If the Action is Relocate, then the deploy deployment cluster is the opposite cluster from the preferred cluster.
    3. If the current DRPC status is FailingOver, then the deployment cluster is the opposite cluster from the failover cluster.
    4. If the current DRPC status is Relocating, then the deployment cluster is the opposite cluster from the preferred cluster.
    5. If the current DRPC status is FailedOver, then the deployment cluster is the preferred cluster.
    6. If the current DRPC status is Relocated, then the deployment cluster is the preferred cluster.
  • If any new state comes then we need to handle that also, If the user changes the preferred cluster/failover cluster/action spec then it's a problem.
  • Instead of checking these many conditions, I decided to check placementRule. Which always gives me the correct deployment cluster name.
  • But one corner case is happening here, ramen is removing the cluster names from placementRule, I decided to check DRPC only for this corner case, (i.e. Relocating/Failing Over). I am assuming that, while Relocating/FailingOver users won't change the DRPC spec.
  • Another alternate solution is to ask the user to input the cluster name from UI, But if we introduce bulk failover and relocate in the future this will be a problem.
  • Again it is not a solid solution, We need to discuss with the DR about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix will at least save us from raising blocker BZ-like failover/relocation is blocked (Or) like UI is not displaying target cluster name on the UI.

@@ -53,7 +53,7 @@ export const FailoverRelocateModal: React.FC<FailoverRelocateModalProps> = (
ModalFooterStatus.INITIAL
);
const [placement, setPlacement] = React.useState<PlacementProps>({});
const [canInitiate, setCanInitiate] = React.useState(false);
const [canInitiate, setCanInitiate] = React.useState<Boolean>(undefined);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: if we are defining the type to be Boolean, then shouldn't; we initialise it with false ??

@GowthamShanmugam GowthamShanmugam force-pushed the bz_2209288 branch 2 times, most recently from a594bde to 1f0ea80 Compare May 25, 2023 08:06
@GowthamShanmugam
Copy link
Contributor Author

/test images

@SanjalKatiyar
Copy link
Collaborator

/lgtm

@GowthamShanmugam
Copy link
Contributor Author

/test odf-console-e2e-aws

1 similar comment
@GowthamShanmugam
Copy link
Contributor Author

/test odf-console-e2e-aws

@openshift-merge-robot openshift-merge-robot merged commit 11f4070 into red-hat-storage:master May 25, 2023
6 checks passed
@SanjalKatiyar
Copy link
Collaborator

/cherry-pick release-4.13

@SanjalKatiyar
Copy link
Collaborator

/cherry-pick release-4.13-compatibility

@openshift-cherrypick-robot

@SanjalKatiyar: #861 failed to apply on top of branch "release-4.13":

Applying: Fix cant initiate failover of applicationSet based apps after hub recovery
Using index info to reconstruct a base tree...
M	packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
M	packages/mco/components/modals/app-failover-relocate/failover-relocate-modal-body.tsx
Falling back to patching base and 3-way merge...
Auto-merging packages/mco/components/modals/app-failover-relocate/failover-relocate-modal-body.tsx
Auto-merging packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
CONFLICT (content): Merge conflict in packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix cant initiate failover of applicationSet based apps after hub recovery
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@SanjalKatiyar: #861 failed to apply on top of branch "release-4.13-compatibility":

Applying: Fix cant initiate failover of applicationSet based apps after hub recovery
Using index info to reconstruct a base tree...
M	packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
M	packages/mco/components/modals/app-failover-relocate/failover-relocate-modal-body.tsx
Falling back to patching base and 3-way merge...
Auto-merging packages/mco/components/modals/app-failover-relocate/failover-relocate-modal-body.tsx
Auto-merging packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
CONFLICT (content): Merge conflict in packages/mco/components/modals/app-failover-relocate/argo-application-set.tsx
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix cant initiate failover of applicationSet based apps after hub recovery
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13-compatibility

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants