Skip to content

feat(conformance) add EPP unavailable fail-open test #999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 24, 2025

Conversation

zetxqx
Copy link
Contributor

@zetxqx zetxqx commented Jun 17, 2025

This pull request introduces a new conformance test, epp_unavailable_fail_open, to validate the behavior of the extensionRef.failureMode==FailOpen setting.

  • A new conformance test, epp_unavailable_fail_open, has been added.
  • The test simulates an EPP becoming unavailable by deleting the EPP deployment and verifies that the gateway fails open and routes traffic to the backend service.
  • This PR is based on the work done in PR feat(conformance): Add EPP conformance test for Gateway routing #961, which added the initial EPP conformance test for Gateway routing.

Verification

gke-l7

go test -v ./conformance -args -debug -gateway-class gke-l7-regional-external-managed -cleanup-base-resources=false -allow-crds-mismatch=true -run-test EppUnAvailableFailOpen
=== NAME  TestConformance/EppUnAvailableFailOpen
    apply.go:283: 2025-06-17T01:19:39.311334464Z: Deleting epp-to-inference-model-reader RoleBinding
    apply.go:283: 2025-06-17T01:19:39.403450067Z: Deleting inference-model-reader Role
    apply.go:283: 2025-06-17T01:19:39.482466834Z: Deleting infra-backend-epp Deployment
    apply.go:283: 2025-06-17T01:19:39.494211326Z: Deleting infra-backend-endpoint-picker Service
    apply.go:283: 2025-06-17T01:19:39.61129453Z: Deleting httproute-for-primary-gw HTTPRoute
    apply.go:283: 2025-06-17T01:19:39.679690563Z: Deleting normal-gateway-pool InferencePool
    apply.go:283: 2025-06-17T01:19:39.746823836Z: Deleting conformance-fake-model-server InferenceModel
    apply.go:283: 2025-06-17T01:19:39.823244608Z: Deleting infra-backend-deployment Deployment
=== RUN   TestConformance/GatewayFollowingEPPRouting
    conformance.go:68: Skipping GatewayFollowingEPPRouting: test explicitly skipped
=== RUN   TestConformance/HTTPRouteInvalidInferencePoolRef
    conformance.go:68: Skipping HTTPRouteInvalidInferencePoolRef: test explicitly skipped
=== RUN   TestConformance/InferencePoolAccepted
    conformance.go:68: Skipping InferencePoolAccepted: test explicitly skipped
=== RUN   TestConformance/InferencePoolResolvedRefsCondition
    conformance.go:68: Skipping InferencePoolResolvedRefsCondition: test explicitly skipped
--- PASS: TestConformance (71.37s)
    --- PASS: TestConformance/EppUnAvailableFailOpen (68.24s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_1:_Verify_baseline_connectivity_with_EPP_available (48.22s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_2:_Verify_fail-open_behavior_after_EPP_becomes_unavailable (0.16s)
    --- SKIP: TestConformance/GatewayFollowingEPPRouting (0.00s)
    --- SKIP: TestConformance/HTTPRouteInvalidInferencePoolRef (0.00s)
    --- SKIP: TestConformance/InferencePoolAccepted (0.00s)
    --- SKIP: TestConformance/InferencePoolResolvedRefsCondition (0.00s)
PASS
ok  	sigs.k8s.io/gateway-api-inference-extension/conformance	71.575s

istio

go test -v ./conformance -args -debug -gateway-class istio -cleanup-base-resources=false -allow-crds-mismatch=true -run-test EppUnAvailableFailOpen
=== RUN   TestConformance/InferencePoolResolvedRefsCondition
    conformance.go:68: Skipping InferencePoolResolvedRefsCondition: test explicitly skipped
--- PASS: TestConformance (11.52s)
    --- PASS: TestConformance/EppUnAvailableFailOpen (8.80s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_1:_Verify_baseline_connectivity_with_EPP_available (7.06s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_2:_Verify_fail-open_behavior_after_EPP_becomes_unavailable (0.09s)
    --- SKIP: TestConformance/GatewayFollowingEPPRouting (0.00s)
    --- SKIP: TestConformance/HTTPRouteInvalidInferencePoolRef (0.00s)
    --- SKIP: TestConformance/InferencePoolAccepted (0.00s)
    --- SKIP: TestConformance/InferencePoolResolvedRefsCondition (0.00s)
PASS
ok  	sigs.k8s.io/gateway-api-inference-extension/conformance	11.715s

Copy link

netlify bot commented Jun 17, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 707c5d7
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6859792fd9f4aa000837f22c
😎 Deploy Preview https://deploy-preview-999--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @zetxqx. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2025
@zetxqx zetxqx changed the title feat(conformance) add epp_unavailable_fail_open test to test the extensionRef.failureMode==FailOpen behavior feat(conformance) add EPP unavailable fail-open test Jun 17, 2025
@zetxqx zetxqx mentioned this pull request Jun 17, 2025
12 tasks
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zetxqx! A few small comments, otherwise LGTM.

type: PathPrefix
value: /primary-gateway-test
---
# --- Conformance EPP service Definition ---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would really like to see everything below this in base manifests + reused to simplify test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased on #982, and reuse as much as possible. PTAL

Comment on lines 51 to 62
# --- InferenceModel Definition ---
# Service for the infra-backend-deployment.
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: conformance-fake-model-server
namespace: gateway-conformance-app-backend
spec:
modelName: conformance-fake-model
criticality: Critical # Mark it as critical to bypass the saturation check since the model server is fake and don't have such metrics.
poolRef:
name: normal-gateway-pool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing - follow up issue is fine since I know it requires corresponding EPP changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, we can follow up in #1002, also added TODO

Headers: map[string]string{eppSelectionHeaderName: targetPodIP},
Method: http.MethodPost,
Body: requestBody,
Backend: appPodBackendPrefix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first test don't we need to ensure that the Gateway routed to a specific Pod, while in the second test any Pod with this prefix is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, make sure the first request is checking the specific pod and make sure the second request is only checking the prefix.


// MakeRequestAndExpectEventuallyConsistentResponse makes a request using the parameters
// from the Request struct and waits for the response to consistently match the expectations.
func MakeRequestAndExpectEventuallyConsistentResponse(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you're converging with @SinaChavoshi on these functions across PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll have a follow up PR to refactor this.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 17, 2025
@zetxqx
Copy link
Contributor Author

zetxqx commented Jun 19, 2025

@robscott I've updated the PR rebasing on #982 and is ready for review. But note this PR is still based on #961 since it's not merged yet.

@robscott
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 21, 2025
@zetxqx
Copy link
Contributor Author

zetxqx commented Jun 23, 2025

@robscott, I rebased to the main, so it's cleaner ready-to-review now.

Re-run the test, results:

    conformance.go:68: Skipping HTTPRouteInvalidInferencePoolRef: test explicitly skipped
=== RUN   TestConformance/InferencePoolAccepted
    conformance.go:68: Skipping InferencePoolAccepted: test explicitly skipped
=== RUN   TestConformance/InferencePoolHTTPRoutePortValidation
    conformance.go:68: Skipping InferencePoolHTTPRoutePortValidation: test explicitly skipped
=== RUN   TestConformance/InferencePoolInvalidEPPService
    conformance.go:68: Skipping InferencePoolInvalidEPPService: test explicitly skipped
=== RUN   TestConformance/HTTPRouteMultipleRulesDifferentPools
    conformance.go:68: Skipping HTTPRouteMultipleRulesDifferentPools: test explicitly skipped
=== RUN   TestConformance/InferencePoolResolvedRefsCondition
    conformance.go:68: Skipping InferencePoolResolvedRefsCondition: test explicitly skipped
--- PASS: TestConformance (30.13s)
    --- PASS: TestConformance/EppUnAvailableFailOpen (19.74s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_1:_Verify_baseline_connectivity_with_EPP_available (8.09s)
        --- PASS: TestConformance/EppUnAvailableFailOpen/Phase_2:_Verify_fail-open_behavior_after_EPP_becomes_unavailable (0.13s)
    --- SKIP: TestConformance/GatewayFollowingEPPRouting (0.00s)
    --- SKIP: TestConformance/HTTPRouteInvalidInferencePoolRef (0.00s)
    --- SKIP: TestConformance/InferencePoolAccepted (0.00s)
    --- SKIP: TestConformance/InferencePoolHTTPRoutePortValidation (0.00s)
    --- SKIP: TestConformance/InferencePoolInvalidEPPService (0.00s)
    --- SKIP: TestConformance/HTTPRouteMultipleRulesDifferentPools (0.00s)
    --- SKIP: TestConformance/InferencePoolResolvedRefsCondition (0.00s)
PASS
ok  	sigs.k8s.io/gateway-api-inference-extension/conformance	30.381s

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zetxqx!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2025
@danehans
Copy link
Contributor

/approve

Thanks @zetxqx!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans, robscott, zetxqx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2025
@k8s-ci-robot k8s-ci-robot merged commit 5afe58b into kubernetes-sigs:main Jun 24, 2025
9 checks passed
shmuelk pushed a commit to shmuelk/gateway-api-inference-extension that referenced this pull request Jun 24, 2025
…#999)

* Add test for epp becoming unavailable and the extensionRef.failureMode is set to failOpen.

* resolve minor comments.

* format.

* import format.
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
…#999)

* Add test for epp becoming unavailable and the extensionRef.failureMode is set to failOpen.

* resolve minor comments.

* format.

* import format.
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
…#999)

* Add test for epp becoming unavailable and the extensionRef.failureMode is set to failOpen.

* resolve minor comments.

* format.

* import format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants