Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

Open
AkshayAdsul opened this issue Sep 25, 2023 · 12 comments
Open

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

AkshayAdsul opened this issue Sep 25, 2023 · 12 comments
Labels
Committed: 1.18 Geo: APAC go-to-prod Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX release/1.16 release/1.17 Type: Bug Something isn't working

Comments

@AkshayAdsul
Copy link

AkshayAdsul commented Sep 25, 2023

Gloo Edge Product

Open Source

Gloo Edge Version

V1.13.6 (But encountered this in other versions as well)

Kubernetes Version

v1.25

Describe the bug

We have an UpstreamGroup where healthchecks are configured on Upstream. But UpstreamGroup continues to send traffic to unhealthy upstream.

Expected Behavior

Expectation is that if the upstream is not healthy then upstream group will not route the request through that upstream.
But it seems that upstream group is not aware of the upstream health.

Steps to reproduce the bug

  1. Create 2 services eg
    kubectl create ns echo
    kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-service.yaml
    kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-v2-service.yaml
  2. Create Upstream Group and Upstream as below
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    discovered_by: kubernetesplugin
  name: echo-echo-v2-8080
  namespace: gloo-system
spec:
  discoveryMetadata:
    labels:
      app: echo-v2
  healthChecks:
  - alwaysLogHealthCheckFailures: true
    eventLogPath: /dev/stdout
    healthyThreshold: 3
    httpHealthCheck:
      path: /health
    interval: 10s
    reuseConnection: false
    timeout: 10s
    unhealthyThreshold: 1
  kube:
    selector:
      app: echo-v2
    serviceName: echo-v2
    serviceNamespace: echo
    servicePort: 8080
  loadBalancerConfig:
    healthyPanicThreshold: 0
    ringHash: {}


apiVersion: gloo.solo.io/v1
kind: UpstreamGroup
metadata:
  name: my-service-group
  namespace: gloo-system
spec:
  destinations:
  - destination:
      upstream:
        name: echo-echo-v1-8080
        namespace: gloo-system
    weight: 5
  - destination:
      upstream:
        name: echo-echo-v2-8080
        namespace: gloo-system
    weight: 5
  1. Update Virtual service configuration
 routes:
   - delegateAction:
       ref:
         name: echo-routetable
         namespace: echo
     matchers:
     - headers:
       - name: :authority
         value: echo.solo.io
       prefix: /
  1. Update Route Table config
spec:
  routes:
  - matchers:
    - headers:
      - name: :authority
        value: echo.solo.io
      prefix: /
    routeAction:
      upstreamGroup:
        name: my-service-group
        namespace: gloo-system
  1. Make one of the upstreams unhealthy( eg. scale deployment replica to 1 of echo app)

  2. Send traffic
    for i in {1..10};
    do
    curl -H "Host: echo.solo.io" $(glooctl proxy url)/
    done

  3. Observe No healthy upstream errors on half the requests.

Additional Environment Detail

No response

Additional Context

No response

Related Issues:

@AkshayAdsul AkshayAdsul added Type: Bug Something isn't working Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX labels Sep 25, 2023
@SantoDE
Copy link
Contributor

SantoDE commented Sep 27, 2023

Could that be a dupe of #6647 @AkshayAdsul ?

@AkshayAdsul
Copy link
Author

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

@SantoDE
Copy link
Contributor

SantoDE commented Sep 28, 2023

@AkshayAdsul did that already work in a previous version? Or is that new?

@AkshayAdsul
Copy link
Author

AkshayAdsul commented Sep 29, 2023

I tried this on 1.13.6 and 1.15.x versions and I saw the same issue. I don't think this ever worked.
Kas looked at the code and he mentioned we incorrectly write empty endpoints for the upstream. And it is not an issue when managing subsets in a single upstream. So it's the UpstreamGroup.

@jbohanon
Copy link
Contributor

jbohanon commented Sep 29, 2023

I'd guess healthy_panic_threshold: {} might have something to do with this in their env. In your sample steps what happens if you turn off UDS and manually define v1 service upstream with the setting explicitly set as v2 has shown above?

Healthy panic threshold needs to be zero to properly get the no_healthy_upstream respone flag/details (link)

@jbohanon
Copy link
Contributor

jbohanon commented Sep 29, 2023

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

If you're receiving this response details, then Envoy should never have sent the upstream request (link)

@AkshayAdsul
Copy link
Author

When we set healthyPanicThreshold: 0 on an Upstream I see in the config dump the value is set as "healthy_panic_threshold": {} rather than 0. Which means its setting to default value which is 50 and not 0.

@day0ops
Copy link
Contributor

day0ops commented Oct 4, 2023

I dont know about the healthyPanicThreshold but have a look at the endpoints this creates. I think when multiple upstreams are involved with health checks it doesnt quite manage those endpoints correctly.

This works fine with a single upstream with multiple subsets by the way.

@solo-io solo-io deleted a comment from AkshayAdsul Oct 5, 2023
@nfuden
Copy link
Contributor

nfuden commented Oct 9, 2023

After more investigation I think we may have a fundamental mismatch in use case here.
I believe what we are looking at is Upstreamgroup -> weighted_cluster on routing which explicitly is fine with some requests failing.
While what is trying to be done here is to have some load balancing plus not to send to unhealthy desitnations.

I am still confirming what the actual behavior is fully but I think we may have to introduce a new concept or flag to actually provide this type of functionality. For example instead of just having the route action with weights actually have the destinations of those weights be aggregate clusters with the other upstream group members as secondary/tertiary locations.

Disclaimer this is just the result of a quick initial investigation.

@nfuden nfuden assigned SantoDE and unassigned davidjumani and nfuden Oct 10, 2023
@nfuden
Copy link
Contributor

nfuden commented Oct 10, 2023

Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.

@SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.

@jbohanon
Copy link
Contributor

jbohanon commented Oct 10, 2023 via email

@nfuden
Copy link
Contributor

nfuden commented Oct 10, 2023

Yep thats one of the steps we are taking :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Committed: 1.18 Geo: APAC go-to-prod Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX release/1.16 release/1.17 Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants