UpstreamGroup continues to send traffic to unhealthy upstream. #8715

AkshayAdsul · 2023-09-25T15:04:42Z

Gloo Edge Product

Open Source

Gloo Edge Version

V1.13.6 (But encountered this in other versions as well)

Kubernetes Version

v1.25

Describe the bug

We have an UpstreamGroup where healthchecks are configured on Upstream. But UpstreamGroup continues to send traffic to unhealthy upstream.

Expected Behavior

Expectation is that if the upstream is not healthy then upstream group will not route the request through that upstream.
But it seems that upstream group is not aware of the upstream health.

Steps to reproduce the bug

Create 2 services eg
kubectl create ns echo
kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-service.yaml
kubectl -n echo apply -f https://raw.githubusercontent.com/solo-io/workshops/master/gloo-edge/data/echo-v2-service.yaml
Create Upstream Group and Upstream as below

apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    discovered_by: kubernetesplugin
  name: echo-echo-v2-8080
  namespace: gloo-system
spec:
  discoveryMetadata:
    labels:
      app: echo-v2
  healthChecks:
  - alwaysLogHealthCheckFailures: true
    eventLogPath: /dev/stdout
    healthyThreshold: 3
    httpHealthCheck:
      path: /health
    interval: 10s
    reuseConnection: false
    timeout: 10s
    unhealthyThreshold: 1
  kube:
    selector:
      app: echo-v2
    serviceName: echo-v2
    serviceNamespace: echo
    servicePort: 8080
  loadBalancerConfig:
    healthyPanicThreshold: 0
    ringHash: {}

apiVersion: gloo.solo.io/v1
kind: UpstreamGroup
metadata:
  name: my-service-group
  namespace: gloo-system
spec:
  destinations:
  - destination:
      upstream:
        name: echo-echo-v1-8080
        namespace: gloo-system
    weight: 5
  - destination:
      upstream:
        name: echo-echo-v2-8080
        namespace: gloo-system
    weight: 5

Update Virtual service configuration

 routes:
   - delegateAction:
       ref:
         name: echo-routetable
         namespace: echo
     matchers:
     - headers:
       - name: :authority
         value: echo.solo.io
       prefix: /

Update Route Table config

spec:
  routes:
  - matchers:
    - headers:
      - name: :authority
        value: echo.solo.io
      prefix: /
    routeAction:
      upstreamGroup:
        name: my-service-group
        namespace: gloo-system

Make one of the upstreams unhealthy( eg. scale deployment replica to 1 of echo app)
Send traffic
for i in {1..10};
do
curl -H "Host: echo.solo.io" $(glooctl proxy url)/
done
Observe No healthy upstream errors on half the requests.

Additional Environment Detail

No response

Additional Context

No response

Related Issues:

https://github.com/solo-io/solo-projects/issues/5647

The text was updated successfully, but these errors were encountered:

SantoDE · 2023-09-27T12:41:47Z

Could that be a dupe of #6647 @AkshayAdsul ?

AkshayAdsul · 2023-09-28T07:01:23Z

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

SantoDE · 2023-09-28T14:30:10Z

@AkshayAdsul did that already work in a previous version? Or is that new?

AkshayAdsul · 2023-09-29T05:56:28Z

I tried this on 1.13.6 and 1.15.x versions and I saw the same issue. I don't think this ever worked.
Kas looked at the code and he mentioned we incorrectly write empty endpoints for the upstream. And it is not an issue when managing subsets in a single upstream. So it's the UpstreamGroup.

jbohanon · 2023-09-29T11:30:26Z

I'd guess healthy_panic_threshold: {} might have something to do with this in their env. In your sample steps what happens if you turn off UDS and manually define v1 service upstream with the setting explicitly set as v2 has shown above?

Healthy panic threshold needs to be zero to properly get the no_healthy_upstream respone flag/details (link)

jbohanon · 2023-09-29T11:34:22Z

In this bug we get a no healthy upstream error rather than upstream connect error or disconnect/reset before headers. reset reason: connection termination . So it detects the upstream as unhealthy but UpStreamGroup still continues to send traffic to it.

If you're receiving this response details, then Envoy should never have sent the upstream request (link)

AkshayAdsul · 2023-10-03T00:10:18Z

When we set healthyPanicThreshold: 0 on an Upstream I see in the config dump the value is set as "healthy_panic_threshold": {} rather than 0. Which means its setting to default value which is 50 and not 0.

day0ops · 2023-10-04T23:53:07Z

I dont know about the healthyPanicThreshold but have a look at the endpoints this creates. I think when multiple upstreams are involved with health checks it doesnt quite manage those endpoints correctly.

This works fine with a single upstream with multiple subsets by the way.

nfuden · 2023-10-09T18:58:46Z

After more investigation I think we may have a fundamental mismatch in use case here.
I believe what we are looking at is Upstreamgroup -> weighted_cluster on routing which explicitly is fine with some requests failing.
While what is trying to be done here is to have some load balancing plus not to send to unhealthy desitnations.

I am still confirming what the actual behavior is fully but I think we may have to introduce a new concept or flag to actually provide this type of functionality. For example instead of just having the route action with weights actually have the destinations of those weights be aggregate clusters with the other upstream group members as secondary/tertiary locations.

Disclaimer this is just the result of a quick initial investigation.

nfuden · 2023-10-10T13:19:58Z

Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration.

@SantoDE with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge.

jbohanon · 2023-10-10T13:23:20Z

This is a valuable learning that we should capture somewhere. User-facing docs may be a good place to clarify that this translates to routing and not cluster config.

…

On Tue, Oct 10, 2023 at 09:20 Nathan Fudenberg ***@***.***> wrote: Alright have confirmed the prior statements that Upstream groups dont respect health between the upstream clusters as clusters contain their concept of health and an Upstream group is a route paradigm rather than a cluster level configuration. @SantoDE <https://github.com/SantoDE> with follow up with the currently affected user and we will determine whether there is a configuration that would be acceptable to reconfigure the upstreams to be a single upstream with some load balancing setup (such as by locality based) or if we should approach adding aggregate clusters as a concept in Gloo Edge. — Reply to this email directly, view it on GitHub <#8715 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANTAA56OQQVYLALGWGNTIV3X6VDQVAVCNFSM6AAAAAA5GHZUB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGQYTMMZZGE> . You are receiving this because you commented.Message ID: ***@***.***>

nfuden · 2023-10-10T13:46:24Z

Yep thats one of the steps we are taking :P

AkshayAdsul added Type: Bug Something isn't working Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX labels Sep 25, 2023

solo-io deleted a comment from AkshayAdsul Oct 5, 2023

nfuden assigned davidjumani and nfuden Oct 6, 2023

nfuden assigned SantoDE and unassigned davidjumani and nfuden Oct 10, 2023

SantoDE added release/1.16 release/1.17 labels Nov 15, 2023

chrisgaun added the go-to-prod label Jan 5, 2024

kcbabo unassigned SantoDE Jan 5, 2024

dy-solo added the Geo: APAC label Jan 9, 2024

DuncanDoyle added the Committed: 1.18 label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

AkshayAdsul commented Sep 25, 2023 •

edited by DuncanDoyle

SantoDE commented Sep 27, 2023

AkshayAdsul commented Sep 28, 2023

SantoDE commented Sep 28, 2023

AkshayAdsul commented Sep 29, 2023 •

edited

jbohanon commented Sep 29, 2023 •

edited

jbohanon commented Sep 29, 2023 •

edited

AkshayAdsul commented Oct 3, 2023

day0ops commented Oct 4, 2023

nfuden commented Oct 9, 2023

nfuden commented Oct 10, 2023

jbohanon commented Oct 10, 2023 via email

nfuden commented Oct 10, 2023

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

UpstreamGroup continues to send traffic to unhealthy upstream. #8715

Comments

AkshayAdsul commented Sep 25, 2023 • edited by DuncanDoyle

Gloo Edge Product

Gloo Edge Version

Kubernetes Version

Describe the bug

Expected Behavior

Steps to reproduce the bug

Additional Environment Detail

Additional Context

Related Issues:

SantoDE commented Sep 27, 2023

AkshayAdsul commented Sep 28, 2023

SantoDE commented Sep 28, 2023

AkshayAdsul commented Sep 29, 2023 • edited

jbohanon commented Sep 29, 2023 • edited

jbohanon commented Sep 29, 2023 • edited

AkshayAdsul commented Oct 3, 2023

day0ops commented Oct 4, 2023

nfuden commented Oct 9, 2023

nfuden commented Oct 10, 2023

jbohanon commented Oct 10, 2023 via email

nfuden commented Oct 10, 2023

AkshayAdsul commented Sep 25, 2023 •

edited by DuncanDoyle

AkshayAdsul commented Sep 29, 2023 •

edited

jbohanon commented Sep 29, 2023 •

edited

jbohanon commented Sep 29, 2023 •

edited