AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

luddskunk · 2019-10-01T13:49:46Z

Environment summary

Provider: ACI

Version: v1.13.1-vk-v0.9.0-1

K8s Master Info: AKS

Install Method: Azure Portal

Issue Details

I have setup a new AKS cluster with Virtual Kubelet enabled. Then I perform a load test with the help of JMeter on my pods. Together with a HPA I succesfully autoscale pods onto ACI instances. However, I have noted that the metrics server does not get any metrics from the ACI instance until after ~5 minutes. After this time the HPA is updated, and a new scale out is performed. If the load the increases during the ~5 minutes waiting time the HPA will not be updated until the next ~5 minutes threshold.

Is there any way I can affect this timing to be less than ~5 minutes, since I want to be even more resilient to burst of traffic for my pods?

Repo Steps

Setup AKS cluster with Virtual Kubelet
Deploy example pod (See attached script for the ones I used)
Run JMeter and perform loadtest

Example output when no metrics is found on metrics-server is:
1 reststorage.go:93] No metrics for pod default/php-apache-86ddb69d6f-9fjwj

HPA.yaml

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
  namespace: default
spec:
  maxReplicas: 100
  minReplicas: 1
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: php-apache
  targetCPUUtilizationPercentage: 5

php-apache.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    run: php-apache
  name: php-apache
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: k8s.gcr.io/hpa-example
        imagePullPolicy: Always
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          limits:
            cpu: 200m
          requests:
            cpu: 200m
      nodeSelector:
         type: virtual-kubelet
      tolerations:
      - key: virtual-kubelet.io/provider
        operator: Equal
        value: azure
        effect: NoSchedule

service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    run: php-apache
  name: php-apache
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP \
    targetPort: 80
  selector:
    run: php-apache
  type: ClusterIP

Any ideas or feedback would be greatly appreciated.
Thanks for an otherwise awesome product!

The text was updated successfully, but these errors were encountered:

cpuguy83 · 2019-10-01T21:31:49Z

metrics-server collects metrics from nodes pretty much on demand.
It's likely the issue is related to slow or even missed collections from ACI itself (limitation in the platform).

Currently the best resolution we can even get from ACI is 1 minute (https://github.com/virtual-kubelet/azure-aci/blob/master/client/aci/metrics.go#L44), and that's in a best case scenario. Collections can get missed (within ACI) just due to low priority of the job.

It's probably best to look at other metrics (such as requests per second?) for scale-out until ACI's metrics collection is more robust.

luddskunk · 2019-10-02T14:40:27Z

Hello,

Thanks for the quick reply. Concerning the resolution, do you mean the flag (https://virtual-kubelet.io/docs/usage/#flags) --full-resync-period duration ?

I just want to ensure that we are on the same page concerning the ~5 minute delay we are experiencing. Since that would mean that we are unable to affect this value unless we turn to Azure ACI support?

Looking forward to your response!

cpuguy83 · 2019-10-02T15:34:14Z

No --full-resync-period is related to the k8s client.
The delay is almost certainly internal to ACI itself.

When metrics-server requests metrics, the ACI VK provider fetches them from the ACI API immediately, but ACI itself doesn't publish live metrics, only a summary over a time interval (1 minute being the shortest interval).
ACI's internal metrics collection/publishing is low priority on the system and may not even run during that interval so metrics will be old (and discarded by metric-server).

In terms of initial metrics, there may even be a longer delay here.

/cc @ibabou

abengtss · 2019-10-02T17:03:25Z

Thanks @cpuguy83 for the great input. it sounds as if you are recommending the following page as a solution to this problem.
https://github.com/Azure-Samples/virtual-node-autoscale?source=post_page-----f66b908661c1----------------------

Cheers

helayoty · 2023-03-16T16:56:15Z

This has been solved by using realtime metrics as the default metrics.

pires transferred this issue from virtual-kubelet/virtual-kubelet Jan 29, 2021

feiskyer added the dependency/aci label Feb 8, 2021

helayoty added this to Needs triage in Bug Triage Apr 18, 2022

helayoty closed this as completed Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

luddskunk commented Oct 1, 2019

cpuguy83 commented Oct 1, 2019

luddskunk commented Oct 2, 2019

cpuguy83 commented Oct 2, 2019

abengtss commented Oct 2, 2019

helayoty commented Mar 16, 2023

AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

Comments

luddskunk commented Oct 1, 2019

Environment summary

Issue Details

Repo Steps

cpuguy83 commented Oct 1, 2019

luddskunk commented Oct 2, 2019

cpuguy83 commented Oct 2, 2019

abengtss commented Oct 2, 2019

helayoty commented Mar 16, 2023