Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS and ACI reporting slow to metrics server. Need faster scale out than 5 min. #119

Closed
luddskunk opened this issue Oct 1, 2019 · 5 comments

Comments

@luddskunk
Copy link


Environment summary

Provider: ACI

Version: v1.13.1-vk-v0.9.0-1

K8s Master Info: AKS

Install Method: Azure Portal

Issue Details

I have setup a new AKS cluster with Virtual Kubelet enabled. Then I perform a load test with the help of JMeter on my pods. Together with a HPA I succesfully autoscale pods onto ACI instances. However, I have noted that the metrics server does not get any metrics from the ACI instance until after ~5 minutes. After this time the HPA is updated, and a new scale out is performed. If the load the increases during the ~5 minutes waiting time the HPA will not be updated until the next ~5 minutes threshold.

Is there any way I can affect this timing to be less than ~5 minutes, since I want to be even more resilient to burst of traffic for my pods?

Repo Steps

  1. Setup AKS cluster with Virtual Kubelet
  2. Deploy example pod (See attached script for the ones I used)
  3. Run JMeter and perform loadtest

Example output when no metrics is found on metrics-server is:
1 reststorage.go:93] No metrics for pod default/php-apache-86ddb69d6f-9fjwj

HPA.yaml

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
  namespace: default
spec:
  maxReplicas: 100
  minReplicas: 1
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: php-apache
  targetCPUUtilizationPercentage: 5

php-apache.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    run: php-apache
  name: php-apache
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: k8s.gcr.io/hpa-example
        imagePullPolicy: Always
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          limits:
            cpu: 200m
          requests:
            cpu: 200m
      nodeSelector:
         type: virtual-kubelet
      tolerations:
      - key: virtual-kubelet.io/provider
        operator: Equal
        value: azure
        effect: NoSchedule

service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    run: php-apache
  name: php-apache
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP \
    targetPort: 80
  selector:
    run: php-apache
  type: ClusterIP

Any ideas or feedback would be greatly appreciated.
Thanks for an otherwise awesome product!

@cpuguy83
Copy link
Contributor

cpuguy83 commented Oct 1, 2019

metrics-server collects metrics from nodes pretty much on demand.
It's likely the issue is related to slow or even missed collections from ACI itself (limitation in the platform).

Currently the best resolution we can even get from ACI is 1 minute (https://github.com/virtual-kubelet/azure-aci/blob/master/client/aci/metrics.go#L44), and that's in a best case scenario. Collections can get missed (within ACI) just due to low priority of the job.

It's probably best to look at other metrics (such as requests per second?) for scale-out until ACI's metrics collection is more robust.

@luddskunk
Copy link
Author

Hello,

Thanks for the quick reply. Concerning the resolution, do you mean the flag (https://virtual-kubelet.io/docs/usage/#flags) --full-resync-period duration ?

I just want to ensure that we are on the same page concerning the ~5 minute delay we are experiencing. Since that would mean that we are unable to affect this value unless we turn to Azure ACI support?

Looking forward to your response!

@cpuguy83
Copy link
Contributor

cpuguy83 commented Oct 2, 2019

No --full-resync-period is related to the k8s client.
The delay is almost certainly internal to ACI itself.

When metrics-server requests metrics, the ACI VK provider fetches them from the ACI API immediately, but ACI itself doesn't publish live metrics, only a summary over a time interval (1 minute being the shortest interval).
ACI's internal metrics collection/publishing is low priority on the system and may not even run during that interval so metrics will be old (and discarded by metric-server).

In terms of initial metrics, there may even be a longer delay here.

/cc @ibabou

@abengtss
Copy link

abengtss commented Oct 2, 2019

Thanks @cpuguy83 for the great input. it sounds as if you are recommending the following page as a solution to this problem.
https://github.com/Azure-Samples/virtual-node-autoscale?source=post_page-----f66b908661c1----------------------

Cheers

@pires pires transferred this issue from virtual-kubelet/virtual-kubelet Jan 29, 2021
@helayoty helayoty added this to Needs triage in Bug Triage Apr 18, 2022
@helayoty
Copy link
Member

This has been solved by using realtime metrics as the default metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Bug Triage
Needs triage
Development

No branches or pull requests

5 participants