Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[telemetry] Limit resource usage of ST and disable OOM killer #10017

Closed
wants to merge 1 commit into from

Conversation

yozhao101
Copy link
Contributor

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it

Recently we observed the occurrence of memory spike in streaming telemetry due to the issues of gNMI client side. As such, we put a hard limitation on memory usage. At the same time, the OOM killer will be disabled in streaming telemetry container since OOM killer will panicked the kernel if OOM did happen.

In other words, if OOM occurs in streaming telemetry, we will depend on the Monit to catch this issue and restart the streaming telemetry container. So we have another open PR (#10008) to do this change on Monit scripts.

How I did it

I added several runtime parameters to limit the memory and CPU usage, also disabled the OOM killer in streaming telemetry.

How to verify it

I verified this change on DuT str-msn27000-20.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • [ x] 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 changed the title [telemetry] Limit memory usage of ST and disable OOM killer. [telemetry] Limit resource usage of ST and disable OOM killer Feb 18, 2022
@@ -29,6 +29,9 @@ endif

$(DOCKER_TELEMETRY)_CONTAINER_NAME = telemetry
$(DOCKER_TELEMETRY)_RUN_OPT += --privileged -t
$(DOCKER_TELEMETRY)_RUN_OPT += --memory 450m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure this scenario does not happen:

  1. the memory_checker seems running inside telemetry docker, and the threshold is 400m: https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-sonic-telemetry/base_image_files/monit_telemetry

  2. if memory_checker run first time check and current memory is 390m, memory_checker passed. memory_check will check 1 minute later.

  3.  then before memory next time run, telemetry use more than 450M memory.

  4. because can't allocate memory, memory_checker failed. telemetry container can't start again, telemetry container will hang there and memory will never release.

https://docs.docker.com/engine/reference/run/#user-memory-constraints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #19179, we're using monit to monitor runtime memory usage of docker, @qiluo-msft do we still need this docker hard limit as double insurance?

Copy link
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems dangerous to merge this feature. If slow memory lead happens in telemetry container, we still need telemetry functionality to detect the memory issue, instead of break telemetry feature. so block this PR.

@qiluo-msft qiluo-msft closed this Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants