[telemetry] Limit resource usage of ST and disable OOM killer #10017

yozhao101 · 2022-02-18T00:13:58Z

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it

Recently we observed the occurrence of memory spike in streaming telemetry due to the issues of gNMI client side. As such, we put a hard limitation on memory usage. At the same time, the OOM killer will be disabled in streaming telemetry container since OOM killer will panicked the kernel if OOM did happen.

In other words, if OOM occurs in streaming telemetry, we will depend on the Monit to catch this issue and restart the streaming telemetry container. So we have another open PR (#10008) to do this change on Monit scripts.

How I did it

I added several runtime parameters to limit the memory and CPU usage, also disabled the OOM killer in streaming telemetry.

How to verify it

I verified this change on DuT str-msn27000-20.

Which release branch to backport (provide reason below if selected)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

liuh-80 · 2024-05-29T00:38:33Z

rules/docker-telemetry.mk

@@ -29,6 +29,9 @@ endif

 $(DOCKER_TELEMETRY)_CONTAINER_NAME = telemetry
 $(DOCKER_TELEMETRY)_RUN_OPT += --privileged -t
+$(DOCKER_TELEMETRY)_RUN_OPT += --memory 450m


Please make sure this scenario does not happen:

the memory_checker seems running inside telemetry docker, and the threshold is 400m: https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-sonic-telemetry/base_image_files/monit_telemetry

if memory_checker run first time check and current memory is 390m, memory_checker passed. memory_check will check 1 minute later.

then before memory next time run, telemetry use more than 450M memory.

because can't allocate memory, memory_checker failed. telemetry container can't start again, telemetry container will hang there and memory will never release.

https://docs.docker.com/engine/reference/run/#user-memory-constraints

With #19179, we're using monit to monitor runtime memory usage of docker, @qiluo-msft do we still need this docker hard limit as double insurance?

qiluo-msft

Seems dangerous to merge this feature. If slow memory lead happens in telemetry container, we still need telemetry functionality to detect the memory issue, instead of break telemetry feature. so block this PR.

[telemetry] Limit memory usage of ST and disable OOM killer.

a5ced3f

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 requested review from qiluo-msft, xumia and lguohan as code owners February 18, 2022 00:13

yozhao101 mentioned this pull request Feb 18, 2022

[memory_monitoring] Enhance monitoring the memory usage of containers #10008

Open

5 tasks

yozhao101 changed the title ~~[telemetry] Limit memory usage of ST and disable OOM killer.~~ [telemetry] Limit resource usage of ST and disable OOM killer Feb 18, 2022

qiluo-msft approved these changes Feb 18, 2022

View reviewed changes

qiluo-msft requested a review from liuh-80 February 18, 2022 00:50

liuh-80 reviewed May 29, 2024

View reviewed changes

qiluo-msft requested changes Jun 7, 2024

View reviewed changes

qiluo-msft closed this Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[telemetry] Limit resource usage of ST and disable OOM killer #10017

[telemetry] Limit resource usage of ST and disable OOM killer #10017

yozhao101 commented Feb 18, 2022

liuh-80 May 29, 2024

FengPan-Frank Jun 6, 2024

qiluo-msft left a comment

[telemetry] Limit resource usage of ST and disable OOM killer #10017

[telemetry] Limit resource usage of ST and disable OOM killer #10017

Conversation

yozhao101 commented Feb 18, 2022

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

liuh-80 May 29, 2024

Choose a reason for hiding this comment

FengPan-Frank Jun 6, 2024

Choose a reason for hiding this comment

qiluo-msft left a comment

Choose a reason for hiding this comment