[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

yozhao101 · 2022-03-20T07:19:07Z

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it

This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"

If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26

Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55

After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it

In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it

I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.

Which release branch to backport (provide reason below if selected)

201811
201911
[x ] 202006
[x ] 202012
[x ] 202106
[x ] 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

qiluo-msft · 2022-04-06T16:57:35Z

files/image_config/monit/memory_checker

@@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value):
        if mem_usage_bytes > threshold_value:
            print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"


print

Do you really need print? #Closed

The message text from statement print(...) will be appended to the end of Monit alerting message in syslog. Please review the following example:

Jan 2 02:34:56.029949 DSM07-0101-0613-10BT0 ERR monit[496]: 'container_memory_telemetry' status failed (3) -- [telemetry]: Memory usage (489475276.8 Bytes) is larger than the threshold (419430400 Bytes)!

qiluo-msft · 2022-04-06T17:05:34Z

dockers/docker-sonic-telemetry/base_image_files/monit_telemetry

@@ -2,4 +2,4 @@
 ## Monit configuration for telemetry container
 ###############################################################################
 check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
-    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
+    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles


2

How do you choose the magic number 2? #Closed

repeat every 2 cycles at here means if Monit failed to reset its counter and memory usage of telemetry is still larger than the threshold (at here 400MB) for 2 minutes, then telemetry container will be restarted.

I selected the number 2 since memory usage of telemetry possibly increased very quickly from MB to around GB within 2 minutes and telemetry should be restarted ASAP.

Other numbers can be selected as well and do you have any suggestion?

qiluo-msft · 2022-04-06T17:06:41Z

dockers/docker-sonic-telemetry/base_image_files/monit_telemetry

@@ -2,4 +2,4 @@
 ## Monit configuration for telemetry container
 ###############################################################################
 check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
-    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
+    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles


exec

Please explain how do you test this PR? The test could be either manual or ideally automated. You need to make sure the old code fails the test and new code passes the test. #Closed

As we discussed at last week's meeting, I am working on the PR of pytest testcase to make sure the old code will fail while new one will succeed.

Pytest PR is submitted for review: sonic-net/sonic-mgmt#5492.

qiluo-msft · 2022-04-06T17:07:22Z

dockers/docker-sonic-telemetry/base_image_files/monit_telemetry

@@ -2,4 +2,4 @@
 ## Monit configuration for telemetry container
 ###############################################################################
 check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
-    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
+    if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles


repeat

You mentioned this is a monit feature. Please make sure monit on all the backport branches have this feature. #Closed

I will double check whether this new syntax existed on 201811, 201911 and 202012 images or not.

In 201811, 201911 and 202012 images, Monit has the same version 5.20.0. I checked this syntax of Monit with 20191130.83 image on lab device and it did work.

…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.

…w syntax as a workaround (#5492) What is the motivation for this PR? This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo. How did you do it? Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then Test whether that container can be restarted by the script ran by Monit. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal counter. How did you verify/test it? I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.

Related work items: #49, #58, #107, sonic-net#247, sonic-net#249, sonic-net#277, sonic-net#593, sonic-net#597, sonic-net#1035, sonic-net#2130, sonic-net#2150, sonic-net#2165, sonic-net#2169, sonic-net#2178, sonic-net#2179, sonic-net#2187, sonic-net#2188, sonic-net#2191, sonic-net#2195, sonic-net#2197, sonic-net#2198, sonic-net#2200, sonic-net#2202, sonic-net#2206, sonic-net#2209, sonic-net#2211, sonic-net#2216, sonic-net#7909, sonic-net#8927, sonic-net#9681, sonic-net#9733, sonic-net#9746, sonic-net#9850, sonic-net#9967, sonic-net#10104, sonic-net#10152, sonic-net#10168, sonic-net#10228, sonic-net#10266, sonic-net#10288, sonic-net#10294, sonic-net#10313, sonic-net#10394, sonic-net#10403, sonic-net#10404, sonic-net#10421, sonic-net#10431, sonic-net#10437, sonic-net#10445, sonic-net#10457, sonic-net#10458, sonic-net#10465, sonic-net#10467, sonic-net#10469, sonic-net#10470, sonic-net#10474, sonic-net#10477, sonic-net#10478, sonic-net#10482, sonic-net#10485, sonic-net#10488, sonic-net#10489, sonic-net#10492, sonic-net#10494, sonic-net#10498, sonic-net#10501, sonic-net#10509, sonic-net#10512, sonic-net#10514, sonic-net#10516, sonic-net#10517, sonic-net#10523, sonic-net#10525, sonic-net#10531, sonic-net#10532, sonic-net#10538, sonic-net#10555, sonic-net#10557, sonic-net#10559, sonic-net#10561, sonic-net#10565, sonic-net#10572, sonic-net#10574, sonic-net#10576, sonic-net#10578, sonic-net#10581, sonic-net#10585, sonic-net#10587, sonic-net#10599, sonic-net#10607, sonic-net#10611, sonic-net#10616, sonic-net#10618, sonic-net#10619, sonic-net#10623, sonic-net#10624, sonic-net#10633, sonic-net#10646, sonic-net#10655, sonic-net#10660, sonic-net#10664, sonic-net#10680, sonic-net#10683

[Monit] Fix the issue which shows Monit can not reset its counter.

db375a9

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

yozhao101 changed the title ~~[Monit] Restart culprit container if Monit can't reset its counter~~ [Monit] Repeat restarting culprit container if Monit can't reset its counter Mar 20, 2022

yozhao101 added Bug 🐛 Request for 202012 Branch labels Mar 20, 2022

yozhao101 marked this pull request as ready for review March 21, 2022 18:14

yozhao101 requested a review from lguohan as a code owner March 21, 2022 18:14

yozhao101 requested a review from qiluo-msft March 21, 2022 18:25

qiluo-msft reviewed Apr 6, 2022

View reviewed changes

yozhao101 mentioned this pull request Apr 8, 2022

[Pytest] Test inability of Monit to reset internal counter and its new syntax as a workaround sonic-net/sonic-mgmt#5492

Merged

3 tasks

qiluo-msft requested a review from yxieca April 18, 2022 21:40

qiluo-msft approved these changes Apr 18, 2022

View reviewed changes

yozhao101 merged commit e24fe9b into sonic-net:master Apr 21, 2022

yozhao101 deleted the fix_monitor_telemetry_memory branch April 21, 2022 01:08

qiluo-msft added the Included in 202012 Branch label Apr 22, 2022

yozhao101 added the Request for 201911 Branch label Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

yozhao101 commented Mar 20, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading

yozhao101 Apr 6, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading

yozhao101 Apr 6, 2022

qiluo-msft Apr 6, 2022 •

edited

Loading

yozhao101 Apr 6, 2022

yozhao101 Apr 8, 2022

qiluo-msft Apr 6, 2022 •

edited

Loading

yozhao101 Apr 8, 2022

yozhao101 Apr 21, 2022

		@@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value):
		if mem_usage_bytes > threshold_value:
		print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

Conversation

yozhao101 commented Mar 20, 2022 • edited Loading

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

qiluo-msft Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

qiluo-msft Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 6, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 6, 2022

Choose a reason for hiding this comment

yozhao101 Apr 8, 2022

Choose a reason for hiding this comment

qiluo-msft Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

yozhao101 Apr 8, 2022

Choose a reason for hiding this comment

yozhao101 Apr 21, 2022

Choose a reason for hiding this comment

yozhao101 commented Mar 20, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading

yozhao101 Apr 6, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading

qiluo-msft Apr 6, 2022 •

edited

Loading