-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[memory_monitoring] Enhance monitoring the memory usage of containers #19179
base: master
Are you sure you want to change the base?
Conversation
if total_memory_usage > threshold_value: | ||
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!" | ||
.format(container_name, total_memory_usage, threshold_value)) | ||
sys.exit(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FengPan-Frank could you add more details about the changes in the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if not container_id: | ||
syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to get contianer ID of '{}'! Exiting ..." | ||
.format(container_name)) | ||
sys.exit(4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define what certain error codes are and use those, such as for example, ERROR_CONTAINER_ID_NOT_FOUND? Maybe we can reuse certain error codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
return memory_usage_in_bytes | ||
|
||
def get_cache_usage(container_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can refer to this as getting inactive cache usage since we want to calculate container memory usage as total memory usage - inactive cache usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
|
||
cache_usage_in_bytes = get_cache_usage(container_id) | ||
syslog.syslog(syslog.LOG_INFO, "[memory_checker] The cache usage of container '{}' is '{}' Bytes!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be beneficial to also see what the active cache usage is? Since here you are defining cache usage as only the inactive cache usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be just identical to the logic of "docker stats", refer https://docs.docker.com/reference/cli/docker/container/stats/#extended-description and below snippet calculation.
…ic-buildimage into memory_checker
…ic-buildimage into memory_checker
…ic-buildimage into memory_checker
""" | ||
container_id = "" | ||
|
||
get_container_info_cmd = "docker ps --no-trunc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that in the get_command_result()
function, when using subprocess with shell=False, the command should be provided as a list of strings rather than a single string. Could you double-check this? Same for other places
Quick test
admin@vlab-01:~$ sudo cat test.py
import subprocess
def get_command_result(command):
command_stdout = ""
proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
command_stdout, command_stderr = proc_instance.communicate()
if proc_instance.returncode != 0:
sys.exit(1)
return command_stdout.strip()
print(get_command_result("docker ps --no-trunc"))
admin@vlab-01:~$ sudo python3 test.py
Traceback (most recent call last):
File "/home/admin/test.py", line 12, in <module>
print(get_command_result("docker ps --no-trunc"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/admin/test.py", line 6, in get_command_result
proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 1024, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.11/subprocess.py", line 1901, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker ps --no-trunc'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed and added verification test result.
Why I did it
We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.
Work item tracking
How I did it
Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.
Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.
How to verify it
Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)