Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[memory_monitoring] Enhance monitoring the memory usage of containers #19179

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Jun 3, 2024

Why I did it

We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

Work item tracking
  • Microsoft ADO (number only):28334202

How I did it

Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

How to verify it

Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

image

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

if total_memory_usage > threshold_value:
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"
.format(container_name, total_memory_usage, threshold_value))
sys.exit(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FengPan-Frank could you add more details about the changes in the description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

if not container_id:
syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to get contianer ID of '{}'! Exiting ..."
.format(container_name))
sys.exit(4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define what certain error codes are and use those, such as for example, ERROR_CONTAINER_ID_NOT_FOUND? Maybe we can reuse certain error codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


return memory_usage_in_bytes

def get_cache_usage(container_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can refer to this as getting inactive cache usage since we want to calculate container memory usage as total memory usage - inactive cache usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.


cache_usage_in_bytes = get_cache_usage(container_id)
syslog.syslog(syslog.LOG_INFO, "[memory_checker] The cache usage of container '{}' is '{}' Bytes!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be beneficial to also see what the active cache usage is? Since here you are defining cache usage as only the inactive cache usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just identical to the logic of "docker stats", refer https://docs.docker.com/reference/cli/docker/container/stats/#extended-description and below snippet calculation.

image

"""
container_id = ""

get_container_info_cmd = "docker ps --no-trunc"
Copy link
Contributor

@maipbui maipbui Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"docker ps --no-trunc"

I recall that in the get_command_result() function, when using subprocess with shell=False, the command should be provided as a list of strings rather than a single string. Could you double-check this? Same for other places

Quick test

admin@vlab-01:~$ sudo cat test.py
import subprocess

def get_command_result(command):
    command_stdout = ""

    proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
    command_stdout, command_stderr = proc_instance.communicate()
    if proc_instance.returncode != 0:
        sys.exit(1)
    return command_stdout.strip()

print(get_command_result("docker ps --no-trunc"))

admin@vlab-01:~$ sudo python3 test.py
Traceback (most recent call last):
  File "/home/admin/test.py", line 12, in <module>
    print(get_command_result("docker ps --no-trunc"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/admin/test.py", line 6, in get_command_result
    proc_instance = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1024, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.11/subprocess.py", line 1901, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker ps --no-trunc'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed and added verification test result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants