Skip to content

Commit

Permalink
[system-health] No longer check critical process/service status via m…
Browse files Browse the repository at this point in the history
…onit (#9068)

HLD updated here: sonic-net/SONiC#887

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
  • Loading branch information
Junchao-Mellanox committed Nov 23, 2021
1 parent 240596e commit 11a93d2
Show file tree
Hide file tree
Showing 7 changed files with 624 additions and 29 deletions.
1 change: 1 addition & 0 deletions rules/system-health.mk
Expand Up @@ -4,6 +4,7 @@ SYSTEM_HEALTH = system_health-1.0-py3-none-any.whl
$(SYSTEM_HEALTH)_SRC_PATH = $(SRC_PATH)/system-health
$(SYSTEM_HEALTH)_PYTHON_VERSION = 3
$(SYSTEM_HEALTH)_DEPENDS = $(SONIC_PY_COMMON_PY3) $(SONIC_CONFIG_ENGINE_PY3)
$(SYSTEM_HEALTH)_DEBS_DEPENDS = $(LIBSWSSCOMMON) $(PYTHON3_SWSSCOMMON)
SONIC_PYTHON_WHEELS += $(SYSTEM_HEALTH)

export system_health_py3_wheel_path="$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))"
21 changes: 12 additions & 9 deletions src/system-health/health_checker/manager.py
@@ -1,3 +1,11 @@
from . import utils
from .config import Config
from .health_checker import HealthChecker
from .service_checker import ServiceChecker
from .hardware_checker import HardwareChecker
from .user_defined_checker import UserDefinedChecker


class HealthCheckerManager(object):
"""
Manage all system health checkers and system health configuration.
Expand All @@ -10,7 +18,6 @@ def __init__(self):
self._checkers = []
self._state = self.STATE_BOOTING

from .config import Config
self.config = Config()
self.initialize()

Expand All @@ -19,8 +26,6 @@ def initialize(self):
Initialize the manager. Create service checker and hardware checker by default.
:return:
"""
from .service_checker import ServiceChecker
from .hardware_checker import HardwareChecker
self._checkers.append(ServiceChecker())
self._checkers.append(HardwareChecker())

Expand All @@ -31,7 +36,6 @@ def check(self, chassis):
:return: A tuple. The first element indicate the status of the checker; the second element is a dictionary that
contains the status for all objects that was checked.
"""
from .health_checker import HealthChecker
HealthChecker.summary = HealthChecker.STATUS_OK
stats = {}
self.config.load_config()
Expand All @@ -45,7 +49,6 @@ def check(self, chassis):
self._do_check(checker, stats)

if self.config.user_defined_checkers:
from .user_defined_checker import UserDefinedChecker
for udc in self.config.user_defined_checkers:
checker = UserDefinedChecker(udc)
self._do_check(checker, stats)
Expand All @@ -71,20 +74,20 @@ def _do_check(self, checker, stats):
else:
stats[category].update(info)
except Exception as e:
from .health_checker import HealthChecker
HealthChecker.summary = HealthChecker.STATUS_NOT_OK
error_msg = 'Failed to perform health check for {} due to exception - {}'.format(checker, repr(e))
entry = {str(checker): {
HealthChecker.INFO_FIELD_OBJECT_STATUS: HealthChecker.STATUS_NOT_OK,
HealthChecker.INFO_FIELD_OBJECT_MSG: error_msg
HealthChecker.INFO_FIELD_OBJECT_MSG: error_msg,
HealthChecker.INFO_FIELD_OBJECT_TYPE: "Internal"
}}
if 'Internal' not in stats:
stats['Internal'] = entry
else:
stats['Internal'].update(entry)

def _is_system_booting(self):
from .utils import get_uptime
uptime = get_uptime()
uptime = utils.get_uptime()
if not self.boot_timeout:
self.boot_timeout = self.config.get_bootup_timeout()
booting = uptime < self.boot_timeout
Expand Down

0 comments on commit 11a93d2

Please sign in to comment.