Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC FM (Fault Manager) HLD #1527

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

shyam77git
Copy link
Contributor

@shyam77git shyam77git commented Nov 29, 2023

code PRs (corresponding to this FM HLD PR)
Fault Manager daemon (faultmgrd): sonic-platform-daemons: sonic-net/sonic-platform-daemons#421
Reboot: sonic-utilities repo: sonic-net/sonic-utilities#3154

Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:

  • SW faults are the ones that can occur during SW processing of a workflow at process/sub-system or a system level
  • HW faults are those that can occur during SW or HW processing of a workflow at HW (board) level - e.g. HW component/device etc.
    They may occur at any of the following stages of system's functioning:
  • system configuration, bring-up
  • feature enablement/configuration
  • during steady state
  • feature disablement/unconfiguration
  • while going-down (config reload, reboot etc.)

Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.

Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:

  • Abstract the platform/HWSKU nuances from an open source NOS (i.e. SONiC) by publishing platform-specific 'Fault-Action Policy table'
  • Fetch these events (alarms/faults) from the eventD (based on published YANG/schema)
  • Analyze them (in a generic way) against the above-mentioned Policy Table
  • Take action based on the lookup/match in Policy Table
    Action could either be generic or platform specfic

Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.

Mistakenly added right under SONiC/ instead of SONiC/doc
Generic Fault Management Infra document
Enhanced the HLD with following:
Updated workflows
Added Fault-Action Policy Table sample
Added section describing about all the steps in the block digram.
Added FM use-cases table.
Added Revision as 1.0 (as this revision is an Initial Draft for External review)
Updated the Revision number for Initial Draft (for review)
@shyam77git shyam77git marked this pull request as ready for review December 4, 2023 20:50
Added Fault's end-to-end WorkFlow sequence section.
@shiraez
Copy link

shiraez commented Jan 24, 2024

Perhaps you can add special handling to avoid endless reboots and shutdowns.
Is there an option to disable this feature temporarily option to the action?

{
"type" : "TEMPERATURE_EXCEEDED",
"severity" : "CRITICAL",
"action" : ["syslog", "obfl", "reload"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure obfl is supported by all vendors. So u might want to rename it to a generic term such as "platform-log". Same comment for other places in the doc where obfl is mentioned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to store faults in a separate table with action performed on them? This will be helpful to know the faults over time in the switch.

- action may range from logging (disk, OBFL flash etc.) to reload/shutdown etc.
- Taking action would either be by itself (i.e. in ts own micro-service) or delegating it to action's owner
7. Tabulate event entry (along with action taken) for book-keeping purposes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to show the Alarm/fault entry schema as represented in EventDB.

# Fault's End-to-End WorkFlow Sequence
Following workflow depicts the end-to-end fault (event) flow from Fault generation to Fault Handling
![Fault Management (FM) Workflow sequence](https://github.com/shyam77git/SONiC/assets/69485234/2b453a1b-6e14-48c6-bf61-ab978e62a3bf)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to mention some examples of processes/daemons which act as FDR.


{

"chassis": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a config-db schema for action configuration for the faults? and SONiC YANG model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider VS platform as well, may be, by default no generic "fault_action_policy.json" populated and platform files can provide the default actions and user can override them if required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to have fault-manager enable/disable config knob as well? Global config knob would be useful to disable all actions.

- https://github.com/sonic-net/sonic-buildimage/tree/master/src/sonic-yang-models/yang-models
- sonic-events-swss.yang
- sonic-events-host.yang
- sonic-events-bgp.yang etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to take any actions for the events (legacy ones) via fault manager?

3) Analyze them (in a generic way) against the above-mentioned Policy Table
4) Take action based on the lookup/match in Policy Table
5) Action could either be generic or platform specific

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a section for "out of scope" to mention about controller driven fault manager, FM in chassis..etc

{
"type": "FANS MISSING",
"severity": "CRITICAL",
"action" : ["syslog", "obfl", "shutdown"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this action be to take "tech-support" or executing some script as well? e.g if case of a critical event, the user may want to log all the states for analysis later.

{

"chassis": {
"name": "PID or HWSKU",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this PID or HWSKU? PID may be changing dynamically, do you want to provide the config knob at the process level granularity?


"type" : "CUSTOM_EVPROFILE_CHANGE",
"severity" : "MAJOR",
"action" : ["syslog"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syslog is the default action, correct? do we need it as part of fault-manager action?

1. Formulate platform/HWSKU specific Fault-Action Policy Table (json or yaml file)
- There would be generic (default) table if none provided by platform
- A platform supplied file would override the default one
2. Introduce a new micro-service (fault_manager) at host (Linux Kernel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan? Is the fault manager a dedicated docker container or service? or it it colocated with EventD docker container?

5. Analyze them against Fault-Action Policy Table (file)
- Take fault_type and fault_severity as input from the fetched event and perform lookup
on these fields in Fault-Action Policy Table to determine the action(s) needed
6. Handle the fault (i.e. take action) based on action(s) specified in Fault-Action Policy Table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can external controllers override the fault manager policies/actions?

@zhangyanzhao
Copy link
Collaborator

202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks.

@liat-grozovik
Copy link
Collaborator

@shyam77git can you please update the PR Description with the list of the Code PRs?
also, in order to approve the code PRs we need a test plan review in sonic test group. was that done? can you share the test plan PR as well?

@zhangyanzhao
Copy link
Collaborator

@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release

@zhangyanzhao
Copy link
Collaborator

HLD is not approved, move to backlog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Development

Successfully merging this pull request may close these issues.

None yet

7 participants