-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system-health service_checker should check containers based on asic presence #13497
system-health service_checker should check containers based on asic presence #13497
Conversation
|
@abdosi Pls assign folks for review. |
@prgeor can you please review/approve this. |
@spilkey-cisco have you tested the scenario where if the fabric card once inserted, can still have hardware issues causing ASIC from detecting? or this is not a valid scenario? |
That scenario should be covered by the checks in pcied, validating that expected devices exist. |
@lguohan None of us has permission to merge. Please help review and if approved, merge the PR. |
…onic-net#13497) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file). system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity. How I did it Port container_checker logic from sonic-net#11442 into service_checker for system-health. How to verify it Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
Cherry-pick PR to 202211: #13857 |
@yxieca can you help cherry-pick this to 202205 branch? |
…13497) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file). system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity. How I did it Port container_checker logic from #11442 into service_checker for system-health. How to verify it Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
@spilkey-cisco PR conflicts with 202205 branch |
@spilkey-cisco please submit a separate PR for this in 202205 branch as it could not be cherry-picked cleanly... |
…onic-net#13497) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file). system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity. How I did it Port container_checker logic from sonic-net#11442 into service_checker for system-health. How to verify it Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
Created #13966 |
…13497) (#13966) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file). system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity. How I did it Port container_checker logic from #11442 into service_checker for system-health. How to verify it Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
…rmat (#7649) What is the motivation for this PR? test_service_checker_with_process_exit failed due to this change: sonic-net/sonic-buildimage#13497 How did you do it? Modify format of verfiying log. How did you verify/test it? Run test Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
…rmat (#7649) What is the motivation for this PR? test_service_checker_with_process_exit failed due to this change: sonic-net/sonic-buildimage#13497 How did you do it? Modify format of verfiying log. How did you verify/test it? Run test Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
@spilkey-cisco this PR break the test case test_service_checker_with_process_exit, because the log format change, pls run sonic-mgmt test suite if any print log changes, that could break test cases. +@rlhui for vis |
I was not aware of this requirement or impact, I will keep it in mind in the future. Perhaps this PR should not have been approved or merged without requesting these test results. Are there any contributor guidelines you can point me to? I'd like to be sure I meet all these requirements before raising PRs or asking for approval in the future. |
@StormLiangMS Thanks for catching this. Can you confirm if this is also causing "syslog" format change as well? If so, we will need to revert the portion of code that made the log format change. |
The log format change was done because the same process name can exist in multiple containers, especially for multi-asic. It certainly can be reverted, but the system-health errors become ambiguous. Please let me know if it needs to be reverted and if you need me to open a PR for that, or if you will handle the change. |
@spilkey-cisco I think the container name is already in, why we need the later one? It could cause the difference among different releases, pls revert this back. |
@yaqiangz when this one is reverted, I think we should change the testcases back too?
I thought it changes the syslog, but from what Yaqiang found, it is a DB info change. @gechiang |
… new format (#7681) What is the motivation for this PR? Test failed introduced by this PR sonic-net/sonic-buildimage#13497 has been fixed by this PR #7649. But in DUT with old image, it will failed. How did you do it? Add support for check_system_health_info to support both two formats How did you verify/test it? Run test Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
The container name was not displayed in the system-health CLI, only the final string argument
With the log change, the CLI tells us which container they belong to:
|
… new format (#7681) What is the motivation for this PR? Test failed introduced by this PR sonic-net/sonic-buildimage#13497 has been fixed by this PR #7649. But in DUT with old image, it will failed. How did you do it? Add support for check_system_health_info to support both two formats How did you verify/test it? Run test Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.
system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.
How I did it
Port container_checker logic from #11442 into service_checker for system-health.
How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
Which release branch to backport (provide reason below if selected)
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)