Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system-health service_checker should check containers based on asic presence #13497

Merged
merged 1 commit into from
Feb 11, 2023

Conversation

spilkey-cisco
Copy link
Contributor

Why I did it

On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it

Port container_checker logic from #11442 into service_checker for system-health.

How to verify it

Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 24, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@bmridul
Copy link
Contributor

bmridul commented Feb 1, 2023

@abdosi Pls assign folks for review.

@shivuv
Copy link

shivuv commented Feb 8, 2023

@abdosi Pls assign folks for review.

@abdosi @gechiang : Can you have someone review this?

@abdosi
Copy link
Contributor

abdosi commented Feb 8, 2023

@prgeor can you please review/approve this.

@prgeor
Copy link
Contributor

prgeor commented Feb 9, 2023

@spilkey-cisco have you tested the scenario where if the fabric card once inserted, can still have hardware issues causing ASIC from detecting? or this is not a valid scenario?

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco have you tested the scenario where if the fabric card once inserted, can still have hardware issues causing ASIC from detecting? or this is not a valid scenario?

That scenario should be covered by the checks in pcied, validating that expected devices exist.

@shivuv
Copy link

shivuv commented Feb 10, 2023

@abdosi @gechiang @prgeor : If you are ok with this change, please approve and merge it. Also requesting to cherry pick this into 202205 branch.

@gechiang
Copy link
Collaborator

@lguohan None of us has permission to merge. Please help review and if approved, merge the PR.
Thanks!

@rlhui rlhui merged commit ad679a0 into sonic-net:master Feb 11, 2023
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Feb 17, 2023
…onic-net#13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from sonic-net#11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #13857

@gechiang
Copy link
Collaborator

@yxieca can you help cherry-pick this to 202205 branch?
Thanks!

mssonicbld pushed a commit that referenced this pull request Feb 17, 2023
…13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
@mssonicbld
Copy link
Collaborator

@spilkey-cisco PR conflicts with 202205 branch

@gechiang
Copy link
Collaborator

@spilkey-cisco please submit a separate PR for this in 202205 branch as it could not be cherry-picked cleanly...

spilkey-cisco added a commit to spilkey-cisco/sonic-buildimage that referenced this pull request Feb 23, 2023
…onic-net#13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from sonic-net#11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco please submit a separate PR for this in 202205 branch as it could not be cherry-picked cleanly...

Created #13966

yxieca pushed a commit that referenced this pull request Feb 28, 2023
…13497) (#13966)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
yaqiangz added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 6, 2023
…rmat (#7649)

What is the motivation for this PR?
test_service_checker_with_process_exit failed due to this change: sonic-net/sonic-buildimage#13497

How did you do it?
Modify format of verfiying log.

How did you verify/test it?
Run test

Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 7, 2023
…rmat (#7649)

What is the motivation for this PR?
test_service_checker_with_process_exit failed due to this change: sonic-net/sonic-buildimage#13497

How did you do it?
Modify format of verfiying log.

How did you verify/test it?
Run test

Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
@StormLiangMS
Copy link
Contributor

@spilkey-cisco this PR break the test case test_service_checker_with_process_exit, because the log format change, pls run sonic-mgmt test suite if any print log changes, that could break test cases.

+@rlhui for vis

@spilkey-cisco
Copy link
Contributor Author

@spilkey-cisco this PR break the test case test_service_checker_with_process_exit, because the log format change, pls run sonic-mgmt test suite if any print log changes, that could break test cases.

+@rlhui for vis

I was not aware of this requirement or impact, I will keep it in mind in the future. Perhaps this PR should not have been approved or merged without requesting these test results. Are there any contributor guidelines you can point me to? I'd like to be sure I meet all these requirements before raising PRs or asking for approval in the future.

@gechiang
Copy link
Collaborator

gechiang commented Mar 8, 2023

@StormLiangMS Thanks for catching this. Can you confirm if this is also causing "syslog" format change as well? If so, we will need to revert the portion of code that made the log format change.

@spilkey-cisco
Copy link
Contributor Author

@StormLiangMS Thanks for catching this. Can you confirm if this is also causing "syslog" format change as well? If so, we will need to revert the portion of code that made the log format change.

The log format change was done because the same process name can exist in multiple containers, especially for multi-asic. It certainly can be reverted, but the system-health errors become ambiguous. Please let me know if it needs to be reverted and if you need me to open a PR for that, or if you will handle the change.

@StormLiangMS
Copy link
Contributor

@StormLiangMS Thanks for catching this. Can you confirm if this is also causing "syslog" format change as well? If so, we will need to revert the portion of code that made the log format change.

The log format change was done because the same process name can exist in multiple containers, especially for multi-asic. It certainly can be reverted, but the system-health errors become ambiguous. Please let me know if it needs to be reverted and if you need me to open a PR for that, or if you will handle the change.

@spilkey-cisco I think the container name is already in, why we need the later one? It could cause the difference among different releases, pls revert this back.
self.set_object_not_ok('Process', '{}:{}'.format(container_name, process_name), "Process '{}' in container '{}' is not running".format(process_name, container_name))

@StormLiangMS
Copy link
Contributor

StormLiangMS commented Mar 8, 2023

@yaqiangz when this one is reverted, I think we should change the testcases back too?

@StormLiangMS Thanks for catching this. Can you confirm if this is also causing "syslog" format change as well? If so, we will need to revert the portion of code that made the log format change.

I thought it changes the syslog, but from what Yaqiang found, it is a DB info change. @gechiang

yaqiangz added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 8, 2023
… new format (#7681)

What is the motivation for this PR?
Test failed introduced by this PR sonic-net/sonic-buildimage#13497 has been fixed by this PR #7649. But in DUT with old image, it will failed.

How did you do it?
Add support for check_system_health_info to support both two formats

How did you verify/test it?
Run test

Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
@spilkey-cisco
Copy link
Contributor Author

spilkey-cisco commented Mar 8, 2023

@spilkey-cisco I think the container name is already in, why we need the later one? It could cause the difference among different releases, pls revert this back. self.set_object_not_ok('Process', '{}:{}'.format(container_name, process_name), "Process '{}' in container '{}' is not running".format(process_name, container_name))

The container name was not displayed in the system-health CLI, only the final string argument '<process>' is not running. On multi-asic systems especially where duplicates of containers exist, the CLI (ie. show system-health summary) does not display the container name:

	     'syncd' is not running
	     'teammgrd' is not running
	     'syncd' is not running
	     'syncd' is not running
	     'vxlanmgrd' is not running
	     'vlanmgrd' is not running
	     'orchagent' is not running
	     'teammgrd' is not running
	     'vxlanmgrd' is not running
	     'vlanmgrd' is not running
	     'orchagent' is not running
	     'vxlanmgrd' is not running
	     'vlanmgrd' is not running
	     'orchagent' is not running
	     'syncd' is not running
	     'vxlanmgrd' is not running
	     'vlanmgrd' is not running
	     'orchagent' is not running
	     'teammgrd' is not running
	     'teammgrd' is not running

With the log change, the CLI tells us which container they belong to:

	     Process 'syncd' in container 'syncd0' is not running
	     Process 'syncd' in container 'syncd12' is not running
	     Process 'teammgrd' in container 'teamd0' is not running
	     Process 'teammgrd' in container 'teamd1' is not running
	     Process 'syncd' in container 'syncd1' is not running
	     Process 'syncd' in container 'syncd13' is not running
	     Process 'teamsyncd' in container 'teamd12' is not running
	     Process 'teammgrd' in container 'teamd12' is not running
	     Process 'teammgrd' in container 'teamd13' is not running
	     Process 'vxlanmgrd' in container 'swss12' is not running
	     Process 'vlanmgrd' in container 'swss12' is not running
	     Process 'vxlanmgrd' in container 'swss1' is not running
	     Process 'vlanmgrd' in container 'swss1' is not running
	     Process 'vxlanmgrd' in container 'swss0' is not running
	     Process 'vlanmgrd' in container 'swss0' is not running
	     Process 'vxlanmgrd' in container 'swss13' is not running
	     Process 'vlanmgrd' in container 'swss13' is not running

wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 9, 2023
… new format (#7681)

What is the motivation for this PR?
Test failed introduced by this PR sonic-net/sonic-buildimage#13497 has been fixed by this PR #7649. But in DUT with old image, it will failed.

How did you do it?
Add support for check_system_health_info to support both two formats

How did you verify/test it?
Run test

Signed-off-by: Yaqiang Zhu <yaqiangzhu@microsoft.com>
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this pull request Mar 28, 2023
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants