Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PCIe transaction check for all peripherals on the bus #331

Merged
merged 10 commits into from
Jul 12, 2023

Conversation

assrinivasan
Copy link
Contributor

@assrinivasan assrinivasan commented Jan 13, 2023

Description

MSFT ADO: 14671564

Added code to carry out PCI transactions with all the peripherals on the device. This is to determine if there are any unreachable devices on the PCIe bus. The gold list of PCI peripherals is a static file that varies depending on the platform. BDF information is parsed from this YAML file. Transactions are then carried out to determine the health of each peripheral.

Motivation and Context

Fixes: Generate syslog alert for missing PCIe devices

How Has This Been Tested?

Manually disabled a PCI peripheral and verified that the code changes were able to pick it up:

admin@str-s6000-on-6:~$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Atom Processor S1200 Internal
           +-01.0-[01]----00.0  Broadcom Inc. and subsidiaries Broadcom BCM56850 Switch ASIC
           +-02.0-[02]----00.0  Marvell Technology Group Ltd. Device 9170
           +-03.0-[03-07]----00.0-[04-07]--+-01.0-[05]--
           |                               +-02.0-[06]--
           |                               \-03.0-[07]--+-00.0  Pericom Semiconductor PI7C9X442SL USB OHCI Controller
           |                                            +-00.1  Pericom Semiconductor PI7C9X442SL USB OHCI Controller
           |                                            \-00.2  Pericom Semiconductor PI7C9X442SL USB EHCI Controller
           +-04.0-[08]----00.0  Intel Corporation 82574L Gigabit Network Connection
           +-0e.0  Intel Corporation Atom Processor S1200 Internal
           +-13.0  Intel Corporation Atom Processor S1200 SMBus 2.0 Controller 0
           +-13.1  Intel Corporation Atom Processor S1200 SMBus 2.0 Controller 1
           +-14.0  Intel Corporation Atom Processor S1200 UART
           \-1f.0  Intel Corporation Atom Processor S1200 Integrated Legacy Bus
admin@str-s6000-on-6:~$ 
admin@str-s6000-on-6:~$ 
admin@str-s6000-on-6:~$ setpci -s 0000:00:01.0 BRIDGE_CONTROL
0013
admin@str-s6000-on-6:~$ sudo setpci -s 0000:00:01.0 BRIDGE_CONTROL=0053
admin@str-s6000-on-6:~$ 
admin@str-s6000-on-6:~$ 
admin@str-s6000-on-6:~$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Atom Processor S1200 Internal
           +-01.0-[01]----00.0  Broadcom Inc. and subsidiaries Broadcom BCM56850 Switch ASIC (ACTUALLY MISSING)
           +-02.0-[02]----00.0  Marvell Technology Group Ltd. Device 9170
           +-03.0-[03-07]----00.0-[04-07]--+-01.0-[05]--
           |                               +-02.0-[06]--
           |                               \-03.0-[07]--+-00.0  Pericom Semiconductor PI7C9X442SL USB OHCI Controller
           |                                            +-00.1  Pericom Semiconductor PI7C9X442SL USB OHCI Controller
           |                                            \-00.2  Pericom Semiconductor PI7C9X442SL USB EHCI Controller
           +-04.0-[08]----00.0  Intel Corporation 82574L Gigabit Network Connection
           +-0e.0  Intel Corporation Atom Processor S1200 Internal
           +-13.0  Intel Corporation Atom Processor S1200 SMBus 2.0 Controller 0
           +-13.1  Intel Corporation Atom Processor S1200 SMBus 2.0 Controller 1
           +-14.0  Intel Corporation Atom Processor S1200 UART
           \-1f.0  Intel Corporation Atom Processor S1200 Integrated Legacy Bus
admin@str-s6000-on-6:~$ lspci -v -s 0000:01:00.0
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries Broadcom BCM56850 Switch ASIC (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: linux-kernel-bde
        Kernel modules: linux_kernel_bde

admin@str-s6000-on-6:~$ setpci -s 0000:01:00.0 0.l
ffffffff
admin@str-s6000-on-6:~$ 




pcied logs:

Mar  8 03:34:10.205062 str-s6000-acs-12 INFO pmon#supervisord 2023-03-08 03:34:10,202 INFO spawned: 'pcied' with pid 921
Mar  8 03:34:11.363019 str-s6000-acs-12 NOTICE pmon#pcied[921]: Failed to load platform Pcie module. Error : No module named 'sonic_platform.pcie', Fallback to default module
Mar  8 03:34:20.238683 str-s6000-acs-12 INFO pmon#supervisord 2023-03-08 03:34:20,236 INFO success: pcied entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
Mar  8 03:35:11.650058 str-s6000-acs-12 ERR pmon#pcied[921]: PCIe device 01:00.0 missing.
Mar  8 03:36:12.067868 str-s6000-acs-12 ERR pmon#pcied[921]: PCIe device 01:00.0 missing.
Mar  8 03:37:12.568690 str-s6000-acs-12 ERR pmon#pcied[921]: PCIe device 01:00.0 missing.

Additional Information (Optional)

This is to determine if there are any unreachable devices on the PCIe bus.
The gold list of PCI peripherals is a static file that varies depending on the platform. BDF information is parsed from this YAML file. Transcations are then carried out to determine the health of each peripheral.
Modified cmd to be executed to setpci (does pci transaction)
Added exception handling
@prgeor
Copy link
Collaborator

prgeor commented Jan 13, 2023

@assrinivasan can you remove the ADO link. its not accessible for public

@assrinivasan
Copy link
Contributor Author

@assrinivasan can you remove the ADO link. its not accessible for public

My apologies -- done.

Added specific check for ASIC missing to differentiate from device mismatch
Added yaml support
@assrinivasan
Copy link
Contributor Author

@prgeor added testing details, please take a look when convenient. Thanks.

prgeor
prgeor previously approved these changes Mar 17, 2023
@prgeor
Copy link
Collaborator

prgeor commented Mar 17, 2023

@assrinivasan code coverage is not met

sonic-pcied/scripts/pcied Outdated Show resolved Hide resolved
prgeor
prgeor previously approved these changes Mar 20, 2023
@prgeor
Copy link
Collaborator

prgeor commented Mar 20, 2023

@assrinivasan please fix the test failure

@assrinivasan
Copy link
Contributor Author

@prgeor added unit tests

@assrinivasan assrinivasan merged commit d73808c into sonic-net:master Jul 12, 2023
4 checks passed
yxieca pushed a commit that referenced this pull request Jul 13, 2023
* Added PCIe transaction check for all peripherals on the bus

This is to determine if there are any unreachable devices on the PCIe bus.
The gold list of PCI peripherals is a static file that varies depending on the platform. BDF information is parsed from this YAML file. Transcations are then carried out to determine the health of each peripheral.

* Modified code to read pcie devices from the pre-configured YAML file
Modified cmd to be executed to setpci (does pci transaction)
Added exception handling

* Formalized syslog error messages
Added specific check for ASIC missing to differentiate from device mismatch
Added yaml support

* Made changes to the way setpci command is called per SA review comments

* Added comments per prgeor review comments

* Added relevant comments per prgeor review comments

* Fixed a bug that was causing UnboundLocalError.

* Added unit tests for the check_pcie_devices method
prgeor added a commit that referenced this pull request Jul 14, 2023
yxieca added a commit that referenced this pull request Jul 19, 2023
@yxieca
Copy link

yxieca commented Jul 19, 2023

This is a feature. Should not be back ported into feature branches.

assrinivasan added a commit to assrinivasan/sonic-platform-daemons that referenced this pull request Aug 8, 2023
prgeor pushed a commit that referenced this pull request Aug 8, 2023
* Revert "Fixes for the issues uncovered by sonic-pcied unit tests (#389)"

This reverts commit 76baca3.

* Revert "Added PCIe transaction check for all peripherals on the bus (#331)"

This reverts commit d73808c.

* Fixes for the issues uncovered by sonic-pcied unit tests (#389)

* Fixes for the following issues:

	- Lack of getKeys() impl in mock swsscommon table class in sonic-pcied
	- Fixed a 'set' bug in pcied that was uncovered by new code flows

* Removed mocked table instances per prgeor review comments
tshalvi pushed a commit to tshalvi/sonic-platform-daemons that referenced this pull request Sep 11, 2023
* Revert "Fixes for the issues uncovered by sonic-pcied unit tests (sonic-net#389)"

This reverts commit 76baca3.

* Revert "Added PCIe transaction check for all peripherals on the bus (sonic-net#331)"

This reverts commit d73808c.

* Fixes for the issues uncovered by sonic-pcied unit tests (sonic-net#389)

* Fixes for the following issues:

	- Lack of getKeys() impl in mock swsscommon table class in sonic-pcied
	- Fixed a 'set' bug in pcied that was uncovered by new code flows

* Removed mocked table instances per prgeor review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants