Skip to content

Add collector for PCIe devices with link information #3339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

naoki9911
Copy link

@naoki9911 naoki9911 commented May 29, 2025

The link status of PCIe devices sometimes changes, like link or speed downgrades, and devices disappear. This patch collects PCIe devices' link infromation to detect such failures.

As a first step, this collector exports PCIe devices'

  • Device information (vendor_id, device_id, etc.)
  • Parent PCIe device (e.g. PCIe bridge, PCIe switch)
  • Link status (max_link_{speed|width}, current_link_{speed|width})

This depends on prometheus/procfs#728
https://groups.google.com/g/prometheus-developers/c/0GJTs2OjvCs

@y1r
Copy link

y1r commented Jul 3, 2025

I am really interested in this feature to check PCIe health in our Kubernetes cluster.

@SuperQ Sorry for involving you. Could you take a look at this PR ? If you are not the right person to ask, please connect to the person to ask.

naoki9911 added 2 commits July 4, 2025 13:11
The link status of PCIe devices sometimes changes,
like link or speed downgrades, and devices disappear.
This patch collects PCIe devices' link infromation  to detect such failures.

As a first step, this collector exports PCIe devices'
- Device information (vendor_id, device_id, etc.)
- Parent PCIe device (e.g. PCIe bridge, PCIe switch)
- Link status (max_link_{speed|width}, current_link_{speed|width})

Signed-off-by: Naoki MATSUMOTO <m.naoki9911@gmail.com>
Signed-off-by: Naoki MATSUMOTO <m.naoki9911@gmail.com>
@SuperQ
Copy link
Member

SuperQ commented Jul 4, 2025

I've created a procfs library update PR separately in order to also update the test fixtures. #3355

@naoki9911
Copy link
Author

Thank you for releasing procfs v0.17.0 !
The e2e-test passed in my Linux environment with your PR.

I also found my PR lacks the fixture test.
I will add it with following commit.

@SuperQ
Copy link
Member

SuperQ commented Jul 4, 2025

I would just wait till the other PR is merged and rebase here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants