merge master into this branch by vince-weka · Pull Request #168 · weka/wekachecker

vince-weka · 2025-04-24T15:14:05Z

No description provided.

Raise a warning if the current NFSW FD usage is >=90% of the configured maximum.

Changes for Weka4 - OS, IOMMU and NTP & OFED

Added parallel-compare scripts

merge-conflicts

- Provide IPv6 support in FIPs sanity, mgmt IP, netmask and SBR checks - Remove jq dependency in NATS check

IPv6 updates and code cleanup

Query NFSW FD usage

jackchallen · 2025-05-07T12:52:13Z

@vince-weka I don't think you need us to do anything here right? Looks like you're rebasing your branch to include changes from master.

vince-weka · 2025-05-07T14:29:44Z

Exactly Vince Fleming Director, Engineering at WEKA M 848-220-0041* *E ***@***.**** *W www.weka.io <https://www.weka.io/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>* * * * <http://weka.io/4>

…

On Wed, May 7, 2025 at 8:53 AM jackchallen ***@***.***> wrote: *jackchallen* left a comment (weka/wekachecker#168) <#168 (comment)> @vince-weka <https://github.com/vince-weka> I don't think you need us to do anything here right? Looks like you're rebasing your branch to include changes from master. — Reply to this email directly, view it on GitHub <#168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK7ZJL56HQWI3AXPB36KP7D25H6ZHAVCNFSM6AAAAAB3ZKB5GSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJYGQ3TGOJVGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Clusters with 0 hot-spare can potentially be configured to allocate and use ~100% of SSD capacity. In the case of loss of a failure domain they'll lose a proportion of FS space (known as shrinkage). The proportion lost is dependent on the number of data disks, and if "too much" is lost then writes can fail with ENOSPACE. Essentially, we should warn customers without hot spares configured

Add a very basic capacity check in the case of 0 hot-spare clusters

In https://wekaio.atlassian.net/browse/WEKAPP-482528 we saw that the link speed was lower than expected, but there were no warnings. We should check that. The only plausible way I can find of doing this is by parsing the text-based output of ethtool, until jq and "ethtool --json" get everywhere, or the kernel interface to ethtool-netlink is exposed in /sys, I can't see any other way of doing it. :(

We don't want weka cluster buckets becoming too full, or too imbalanced. Ordinarily the RAID stripe allocation takes care of this for us, but in at least https://wekaio.atlassian.net/browse/WEKAPP-488736 (for example) network interruptions led to us not being able to find free stripes. This in turn led to buckets becoming full and thus FS writes stalling.

Stupid typo fix

…o proceed

Add a check to examine bucket fill levels

Basic RDMA errors check as per #weka-platform Slack

Basic checker to compare current NIC link speed with maximum

+3 statistics as per internal slack channel

Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.

NVME block size check

This was RCA'd down to too many connections to Ganesha, so we should start checking this.

In WEKAPP-502848 we saw that NFS service was failing over

…iners_in_gateway_check Should exclude dataservice containers from these checks

adamuk01 and others added 18 commits April 2, 2025 19:07

Changes for Weka4 - NTP,OS,IOMMU & OFED

6391e6b

Changes to NTP and IOMMU check

c7f891e

Update NTP check on client

06eccb6

Query NFSW FD usage

aea9f3c

Raise a warning if the current NFSW FD usage is >=90% of the configured maximum.

Merge pull request #163 from weka/Weka4Updates

778c57f

Changes for Weka4 - OS, IOMMU and NTP & OFED

Merge branch 'vince/os-kernel-updates' into master

606433f

Added parallel-compare scripts

1b1e6dc

resolve merge conflicts

2e62b01

Merge pull request #169 from weka/vince/2025-03-03

cbe4a53

Added parallel-compare scripts

resolve merge conflicts

b416b28

resolve merge conflicts

6d5c854

resolve merge conflicts

d73708c

Merge branch 'master' into merge-conflicts

c7f405e

Merge pull request #171 from weka/merge-conflicts

907f82c

merge-conflicts

IPv6 updates and code cleanup

904b177

- Provide IPv6 support in FIPs sanity, mgmt IP, netmask and SBR checks - Remove jq dependency in NATS check

Minor typo fixes

4301aef

Merge pull request #172 from weka/cst_vragosta_ipv6

3814fb6

IPv6 updates and code cleanup

Merge pull request #167 from weka/cst_vragosta_nfsw_fds

a0591e4

Query NFSW FD usage

jackchallen and others added 10 commits May 7, 2025 16:56

Improve wording, fix typos

cadc0d6

Merge pull request #173 from weka/cst_jack_check_hot_spare_capacity

8485bf3

Add a very basic capacity check in the case of 0 hot-spare clusters

Stupid typo fix

19d6534

Merge pull request #175 from weka/cstjack-typo

01f6b25

Stupid typo fix

Clumsily avoid divide by zero risk while still allowing calculation t…

10f2dfe

…o proceed

Merge pull request #176 from weka/cst_jack_check_bucket_fill_levels

400e068

Add a check to examine bucket fill levels

Basic RDMA errors check as per #weka-platform Slack

65b2b4d

vrragosta and others added 11 commits May 15, 2025 07:58

Merge pull request #177 from weka/cst_jack_rdma_network_errors

956c368

Basic RDMA errors check as per #weka-platform Slack

Check --stable instead as they're the actually in-use resources

67a8b56

Merge pull request #174 from weka/cst_jack_check_backend_link_speeds

787a6d1

Basic checker to compare current NIC link speed with maximum

+3 statistics as per internal slack channel

943bdbc

Merge pull request #178 from weka/cst_jack_rdma_network_errors

22beea8

+3 statistics as per internal slack channel

NVME block size check

fc70123

Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.

Merge pull request #179 from weka/cst_vragosta_nvme_bs

b47322d

NVME block size check

In WEKAPP-502848 we saw that NFS service was failing over

edcbd94

This was RCA'd down to too many connections to Ganesha, so we should start checking this.

Merge pull request #180 from weka/cst_jack_count_nfs_connections

173f532

In WEKAPP-502848 we saw that NFS service was failing over

Should exclude dataservice containers from these checks

ed96851

Merge pull request #182 from weka/cst_jack_skip_dataserv_and_s3_conta…

500dbcd

…iners_in_gateway_check Should exclude dataservice containers from these checks

vince-weka merged commit 645a404 into vince/os-kernel-updates Jun 23, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge master into this branch#168

merge master into this branch#168
vince-weka merged 39 commits intovince/os-kernel-updatesfrom
master

vince-weka commented Apr 24, 2025

Uh oh!

jackchallen commented May 7, 2025

Uh oh!

vince-weka commented May 7, 2025 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vince-weka commented Apr 24, 2025

Uh oh!

jackchallen commented May 7, 2025

Uh oh!

vince-weka commented May 7, 2025 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants