Skip to content

merge master into this branch#168

Merged
vince-weka merged 39 commits intovince/os-kernel-updatesfrom
master
Jun 23, 2025
Merged

merge master into this branch#168
vince-weka merged 39 commits intovince/os-kernel-updatesfrom
master

Conversation

@vince-weka
Copy link
Copy Markdown
Contributor

No description provided.

@jackchallen
Copy link
Copy Markdown
Collaborator

@vince-weka I don't think you need us to do anything here right? Looks like you're rebasing your branch to include changes from master.

@vince-weka
Copy link
Copy Markdown
Contributor Author

vince-weka commented May 7, 2025 via email

jackchallen and others added 10 commits May 7, 2025 16:56
Clusters with 0 hot-spare can potentially be configured to allocate
and use ~100% of SSD capacity. In the case of loss of a failure
domain they'll lose a proportion of FS space (known as shrinkage).
The proportion lost is dependent on the number of data disks, and
if "too much" is lost then writes can fail with ENOSPACE.

Essentially, we should warn customers without hot spares configured
Add a very basic capacity check in the case of 0 hot-spare clusters
In https://wekaio.atlassian.net/browse/WEKAPP-482528 we saw that
the link speed was lower than expected, but there were no warnings.
We should check that.

The only plausible way I can find of doing this is by parsing the
text-based output of ethtool, until jq and "ethtool --json" get
everywhere, or the kernel interface to ethtool-netlink is exposed
in /sys, I can't see any other way of doing it. :(
We don't want weka cluster buckets becoming too full, or too
imbalanced. Ordinarily the RAID stripe allocation takes care of
this for us, but in at least https://wekaio.atlassian.net/browse/WEKAPP-488736
(for example) network interruptions led to us not being able to
find free stripes. This in turn led to buckets becoming full and
thus FS writes stalling.
Add a check to examine bucket fill levels
vrragosta and others added 11 commits May 15, 2025 07:58
Basic RDMA errors check as per #weka-platform Slack
Basic checker to compare current NIC link speed with maximum
+3 statistics as per internal slack channel
Check to ensure cluster drives have consistent block sizes, and if not, raise a warning.
This was RCA'd down to too many connections to Ganesha, so we
should start checking this.
In WEKAPP-502848 we saw that NFS service was failing over
…iners_in_gateway_check

Should exclude dataservice containers from these checks
@vince-weka vince-weka merged commit 645a404 into vince/os-kernel-updates Jun 23, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants