Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Lifetime / wear levelling trigger #2

Open
danboid opened this issue Feb 19, 2024 · 4 comments
Open

Adjust Lifetime / wear levelling trigger #2

danboid opened this issue Feb 19, 2024 · 4 comments

Comments

@danboid
Copy link

danboid commented Feb 19, 2024

In my experience of using Samsung SSDs as members of ZFS pools, when one disks wear levelling count gets to about 9 or 10%, that disk can bring the IO performance of the whole pool to its knees so I'm going to create a trigger for my Samsung SSDs when they exceed > 7% wear levelling. It seems that this template doesn't include a trigger for wear levelling by default so I'd like to see one added.

Thanks

@danboid
Copy link
Author

danboid commented Feb 21, 2024

I realised today that your template already has a trigger for wear levelling but its called Lifetime and its configured like so:
last([/S.M.A.R.T. SSD Samsung/ssd.v177[{#SSDDISK}]])<10

So I think all I need to do is change this to:

last([/S.M.A.R.T. SSD Samsung/ssd.v177[{#SSDDISK}]])<7

or maybe

last([/S.M.A.R.T. SSD Samsung/ssd.v177[{#SSDDISK}]])>7

To make it alert sooner.

Even 90% wear levelling is too late in my experience. Waiting for it to get below 10% would be much too late for a timely alert.

@danboid
Copy link
Author

danboid commented Feb 23, 2024

I was testing the wear levelling / Lifetime monitoring today with Zabbix 6.0 with a known bad (high wear levelling) Samsung SSD with no luck. Two questions:

Does a disk need to be mounted for this template to work? I did mount my faulty disk but it still didn't trigger the Lifetime alert.

Do I need to "Unlink and clear" a template every time I change a macro or trigger?

@danboid danboid changed the title Add a default trigger for wear levelling Adjust Lifetime / wear levelling trigger Feb 26, 2024
@danboid
Copy link
Author

danboid commented Feb 26, 2024

I think this might be working under Zabbix 6 actually but I got the trigger prototype expression for Lifetime wrong. I think I should be using this:

last(/S.M.A.R.T. SSD Samsung/ssd.v177[{#SSDDISK}])<93

Problem is that I need to wait 12 hours now to find out if that is correct because I don't know how to change the interval or manually trigger a re-check? I have inserted a Samsung SSD with a wear level value of 85 so it should cause this template to trigger, even though its not mounted, I presume.

I have changed the item prototype for [{#SSDDISK} Wear Leveling Count] to use a 1hr interval but that doesn't update it every hour. I suspect thats because the Discovery rule for this template is still set to a 12hr interval but I've not worked out how to adjust that interval yet?

@danboid
Copy link
Author

danboid commented Feb 27, 2024

Good news! This template does indeed work fine for Zabbix 6.0 when using Samsung SSDs attached to a RAID controller running in HBA mode.

This is what I've added to my Zabbix notes about this template:

This template is no use for monitoring Samsung SATA SSD based ZFS pools in its default configuration because it doesn't alert until the wear levelling gets as low as 9%. A new disk starts at 100% levelling. In my experience, if just one disk in a ZFS pool gets to about 91% or 90% wear levelling, it can tank the performance of the whole pool so we want to alert when the levelling of a disk gets to 8% use ie 92% as the alert point.

This is how you configure the Samsung SSD Zabbix template to alert much sooner:

Configuration -> Templates
Search for samsung, you should find S.M.A.R.T. SSD Samsung
Click the "Discovery" link next to the Samsung template to edit its Discovery rules then select "Trigger prototypes" and then click on "{#SSDDISK} -- Lifetime"
Change the Expression value for the Lifetime trigger to:

last(/S.M.A.R.T. SSD Samsung/ssd.v177[{#SSDDISK}])<93

Then click Update to adjust the trigger expression to alert much sooner, when the SSD's wear levelling count gets to 92% instead of the templates default of 9% (<10).

You may want to add something like that to the README if you don't want to change to the default Lifetime trigger expression to something closer to mine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant