Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of a Monitoring Daemon for storage devices in SONiC switches #433

Merged
merged 31 commits into from
May 31, 2024

Conversation

assrinivasan
Copy link
Contributor

@assrinivasan assrinivasan commented Feb 9, 2024

Description

This commit adds a monitoring daemon for Storage device attributes on a device running SONiC.
SONiC Storage Monitoring Daemon HLD

Motivation and Context

Storage devices experience performance degradation over time on account of a variety of factors such as overall disk writes, bad-blocks management, lack of free space, sub-optimal operational temperature and good-old wear-and-tear which speaks to the overall health of the disk.

The goal of the Storage Monitoring Daemon (storagemond) is to provide meaningful metrics for the aforementioned issues and enable streaming telemetry for these attributes so that the required preventative measures are triggered in the eventuality of performance degradation.

How Has This Been Tested?

Has been manually tested on following platforms:

7050cx3.txt
S6100.txt
SN2700.txt

Additional Information (Optional)

@assrinivasan assrinivasan changed the title Implementation of a Storage Monitoring Daemon for storage devices in SONiC switches Implementation of a Monitoring Daemon for storage devices in SONiC switches Feb 9, 2024
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Show resolved Hide resolved
sonic-stormond/scripts/stormond Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
@assrinivasan
Copy link
Contributor Author

@assrinivasan please add more details for manual testing.

sonc image upgrade, reboot, crash, fast/warm reboot

Added to the PR.

sonic-stormond/scripts/stormond Outdated Show resolved Hide resolved
sonic-stormond/setup.py Outdated Show resolved Hide resolved
sonic-stormond/setup.py Outdated Show resolved Hide resolved
sonic-stormond/setup.py Outdated Show resolved Hide resolved
Copy link

linux-foundation-easycla bot commented May 30, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@assrinivasan
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).


STORAGEUTIL_LOAD_ERROR = 127

log = syslogger.SysLogger(SYSLOG_IDENTIFIER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan can we move this inside daemon calss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest


if value is None: self.log_warning("{}:{} value = None in StateDB".format(storage_device, field))

self.statedb_storage_info_loaded = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan what if the value is None, in that case we should fall back to .json on the disk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in latest. Also added a None check in the _load_fsio_rw_json function for None values. In this scenario, Both StateDB and JSON file have junk values, so it will be considered an init case.

Comment on lines 191 to 200
if self.statedb_storage_info_loaded == False and self.fsio_json_file_loaded == True:
self.use_fsio_json_baseline = True
self.use_statedb_baseline = False

# If stormond is coming back up after a daemon crash, storage information would be saved in the
# STATE_DB. In that scenario, we use the STATE_DB information as the SoT and reconcile the FSIO
# reads and writes values.
elif self.statedb_storage_info_loaded == True:
self.use_fsio_json_baseline = False
self.use_statedb_baseline = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan can you make the logic more clear, i.e, if the stats are available in STATE_DB, then use that and as a fallback use .json values from the backup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest

@assrinivasan
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

prgeor
prgeor previously approved these changes May 30, 2024
@prgeor prgeor merged commit f41ecca into sonic-net:master May 31, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants