Skip to content

Conversation

markgoddard
Copy link
Contributor

  • Add SMART Monitoring with dash and alerts
  • Increase job timeout for kolla image build GHA
  • Fix oom-killer graph
  • Rephrase the match logic for interfaces monitored for package drops

technowhizz and others added 4 commits December 19, 2022 12:40
Enabled Textfile collector in node exporter in kolla/globals.yml

Added smartmon script as is from the prometheus-community github and then
removed NVME support from this script in favour of using the nvme-cli script,
which has also been added in. This is because the nvme-cli script provides
better metrics than the smartmon script does. The script also adds the serial
number of the disk as a label to all SMART metrics.

Added a Kayobe custom playbook to easily deploy the script and associated
cron job. This playbook installs smartmontool and nvmecli then copies these
over to the hosts and sets up a cronjob which runs the scripts and stores
the metrics in the docker volume for node exporter. The playbook changes
the way the metrics are saved to a file by making use of the mv command
as it is atomic. This was needed as at times prometheus would read a
partially completed file.

Added a prometheus alert to alert when a drive is reported as not healthy
for more than 10 minutes.

Added a Grafana dashboard to display the number of healthy and unhealthy
drives reported in prometheus.

(cherry picked from commit d83ecde)

Add docs for SMART Monitoring

(cherry picked from commit 595429a)

Update doc/source/configuration/monitoring.rst

Fix kayobe command

Co-authored-by: Will Szumski <will@stackhpc.com>
(cherry picked from commit 9a5fc53)

Update doc/source/configuration/monitoring.rst

Fix Spelling

Co-authored-by: Will Szumski <will@stackhpc.com>
(cherry picked from commit ef25d6f)

Add release note

(cherry picked from commit 3d4d011)

Amend docs and add release note

(cherry picked from commit b6cb511)

Move SMART prometheus alert to own file

(cherry picked from commit b353fd3)

Fix typo

(cherry picked from commit 611f2fb)

fixup
Changes the oom-killer graph from a smoothed irate to a discrete delta
function.

Change-Id: I2e4a8576c628610409ade4aad2bd98754bec3860
(cherry picked from commit ef1a449)
OVS bridge interfaces drop packets during normal operation.  Change
the regex to filter out interfaces that don't matter for packet
drops.

(cherry picked from commit 9c3f15a)
@markgoddard markgoddard requested a review from a team as a code owner December 19, 2022 12:43
@markgoddard markgoddard self-assigned this Dec 19, 2022
@markgoddard markgoddard merged commit 7660c05 into stackhpc/xena Dec 19, 2022
@markgoddard markgoddard deleted the xena-backports branch December 19, 2022 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants