There are no notifications for service issues #4

rmeissner · 2021-11-23T14:45:40Z

Problem Owner

What is the Problem

Currently our services (especially the indexing service) requires a lot of database resources. This causes the following issues:

Longer response time when service related issues occur

Why is it relevant?

Longer response time when service related issues occur

As the service issue normally persist this has a big impact on our user experience as the services are not available

How can we track the problem? (What is the KPI?)

Uptime

fmrsabino · 2022-01-26T16:43:16Z

Currently our services (especially the indexing service) requires a lot of database resources. This causes the following issues:
Longer response time when service related issues occur

I would say that this is more related with #3 since it's related to the performance of the service.

Regarding the issue: There are no notifications for service issues

We do have notifications in place to notify us about potential ongoing issue with the services. Unfortunately these issues are too broad and there's no real category to distinguish them – this leads to an increase of the number of issues reported (noise) which leaves most/all of them ignored until there's a more insightful report from our users.

Sub-problem 1 – Lack of issue Classification

The issues that are reported on to Kibana and Slack do not have any meaningful category that can distinguish its severity. Currently, evaluating a specific issue requires human intervention – this means reading the stream of error messages and filtering out what is considered normal behaviour – this does not scale well not only because of the number of messages that need to be evaluated but also because we need to keep track of the current alerts being sent.

Action: Therefore we first need to classify the issues that are reported by creating and assigning different categories to them.

When issues can be easily distinguished we can then move on to work on creating notifications and alerts.

Sub-problem 2 – Notify on issues that require immediate attention

After creating a classification system for the issues reported by our services we should then create rules to notify the team depending on the severity of those issues.

Some issues would require immediate attention (like service unavailability) while others might be more feature specific. Additionally we should consider rules that are triggered on different thresholds eg.: number of pending tasks > X or number of 500s returned in the last Y minutes.

Action Points (to be discussed)

Come up with categories for issues. Example:

Availability and Basic Functionality
Latency
Correctness
Feature specific issues

Evaluate Black Box monitoring and White Box monitoring

My personal take on this is that we should prioritise Black Box monitoring given our past incidents.

Black box monitoring is performed by monitoring our systems from the perspective of a client ie.: it'd monitor the symptoms of all our services.
Eg.: since our apps rely on the Safe Client Gateway as entry point we can monitor the Safe Client Gateway by having a fake client performing a number of different actions every X minutes and reporting the result of these actions.
If that the correct rules and alerting are in place, this would report issues that are user-facing and thus might require immediate attention.
White box monitoring is performed by monitoring internal metrics in our system ie.: it'd monitor the causes of the issues in our services
Eg.: Disk space, memory usage, Disk IO, etc..
This would provide valuable information in anticipating issues – ie.: the issues reported here might not be critical (from the user perspective) but might cause issues at some later stage if left unattended.

Evaluate tools to accomplish the above

After a brief discussion with the team we should look into Prometheus (https://prometheus.io) for the rule creation (as we are possibly moving away from Kibana).

As for alerting, we should consider email and slack since these tools are already available for us (but additional channels can be considered).

Sources

rmeissner added draft infra labels Nov 23, 2021

rmeissner mentioned this issue Nov 23, 2021

There is an overflow of service related logs #11

Closed

rmeissner removed the draft label Dec 13, 2021

rmeissner assigned fmrsabino Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are no notifications for service issues #4

There are no notifications for service issues #4

rmeissner commented Nov 23, 2021 •

edited by fmrsabino

Loading

fmrsabino commented Jan 26, 2022 •

edited

Loading

There are no notifications for service issues #4

There are no notifications for service issues #4

Comments

rmeissner commented Nov 23, 2021 • edited by fmrsabino Loading

Problem Owner

What is the Problem

Why is it relevant?

How can we track the problem? (What is the KPI?)

fmrsabino commented Jan 26, 2022 • edited Loading

Sub-problem 1 – Lack of issue Classification

Sub-problem 2 – Notify on issues that require immediate attention

Action Points (to be discussed)

Come up with categories for issues. Example:

Evaluate Black Box monitoring and White Box monitoring

Evaluate tools to accomplish the above

Sources

rmeissner commented Nov 23, 2021 •

edited by fmrsabino

Loading

fmrsabino commented Jan 26, 2022 •

edited

Loading