-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are no notifications for service issues #4
Comments
I would say that this is more related with #3 since it's related to the performance of the service. Regarding the issue: We do have notifications in place to notify us about potential ongoing issue with the services. Unfortunately these issues are too broad and there's no real category to distinguish them – this leads to an increase of the number of issues reported (noise) which leaves most/all of them ignored until there's a more insightful report from our users. Sub-problem 1 – Lack of issue ClassificationThe issues that are reported on to Kibana and Slack do not have any meaningful category that can distinguish its severity. Currently, evaluating a specific issue requires human intervention – this means reading the stream of error messages and filtering out what is considered normal behaviour – this does not scale well not only because of the number of messages that need to be evaluated but also because we need to keep track of the current alerts being sent. Action: Therefore we first need to classify the issues that are reported by creating and assigning different categories to them. When issues can be easily distinguished we can then move on to work on creating notifications and alerts. Sub-problem 2 – Notify on issues that require immediate attentionAfter creating a classification system for the issues reported by our services we should then create rules to notify the team depending on the severity of those issues. Some issues would require immediate attention (like service unavailability) while others might be more feature specific. Additionally we should consider rules that are triggered on different thresholds eg.: number of pending tasks > X or number of 500s returned in the last Y minutes. Action Points (to be discussed)Come up with categories for issues. Example:
Evaluate Black Box monitoring and White Box monitoringMy personal take on this is that we should prioritise Black Box monitoring given our past incidents.
Evaluate tools to accomplish the aboveAfter a brief discussion with the team we should look into Prometheus (https://prometheus.io) for the rule creation (as we are possibly moving away from Kibana). As for alerting, we should consider email and slack since these tools are already available for us (but additional channels can be considered). Sources |
Problem Owner
@fmrsabino
What is the Problem
Currently our services (especially the indexing service) requires a lot of database resources. This causes the following issues:
Why is it relevant?
As the service issue normally persist this has a big impact on our user experience as the services are not available
How can we track the problem? (What is the KPI?)
The text was updated successfully, but these errors were encountered: