Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling opni alerting as a microservice OEP #1551

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions enhancements/alerting/20230704-microservice.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Title

Opni Alerting into a Microservice architecture

## Summary :

Opni Alerting requires functionality that is outside the scope of a gateway plugin, which should only be
responsible for forwarding requests to their destination.

## Usecase :

- High availabiility & scaling of opni alerting
- each (functional) component of alerting has different requirements for maintaining consistency & availability
- Required for high availability & scaling of opni gateway

## Benefits :

- Easier to manage scalability of alerting components
- Greatly reduces complexity of opni gateway responsibility
- Failure recovery for individual components
- Easier to test & manage growing alerting features

## Impact :

- Addresses production usecases of alerting

## Implementation details

Currently the opni alerting architecture looks like the gateway monolith in the supporting documents.

The subcomponents of each system are functionaly separate but are part of the same lifecycle
as the gateway (or the alertmanager cluster, for the config reconciler & emitters).

The components in yellow specifically belong outside the scope of the gateway since they should be able to track stateful information and operate outside of the gateway's lifecycle.

In this proposal, we want to separate the gateway alerting plugin into separate services

- CRUD configs
- functional APIs & syncing configs
- Dispatching alerts
- Evaluating internal status
- Emitting alerts to a default place (cache/ kubernetes events).

In essence, each subcomponent is separated out to scale independently and recover from potential failures independently.

## Acceptance criteria

Separating the existing apis into the following separate services:

- [ ] A **control plane** deployment, that manages the "sync tasks" of the alerting plugin.
- [ ] Applies/removes alerting configurations on their respective backend
- [ ] Builds the AM configuration required to route alerts to their respective endpoints
- [ ] Applies functional apis to their respective backends, like silencing endpoints and sending notifications to endpoints
- [ ] Its APIs will be served by the Gateway
- This can scale according to a leader election process
- [ ] A **data plane** deployment, that manages persisting the opni alerting configurations.
- [ ] Stores alerting endpoints / conditions / SLOs
- [ ] Its APIs will be served by the Gateway
- This can scale according to a distributed locking mechanism
- [ ] An **evaluator** deployment, which takes on the role of tracking the the state of internal opni components for alerting on things that are mission critical for opni
- [ ] evaluates status of agents / opni backends over time
- [ ] send alerts based on loaded user configs that track opni status
- This can scale via a leader election process
- [ ] Emitter deployment
- Handles sending all received notifications to a backend such as:
- [ ] Alerting message cache
- [ ] Kubernetes events API
- In the future, sending messages to OpenSearch
- can scale according to builtin k8s service LB, it is completely stateless

## Supporting documents

### Gateway monolith

![Gateway monolith](images/AlertingMiscroserviceGatewayMonolith.png)

### Microservice

![Microservice](images/AlertingMiscroserviceMicroservice.png)

### Dependencies

- Downstream Agent : https://github.com/rancher/opni/pull/1493
- HA consistency : https://github.com/rancher/opni/pull/1408

## Risks and Contigencies

N/A

## Level of Effort

2 weeks

- 1 week separating servers into opni commands + pods
- 1 week testing

## Resources

- 1 Upstream and 1 Downstream alerting cluster
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.