-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture SLA SLI SLO
Service-level agreements, indicators, objectives. Error budgets. Burn rate alerts. Reporting cadence. Companion to Architecture-Observability, Architecture-Resilience.
| Term | Meaning |
|---|---|
| SLA (Service Level Agreement) | External commitment to a customer with consequences (refunds, credits) on breach. |
| SLO (Service Level Objective) | Internal target. We aim for it; reliability decisions reference it. |
| SLI (Service Level Indicator) | The metric that measures whether we're meeting an SLO. |
| Error budget |
(1 - SLO) × period - the allowable amount of unreliability before action. |
| Burn rate | Speed at which the error budget is consumed. |
FinCore Engine OSS does not publish SLAs - adopters define their own with their customers. We publish SLO targets as best-effort recommendations based on production fintech industry norms.
Sources for the targets below:
- Google SRE Book - SLO chapter
- PSD2 RTS - Strong Customer Authentication API SLA requirements
- ISO 20022 - payment messaging norms
- NIST SP 800-53 - fintech control framework
- SOC 2 Trust Services Criteria
- Public status pages: Stripe, Wise, Adyen, Plaid, Revolut
| SLI | Definition | Metric expression |
|---|---|---|
| Ledger Post Availability | % of POST /v1/transactions returning non-5xx |
1 - (rate(http_server_requests_total{path="/v1/transactions",status=~"5.."}[5m]) / rate(http_server_requests_total{path="/v1/transactions"}[5m])) |
| Read API Availability | % of GET requests returning non-5xx | similar with GETs |
| Payment Initiate Availability | % of POST /v1/payments returning non-5xx |
similar |
| Decision Engine Availability | % of POST /v1/decision/evaluate returning non-5xx |
similar |
| Webhook Subscription Delivery | % of webhooks delivered within 30s of event | rate(webhook_delivered_within_30s[5m]) / rate(webhook_total_attempts[5m]) |
| SLI | Definition | Metric expression |
|---|---|---|
| Ledger Post p99 | p99 of POST /v1/transactions server-side latency |
histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m])) |
| Get Balance p99 | p99 of GET /v1/accounts/{id}/balance (cached MV) |
similar |
| Payment Initiate p99 | p99 of POST /v1/payments (full sync flow) |
similar |
| Decision Eval p99 | p99 of decision engine evaluation | histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m])) |
| Outbox Publish Lag p99 | p99 time from outbox row creation to Kafka publish | histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m])) |
| SLI | Definition | Metric expression |
|---|---|---|
| Ledger Invariant Compliance | % of transactions that satisfied SUM=0 invariant |
1 - (rate(ledger_invariant_violation_total[1h])) (zero violations expected; any violation = breach) |
| Idempotency Correctness | % of duplicate-key requests returning cached response | rate(idempotency_replay_total[5m]) / rate(idempotency_check_total[5m]) |
| Outbox No-Loss | % of business writes that have a corresponding outbox row | offline reconciliation; alert on any miss |
| SLI | Definition |
|---|---|
| Postgres RPO | Max data loss window in catastrophic failure |
| Backup Recoverability | Quarterly DR drill demonstrates restore-from-backup |
| Decision Log Retention | Decision logs retained 7+ years per regulatory requirement |
Conservative; we don't promise more than we can measure.
| SLI | SLO Target | Error budget per 30 days | Source |
|---|---|---|---|
| Ledger Post Availability | 99.9% | 43m 12s | initial baseline |
| Read API Availability | 99.95% | 21m 36s | initial baseline |
| Ledger Post p99 latency | < 500ms | n/a (per-request budget) | architectural target |
| Get Balance p99 (cached) | < 100ms | n/a | architectural target |
| Decision Eval p99 | < 50ms | n/a | architectural target (relaxed for v0.1) |
| Webhook delivery within 30s | 99.5% | 3h 36m | initial baseline |
| Outbox publish lag p99 | < 5s | n/a | architectural target |
| Ledger invariant compliance | 100% | 0 (no tolerance) | regulatory |
| Idempotency correctness | 100% | 0 (no tolerance) | correctness |
More aggressive; adopters can rely on these.
| SLI | SLO Target | Error budget per 30 days |
|---|---|---|
| Ledger Post Availability | 99.95% | 21m 36s |
| Read API Availability | 99.99% | 4m 19s |
| Ledger Post p99 latency | < 300ms | - |
| Get Balance p99 | < 50ms | - |
| Decision Eval p99 | < 10ms | - |
| Webhook delivery within 30s | 99.9% | 43m 12s |
| Outbox publish lag p99 | < 1s | - |
| Payment durability (no in-flight loss) | 99.9999% | <2.6s/year |
| Ledger invariant compliance | 100% | 0 |
| Idempotency correctness | 100% | 0 |
| KYC verification turnaround p95 | < 10 min auto, < 24h manual | - |
| SLI | SLO Target |
|---|---|
| Ledger Post Availability | 99.99% |
| Read API Availability | 99.999% (5 nines) |
Each SLO has a 30-day rolling error budget.
| Budget | Action |
|---|---|
| Healthy (>50% remaining) | Normal release cadence, all features ship |
| Cautious (10-50% remaining) | Code review demands explicit reliability discussion. New risky features delayed. |
| Frozen (<10% remaining) | Feature releases blocked. Only reliability fixes. Post-mortem required for budget consumption. |
| Exhausted (<0%) | Page on-call. All hands on reliability. Post-mortem mandatory. SLO temporarily downgraded to allow recovery. |
This policy is published, applies to maintainers, and reviewed quarterly.
- Stop the bleeding (whatever caused the breach)
- Post-mortem within 48 hours
- Action items prioritized over feature work
- SLO target temporarily relaxed (not the SLI; we still measure honestly)
- Once budget is healthy for 7 consecutive days, restore SLO to original target
Per Google SRE workbook - multi-window, multi-burn-rate alerts.
For SLO 99.9% (error budget = 0.1%):
| Burn rate | Detection window | Action |
|---|---|---|
| 14.4× | 1 hour | P0 - burns 30d budget in 2h |
| 6× | 6 hours | P1 - burns 30d budget in 5d |
| 1× | 3 days | P2 - burns 30d budget in 30d (track only) |
Implementation:
- alert: AvailabilityBurnRateFast
expr: |
(
sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
/ sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
) > (14.4 * 0.001) # 14.4× burn against 99.9% SLO
for: 5m
labels: { severity: P0, slo: ledger_post_availability }
- alert: AvailabilityBurnRateMedium
expr: |
(
sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
/ sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
) > (6 * 0.001)
for: 30m
labels: { severity: P1, slo: ledger_post_availability }Single Grafana dashboard with one panel per SLO:
- Current SLI value
- 30-day SLO target line
- Error budget remaining (visualized as fuel gauge)
- Burn rate over time
- Recent SLO breaches
JSON in deploy/grafana/dashboards/slo.json.
| Audience | Cadence | Format |
|---|---|---|
| Maintainers (internal) | Weekly | Slack post with SLI numbers + budget status |
| Maintainer team / contributors | Monthly | Release notes / forum post (transparency) |
| Public adopters | Quarterly | Status page update |
| Regulatory (eventually) | Annual | SOC 2 audit deliverable |
- No No public SLA in OSS - adopters bring their own
- No No 24/7 on-call from upstream maintainers (community contracts only)
- No No multi-region active-active in v0.1 (RTO 30 min, RPO 1 sec via PITR)
- No No real-time fraud signal SLA - that's the adopter's responsibility through the
RiskScorerplug-in - No No data residency guarantee in v0.1 - single-region deployment
These show up in the roadmap as adopters demand them.
For adopters defining SLAs with their own customers, this template encodes our SLOs as adopter-side baselines:
## Service Level Agreement (SLA)
We commit to the following service levels, measured monthly:
### Availability
- Ledger and balance APIs: 99.95% uptime
- Payment API: 99.95% uptime
- Decision and compliance APIs: 99.9% uptime
### Latency (p99)
- Get balance: < 200ms
- Post transaction: < 500ms
- Initiate payment: < 1s
- Decision evaluation: < 100ms
### Webhooks
- Delivered within 30 seconds: 99.9%
- Maximum delivery attempts: 7 over 7 days
### Compensation
- Service availability < 99.9% in a billing month: 10% credit
- Service availability < 99% in a billing month: 25% credit
- Service availability < 95% in a billing month: 50% credit
### Exclusions
- Scheduled maintenance windows (announced 7 days in advance)
- Force majeure
- Customer-controlled infrastructure (their bank, their KYC provider, their LLM)Adopters customize and present to their customers.
Private-vertical SLOs are documented privately. Not part of OSS commitments.
- Architecture-Observability - SLI metric definitions and dashboards
- Architecture-Resilience - what we do when SLOs are at risk
- Runbook - incident response when alerts fire
- Incident-Response - post-mortem workflow
- Overview
- Services
- Data Model
- Domain Model
- Event Flow
- Security
- Observability
- Resilience
- SLA / SLI / SLO