-
Notifications
You must be signed in to change notification settings - Fork 0
Operations Bundle
Operations Bundle - Getting Started, Deployment, Helm, Monitoring, Runbook, Incident Response, FAQ, Troubleshooting
Operational Wiki pages bundled for compactness. Each section can be split into a standalone page if it grows.
- Docker 24+ and Docker Compose v2
- 8 GB RAM minimum (4 for the stack + 4 for your dev environment)
- Java 21 (only if running services outside Docker)
-
curl,jq(for the demo script)
git clone https://github.com/tiana-code/fincore-engine
cd fincore-engine
docker compose up -d
# Wait for services to be healthy (~30 seconds)
docker compose ps
# Run the demo
./scripts/demo.shThe demo:
- Creates two accounts (USER_WALLET, EUR)
- Posts a 100 EUR transfer (double-entry transaction)
- Verifies balances match expected
- Demonstrates time-travel balance query
- Reverses the transaction
- Verifies idempotency (retry produces 1 transaction, not 2)
After docker compose up:
| Service | URL | Default credentials |
|---|---|---|
| Ledger Service | http://localhost:8080 | get JWT from Keycloak |
| Swagger UI | http://localhost:8080/swagger-ui.html | - |
| Keycloak Admin | http://localhost:8081 | admin / admin |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | - |
| Loki | http://localhost:3100 | - |
| Postgres | localhost:5432 | fincore / fincore |
| Redpanda Console | http://localhost:8888 | - |
TOKEN=$(curl -s -X POST http://localhost:8081/realms/fincore/protocol/openid-connect/token \
-d "grant_type=client_credentials" \
-d "client_id=fincore-api-client" \
-d "client_secret=demo-secret" \
| jq -r '.access_token')
# Use it
curl http://localhost:8080/v1/accounts/some-id \
-H "Authorization: Bearer $TOKEN"docker compose down # stops, keeps volumes
docker compose down -v # stops + removes volumes (loses data)| Option | When to use |
|---|---|
| Helm chart on Kubernetes | Recommended - most adopters |
Plain docker compose
|
Small / single-host POC |
| Manual JAR deployment | Air-gapped environments |
| Cloud-managed (EKS, GKE, AKS) | Enterprise - combine with Helm |
# Add the chart repo
helm repo add fincore https://tiana-code.github.io/fincore-helm-charts
helm repo update
# Install with production values
helm install fincore-engine fincore/fincore-engine \
--namespace fincore-engine --create-namespace \
--values values-prod.yaml
# Upgrade
helm upgrade fincore-engine fincore/fincore-engine \
--values values-prod.yamlvalues-prod.yaml template:
global:
environment: production
image:
registry: ghcr.io/tiana-code
pullPolicy: IfNotPresent
imagePullSecrets:
- name: ghcr-cred
ledger:
replicaCount: 3
hpa:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
resources:
requests: { memory: "1Gi", cpu: "500m" }
limits: { memory: "2Gi", cpu: "2" }
config:
spring:
profiles:
active: prod
datasource:
url: ${VAULT_DB_URL}
username: ${VAULT_DB_USER}
password: ${VAULT_DB_PASS}
postgres:
enabled: false # use external managed Postgres
externalUrl: jdbc:postgresql://prod-postgres.example.com:5432/fincore
redpanda:
enabled: false # use external Strimzi Kafka or MSK
bootstrapServers: kafka-bootstrap.example.com:9092
keycloak:
enabled: false # use external Keycloak
externalIssuerUri: https://auth.example.com/realms/fincore
resilience:
circuitBreaker:
bankAdapter:
failureRateThreshold: 50
rateLimit:
perIp: 100
perUser: 1000
observability:
prometheus:
enabled: true
serviceMonitor: true
loki:
enabled: true
tempo:
enabled: true
security:
podSecurityContext:
runAsNonRoot: true
runAsUser: 65532
networkPolicy:
enabled: true
egressAllowedTo:
- postgres
- kafka
- keycloak
- external-providers # configure CIDRsBefore going live:
- External Postgres (HA primary + replicas, PITR enabled, encryption at rest)
- External Kafka (Strimzi / MSK / Confluent)
- External Keycloak (HA + DB)
- External Redis (HA)
- TLS certs for ingress (cert-manager + Let's Encrypt or commercial)
- mTLS in cluster (Istio or Linkerd recommended)
- Network policies (default-deny + explicit allows)
- Secrets via Vault / AWS SM / GCP SM (not env vars)
- Image pull from your private registry (re-tag from ghcr.io/tiana-code)
- Backup strategy: WAL archiving, nightly snapshots, quarterly DR drill
- Monitoring + alerting wired (PagerDuty / Slack)
- Runbook reviewed
- Pen test scheduled
- SOC 2 readiness assessment (if regulated)
Raw manifests in deploy/kubernetes/ for adopters who prefer kustomize / no-Helm. Maintained but Helm is the primary path.
deploy/helm/fincore-engine/
├── Chart.yaml
├── values.yaml # defaults (Redpanda + Keycloak bundled, dev mode)
├── values-prod.yaml # production overrides
├── values-kafka.yaml # Apache Kafka via Strimzi instead of Redpanda
├── values-observability.yaml # full Grafana stack
└── templates/
├── _helpers.tpl
├── ledger-deployment.yaml
├── ledger-service.yaml
├── ledger-configmap.yaml
├── ledger-secret.yaml
├── ledger-hpa.yaml
├── ledger-pdb.yaml # PodDisruptionBudget
├── ledger-servicemonitor.yaml
├── ledger-networkpolicy.yaml
├── postgres-statefulset.yaml
├── postgres-service.yaml
├── redpanda-statefulset.yaml
├── keycloak-deployment.yaml
├── keycloak-service.yaml
├── ingress.yaml
└── (per-service templates for v0.2+)
global:
environment: dev
image:
registry: ghcr.io/tiana-code
tag: 0.1.0
pullPolicy: IfNotPresent
ledger:
enabled: true
replicaCount: 1
service:
type: ClusterIP
port: 8080
config:
spring:
profiles:
active: dev
postgres:
enabled: true
persistence:
size: 10Gi
credentials:
database: fincore
username: fincore
password: fincore
redpanda:
enabled: true
resources:
requests: { memory: "512Mi", cpu: "250m" }
limits: { memory: "1Gi", cpu: "1" }
keycloak:
enabled: true
realmImport: /opt/keycloak/data/import/fincore-realm.json
adminUser: admin
adminPassword: admin # CHANGE IN PROD
observability:
enabled: false # toggle for Grafana stack# Disable bundled Postgres, use external
helm install fincore fincore/fincore-engine \
--set postgres.enabled=false \
--set ledger.config.spring.datasource.url=jdbc:postgresql://prod-db:5432/fincore
# Use Apache Kafka instead of Redpanda
helm install fincore fincore/fincore-engine -f values-kafka.yaml
# Enable observability
helm install fincore fincore/fincore-engine -f values-observability.yaml
# Production deployment
helm install fincore fincore/fincore-engine -f values-prod.yamlhelm lint deploy/helm/fincore-engine
helm template deploy/helm/fincore-engine | kubectl apply --dry-run=client -f -
helm test fincore-engine # runs chart-testsCI (.github/workflows/helm-test.yml) runs all of the above on every PR.
See Architecture-Observability for the full list. Minimal must-haves:
- Service Health Overview - RPS, error rate, p99 latency, pod count, heap, GC
- Ledger Throughput - transactions posted/sec, balance read p99, invariant compliance
- Outbox & Event Flow - pending events, dispatcher lag, consumer lag, DLQ depth
- Resilience - circuit breaker states, retry counts, saga interventions
Dashboard JSONs in deploy/grafana/dashboards/.
Defined as Prometheus rules in deploy/prometheus/rules.yaml. Severity matrix:
| Severity | Channel | Ack SLA |
|---|---|---|
| P0 | PagerDuty + phone | 5 min |
| P1 | PagerDuty | 15 min |
| P2 | Slack #engineering | 1 hour |
| P3 | Slack #monitoring | next business day |
Each alert links to a runbook entry (see §5 below).
Each FinCore service exposes Micrometer metrics at /actuator/prometheus. Business metric naming:
-
ledger.transactions.posted.total(counter, labels: currency) -
ledger.invariant.violation.total(counter - alert on any > 0) -
payments.completed.total(counter) -
decision.evaluation.duration(histogram) -
outbox.events.pending(gauge per schema) -
webhook.delivery.success.total(counter)
Full catalog in Architecture-Observability§custom-counters--timers.
Each runbook entry lives in runbooks/<topic>.md of the repo and is linked from alert annotations.
| Topic | Severity | When |
|---|---|---|
ledger-invariant-violation.md |
P0 | Any invariant violation reported |
service-down.md |
P0 | Service health probe fails |
outbox-backlog.md |
P1 | Pending events > 1000 for 5 min |
consumer-lag.md |
P1 | Consumer lag > 10000 for 5 min |
circuit-breaker-open.md |
P1 | Bank/KYC adapter circuit OPEN > 2 min |
dlq-non-zero.md |
P2 | DLQ has messages |
idempotency-conflict-rate.md |
P2 | Conflict rate > 1% |
dispatcher-failed.md |
P0 | Outbox dispatcher cannot publish |
db-connection-pool-exhausted.md |
P1 | HikariCP saturation > 90% |
db-disk-full.md |
P0 | Postgres disk usage > 90% |
# Runbook: <topic>
## Severity: P0 - <one-line description>
## What this means
<2-3 sentences explaining the alert>
## Detection
- Alert: <alert name>
- Metric: <prometheus query>
- Logs: <loki query>
## Immediate actions
1. <step 1>
2. <step 2>
3. <step 3>
## Investigation
- <hypotheses>
## Recovery
- <recovery steps>
## Post-incident
- <follow-up requirements># Runbook: Ledger Invariant Violation
## Severity: P0 - Customer-impacting, possible data corruption
## What this means
The deferred trigger detected entries that don't sum to zero per currency.
This should be impossible - investigate immediately.
## Detection
- Alert: LedgerInvariantViolation
- Metric: rate(ledger_invariant_violation_total[1m]) > 0
- Logs: application="ledger-service" level=ERROR message=~"invariant"
## Immediate actions
1. Confirm the metric is real (cross-instance check)
2. Identify offending transaction:
SELECT t.* FROM transactions t WHERE id IN (
SELECT transaction_id FROM entries
GROUP BY transaction_id, currency
HAVING SUM(amount) <> 0
);
3. If recent deploy: roll back. If not: page eng manager.
## Investigation
- Was the trigger disabled? `\df+ verify_double_entry_invariant`
- Was the materialized view stale?
- Was there a data import bypassing the trigger?
## Recovery
- DO NOT delete offending entries (immutable journal).
- Determine intended correct state.
- Post a compensating transaction if mathematically valid.
- Otherwise escalate to data-integrity working group.
## Post-incident
- Mandatory post-mortem within 48h.
- Root cause must include: how was invariant bypassed, why didn't tests catch.DETECTED → TRIAGED → DIAGNOSED → MITIGATED → RESOLVED → POST-MORTEM
- Incident Commander (IC) - coordinates, makes decisions, communicates
- Subject Matter Expert (SME) - debugs, applies fixes
- Communicator - keeps stakeholders informed (status page, Slack, email)
- Scribe - captures timeline, decisions, action items
For solo maintainer - all four roles. For larger teams - separate.
| Audience | Channel | Cadence |
|---|---|---|
| Internal (maintainers, on-call) | Slack #incidents | Continuous |
| Public (status page) | status.fincore.dev | Every 30 min |
| GitHub Discussions | "Incident" category | At start, mid, end |
| Sponsors (if affected) | At start, end | |
| Adopters (paid) | Email + status page | Continuous |
# Post-mortem: <incident title>
## Date / time
- Detected: ...
- Resolved: ...
- Duration: ...
## Severity
- P0/P1/P2
## Impact
- Customer-facing: ...
- Internal: ...
- Data: ...
## Timeline (UTC)
- HH:MM - <what happened>
- HH:MM - <what we noticed>
- HH:MM - <what we tried>
- HH:MM - <what fixed it>
## Root cause
- <single sentence>
- <detailed analysis>
## What went well
- ...
## What went poorly
- ...
## Action items (owner + ETA)
- [ ] <action> - Maintainer - 2026-MM-DD
- [ ] <action> - ...
## Lessons learned
- ...Every P0/P1 → post-mortem within 48 hours. Public for non-sensitive incidents (transparency).
- Focus on what failed, not who failed
- Mistakes are systemic problems, not individual ones
- Action items must be process or tooling, not "be more careful"
Q: Is FinCore Engine a fintech? No. It's open-source infrastructure for building fintech apps. We don't hold money, process payments, or have regulatory presence.
Q: Do I need a license to use FinCore Engine? For non-production use, embedded use in your own product, evaluation, contributions - no, BSL 1.1 grants free use. Only "competing managed service" requires commercial license. See ADR-0002.
Q: When does my use convert to Apache 2.0? Each release auto-converts 4 years after its release date. v0.1.0 (June 2026) → Apache 2.0 in June 2030.
Q: Can I fork it? Yes. Per the BSL, you can fork, modify, and use freely under the same terms.
Q: Is FinCore Engine production-ready? v0.1.0 is MVP - production for low-risk workloads. v1.0.0 (target Q2 2027) is the production-stable milestone with public SLOs and SOC 2 readiness checklist.
Q: Can I use my own database (not Postgres)? v0.1.0 is Postgres-only. v0.3.0 introduces TigerBeetle adapter. Other DBs not supported (out of scope).
Q: Can I use Apache Kafka instead of Redpanda?
Yes. Same Kafka client code. Helm chart has values-kafka.yaml. Just point bootstrap-servers at your cluster.
Q: Can I run without Keycloak?
Yes. Use any OIDC-compatible provider (Auth0, Okta, Cognito, internal). Just change issuer-uri.
Q: What's the difference between Decision Engine and Drools? Decision Engine is JSON-DSL, deterministic, audit-first, lightweight (Maven JAR), Kotlin-native. Drools is .drl-DSL, BRMS-heavy, JBoss-coupled. See ADR-0008 for the full rationale.
Q: How do I add a new payment provider?
Implement the BankProvider interface, register as Spring bean, configure in application.yml. See services/payment/.../external/SandboxBankAdapter.kt as reference.
Q: How do I add ML?
Implement RiskScorer or AnomalyDetector interface, register as Spring bean, configure. ML models stay private - only interfaces are in OSS.
Q: Can I use FinCore Engine for crypto?
No mature support. The ledger handles arbitrary precision (NUMERIC(38,18)), so technically yes. But custodial wallet, KYC for crypto, on-chain interaction - out of scope.
Q: Does FinCore handle FX (currency conversion)? No. FinCore is single-currency per transaction. Cross-currency = your bank provider does the conversion; FinCore tracks both sides via separate transactions.
Q: When is v1.0? Q2 2027. See Roadmap.
Q: Will there be a hosted SaaS? Not in 2026. Maybe in 2028+ depending on signal. Self-hosted is the supported path for the foreseeable future.
Q: Can I influence the roadmap? Yes - issues, discussions, sponsorship at higher tiers. See Roadmap#how-priorities-are-decided.
Q: Are PRs welcome? Yes. See CONTRIBUTING.md. For Y1, bug reports and integration testing are more valuable than feature PRs.
Q: Do I need to sign a CLA? For substantial PRs, yes - standard for BSL projects. The CLA grants the project the right to relicense your contribution as part of the project (necessary for the BSL → Apache 2.0 auto-conversion to work).
Q: Can I report security issues? See SECURITY.md. 48h ack, 30-day fix SLA for HIGH/CRITICAL.
docker compose up fails: port already in use
Another service is using port 8080, 5432, 9092, 8081, or 3000. Stop it or override:
# docker-compose.override.yml (auto-loaded)
services:
ledger:
ports: ["18080:8080"]Ledger service won't start: "Liquibase lock" A previous startup left a lock. Clear:
docker compose exec postgres psql -U fincore -c "UPDATE databasechangeloglock SET locked = FALSE;"Keycloak auth: "invalid_client"
Default sandbox client secret is demo-secret. Production needs your own.
Kafka producer: "TimeoutException"
Redpanda not yet healthy. Wait 30s. If persistent: docker compose logs redpanda for errors.
Test fails: "Container failed to start" Testcontainers can't pull images. Ensure Docker is running. Pull manually:
docker pull postgres:17-alpine
docker pull redpandadata/redpanda:v24.3.1./gradlew build: out of memory
Increase Gradle heap:
export GRADLE_OPTS="-Xmx4g"
./gradlew buildHibernate: LazyInitializationException
Lazy collection accessed outside transaction. Use @EntityGraph or fetch eagerly. See Code-Rules§5.
docker compose down -v lost my data
Yes, -v removes volumes. Don't use unless intentional.
MapStruct: "Mapper not generated"
KSP needs annotation processor; check build.gradle.kts has ksp("org.mapstruct:mapstruct-processor:1.6.3"). Run ./gradlew kspKotlin.
Demo script fails: "jq not found"
Install jq: brew install jq / apt install jq.
- Check this troubleshooting page
- Search GitHub Discussions
- Open a new Discussion (preferred) or Issue (for bugs)
- For commercial inquiries: email per SECURITY.md
- Home - Wiki entry point
- Architecture-Resilience - what's behind these operational behaviors
- Architecture-Observability - how we monitor
- Architecture-Security - how we secure
- Risk-Register - what can go wrong
- Overview
- Services
- Data Model
- Domain Model
- Event Flow
- Security
- Observability
- Resilience
- SLA / SLI / SLO