Skip to content

RahulModugula/kronos

Repository files navigation

kronos

Durable workflows you can debug like regular programs.

Go Report Card License: MIT Go Reference CI

Time-travel debugger for durable workflows. Record every step, scrub the timeline, replay in your IDE with breakpoints, fork failed runs from any midpoint.

Kronos is a Postgres-native durable workflow engine with a time-travel debugger. Single Go binary, zero external dependencies beyond Postgres. 2,800+ workflow steps/sec.

Why Kronos?

Debugging failed workflows in production is painful. You get stack traces and logs, but you can't actually step through what happened. Kronos fixes this:

  • Time-travel debugger — scrub through any workflow run's timeline, inspect exact inputs/outputs at each step
  • Local replay with breakpointskronos replay <run-id> re-executes production runs locally; attach Delve or your IDE debugger
  • Fork from any midpoint — failed at step 5 of 10? Fork from step 5 with a fixed handler, skip re-executing steps 1–4
  • Single binary — no cluster, no Elasticsearch, no separate services. Just Postgres.

Comparison

Kronos Temporal Hatchet River
Time-travel debugger
Local IDE replay
Fork from midpoint
Single binary
Postgres-native ✅*
Workflow versioning
Zero external deps

* Temporal supports Postgres but requires a separate Temporal server cluster + Elasticsearch for production visibility.

Quick Start

# Start Postgres
docker-compose up -d

# Build and run
make build && ./bin/kronos

Open the debugger UI at http://localhost:8080/debugger/

# Submit a workflow
grpcurl -plaintext -d '{
  "workflow_name": "example-workflow",
  "input": "{\"message\": \"hello\"}"
}' localhost:50051 kronos.v1.KronosService/StartWorkflow

Define a Workflow

wf := workflow.NewWorkflow("data-pipeline", "v1").
    AddStep("fetch-user", fetchUserData, nil).
    AddStep("enrich", enrichProfile, []string{"fetch-user"}).
    AddStep("validate", validate, []string{"enrich"}).
    AddStep("notify-1", sendEmail, []string{"enrich"}).
    AddStep("notify-2", slackMessage, []string{"enrich"}).
    AddStep("finish", summarize, []string{"validate", "notify-1", "notify-2"})

registry.Register(wf)

Step functions accept a context and JSON input:

func enrichProfile(ctx context.Context, input json.RawMessage) (json.RawMessage, error) {
    var data struct{ UserID string }
    json.Unmarshal(input, &data)
    // ... do work ...
    return json.Marshal(result)
}

Replay & Debug

# Replay a production run locally
kronos replay <run-id>

# Pause at a step — attach Delve or VS Code
kronos replay <run-id> --break-at enrich

# Fork from a step — test your fix without re-running upstream
kronos replay <run-id> --fork-from enrich

Output:

[replay] run abc-123 (workflow: data-pipeline v1)
[replay] fetched 24 events

[replay] step 1/6: fetch-user  OK (12ms, diff: 0 bytes)
[replay] step 2/6: enrich      BREAK
[replay] pid=48291 — attach your debugger now, press enter to continue

Architecture

       SubmitWorkflow RPC
              │
              ▼
   ┌──────────────────────┐
   │  workflow_runs row   │      RunStarted event
   │   (status=running)   │              │
   └──────────┬───────────┘              ▼
              │              ┌──────────────────────┐
              │              │  workflow_events     │
              │              │   (append-only log)  │
              ▼              └──────────────────────┘
   For each ready step:
              │
              ▼
   ┌──────────────────────┐
   │   jobs row inserted  │  ◄── existing scheduler/worker
   │ type=kronos.workflow │      infrastructure handles this
   │       _step          │      (zero code changes)
   └──────────┬───────────┘
              ▼
   workflow_step handler:
     • load workflow definition (pinned version)
     • assemble step input from upstream outputs
     • invoke user's StepFunc
     • write events transactionally
     • schedule downstream steps

The workflow engine layers on top of the existing job system — no changes to the scheduler, worker pool, retries, dead letter, or cron. Workflows execute as ordinary jobs through the production-proven pipeline.

Core Features

Durable Workflows

  • DAG-based execution — steps form a directed acyclic graph with parallel branches and fan-in
  • Per-step checkpointing — step outputs are durably recorded; retries resume from the last completed step
  • Event sourcing — every run is an append-only log of events
  • Workflow versioning — in-flight runs pin their definition; new runs use the latest version
  • Blob deduplication — large step inputs/outputs are content-addressed and shared across forks

Time-Travel Debugger UI

  • Runs list — filter by workflow/status, card layout with duration and status badges
  • Run detail — 3-pane layout: step tree, timeline scrubber, step inspector
  • Timeline scrubber — click/drag to rewind through events, see step state at any point
  • Step inspection — syntax-highlighted JSON for inputs, outputs, errors
  • Live streaming — SSE auto-updates for in-progress runs
  • Fork interface — one click to fork a failed run from any step

Local Replay & IDE Debugging

  • kronos replay <run-id> — re-executes a production run locally using recorded outputs
  • kronos replay <run-id> --break-at step_name — pause at a step; attach Delve / VS Code / GoLand
  • kronos replay <run-id> --fork-from step_name — test a fix without re-executing upstream steps
  • Output diffs surface divergence from production

Production Infrastructure

  • PostgreSQL persistence — single source of truth, survives crashes
  • Leader election — advisory locks ensure one active scheduler per cluster
  • Multi-node clustering — stateless API nodes, one scheduler; scale horizontally
  • Prometheus metrics — workflow throughput, step latency, queue depth
  • gRPC health checks — ready for Kubernetes liveness/readiness probes
  • Graceful shutdown — drains in-flight steps on SIGTERM

Built-in Job Queue

For simple one-off background work, Kronos also functions as a traditional job queue:

  • Job priorities — 1–10 scale
  • Cron scheduling — recurring jobs
  • Exponential backoff with jitter — intelligent retries
  • Dead-letter handling — failed jobs preserved for manual inspection
  • Idempotency keys — deduplication across retries

Performance

Worker pool throughput (CPU-bound, no Postgres):

  • 10 workers: 1,256,838 jobs/sec (795ns/op)
  • 50 workers: 1,135,330 jobs/sec (880ns/op)
  • 100 workers: 1,117,641 jobs/sec (894ns/op)

End-to-end throughput (with Postgres 17, 50 workers):

  • ~2,800 workflow steps/sec

Benchmark locally:

docker-compose up -d
make bench BENCH_FLAGS="-jobs=10000 -workers=50"

Configuration

Environment Variable Default Description
DATABASE_URL postgres://kronos:kronos@localhost:5432/kronos?sslmode=disable PostgreSQL DSN
GRPC_ADDR :50051 gRPC API listen address
ADMIN_ADDR :8080 HTTP admin server (debugger UI, dead-letter)
METRICS_ADDR :2112 Prometheus metrics address
WORKER_CONCURRENCY 10 Worker pool size
SCHEDULER_POLL_INTERVAL 2s Job polling interval
SCHEDULER_BATCH_SIZE 10 Max jobs to claim per poll
JOB_TIMEOUT 30m Maximum execution time per step

Running Tests

make test                # Unit tests (race detector enabled)
make test-integration    # Integration tests (requires Docker)

Project Structure

kronos/
├── cmd/
│   ├── kronos/              main entrypoint + replay subcommand
│   └── bench/               throughput benchmark binary
├── internal/
│   ├── api/                 gRPC server (jobs + workflows)
│   ├── config/              environment-based configuration
│   ├── cron/                cron scheduling
│   ├── debugger/            web UI for time-travel debugging (HTMX + Alpine + Tailwind)
│   ├── health/              gRPC health protocol
│   ├── leader/              PostgreSQL advisory lock leader election
│   ├── metrics/             Prometheus instrumentation
│   ├── middleware/          gRPC interceptors
│   ├── replay/              local replay client + replayer engine + JSON diff
│   ├── retry/               exponential backoff
│   ├── scheduler/           job polling loop
│   ├── store/               job persistence (Store interface + Postgres)
│   ├── worker/              goroutine pool + handler registry
│   └── workflow/            workflow engine, event log, step dispatcher, fork
├── proto/kronos/v1/         protobuf definitions
├── gen/kronos/v1/           generated gRPC code
└── migrations/              SQL migrations

Roadmap

v1 (current):

  • Durable workflow engine with DAG support
  • Time-travel debugger UI
  • Local replay and IDE debugging
  • Fork from any midpoint
  • Workflow versioning

v1.1:

  • Typed step I/O via Go generics
  • Embeddable library mode (no separate binary)
  • Workflow signals & queries
  • Human-in-the-loop approval steps

v2:

  • AI step primitives (LLM calls, token counting, cost tracking)
  • Python / TypeScript client SDKs
  • Nested fan-out (map steps)

License

MIT

About

Distributed job scheduler in Go — gRPC, PostgreSQL, worker pool, retries

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors