Skill goes in. Durable workflow comes out. Free, offline, and deterministic.
npx tenure ./skill # fix it
npx tenure score ./skill # see what's broken
npx tenure openclaw-skills # compile to .skillx
npx tenure run ./skill # print localhost Temporal run commandAll commands run locally. No account. No API key. No network calls. No tokens consumed.
Your 10-step agent succeeds 60% of the time. Tool calls fail 3–15% in production. At 95% per-step reliability, a 10-step workflow lands at 60% end-to-end. A 20-step workflow hits 36%. Every framework treats this as an edge case. It's not. It's math.
No framework declares whether a step is safe to retry, what happens when a mutation fails halfway, or whether the agent should ask a human before executing. Every skill runs with identical execution guarantees — a web search retries the same way as a payment. tenure fixes this.
npx tenure score ./social-post
tenure score · social-post v1.0
Score: 18/100 · probationary
✗ Crash Recovery 0/20 No checkpoint decomposition. Single wrapper.
✗ Idempotency 0/20 3 mutations, 0 dedup keys.
✗ Compensation 0/20 No rollback on any mutation.
✓ HITL Gates 15/20 Approval prompt detected in step 4.
✗ Budget 3/20 No token cap. Unbounded inference.
Gaps
→ step 2 "generate image with DALL-E" — no retry policy, 3% call failure rate
→ step 3 "post to Twitter" — no idempotency key, retry = duplicate post
→ step 5 "post to LinkedIn" — no compensation, partial failure = inconsistent state
→ no budget enforcement before LLM dispatch
Run npx tenure ./social-post to fix
The agent posted the same tweet four times yesterday. The image generation burned $12 in retries. The LinkedIn post went out but Twitter failed, so the content exists on one platform but not the other with no rollback. The developer found out from a customer.
npx tenure ./social-post
npx tenure ./social-post
Compiling social-post...
step 3 "post to LinkedIn" uses linkedin.posts.create
Tool not in registry. How should this step be classified?
1. Read — retrieves data, no side effects, safe to retry
2. Compute — pure transformation, no external interaction
3. Mutation — modifies external state, needs idempotency
4. Human — requires approval before executing
5. Skip — classify later
> 3
What reverses this action?
1. linkedin.posts.delete — delete the post
2. Irreversible — cannot be undone (like sending email)
3. Don't know — figure out later
> 1
✓ linkedin.posts.create → mutation, compensation: linkedin.posts.delete
✓ Written to registry. Future compilations will use this classification.
...
✓ Crash Recovery 18/20 5 Activities, checkpointed
✓ Idempotency 14/20 Keys: platform+content_hash+date, img_prompt+params
✓ Compensation 16/20 Compensate: delete post. Irreversible: image generation.
✓ HITL Gates 15/20 Signal gate on publish step
✓ Budget 20/20 Cap: 100k tokens. Pre-dispatch enforcement.
Score: 18 → 84/100 · compiled
Written: ./social-post.skillx
◢ View workflow → http://localhost:8233/workflows/social-post-d4f7a1
Click the link. Watch the workflow execute in Temporal's web UI. Kill the process. Watch it resume from the last checkpoint. The tweet posts exactly once.
The five properties (Crash Recovery, Idempotency, Compensation, HITL Gates, Budget) and their weights aren't opinions. They're derived from a peer-reviewed empirical study. We compiled 108 agent skills from three independent ecosystems (Anthropic, OpenAI, OpenClaw). 66 produced valid structural reports. We then ran 346 failure injection tests — SIGKILL between steps, SIGKILL during mutations, replay after state change, budget exhaustion, gate removal, repeated execution with random crashes — across 31 skills with fully controllable external state.
The regression connected structural metrics to injection outcomes: which gaps predicted which failures, and how strongly. The weights in tenure score are those regression coefficients, not expert guesses.
The formal definition of durability (D1–D5) is the first academic formalization of what "durable execution" means for agent skills. The five properties compose from Garcia-Molina's saga theory (1987), Sozer's decomposition theory (2009), and Rabanser's agent reliability framework (ICLR 2026). Two independent derivations from the same literature converged on the same five properties.
Findings from the corpus: 89% of skills lack idempotency on mutations. 95% lack compensation. 100% lack budget enforcement. 57% couldn't be compiled into durable workflows at all.
Paper: Lalge & Basili, "A Formal Definition of Durability for Autonomous Agent Skills," 2026. Preprint, data, and regression code at github.com/tenure/durability-study.
The agent runs a nightly data pipeline. It crashes at step 7 of 10. On restart, it starts over from step 1. Seven steps of work — API calls, database writes, file transformations — gone. The three database writes from steps 3, 5, and 6 already committed. Now they'll commit again. Duplicated rows, corrupted state, silent data loss.
tenure score · nightly-pipeline v2.1
Score: 12/100 · probationary
✗ Crash Recovery 0/20 No decomposition. 10 steps in single execution.
✗ Idempotency 4/20 2 of 5 mutations missing dedup key.
✗ Compensation 0/20 No rollback on 3 database writes.
✗ HITL Gates 0/20 Destructive migration in step 9 has no approval gate.
✗ Budget 8/20 Hard timeout but no token-level cap.
Gaps
→ steps 1–10 execute as monolith — crash at any point restarts all
→ step 3 "insert records" — no idempotency key, restart = duplicate rows
→ step 5 "update balances" — no compensation, partial update = inconsistent
→ step 9 "run migration" — destructive, no human approval gate
tenure · nightly-pipeline v2.1
Score: 12 → 91/100 · reviewed
✓ Crash Recovery 20/20 10 Activities, each checkpointed independently
✓ Idempotency 18/20 Keys: record_hash+batch_id, balance_id+date
✓ Compensation 17/20 Compensate: soft-delete rows, reverse balance delta
✓ HITL Gates 18/20 Signal gate on migration step. Zero compute while waiting.
✓ Budget 18/20 Cap: 200k tokens. Per-Activity tracking.
Written: ./nightly-pipeline.skillx
◢ View workflow → http://localhost:8233/workflows/nightly-pipeline-b8c2e5
Kill the worker between step 6 and step 7. Restart. It resumes at step 7. Steps 1–6 are not re-executed. Database writes are not duplicated. Open the Temporal UI and watch replay produce identical results.
The agent sends Slack notifications, updates a CRM, and emails a summary. The Slack API returns a timeout. The framework retries. The message was actually delivered — the timeout was on the response, not the request. The customer gets two identical Slack messages. The CRM has two entries. The email went out twice. Nobody knows which copy is the original.
tenure score · raas-video-pipeline v1.0
Weights: paper v1.0 · N=31 behavioral · directional
Score: 18/100 · probationary
✗ D1 Crash Recovery 3/20 Monolithic. 9 steps, 0 checkpointed boundaries.
✗ D2 Deterministic Replay 2/20 3 non-deterministic tools, no event history.
✗ D3 Idempotency 0/20 5 mutations, 0 dedup keys.
✗ D4 Compensation 4/20 Asset cache for Whisper. Avatar/VO/Image irreversible.
✗ D5 Budget 9/20 No per-step token cap. No $ ceiling per video.
Cross-cutting
○ HITL Gate 0/10 No approval gate on HeyGen ($0.30–1.50/clip).
Gaps
→ step 2 "generate_vo.py" — no idempotency key, retry = duplicate ElevenLabs charge
→ step 4 "generate_images.py" — no dedup key, 4 scenes × retry = 4× Replicate charge
→ step 5 "generate_avatar.py" — no idempotency key, HeyGen call not cached by (vo_hash, start, duration)
→ step 5 "generate_avatar.py" — no HITL gate on most expensive step in pipeline
→ step 7 "remotion render" — no checkpoint, crash mid-render = full re-encode
→ step 3b "word boundary QA" — manual gate, no Signal primitive, blocks workflow thread
→ retry policy global — same policy for Tesseract OCR (read) as HeyGen (expensive mutation)
→ no budget cap — runaway script could burn $50+ in API calls before halting
Classification confidence
14 steps classified high · 3 medium · 1 unknown (word boundary QA shape)
Run npx tenure ./raas-video-pipeline to fix
View audit log → raas-video-pipeline.audit.json
$ npx tenure ./raas-video-pipeline
Compiling raas-video-pipeline...
Reading SKILL.md... 11 pipeline steps detected.
Reading compiler cache... 4 tools known, 7 tools unknown.
─────────────────────────────────────────────────────
Step 1 "Generate script JSON"
Detected: artifact_write (file output)
Is this step idempotent on re-run?
1. Yes — same input always produces same script
2. No — LLM call, output varies
3. Yes if seeded
> 3
What's the idempotency key?
1. script:{title_hash}:{seed} (recommended)
2. Custom
> 1
✓ Key: script:{title_hash}:{seed}
✓ Classification: idempotent_mutation (high confidence)
─────────────────────────────────────────────────────
Step 2 "Generate VO via ElevenLabs"
Detected: external API call (elevenlabs.com)
Tool not in registry. How should this step be classified?
1. Read — retrieves data, no side effects, safe to retry
2. Compute — pure transformation, no external interaction
3. Mutation — modifies external state, needs idempotency
4. Human — requires approval before executing
5. Skip — classify later
> 3
What's the unique key for this call?
1. script_hash + voice_id (recommended — same script + voice = same audio)
2. Custom
> 1
What reverses this action?
1. Cache the MP3 (idempotent re-fetch from local store)
2. Delete from ElevenLabs (no such API)
3. Irreversible
> 1
Detected HITL marker: "⚠️ Send clip to user for approval before continuing"
Adding approval gate after Step 2.
✓ Idempotency: vo:{script_hash}:{voice_id}
✓ Compensation: asset cache
✓ HITL gate: signal wait on "vo_approved"
✓ Registry updated. Future skills using elevenlabs.tts inherit this.
─────────────────────────────────────────────────────
Step 3 "Whisper canon extraction"
Detected: compute (local faster-whisper)
Deterministic on same audio input.
✓ Classification: deterministic_computation (high confidence)
✓ Cache key: whisper:{audio_hash}
─────────────────────────────────────────────────────
Step 3b "Word boundary QA"
Detected HITL marker: "REQUIRED before Step 4"
Is this a blocking human review?
1. Yes — workflow halts until human confirms
2. No — automated QA check
> 1
✓ HITL gate: signal wait on "boundaries_confirmed"
✓ Zero-compute while waiting
─────────────────────────────────────────────────────
Step 4 "Generate background images via Replicate"
Detected: external API call (api.replicate.com)
Tool not in registry. Classification?
1. Read
2. Compute
3. Mutation
4. Human
> 3
Unique key?
1. prompt_hash + model_version + seed (recommended)
2. Custom
> 1
Reversal?
1. Cache the image
2. Irreversible
> 1
✓ Idempotency: image:{prompt_hash}:{model_version}:{seed}
✓ Compensation: asset cache
✓ Registry updated.
─────────────────────────────────────────────────────
Step 4b "OCR extraction + karaoke plan"
Detected: compute (tesseract + deterministic mapping)
✓ Classification: deterministic_computation (high confidence)
✓ Cache key: ocr:{image_hash}
─────────────────────────────────────────────────────
Step 5 "Generate avatars via HeyGen"
Detected: external API call (api.heygen.com)
Tool not in registry. Classification?
1. Read
2. Compute
3. Mutation
4. Human
> 3
Unique key?
1. vo_hash + start + duration (recommended)
2. scene_id + vo_hash
3. Custom
> 1
Reversal?
1. Cache the clip
2. Delete from HeyGen (DELETE /v2/video/{id})
3. Irreversible
> 1
Budget concern: HeyGen charges per minute of avatar video.
Set per-call budget cap?
1. $5 per avatar clip (recommended for 60s max)
2. Custom
3. No cap
> 1
✓ Idempotency: avatar:{vo_hash}:{start}:{duration}
✓ Compensation: asset cache
✓ Budget: $5/call
✓ Registry updated.
─────────────────────────────────────────────────────
Step 6 "Assemble Remotion props"
Detected: compute (TypeScript file generation)
✓ Classification: deterministic_computation (high confidence)
─────────────────────────────────────────────────────
Step 7 "Render MP4 via Remotion"
Detected: artifact_write (local MP4)
Is render deterministic?
1. Yes — same props always produce same MP4
2. No
> 1
✓ Idempotency: render:{props_hash}
✓ Compensation: cached MP4
─────────────────────────────────────────────────────
Step 8 "Git commit"
Detected: mutation (git commit)
Unique key?
1. tree_hash + message_hash (recommended)
2. Custom
> 1
Reversal?
1. Revert commit (git revert {sha})
2. Irreversible
> 1
✓ Idempotency: commit:{tree_hash}:{message_hash}
✓ Compensation: git revert
─────────────────────────────────────────────────────
Workflow-level budget
Estimated per-run cost: $8-12 (ElevenLabs $0.30 + Replicate $0.50 + HeyGen $7-11)
Set workflow budget cap?
1. $15 per video (recommended — 30% margin)
2. $25 per video
3. Custom
> 1
✓ Budget: $15/video, pre-dispatch enforcement
─────────────────────────────────────────────────────
Compiling .skillx...
✓ Crash Recovery 18/20 11 Activities, checkpointed at artifact boundaries
✓ Idempotency 18/20 6 mutation keys assigned
✓ Compensation 15/20 5 cache + 1 git revert; 0 unresolved
✓ HITL Gates 16/20 2 approval gates (VO approval, boundary QA)
✓ Budget 20/20 $15/video cap, per-call cap on HeyGen
Score: 45 → 87/100 · reviewed
Registry contributions:
elevenlabs.tts → mutation, cache-compensable
replicate.predict → mutation, cache-compensable
heygen.video.generate → mutation, cache-compensable, cost-bounded
Written: ./raas-video-pipeline.skillx
◢ View workflow → http://localhost:8233/workflows/raas-video-pipeline-a7c3e1
Share these classifications with the registry? (y/n)
> y
Replay the workflow from Temporal history. The Slack message is not re-sent. The CRM entry is not re-created. Idempotency keys match, dedup fires, mutations are skipped. The email sends exactly once.
The agent processes expense reports. It reads receipts, categorizes expenses, and submits reimbursements to the payment system. The CFO asks: which expenses were auto-approved vs. human-approved? What happens if the payment API fails after the expense is marked as processed? Can you prove the agent didn't submit the same reimbursement twice? There is no answer. There is no audit trail. There is no proof.
tenure score · expense-processor v3.0
Score: 29/100 · probationary
✗ Crash Recovery 8/20 Some decomposition. Payment step not isolated.
✗ Idempotency 6/20 Receipt read is cached. Payment has no dedup key.
✗ Compensation 0/20 No reversal on failed reimbursement.
✗ HITL Gates 0/20 Payment submitted without human approval.
✗ Budget 15/20 Token cap exists. No per-step tracking.
Gaps
→ step 5 "submit reimbursement" — no idempotency key, retry = double payment
→ step 5 "submit reimbursement" — no human approval gate on financial action
→ step 5 "submit reimbursement" — no compensation (refund) on downstream failure
→ audit log does not record classification decisions
tenure · expense-processor v3.0
Score: 29 → 94/100 · reviewed
✓ Crash Recovery 18/20 7 Activities, payment step fully isolated
✓ Idempotency 20/20 Keys: receipt_hash+employee_id, payment_ref+amount
✓ Compensation 18/20 Compensate: initiate refund. Irreversible: receipt scan.
✓ HITL Gates 20/20 Signal gate on reimbursement. Zero compute while waiting.
✓ Budget 18/20 Cap: 75k tokens. Per-Activity tracking.
Written: ./expense-processor.skillx
◢ View workflow → http://localhost:8233/workflows/expense-processor-c7d4b2
The .skillx audit log records every classification decision. Which steps are mutations. Which have compensation actions. Which require human approval. The CFO gets a document, not a promise.
A .skillx is a skill manifest, not a script. It describes what to execute and how to survive failure — but contains no executable code itself. The runtime decides how to execute it.
| Runtime | Status | Notes |
|---|---|---|
| Temporal | Native | Full primitive support. Default target. |
| Inngest | Beta | Event-driven resolution. Core primitives mapped. |
| Restate | Beta | Virtual object resolution. Core primitives mapped. |
| Self-hosted | Yes | .skillx is JSON. Write your own resolver. |
A .skillx declares an execution graph with typed steps, classified primitives, and resolution hints. It's portable JSON — readable, diffable, transportable over HTTP or cat. A .skillx from an untrusted source cannot run arbitrary code. It can only request primitives from the runtime's own catalog.
SKILL.md is the source code. The A10 compiler is the compiler. The .skillx is the build artifact.
A10 classifies every step deterministically — no LLM calls, no inference, no probability. Tool lookup, mapping lookup, verb matching. Same input always produces the same output. The classification taxonomy grows with every compilation: novel tool classifications write back to the mapping, so the compiler improves by compiling more skills.
The compilation report logs every decision: which sources were consulted, which classification was chosen, what confidence level was assigned, and why. When the compiler doesn't know, it says unknown — not a defect, a measurement of the compiler's knowledge boundary.
Temporal is the kernel. Tenure is the compiler. Every other agent framework is an interpreted language running directly on the kernel without a compilation step, hoping the kernel figures out the right execution strategy at runtime. Tenure is the first tool that analyzes the skill before it hits the kernel and tells the kernel exactly how to run it durably.
Not a framework. Keep CrewAI, LangGraph, AutoGen, ADK, Pydantic AI — whatever you use to decide what to do. Tenure decides how to survive doing it.
Not a runtime. Temporal, Inngest, and Restate are runtimes. Tenure compiles skills into artifacts those runtimes execute. Different layer.
Not an agent. Tenure does not reason, plan, or call LLMs. It classifies steps and assigns durability primitives. Deterministic in, deterministic out.
Not an LLM wrapper. The compiler uses zero LLM calls. Classification is structural analysis — tool lists, verb matching, mapping lookup. No tokens consumed during compilation.
# Compile any skill package into a portable .skillx artifact (no Temporal required)
npx tenure openclaw-skills ./my-skill
# Compile with explicit output path
npx tenure openclaw-skills ./my-skill --output ./my-skill.skillx
# Start a local Temporal server (one-time setup)
temporal server start-dev
# Score a skill
npx tenure score ./my-skill
# Fix it and deploy
npx tenure ./my-skill
# Watch it run
open http://localhost:8233The classification taxonomy is open. The tool lists, verb categories, and durability mapping are in src/mapping/ — contributions that improve classification coverage directly improve every future compilation.
If you score a skill and the classifications look wrong, that's a contribution: file an issue with the skill and the audit log. Wrong classifications with good audit trails are how the compiler improves.
MIT
Three commands. Diagnose, fix, compile. Zero install. Portable .skillx output.