tenure

Skill goes in. Durable workflow comes out. Free, offline, and deterministic.

npx tenure ./skill              # fix it
npx tenure score ./skill        # see what's broken
npx tenure openclaw-skills      # compile to .skillx
npx tenure run ./skill          # print localhost Temporal run command

All commands run locally. No account. No API key. No network calls. No tokens consumed.

Your 10-step agent succeeds 60% of the time. Tool calls fail 3–15% in production. At 95% per-step reliability, a 10-step workflow lands at 60% end-to-end. A 20-step workflow hits 36%. Every framework treats this as an edge case. It's not. It's math.

No framework declares whether a step is safe to retry, what happens when a mutation fails halfway, or whether the agent should ask a human before executing. Every skill runs with identical execution guarantees — a web search retries the same way as a payment. tenure fixes this.

The Problem

npx tenure score ./social-post

tenure score · social-post v1.0

  Score: 18/100 · probationary

  ✗ Crash Recovery      0/20  No checkpoint decomposition. Single wrapper.
  ✗ Idempotency         0/20  3 mutations, 0 dedup keys.
  ✗ Compensation        0/20  No rollback on any mutation.
  ✓ HITL Gates         15/20  Approval prompt detected in step 4.
  ✗ Budget              3/20  No token cap. Unbounded inference.

  Gaps
    → step 2 "generate image with DALL-E" — no retry policy, 3% call failure rate
    → step 3 "post to Twitter" — no idempotency key, retry = duplicate post
    → step 5 "post to LinkedIn" — no compensation, partial failure = inconsistent state
    → no budget enforcement before LLM dispatch

  Run npx tenure ./social-post to fix

The agent posted the same tweet four times yesterday. The image generation burned $12 in retries. The LinkedIn post went out but Twitter failed, so the content exists on one platform but not the other with no rollback. The developer found out from a customer.

npx tenure ./social-post

npx tenure ./social-post

  Compiling social-post...

  step 3 "post to LinkedIn" uses linkedin.posts.create
  Tool not in registry. How should this step be classified?

    1. Read      — retrieves data, no side effects, safe to retry
    2. Compute   — pure transformation, no external interaction
    3. Mutation   — modifies external state, needs idempotency
    4. Human     — requires approval before executing
    5. Skip      — classify later

  > 3

  What reverses this action?
    1. linkedin.posts.delete  — delete the post
    2. Irreversible           — cannot be undone (like sending email)
    3. Don't know             — figure out later

  > 1

  ✓ linkedin.posts.create → mutation, compensation: linkedin.posts.delete
  ✓ Written to registry. Future compilations will use this classification.

  ...

  ✓ Crash Recovery     18/20  5 Activities, checkpointed
  ✓ Idempotency       14/20  Keys: platform+content_hash+date, img_prompt+params
  ✓ Compensation      16/20  Compensate: delete post. Irreversible: image generation.
  ✓ HITL Gates        15/20  Signal gate on publish step
  ✓ Budget            20/20  Cap: 100k tokens. Pre-dispatch enforcement.

  Score: 18 → 84/100 · compiled

  Written: ./social-post.skillx
  ◢ View workflow → http://localhost:8233/workflows/social-post-d4f7a1

Click the link. Watch the workflow execute in Temporal's web UI. Kill the process. Watch it resume from the last checkpoint. The tweet posts exactly once.

"Why These Numbers"

The five properties (Crash Recovery, Idempotency, Compensation, HITL Gates, Budget) and their weights aren't opinions. They're derived from a peer-reviewed empirical study. We compiled 108 agent skills from three independent ecosystems (Anthropic, OpenAI, OpenClaw). 66 produced valid structural reports. We then ran 346 failure injection tests — SIGKILL between steps, SIGKILL during mutations, replay after state change, budget exhaustion, gate removal, repeated execution with random crashes — across 31 skills with fully controllable external state.

The regression connected structural metrics to injection outcomes: which gaps predicted which failures, and how strongly. The weights in tenure score are those regression coefficients, not expert guesses.

The formal definition of durability (D1–D5) is the first academic formalization of what "durable execution" means for agent skills. The five properties compose from Garcia-Molina's saga theory (1987), Sozer's decomposition theory (2009), and Rabanser's agent reliability framework (ICLR 2026). Two independent derivations from the same literature converged on the same five properties.

Findings from the corpus: 89% of skills lack idempotency on mutations. 95% lack compensation. 100% lack budget enforcement. 57% couldn't be compiled into durable workflows at all.

Paper: Lalge & Basili, "A Formal Definition of Durability for Autonomous Agent Skills," 2026. Preprint, data, and regression code at github.com/tenure/durability-study.

Three Ways Your Agent Fails

1. The Crasher

The agent runs a nightly data pipeline. It crashes at step 7 of 10. On restart, it starts over from step 1. Seven steps of work — API calls, database writes, file transformations — gone. The three database writes from steps 3, 5, and 6 already committed. Now they'll commit again. Duplicated rows, corrupted state, silent data loss.

tenure score · nightly-pipeline v2.1

  Score: 12/100 · probationary

  ✗ Crash Recovery      0/20  No decomposition. 10 steps in single execution.
  ✗ Idempotency         4/20  2 of 5 mutations missing dedup key.
  ✗ Compensation        0/20  No rollback on 3 database writes.
  ✗ HITL Gates          0/20  Destructive migration in step 9 has no approval gate.
  ✗ Budget              8/20  Hard timeout but no token-level cap.

  Gaps
    → steps 1–10 execute as monolith — crash at any point restarts all
    → step 3 "insert records" — no idempotency key, restart = duplicate rows
    → step 5 "update balances" — no compensation, partial update = inconsistent
    → step 9 "run migration" — destructive, no human approval gate

tenure · nightly-pipeline v2.1

  Score: 12 → 91/100 · reviewed

  ✓ Crash Recovery     20/20  10 Activities, each checkpointed independently
  ✓ Idempotency       18/20  Keys: record_hash+batch_id, balance_id+date
  ✓ Compensation      17/20  Compensate: soft-delete rows, reverse balance delta
  ✓ HITL Gates        18/20  Signal gate on migration step. Zero compute while waiting.
  ✓ Budget            18/20  Cap: 200k tokens. Per-Activity tracking.

  Written: ./nightly-pipeline.skillx
  ◢ View workflow → http://localhost:8233/workflows/nightly-pipeline-b8c2e5

Kill the worker between step 6 and step 7. Restart. It resumes at step 7. Steps 1–6 are not re-executed. Database writes are not duplicated. Open the Temporal UI and watch replay produce identical results.

2. The Duplicator

The agent sends Slack notifications, updates a CRM, and emails a summary. The Slack API returns a timeout. The framework retries. The message was actually delivered — the timeout was on the response, not the request. The customer gets two identical Slack messages. The CRM has two entries. The email went out twice. Nobody knows which copy is the original.

tenure score · raas-video-pipeline v1.0
Weights: paper v1.0 · N=31 behavioral · directional

  Score: 18/100 · probationary

  ✗ D1 Crash Recovery     3/20  Monolithic. 9 steps, 0 checkpointed boundaries.
  ✗ D2 Deterministic Replay  2/20  3 non-deterministic tools, no event history.
  ✗ D3 Idempotency        0/20  5 mutations, 0 dedup keys.
  ✗ D4 Compensation       4/20  Asset cache for Whisper. Avatar/VO/Image irreversible.
  ✗ D5 Budget             9/20  No per-step token cap. No $ ceiling per video.

  Cross-cutting
    ○ HITL Gate           0/10  No approval gate on HeyGen ($0.30–1.50/clip).

  Gaps
    → step 2 "generate_vo.py" — no idempotency key, retry = duplicate ElevenLabs charge
    → step 4 "generate_images.py" — no dedup key, 4 scenes × retry = 4× Replicate charge
    → step 5 "generate_avatar.py" — no idempotency key, HeyGen call not cached by (vo_hash, start, duration)
    → step 5 "generate_avatar.py" — no HITL gate on most expensive step in pipeline
    → step 7 "remotion render" — no checkpoint, crash mid-render = full re-encode
    → step 3b "word boundary QA" — manual gate, no Signal primitive, blocks workflow thread
    → retry policy global — same policy for Tesseract OCR (read) as HeyGen (expensive mutation)
    → no budget cap — runaway script could burn $50+ in API calls before halting

  Classification confidence
    14 steps classified high · 3 medium · 1 unknown (word boundary QA shape)

  Run npx tenure ./raas-video-pipeline to fix
  View audit log → raas-video-pipeline.audit.json

$ npx tenure ./raas-video-pipeline

Compiling raas-video-pipeline...

  Reading SKILL.md... 11 pipeline steps detected.
  Reading compiler cache... 4 tools known, 7 tools unknown.

  ─────────────────────────────────────────────────────

Step 1 "Generate script JSON"
  Detected: artifact_write (file output)
  
  Is this step idempotent on re-run?
    1. Yes — same input always produces same script
    2. No — LLM call, output varies
    3. Yes if seeded
    
  > 3
  
  What's the idempotency key?
    1. script:{title_hash}:{seed}  (recommended)
    2. Custom
    
  > 1
  
  ✓ Key: script:{title_hash}:{seed}
  ✓ Classification: idempotent_mutation (high confidence)

  ─────────────────────────────────────────────────────

Step 2 "Generate VO via ElevenLabs"
  Detected: external API call (elevenlabs.com)
  Tool not in registry. How should this step be classified?
  
    1. Read      — retrieves data, no side effects, safe to retry
    2. Compute   — pure transformation, no external interaction  
    3. Mutation  — modifies external state, needs idempotency
    4. Human     — requires approval before executing
    5. Skip      — classify later
    
  > 3
  
  What's the unique key for this call?
    1. script_hash + voice_id  (recommended — same script + voice = same audio)
    2. Custom
    
  > 1
  
  What reverses this action?
    1. Cache the MP3 (idempotent re-fetch from local store)
    2. Delete from ElevenLabs (no such API)
    3. Irreversible
    
  > 1
  
  Detected HITL marker: "⚠️ Send clip to user for approval before continuing"
  Adding approval gate after Step 2.
  
  ✓ Idempotency: vo:{script_hash}:{voice_id}
  ✓ Compensation: asset cache
  ✓ HITL gate: signal wait on "vo_approved"
  ✓ Registry updated. Future skills using elevenlabs.tts inherit this.

  ─────────────────────────────────────────────────────

Step 3 "Whisper canon extraction"
  Detected: compute (local faster-whisper)
  Deterministic on same audio input.
  
  ✓ Classification: deterministic_computation (high confidence)
  ✓ Cache key: whisper:{audio_hash}

  ─────────────────────────────────────────────────────

Step 3b "Word boundary QA"
  Detected HITL marker: "REQUIRED before Step 4"
  
  Is this a blocking human review?
    1. Yes — workflow halts until human confirms
    2. No — automated QA check
    
  > 1
  
  ✓ HITL gate: signal wait on "boundaries_confirmed"
  ✓ Zero-compute while waiting

  ─────────────────────────────────────────────────────

Step 4 "Generate background images via Replicate"
  Detected: external API call (api.replicate.com)
  Tool not in registry. Classification?
  
    1. Read
    2. Compute
    3. Mutation
    4. Human
    
  > 3
  
  Unique key?
    1. prompt_hash + model_version + seed  (recommended)
    2. Custom
    
  > 1
  
  Reversal?
    1. Cache the image
    2. Irreversible
    
  > 1
  
  ✓ Idempotency: image:{prompt_hash}:{model_version}:{seed}
  ✓ Compensation: asset cache
  ✓ Registry updated.

  ─────────────────────────────────────────────────────

Step 4b "OCR extraction + karaoke plan"
  Detected: compute (tesseract + deterministic mapping)
  
  ✓ Classification: deterministic_computation (high confidence)
  ✓ Cache key: ocr:{image_hash}

  ─────────────────────────────────────────────────────

Step 5 "Generate avatars via HeyGen"
  Detected: external API call (api.heygen.com)
  Tool not in registry. Classification?
  
    1. Read
    2. Compute
    3. Mutation
    4. Human
    
  > 3
  
  Unique key?
    1. vo_hash + start + duration  (recommended)
    2. scene_id + vo_hash
    3. Custom
    
  > 1
  
  Reversal?
    1. Cache the clip
    2. Delete from HeyGen (DELETE /v2/video/{id})
    3. Irreversible
    
  > 1
  
  Budget concern: HeyGen charges per minute of avatar video.
  Set per-call budget cap?
    1. $5 per avatar clip  (recommended for 60s max)
    2. Custom
    3. No cap
    
  > 1
  
  ✓ Idempotency: avatar:{vo_hash}:{start}:{duration}
  ✓ Compensation: asset cache
  ✓ Budget: $5/call
  ✓ Registry updated.

  ─────────────────────────────────────────────────────

Step 6 "Assemble Remotion props"
  Detected: compute (TypeScript file generation)
  
  ✓ Classification: deterministic_computation (high confidence)

  ─────────────────────────────────────────────────────

Step 7 "Render MP4 via Remotion"
  Detected: artifact_write (local MP4)
  
  Is render deterministic?
    1. Yes — same props always produce same MP4
    2. No
    
  > 1
  
  ✓ Idempotency: render:{props_hash}
  ✓ Compensation: cached MP4

  ─────────────────────────────────────────────────────

Step 8 "Git commit"
  Detected: mutation (git commit)
  
  Unique key?
    1. tree_hash + message_hash  (recommended)
    2. Custom
    
  > 1
  
  Reversal?
    1. Revert commit (git revert {sha})
    2. Irreversible
    
  > 1
  
  ✓ Idempotency: commit:{tree_hash}:{message_hash}
  ✓ Compensation: git revert

  ─────────────────────────────────────────────────────

Workflow-level budget
  Estimated per-run cost: $8-12 (ElevenLabs $0.30 + Replicate $0.50 + HeyGen $7-11)
  
  Set workflow budget cap?
    1. $15 per video  (recommended — 30% margin)
    2. $25 per video
    3. Custom
    
  > 1
  
  ✓ Budget: $15/video, pre-dispatch enforcement

  ─────────────────────────────────────────────────────

Compiling .skillx...

  ✓ Crash Recovery     18/20  11 Activities, checkpointed at artifact boundaries
  ✓ Idempotency       18/20  6 mutation keys assigned
  ✓ Compensation      15/20  5 cache + 1 git revert; 0 unresolved
  ✓ HITL Gates        16/20  2 approval gates (VO approval, boundary QA)
  ✓ Budget            20/20  $15/video cap, per-call cap on HeyGen

  Score: 45 → 87/100 · reviewed

  Registry contributions:
    elevenlabs.tts           → mutation, cache-compensable
    replicate.predict        → mutation, cache-compensable
    heygen.video.generate    → mutation, cache-compensable, cost-bounded

  Written: ./raas-video-pipeline.skillx
  ◢ View workflow → http://localhost:8233/workflows/raas-video-pipeline-a7c3e1

  Share these classifications with the registry? (y/n)
  > y

Replay the workflow from Temporal history. The Slack message is not re-sent. The CRM entry is not re-created. Idempotency keys match, dedup fires, mutations are skipped. The email sends exactly once.

3. The Compliance Gap

The agent processes expense reports. It reads receipts, categorizes expenses, and submits reimbursements to the payment system. The CFO asks: which expenses were auto-approved vs. human-approved? What happens if the payment API fails after the expense is marked as processed? Can you prove the agent didn't submit the same reimbursement twice? There is no answer. There is no audit trail. There is no proof.

tenure score · expense-processor v3.0

  Score: 29/100 · probationary

  ✗ Crash Recovery      8/20  Some decomposition. Payment step not isolated.
  ✗ Idempotency         6/20  Receipt read is cached. Payment has no dedup key.
  ✗ Compensation        0/20  No reversal on failed reimbursement.
  ✗ HITL Gates          0/20  Payment submitted without human approval.
  ✗ Budget             15/20  Token cap exists. No per-step tracking.

  Gaps
    → step 5 "submit reimbursement" — no idempotency key, retry = double payment
    → step 5 "submit reimbursement" — no human approval gate on financial action
    → step 5 "submit reimbursement" — no compensation (refund) on downstream failure
    → audit log does not record classification decisions

tenure · expense-processor v3.0

  Score: 29 → 94/100 · reviewed

  ✓ Crash Recovery     18/20  7 Activities, payment step fully isolated
  ✓ Idempotency       20/20  Keys: receipt_hash+employee_id, payment_ref+amount
  ✓ Compensation      18/20  Compensate: initiate refund. Irreversible: receipt scan.
  ✓ HITL Gates        20/20  Signal gate on reimbursement. Zero compute while waiting.
  ✓ Budget            18/20  Cap: 75k tokens. Per-Activity tracking.

  Written: ./expense-processor.skillx
  ◢ View workflow → http://localhost:8233/workflows/expense-processor-c7d4b2

The .skillx audit log records every classification decision. Which steps are mutations. Which have compensation actions. Which require human approval. The CFO gets a document, not a promise.

Runtime Compatibility

A .skillx is a skill manifest, not a script. It describes what to execute and how to survive failure — but contains no executable code itself. The runtime decides how to execute it.

Runtime	Status	Notes
Temporal	Native	Full primitive support. Default target.
Inngest	Beta	Event-driven resolution. Core primitives mapped.
Restate	Beta	Virtual object resolution. Core primitives mapped.
Self-hosted	Yes	`.skillx` is JSON. Write your own resolver.

A .skillx declares an execution graph with typed steps, classified primitives, and resolution hints. It's portable JSON — readable, diffable, transportable over HTTP or cat. A .skillx from an untrusted source cannot run arbitrary code. It can only request primitives from the runtime's own catalog.

How It Works

SKILL.md is the source code. The A10 compiler is the compiler. The .skillx is the build artifact.

A10 classifies every step deterministically — no LLM calls, no inference, no probability. Tool lookup, mapping lookup, verb matching. Same input always produces the same output. The classification taxonomy grows with every compilation: novel tool classifications write back to the mapping, so the compiler improves by compiling more skills.

The compilation report logs every decision: which sources were consulted, which classification was chosen, what confidence level was assigned, and why. When the compiler doesn't know, it says unknown — not a defect, a measurement of the compiler's knowledge boundary.

Temporal is the kernel. Tenure is the compiler. Every other agent framework is an interpreted language running directly on the kernel without a compilation step, hoping the kernel figures out the right execution strategy at runtime. Tenure is the first tool that analyzes the skill before it hits the kernel and tells the kernel exactly how to run it durably.

What Tenure Is Not

Not a framework. Keep CrewAI, LangGraph, AutoGen, ADK, Pydantic AI — whatever you use to decide what to do. Tenure decides how to survive doing it.

Not a runtime. Temporal, Inngest, and Restate are runtimes. Tenure compiles skills into artifacts those runtimes execute. Different layer.

Not an agent. Tenure does not reason, plan, or call LLMs. It classifies steps and assigns durability primitives. Deterministic in, deterministic out.

Not an LLM wrapper. The compiler uses zero LLM calls. Classification is structural analysis — tool lists, verb matching, mapping lookup. No tokens consumed during compilation.

Quick Start

# Compile any skill package into a portable .skillx artifact (no Temporal required)
npx tenure openclaw-skills ./my-skill

# Compile with explicit output path
npx tenure openclaw-skills ./my-skill --output ./my-skill.skillx

# Start a local Temporal server (one-time setup)
temporal server start-dev

# Score a skill
npx tenure score ./my-skill

# Fix it and deploy
npx tenure ./my-skill

# Watch it run
open http://localhost:8233

Contributing

The classification taxonomy is open. The tool lists, verb categories, and durability mapping are in src/mapping/ — contributions that improve classification coverage directly improve every future compilation.

If you score a skill and the classifications look wrong, that's a contribution: file an issue with the skill and the audit log. Wrong classifications with good audit trails are how the compiler improves.

License

MIT

Three commands. Diagnose, fix, compile. Zero install. Portable .skillx output.

tenur.ing

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.cursor/plans		.cursor/plans
injection-output-doc-coauthoring-fix		injection-output-doc-coauthoring-fix
injection-output-rerun-2		injection-output-rerun-2
injection-output-rerun-3		injection-output-rerun-3
injection-output-rerun-4		injection-output-rerun-4
injection-output-rerun		injection-output-rerun
injection-output-skill-creator-debug		injection-output-skill-creator-debug
injection-output		injection-output
marketing-review		marketing-review
output/cron-log-writer/probationary/runs		output/cron-log-writer/probationary/runs
research-papers		research-papers
scaffolding		scaffolding
skill-creator		skill-creator
skill-temporal-developer		skill-temporal-developer
skills		skills
tenure		tenure
tests		tests
.gitignore		.gitignore
50-high-stakes-skills.md		50-high-stakes-skills.md
A10-CAPABILITY-ROADMAP.md		A10-CAPABILITY-ROADMAP.md
A10-COMPILE-TO-WORKFLOW-SPEC.md		A10-COMPILE-TO-WORKFLOW-SPEC.md
DURABLE-SKILL-CREATOR-SPEC-v2.md		DURABLE-SKILL-CREATOR-SPEC-v2.md
Durable Skills SDK — Product Requirements Document.md		Durable Skills SDK — Product Requirements Document.md
README.MD		README.MD
SOUL.md		SOUL.md
commands.md		commands.md
compass_artifact_wf-afc19df5-5448-44f9-9ca4-a41778ed3475_text_markdown.md		compass_artifact_wf-afc19df5-5448-44f9-9ca4-a41778ed3475_text_markdown.md
compile-skill-spec.md		compile-skill-spec.md
durability_research_pipeline.html		durability_research_pipeline.html
idea.md		idea.md
injection-harness-spec.md		injection-harness-spec.md
package.json		package.json
play		play
seed-classifications.json		seed-classifications.json
skill-creator-plan.json		skill-creator-plan.json
tenure-one-pager.md		tenure-one-pager.md
tenure-score-calculation.md		tenure-score-calculation.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tenure

The Problem

"Why These Numbers"

Three Ways Your Agent Fails

1. The Crasher

2. The Duplicator

3. The Compliance Gap

Runtime Compatibility

How It Works

What Tenure Is Not

Quick Start

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

tenure

The Problem

"Why These Numbers"

Three Ways Your Agent Fails

1. The Crasher

2. The Duplicator

3. The Compliance Gap

Runtime Compatibility

How It Works

What Tenure Is Not

Quick Start

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages