castwright

Generate synthetic instruction-tuning data that doesn't look synthetic.

castwright takes a handful of seed examples and produces thousands of new instruction-output pairs using any LLM API. It handles the annoying parts — prompt engineering, JSON parsing, deduplication, quality filtering — so you can focus on the model you're actually training.

from castwright import generate, load_seeds, save_results, GenerationConfig
from castwright import OpenAIProvider

seeds = load_seeds("seeds.jsonl")
provider = OpenAIProvider(model="gpt-4o-mini")

result = generate(seeds, provider, GenerationConfig(n=500, temperature=0.9))
save_results(result, "training_data.jsonl")
print(f"Saved {len(result.examples)} examples ({result.n_filtered} filtered)")

Why castwright?

Building a fine-tuning dataset by hand is slow. Getting an LLM to generate training data sounds easy until you deal with:

Refusals showing up in your training set
Repetitive examples that add nothing
Models talking about generating data instead of actually doing it
Raw JSON extraction from markdown blocks
Deduplication against your seeds so the model doesn't just copy them

distilabel tried to solve this but became a pipeline framework. Alpaca_eval is evaluation-only. Self-instruct is a research repo, not a library.

castwright is the missing middle ground: a pip-installable library with a clean API, built-in quality filters, and output in every format fine-tuning frameworks expect.

What you get:

Pluggable LLM backends: OpenAI, Anthropic, or any OpenAI-compatible API
Six fast heuristic filters that catch bad generations before they hit your training set
Automatic dedup against your seed data
Output in Alpaca, ShareGPT, or OpenAI chat format
Multi-turn conversation generation
A CLI for quick generation runs without writing Python
Zero required dependencies (provider SDKs are optional extras)

Install

pip install castwright

With OpenAI support:

pip install castwright[openai]

With Anthropic support:

pip install castwright[anthropic]

Everything (both providers + CLI):

pip install castwright[all]

Seed file format

Create a JSONL file with your seed examples. You need at least a few good ones — castwright uses them to teach the LLM what you want:

{"instruction": "Explain the difference between TCP and UDP", "output": "TCP is a connection-oriented protocol that guarantees delivery..."}
{"instruction": "Write a Python function to flatten a nested list", "output": "def flatten(lst):\n    result = []\n    for item in lst:\n        if isinstance(item, list):\n            result.extend(flatten(item))\n        else:\n            result.append(item)\n    return result"}
{"instruction": "What causes a segfault?", "input": "In C/C++ programs", "output": "A segmentation fault occurs when a program tries to access memory..."}

Also accepts JSON arrays and prompt/response field names.

Usage

Basic generation

from castwright import generate, GenerationConfig, Seed
from castwright import OpenAIProvider

seeds = [
    Seed(instruction="Explain recursion", output="Recursion is when a function calls itself..."),
    Seed(instruction="What is a hash table?", output="A hash table is a data structure that maps keys to values..."),
]

provider = OpenAIProvider(model="gpt-4o-mini")
config = GenerationConfig(n=100, temperature=0.9, diversity_factor=0.7)

result = generate(seeds, provider, config)
print(f"Generated: {result.n_generated}, Filtered: {result.n_filtered}, Kept: {len(result.examples)}")

Output formats

from castwright import save_results, OutputFormat

# Alpaca format (default) — works with axolotl, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.ALPACA)

# ShareGPT format — works with FastChat, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.SHAREGPT)

# OpenAI chat format — works with OpenAI fine-tuning API
save_results(result, "data.jsonl", OutputFormat.OPENAI)

Multi-turn conversations

from castwright import generate_multiturn, Seed
from castwright import OpenAIProvider

seeds = [Seed(instruction="Help me debug this Python code", output="Let me look at that...")]
provider = OpenAIProvider()

result = generate_multiturn(seeds, provider, n=50, turns=4)

Custom providers

Any OpenAI-compatible API works out of the box:

from castwright import OpenAIProvider

# vLLM, Ollama, Together, etc.
provider = OpenAIProvider(
    model="meta-llama/Llama-3-70B-Instruct",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

Quality filters

castwright applies six filters by default:

Filter	What it catches
`not_empty`	Blank instruction or output
`min_length`	Instructions shorter than 10 characters
`not_repetitive`	Output with >30% consecutive word repeats
`not_refusal`	"I'm sorry, I can't..." responses
`no_meta_talk`	"Here's an example..." meta-commentary
`balanced_formatting`	Unclosed code blocks

You can also pass your own:

from castwright import filter_examples, GeneratedExample

def my_filter(ex: GeneratedExample) -> bool:
    return len(ex.output) > 100

filtered = filter_examples(result.examples, filters=[my_filter])

Generation config

GenerationConfig(
    n=100,                    # Number of examples to generate
    model="gpt-4o-mini",     # Model name (passed to provider)
    temperature=0.9,          # Sampling temperature (0.0-2.0)
    max_retries=3,            # Retries on parse failure
    diversity_factor=0.7,     # 0.0=similar to seeds, 1.0=very diverse
    output_format=OutputFormat.ALPACA,
)

CLI

# Generate from seed file
castwright gen seeds.jsonl -n 200 -m gpt-4o-mini -o output.jsonl --provider openai

# Use Anthropic
castwright gen seeds.jsonl -n 100 -m claude-sonnet-4-20250514 -o output.jsonl --provider anthropic

# Preview your seed examples
castwright preview seeds.jsonl

# Test without API calls
castwright gen seeds.jsonl -n 10 -o test.jsonl --provider mock

Comparison with alternatives

	castwright	distilabel	self-instruct	manual
pip install	yes	yes	clone repo	-
Simple API	3 lines	pipeline DSL	scripts	-
Quality filters	built-in	separate step	none	human
Multi-provider	OpenAI, Anthropic, any compatible	varies	OpenAI only	-
Format output	Alpaca, ShareGPT, OpenAI	custom	Alpaca	any
Maintained	active	founders left	archived	-

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
assets		assets
scripts		scripts
src/castwright		src/castwright
templates		templates
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
datamix	Dataset mixing & curriculum optimization
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

castwright

Why castwright?

Install

Seed file format

Usage

Basic generation

Output formats

Multi-turn conversations

Custom providers

Quality filters

Generation config

CLI

Comparison with alternatives

See Also

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

castwright

Why castwright?

Install

Seed file format

Usage

Basic generation

Output formats

Multi-turn conversations

Custom providers

Quality filters

Generation config

CLI

Comparison with alternatives

See Also

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages