data-forge

data-forge is a framework for building high-quality synthetic dataset pipelines.

The core idea is simple: generator models are useful for producing candidate data, but the moat is the quality system around them. Rows are generated, executed or validated, judged against explicit rubrics, reviewed by humans when needed, signed off, and only then exported for training.

This repository is designed to be cloned and adapted for new niches. A niche can be SQL, coding, customer-support tools, browser tasks, legal classification, logistics reasoning, or any other domain where data quality can be measured with clear gates.

What It Provides

A reusable storage layer with local:// and gdrive:// backends.
A pattern for niche-specific generation prompts, validators, reports, review packets, and dataset exports.
Sharded generation for running independent workers in parallel without write races.
Static HTML review packets for human approval without running a server.
Signoff enforcement before fine-tuning exports.
Testable quality gates instead of trust-based synthetic data.

Repository Layout

data-forge/
  docs/                 # Framework-level architecture and storage docs
  niches/               # Domain-specific dataset factories
  src/data_forge/core/  # Reusable storage, scoring, and JSON helpers
  src/data_forge/niches # Python implementation for niche packs
  tests/                # Core and niche tests

Niche-specific scripts and docs live inside each niche folder. The current example niche is under niches/.

Quick Start

Clone and run tests:

git clone <repo-url>
cd data-forge
python3 -m unittest discover -s tests
for dir in niches/*/tests; do python3 -m unittest discover -s "$dir"; done

Install package dependencies:

python3 -m pip install -e .

Use local storage during development:

export DATA_FORGE_STORAGE=local

Use Google Drive as the shared data store:

export DATA_FORGE_STORAGE=gdrive
export DATA_FORGE_DRIVE_ROOT_ID=<google-drive-folder-id>
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Cloud agents can use:

export GOOGLE_APPLICATION_CREDENTIALS_JSON='<raw service account json>'

See docs/google_drive_storage.md for Drive setup.

Core Workflow

Define a niche pack with a row contract, generation prompts, validators, review UI, and export format.
Generate raw candidate rows with a cheap or high-throughput generator model.
Run deterministic gates and rubric scoring.
Archive rejected rows with reasons.
For larger runs, generate multiple independent shards and merge/dedupe accepted rows.
Build static HTML review packets for accepted rows.
Apply human review decisions.
Create a signoff manifest.
Export only approved rows into training-ready datasets.
Evaluate the trained model against public or private benchmarks.

Building a New Niche

A niche should include:

README.md: domain goal, benchmark target, and usage.
config.json: domains, skills, thresholds, and prompt paths.
prompts/: orchestrator, generator, and judge instructions.
examples/: one accepted row and one rejected row.
scripts/: niche-specific generation, review, signoff, and export commands.
Python gates under src/data_forge/niches/<niche_name>/.
Tests covering acceptance, rejection, review, signoff, and export.

Keep generated datasets, benchmark downloads, model outputs, adapters, and service-account credentials out of git.

Design Principle

Fine-tuning is downstream. The asset is the reviewed dataset and the repeatable process that created it.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
niches/text-to-sql		niches/text-to-sql
src/data_forge		src/data_forge
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-forge

What It Provides

Repository Layout

Quick Start

Core Workflow

Building a New Niche

Design Principle

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-forge

What It Provides

Repository Layout

Quick Start

Core Workflow

Building a New Niche

Design Principle

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages