Build, label, and monetize AI training datasets collaboratively — powered by Shelby Protocol + Aptos blockchain.
AI model quality is directly tied to training data quality. But building high-quality datasets is expensive, slow, and opaque:
- No fair attribution: Multiple contributors build a dataset, but only the organization that owns the repo gets credit — and revenue.
- No provenance: When a model underperforms or causes harm, there is no way to trace which data caused it.
- No monetization layer: Data contributors have no mechanism to earn from downstream model usage.
- Centralized bottleneck: Dataset hosting on S3 or HuggingFace is fast to read but offers no cryptographic guarantees about what was read, when, or by whom.
Existing tools (Label Studio, Scale AI, Roboflow) solve the labeling UX but ignore the economic and provenance layer entirely.
Datavault is a collaborative dataset platform where every contribution is tracked on-chain, every training read generates a cryptographic receipt, and contributors earn micropayments when their data is actually used to train a model.
Contributor uploads / labels data
↓
Data stored on Shelby Protocol (hot storage, S3-compatible)
↓
Contribution metadata written to Aptos (contributor address, timestamp, data hash)
↓
AI team pulls dataset via tRPC API
↓
Shelby generates cryptographic receipt per read
↓
Receipt anchored to Aptos → contributor triggered for micropayment
↓
Model card automatically populated with verifiable data lineage
- Upload raw data (text, image, audio) or label existing datasets
- Every contribution timestamped and hashed on Aptos at upload
- Real-time dashboard: see exactly which models used your data and when
- Automatic micropayment when your contribution is pulled for training
- S3-compatible API — drop-in replacement for existing data pipelines
- Immutable provenance receipt per dataset read (Shelby receipt → Aptos transaction)
- Auto-generated model card with full data lineage
- EU AI Act compliance out of the box — every training read is auditable
- Open dataset marketplace: list datasets with licensing terms enforced at storage layer
- Access control via Aptos smart contracts (pay-per-read, subscription, token-gated)
- Every dataset version is independently verifiable
┌─────────────────────────────────────────────────────────┐
│ Datavault │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Contributor │ │ AI Team │ │ Public │ │
│ │ Portal │ │ Data Client │ │ Docs │ │
│ └──────┬──────┘ └──────┬───────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌──────▼──────────────────▼──────────────────▼─────┐ │
│ │ TanStack Start + tRPC Router │ │
│ │ upload label dataset receipts payments │ │
│ └──────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ Shelby │ │ Aptos │ │ SQLite + │ │
│ │ Protocol │ │ Blockchain│ │ Drizzle ORM │ │
│ │(hot storage)│ │(receipts) │ │ (metadata) │ │
│ └─────────────┘ └───────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
- All dataset files stored on Shelby (shelbynet)
- S3-compatible API for seamless integration with existing ML pipelines
- Every read operation produces a cryptographic receipt at the storage layer — not a log, not a proxy, actual proof from within the storage protocol
- DoubleZero fiber backbone: read latency optimized for high-frequency AI training workloads
- Egress cost ~70% lower than AWS S3
- Contribution events written as Aptos transactions at upload time
- Shelby receipts anchored to Aptos at read time
- Smart contracts enforce access rules and trigger micropayments
- Move module handles: contributor registry, dataset versioning, payment distribution
- Fast local queries for dataset listings, contributor stats, receipt history
- Drizzle ORM with type-safe schema — datasets, contributions, receipts, users tables
- SQLite for zero-config local dev, easily swappable for Turso in production
| Layer | Technology |
|---|---|
| Frontend | TanStack Start (SSR-first React framework) |
| API | tRPC (end-to-end type-safe, no REST boilerplate) |
| Auth | Better Auth (session management, wallet-compatible) |
| Database | SQLite via Drizzle ORM |
| Storage | Shelby Protocol (@shelby-protocol/sdk) |
| Blockchain | Aptos (@aptos-labs/ts-sdk), Move smart contracts |
| Monorepo | Turborepo (parallel builds, shared packages) |
| Linting | Biome (fast, zero-config formatter + linter) |
| Docs | Fumadocs (integrated documentation site) |
| MCP | Shelby MCP Server + custom Datavault MCP tools |
| Package Manager | pnpm |
datavault/ # Turborepo monorepo
├── apps/
│ ├── web/ # TanStack Start app
│ │ ├── app/
│ │ │ ├── routes/
│ │ │ │ ├── index.tsx # Landing
│ │ │ │ ├── contribute/ # Upload + labeling UI
│ │ │ │ ├── datasets/ # Marketplace
│ │ │ │ ├── receipts/ # Provenance explorer
│ │ │ │ └── dashboard/ # Contributor earnings
│ │ │ ├── server/
│ │ │ │ └── trpc/
│ │ │ │ ├── router.ts # Root tRPC router
│ │ │ │ ├── dataset.ts # Dataset procedures
│ │ │ │ ├── receipt.ts # Provenance procedures
│ │ │ │ └── payment.ts # Micropayment procedures
│ │ │ └── lib/
│ │ │ ├── shelby.ts # Shelby client
│ │ │ └── aptos.ts # Aptos client + receipt anchor
│ │ └── package.json
│ └── fumadocs/ # Fumadocs documentation site
│ ├── content/
│ │ ├── quickstart.mdx
│ │ ├── api-reference.mdx
│ │ └── provenance.mdx
│ └── package.json
├── packages/
│ ├── db/ # Drizzle schema + migrations
│ │ ├── schema/
│ │ │ ├── datasets.ts
│ │ │ ├── contributions.ts
│ │ │ ├── receipts.ts
│ │ │ └── users.ts
│ │ ├── drizzle.config.ts
│ │ └── package.json
│ ├── shelby/ # Shelby SDK wrapper
│ │ ├── src/
│ │ │ ├── client.ts # ShelbyNodeClient init
│ │ │ ├── upload.ts # Dataset upload
│ │ │ ├── download.ts # Dataset read + receipt capture
│ │ │ └── types.ts
│ │ └── package.json
│ ├── aptos/ # Aptos + Move contract client
│ │ ├── src/
│ │ │ ├── account.ts
│ │ │ ├── receipt.ts # anchor_read_receipt
│ │ │ └── payment.ts # distribute_payment
│ │ └── package.json
│ ├── api/ # API layer / business logic
│ ├── auth/ # Authentication configuration & logic
│ ├── config/ # Configuration package
│ ├── env/ # Environment variables validation
│ ├── ui/ # Shared shadcn/ui components and styles
│ └── mcp/ # Datavault MCP server
│ ├── src/
│ │ ├── tools/
│ │ │ ├── upload-dataset.ts
│ │ │ ├── query-receipts.ts
│ │ │ └── verify-provenance.ts
│ │ └── index.ts
│ └── package.json
├── contracts/ # Move smart contracts
│ ├── sources/
│ │ ├── dataset_registry.move
│ │ └── payment_splitter.move
│ └── Move.toml
├── biome.json # Biome config (linting + formatting)
├── turbo.json # Turborepo pipeline config
├── pnpm-workspace.yaml
└── package.json
// packages/db/schema/datasets.ts
export const datasets = sqliteTable('datasets', {
id: text('id').primaryKey(),
name: text('name').notNull(),
description: text('description'),
shelbyBlobAddress: text('shelby_blob_address').notNull(),
ownerAddress: text('owner_address').notNull(),
version: integer('version').default(1),
licenseType: text('license_type').notNull(), // pay-per-read | subscription | open
pricePerRead: integer('price_per_read').default(0),
createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})
// packages/db/schema/contributions.ts
export const contributions = sqliteTable('contributions', {
id: text('id').primaryKey(),
datasetId: text('dataset_id').references(() => datasets.id),
contributorAddress: text('contributor_address').notNull(),
dataHash: text('data_hash').notNull(),
aptosTxHash: text('aptos_tx_hash').notNull(), // on-chain proof
weight: integer('weight').default(1),
createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})
// packages/db/schema/receipts.ts
export const receipts = sqliteTable('receipts', {
id: text('id').primaryKey(),
datasetId: text('dataset_id').references(() => datasets.id),
readerAddress: text('reader_address').notNull(),
shelbyReceiptHash: text('shelby_receipt_hash').notNull(), // storage-layer proof
aptosTxHash: text('aptos_tx_hash').notNull(), // blockchain anchor
paid: integer('paid', { mode: 'boolean' }).default(false),
createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})// apps/web/app/server/trpc/dataset.ts
export const datasetRouter = router({
upload: protectedProcedure
.input(z.object({
name: z.string(),
file: z.instanceof(Buffer),
license: z.enum(['pay-per-read', 'subscription', 'open']),
}))
.mutation(async ({ input, ctx }) => {
// 1. Upload to Shelby
const blobAddress = await shelbyClient.upload(input.file)
// 2. Register contribution on Aptos
const aptosTx = await aptosClient.registerContribution(blobAddress, ctx.user.address)
// 3. Save metadata to SQLite via Drizzle
return db.insert(datasets).values({ ...input, shelbyBlobAddress: blobAddress, aptosTxHash: aptosTx })
}),
read: protectedProcedure
.input(z.object({ datasetId: z.string() }))
.query(async ({ input, ctx }) => {
// 1. Fetch from Shelby — receipt generated at storage layer
const { data, receiptHash } = await shelbyClient.download(input.datasetId)
// 2. Anchor receipt to Aptos
const aptosTx = await aptosClient.anchorReceipt(receiptHash, input.datasetId)
// 3. Log receipt in SQLite
await db.insert(receipts).values({ shelbyReceiptHash: receiptHash, aptosTxHash: aptosTx })
return { data, receiptHash, aptosTxHash: aptosTx }
}),
listReceipts: publicProcedure
.input(z.object({ datasetId: z.string() }))
.query(({ input }) =>
db.select().from(receipts).where(eq(receipts.datasetId, input.datasetId))
),
})module datavault::dataset_registry {
struct Contribution has key {
contributor: address,
dataset_id: vector<u8>,
data_hash: vector<u8>,
timestamp: u64,
weight: u64,
}
struct ReadReceipt has key {
shelby_receipt_hash: vector<u8>,
dataset_id: vector<u8>,
reader: address,
timestamp: u64,
paid: bool,
}
public entry fun register_contribution(
contributor: &signer,
dataset_id: vector<u8>,
data_hash: vector<u8>,
weight: u64,
) { ... }
public entry fun anchor_read_receipt(
reader: &signer,
shelby_receipt_hash: vector<u8>,
dataset_id: vector<u8>,
) { ... }
public entry fun distribute_payment(
dataset_id: vector<u8>,
amount: u64,
) { ... }
}1. AI team calls tRPC dataset.read({ datasetId })
2. Better Auth verifies session / wallet signature
3. tRPC handler fetches dataset from Shelby via @shelby-protocol/sdk
4. Shelby generates cryptographic receipt (storage-layer proof)
5. Handler anchors receipt to Aptos: anchor_read_receipt(receipt_hash, dataset_id)
6. Aptos transaction confirmed (~600ms finality)
7. Receipt + aptos_tx_hash saved to SQLite via Drizzle
8. tRPC returns { data, receiptHash, aptosTxHash } to client
9. Verify: explorer.aptoslabs.com/txn/{aptosTxHash}?network=testnet
10. Payment distribution triggered per contributor weight
- Shelby package: upload / download / receipt capture
- Aptos package: account management +
anchor_read_receipt - Drizzle schema + migrations: datasets, contributions, receipts
- tRPC procedures:
dataset.upload,dataset.read,receipt.list - Milestone: First verifiable read receipt on Aptos testnet
- TanStack Start UI: contributor portal + dataset marketplace
- Better Auth: wallet-based sign-in + session management
- Dataset versioning (immutable versions on Shelby)
- Provenance explorer page: receipts by dataset / contributor
- Move contract: micropayment distribution
- Fumadocs: API reference + quickstart guide
- Milestone: End-to-end flow — contribute → train → receipt → payment
- Public dataset marketplace with on-chain licensing
- Datavault MCP server: AI agents can query + verify datasets directly
- HuggingFace Hub integration: push/pull with provenance
- Model card auto-generation from receipt history
- Milestone: Public beta, first paying dataset consumer
| Requirement | AWS S3 | IPFS / Filecoin | Shelby |
|---|---|---|---|
| Cryptographic read proof | ❌ | ❌ | ✅ |
| Low-latency AI reads | ✅ | ❌ | ✅ |
| S3-compatible API | ✅ | ❌ | ✅ |
| Micropayment native | ❌ | ❌ | ✅ |
| Decentralized | ❌ | ✅ | ✅ |
| Egress cost | High | Variable | ~70% lower |
The cryptographic receipt at the storage layer is the only mechanism that makes trustless payment splitting possible. Without Shelby, contributor payments require trusting whoever runs the server. With Shelby, the receipt is proof — no trust needed.
# Clone
git clone https://github.com/YOUR_USERNAME/datavault
cd datavault
# Install all dependencies (pnpm workspaces + Turborepo)
pnpm install
# Set up environment variables
cp apps/web/.env.example apps/web/.env
# Fill in: APTOS_API_KEY, SHELBY_API_KEY, BETTER_AUTH_SECRET
# Run database migrations
pnpm --filter @datavault/db migrate
# Start all apps in parallel (Turborepo)
pnpm dev
# Lint + format with Biome
pnpm biome check .
# Build all packages
pnpm build- Real-world receipt latency under high-frequency dataset reads (AI training workloads)
- TypeScript SDK ergonomics in a full-stack tRPC + TanStack Start context
- S3-compatibility edge cases for ML frameworks
- Micropayment channel behavior under bulk read operations
- MCP tool integration patterns for AI agent dataset access
MIT