Skip to content

redtoali/datavault

Repository files navigation

Datavault — Collaborative AI Training Dataset Builder

Build, label, and monetize AI training datasets collaboratively — powered by Shelby Protocol + Aptos blockchain.


The Problem

AI model quality is directly tied to training data quality. But building high-quality datasets is expensive, slow, and opaque:

  • No fair attribution: Multiple contributors build a dataset, but only the organization that owns the repo gets credit — and revenue.
  • No provenance: When a model underperforms or causes harm, there is no way to trace which data caused it.
  • No monetization layer: Data contributors have no mechanism to earn from downstream model usage.
  • Centralized bottleneck: Dataset hosting on S3 or HuggingFace is fast to read but offers no cryptographic guarantees about what was read, when, or by whom.

Existing tools (Label Studio, Scale AI, Roboflow) solve the labeling UX but ignore the economic and provenance layer entirely.


The Solution

Datavault is a collaborative dataset platform where every contribution is tracked on-chain, every training read generates a cryptographic receipt, and contributors earn micropayments when their data is actually used to train a model.

Contributor uploads / labels data
        ↓
Data stored on Shelby Protocol (hot storage, S3-compatible)
        ↓
Contribution metadata written to Aptos (contributor address, timestamp, data hash)
        ↓
AI team pulls dataset via tRPC API
        ↓
Shelby generates cryptographic receipt per read
        ↓
Receipt anchored to Aptos → contributor triggered for micropayment
        ↓
Model card automatically populated with verifiable data lineage

Core Features

For Contributors

  • Upload raw data (text, image, audio) or label existing datasets
  • Every contribution timestamped and hashed on Aptos at upload
  • Real-time dashboard: see exactly which models used your data and when
  • Automatic micropayment when your contribution is pulled for training

For AI Teams

  • S3-compatible API — drop-in replacement for existing data pipelines
  • Immutable provenance receipt per dataset read (Shelby receipt → Aptos transaction)
  • Auto-generated model card with full data lineage
  • EU AI Act compliance out of the box — every training read is auditable

For the Ecosystem

  • Open dataset marketplace: list datasets with licensing terms enforced at storage layer
  • Access control via Aptos smart contracts (pay-per-read, subscription, token-gated)
  • Every dataset version is independently verifiable

Architecture

┌─────────────────────────────────────────────────────────┐
│                      Datavault                          │
│                                                         │
│   ┌─────────────┐    ┌──────────────┐    ┌──────────┐  │
│   │ Contributor │    │   AI Team    │    │  Public  │  │
│   │   Portal    │    │ Data Client  │    │   Docs   │  │
│   └──────┬──────┘    └──────┬───────┘    └────┬─────┘  │
│          │                  │                  │        │
│   ┌──────▼──────────────────▼──────────────────▼─────┐  │
│   │       TanStack Start + tRPC Router               │  │
│   │   upload  label  dataset  receipts  payments     │  │
│   └──────────────────────┬───────────────────────────┘  │
│                          │                              │
│          ┌───────────────┼───────────────┐              │
│          ▼               ▼               ▼              │
│   ┌─────────────┐ ┌───────────┐ ┌──────────────┐       │
│   │   Shelby    │ │  Aptos    │ │  SQLite +    │       │
│   │  Protocol   │ │ Blockchain│ │  Drizzle ORM │       │
│   │(hot storage)│ │(receipts) │ │  (metadata)  │       │
│   └─────────────┘ └───────────┘ └──────────────┘       │
└─────────────────────────────────────────────────────────┘

Storage Layer — Shelby Protocol

  • All dataset files stored on Shelby (shelbynet)
  • S3-compatible API for seamless integration with existing ML pipelines
  • Every read operation produces a cryptographic receipt at the storage layer — not a log, not a proxy, actual proof from within the storage protocol
  • DoubleZero fiber backbone: read latency optimized for high-frequency AI training workloads
  • Egress cost ~70% lower than AWS S3

Provenance Layer — Aptos Blockchain

  • Contribution events written as Aptos transactions at upload time
  • Shelby receipts anchored to Aptos at read time
  • Smart contracts enforce access rules and trigger micropayments
  • Move module handles: contributor registry, dataset versioning, payment distribution

Local Metadata Layer — SQLite + Drizzle

  • Fast local queries for dataset listings, contributor stats, receipt history
  • Drizzle ORM with type-safe schema — datasets, contributions, receipts, users tables
  • SQLite for zero-config local dev, easily swappable for Turso in production

Tech Stack

Layer Technology
Frontend TanStack Start (SSR-first React framework)
API tRPC (end-to-end type-safe, no REST boilerplate)
Auth Better Auth (session management, wallet-compatible)
Database SQLite via Drizzle ORM
Storage Shelby Protocol (@shelby-protocol/sdk)
Blockchain Aptos (@aptos-labs/ts-sdk), Move smart contracts
Monorepo Turborepo (parallel builds, shared packages)
Linting Biome (fast, zero-config formatter + linter)
Docs Fumadocs (integrated documentation site)
MCP Shelby MCP Server + custom Datavault MCP tools
Package Manager pnpm

Project Structure

datavault/                          # Turborepo monorepo
├── apps/
│   ├── web/                        # TanStack Start app
│   │   ├── app/
│   │   │   ├── routes/
│   │   │   │   ├── index.tsx       # Landing
│   │   │   │   ├── contribute/     # Upload + labeling UI
│   │   │   │   ├── datasets/       # Marketplace
│   │   │   │   ├── receipts/       # Provenance explorer
│   │   │   │   └── dashboard/      # Contributor earnings
│   │   │   ├── server/
│   │   │   │   └── trpc/
│   │   │   │       ├── router.ts   # Root tRPC router
│   │   │   │       ├── dataset.ts  # Dataset procedures
│   │   │   │       ├── receipt.ts  # Provenance procedures
│   │   │   │       └── payment.ts  # Micropayment procedures
│   │   │   └── lib/
│   │   │       ├── shelby.ts       # Shelby client
│   │   │       └── aptos.ts        # Aptos client + receipt anchor
│   │   └── package.json
│   └── fumadocs/                   # Fumadocs documentation site
│       ├── content/
│       │   ├── quickstart.mdx
│       │   ├── api-reference.mdx
│       │   └── provenance.mdx
│       └── package.json
├── packages/
│   ├── db/                         # Drizzle schema + migrations
│   │   ├── schema/
│   │   │   ├── datasets.ts
│   │   │   ├── contributions.ts
│   │   │   ├── receipts.ts
│   │   │   └── users.ts
│   │   ├── drizzle.config.ts
│   │   └── package.json
│   ├── shelby/                     # Shelby SDK wrapper
│   │   ├── src/
│   │   │   ├── client.ts           # ShelbyNodeClient init
│   │   │   ├── upload.ts           # Dataset upload
│   │   │   ├── download.ts         # Dataset read + receipt capture
│   │   │   └── types.ts
│   │   └── package.json
│   ├── aptos/                      # Aptos + Move contract client
│   │   ├── src/
│   │   │   ├── account.ts
│   │   │   ├── receipt.ts          # anchor_read_receipt
│   │   │   └── payment.ts          # distribute_payment
│   │   └── package.json
│   ├── api/                        # API layer / business logic
│   ├── auth/                       # Authentication configuration & logic
│   ├── config/                     # Configuration package
│   ├── env/                        # Environment variables validation
│   ├── ui/                         # Shared shadcn/ui components and styles
│   └── mcp/                        # Datavault MCP server
│       ├── src/
│       │   ├── tools/
│       │   │   ├── upload-dataset.ts
│       │   │   ├── query-receipts.ts
│       │   │   └── verify-provenance.ts
│       │   └── index.ts
│       └── package.json
├── contracts/                      # Move smart contracts
│   ├── sources/
│   │   ├── dataset_registry.move
│   │   └── payment_splitter.move
│   └── Move.toml
├── biome.json                      # Biome config (linting + formatting)
├── turbo.json                      # Turborepo pipeline config
├── pnpm-workspace.yaml
└── package.json

Database Schema (Drizzle + SQLite)

// packages/db/schema/datasets.ts
export const datasets = sqliteTable('datasets', {
  id: text('id').primaryKey(),
  name: text('name').notNull(),
  description: text('description'),
  shelbyBlobAddress: text('shelby_blob_address').notNull(),
  ownerAddress: text('owner_address').notNull(),
  version: integer('version').default(1),
  licenseType: text('license_type').notNull(), // pay-per-read | subscription | open
  pricePerRead: integer('price_per_read').default(0),
  createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})

// packages/db/schema/contributions.ts
export const contributions = sqliteTable('contributions', {
  id: text('id').primaryKey(),
  datasetId: text('dataset_id').references(() => datasets.id),
  contributorAddress: text('contributor_address').notNull(),
  dataHash: text('data_hash').notNull(),
  aptosTxHash: text('aptos_tx_hash').notNull(), // on-chain proof
  weight: integer('weight').default(1),
  createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})

// packages/db/schema/receipts.ts
export const receipts = sqliteTable('receipts', {
  id: text('id').primaryKey(),
  datasetId: text('dataset_id').references(() => datasets.id),
  readerAddress: text('reader_address').notNull(),
  shelbyReceiptHash: text('shelby_receipt_hash').notNull(), // storage-layer proof
  aptosTxHash: text('aptos_tx_hash').notNull(),             // blockchain anchor
  paid: integer('paid', { mode: 'boolean' }).default(false),
  createdAt: integer('created_at', { mode: 'timestamp' }).notNull(),
})

tRPC Router Design

// apps/web/app/server/trpc/dataset.ts
export const datasetRouter = router({
  upload: protectedProcedure
    .input(z.object({
      name: z.string(),
      file: z.instanceof(Buffer),
      license: z.enum(['pay-per-read', 'subscription', 'open']),
    }))
    .mutation(async ({ input, ctx }) => {
      // 1. Upload to Shelby
      const blobAddress = await shelbyClient.upload(input.file)
      // 2. Register contribution on Aptos
      const aptosTx = await aptosClient.registerContribution(blobAddress, ctx.user.address)
      // 3. Save metadata to SQLite via Drizzle
      return db.insert(datasets).values({ ...input, shelbyBlobAddress: blobAddress, aptosTxHash: aptosTx })
    }),

  read: protectedProcedure
    .input(z.object({ datasetId: z.string() }))
    .query(async ({ input, ctx }) => {
      // 1. Fetch from Shelby — receipt generated at storage layer
      const { data, receiptHash } = await shelbyClient.download(input.datasetId)
      // 2. Anchor receipt to Aptos
      const aptosTx = await aptosClient.anchorReceipt(receiptHash, input.datasetId)
      // 3. Log receipt in SQLite
      await db.insert(receipts).values({ shelbyReceiptHash: receiptHash, aptosTxHash: aptosTx })
      return { data, receiptHash, aptosTxHash: aptosTx }
    }),

  listReceipts: publicProcedure
    .input(z.object({ datasetId: z.string() }))
    .query(({ input }) =>
      db.select().from(receipts).where(eq(receipts.datasetId, input.datasetId))
    ),
})

Smart Contract Design (Move)

module datavault::dataset_registry {
    struct Contribution has key {
        contributor: address,
        dataset_id: vector<u8>,
        data_hash: vector<u8>,
        timestamp: u64,
        weight: u64,
    }

    struct ReadReceipt has key {
        shelby_receipt_hash: vector<u8>,
        dataset_id: vector<u8>,
        reader: address,
        timestamp: u64,
        paid: bool,
    }

    public entry fun register_contribution(
        contributor: &signer,
        dataset_id: vector<u8>,
        data_hash: vector<u8>,
        weight: u64,
    ) { ... }

    public entry fun anchor_read_receipt(
        reader: &signer,
        shelby_receipt_hash: vector<u8>,
        dataset_id: vector<u8>,
    ) { ... }

    public entry fun distribute_payment(
        dataset_id: vector<u8>,
        amount: u64,
    ) { ... }
}

Data Flow: Training Read with Provenance

1. AI team calls tRPC dataset.read({ datasetId })
2. Better Auth verifies session / wallet signature
3. tRPC handler fetches dataset from Shelby via @shelby-protocol/sdk
4. Shelby generates cryptographic receipt (storage-layer proof)
5. Handler anchors receipt to Aptos: anchor_read_receipt(receipt_hash, dataset_id)
6. Aptos transaction confirmed (~600ms finality)
7. Receipt + aptos_tx_hash saved to SQLite via Drizzle
8. tRPC returns { data, receiptHash, aptosTxHash } to client
9. Verify: explorer.aptoslabs.com/txn/{aptosTxHash}?network=testnet
10. Payment distribution triggered per contributor weight

Development Roadmap

Phase 1 — Core Infrastructure (Weeks 1–3)

  • Shelby package: upload / download / receipt capture
  • Aptos package: account management + anchor_read_receipt
  • Drizzle schema + migrations: datasets, contributions, receipts
  • tRPC procedures: dataset.upload, dataset.read, receipt.list
  • Milestone: First verifiable read receipt on Aptos testnet

Phase 2 — MVP Product (Weeks 4–8)

  • TanStack Start UI: contributor portal + dataset marketplace
  • Better Auth: wallet-based sign-in + session management
  • Dataset versioning (immutable versions on Shelby)
  • Provenance explorer page: receipts by dataset / contributor
  • Move contract: micropayment distribution
  • Fumadocs: API reference + quickstart guide
  • Milestone: End-to-end flow — contribute → train → receipt → payment

Phase 3 — Marketplace & MCP (Weeks 9–12)

  • Public dataset marketplace with on-chain licensing
  • Datavault MCP server: AI agents can query + verify datasets directly
  • HuggingFace Hub integration: push/pull with provenance
  • Model card auto-generation from receipt history
  • Milestone: Public beta, first paying dataset consumer

Why Shelby — Not S3, Not IPFS

Requirement AWS S3 IPFS / Filecoin Shelby
Cryptographic read proof
Low-latency AI reads
S3-compatible API
Micropayment native
Decentralized
Egress cost High Variable ~70% lower

The cryptographic receipt at the storage layer is the only mechanism that makes trustless payment splitting possible. Without Shelby, contributor payments require trusting whoever runs the server. With Shelby, the receipt is proof — no trust needed.


Getting Started

# Clone
git clone https://github.com/YOUR_USERNAME/datavault
cd datavault

# Install all dependencies (pnpm workspaces + Turborepo)
pnpm install

# Set up environment variables
cp apps/web/.env.example apps/web/.env
# Fill in: APTOS_API_KEY, SHELBY_API_KEY, BETTER_AUTH_SECRET

# Run database migrations
pnpm --filter @datavault/db migrate

# Start all apps in parallel (Turborepo)
pnpm dev

# Lint + format with Biome
pnpm biome check .

# Build all packages
pnpm build

What Datavault Delivers as Feedback to Shelby

  • Real-world receipt latency under high-frequency dataset reads (AI training workloads)
  • TypeScript SDK ergonomics in a full-stack tRPC + TanStack Start context
  • S3-compatibility edge cases for ML frameworks
  • Micropayment channel behavior under bulk read operations
  • MCP tool integration patterns for AI agent dataset access

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors