Skip to content

rnestertsov/db-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

db-toolkit

Arrow-based data I/O for Go — like io.Reader/io.Writer, but for databases and file formats.

Overview

db-toolkit provides a unified streaming interface for moving data between databases and file formats using Apache Arrow as the in-memory representation. It follows a pull model: readers produce BatchStreams of Arrow records, writers consume them, and dio.Copy wires them together.

    Reader           BatchStream           Writer
┌────────────┐    ┌──────────────┐    ┌────────────┐
│ Source     │───▶│ Schema()     │───▶│ Destination│
│ Schema()   │    │ Next(ctx)    │    │            │
│ Read(ctx)  │    │ Close()      │    │ WriteStream│
└────────────┘    └──────────────┘    └────────────┘

Features:

  • Pull-model streaming — readers produce batches on demand, keeping memory usage bounded
  • Filter pushdown — push predicates down to the source (SQL WHERE clauses, Parquet row-group pruning)
  • Column projection — read only the columns you need
  • Platform registry — database platforms self-register via init(), activated with a blank import

Package Layout

Package Description
dio Core interfaces: Reader, Writer, BatchStream, Filter, Copy
dio/csv CSV reader and writer
dio/parquet Parquet reader and writer with row-group pruning via filter pushdown
dio/gen Synthetic data generation for testing
platform Platform registry (Register, Get, List)
platform/postgres PostgreSQL reader, writer, and filter pushdown

Quick Start

package main

import (
    "context"
    "os"

    "github.com/rnestertsov/db-toolkit/dio"
    "github.com/rnestertsov/db-toolkit/dio/csv"
    "github.com/rnestertsov/db-toolkit/platform"
    "github.com/rnestertsov/db-toolkit/platform/postgres"
)

func main() {
    ctx := context.Background()

    pg, err := platform.Get("postgres")
    if err != nil {
        panic(err)
    }

    conn, err := pg.OpenConnection(ctx, postgres.ConnectionConfig{
        Name: "mydb",
        User: "myuser",
        Host: "localhost",
        Port: "5432",
    })
    if err != nil {
        panic(err)
    }
    defer conn.Close(ctx)

    reader, err := conn.Query(ctx, "SELECT id, name, email FROM users")
    if err != nil {
        panic(err)
    }

    writer := csv.NewWriter(os.Stdout)

    if err := dio.Copy(ctx, reader, writer); err != nil {
        panic(err)
    }
}

Key Concepts

BatchStream

A BatchStream is a pull-based iterator over Arrow record batches. Call Next(ctx) to get the next batch; it returns io.EOF when exhausted. Callers must call Release() on each returned record.

stream, err := reader.Read(ctx)
defer stream.Close()

for {
    rec, err := stream.Next(ctx)
    if err == io.EOF {
        break
    }
    // process rec
    rec.Release()
}

Filter Pushdown

Pass filters as ReadOptions to push predicates to the source. SQL databases translate them into WHERE clauses; Parquet readers use them for row-group pruning.

stream, err := reader.Read(ctx,
    dio.WithFilter([]dio.Filter{
        dio.Eq{Column: 0, Value: 42},
        dio.GtEq{Column: 1, Value: "2024-01-01"},
    }),
)

Readers optionally implement dio.FilterClassifier to report how precisely they handle each filter (FilterExact, FilterInexact, or FilterUnsupported).

Column Projection

Read only the columns you need:

stream, err := reader.Read(ctx,
    dio.WithProjection([]int{0, 2, 5}),
)

Platform Registry

Platforms register themselves at import time. Activate a platform with a blank import, then retrieve it by name:

import _ "github.com/rnestertsov/db-toolkit/platform/postgres"

pg, err := platform.Get("postgres")

Supported Platforms

Platform Package Status
PostgreSQL platform/postgres Supported

Adding a new platform? See docs/adding-a-platform.md.

Development

Prerequisites

  • Go 1.24+
  • Docker (for integration tests using testcontainers)

Build

make build

Test

make test

Integration tests spin up real database instances via testcontainers, so Docker must be running.

License

This project is licensed under the MIT License — see the LICENSE file for details.

About

Arrow-based data I/O for Go with pull-model streaming, filter pushdown, and column projection across databases and file formats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors