Arrow-based data I/O for Go — like io.Reader/io.Writer, but for databases and file formats.
db-toolkit provides a unified streaming interface for moving data between databases and file formats using Apache Arrow as the in-memory representation. It follows a pull model: readers produce BatchStreams of Arrow records, writers consume them, and dio.Copy wires them together.
Reader BatchStream Writer
┌────────────┐ ┌──────────────┐ ┌────────────┐
│ Source │───▶│ Schema() │───▶│ Destination│
│ Schema() │ │ Next(ctx) │ │ │
│ Read(ctx) │ │ Close() │ │ WriteStream│
└────────────┘ └──────────────┘ └────────────┘
Features:
- Pull-model streaming — readers produce batches on demand, keeping memory usage bounded
- Filter pushdown — push predicates down to the source (SQL WHERE clauses, Parquet row-group pruning)
- Column projection — read only the columns you need
- Platform registry — database platforms self-register via
init(), activated with a blank import
| Package | Description |
|---|---|
dio |
Core interfaces: Reader, Writer, BatchStream, Filter, Copy |
dio/csv |
CSV reader and writer |
dio/parquet |
Parquet reader and writer with row-group pruning via filter pushdown |
dio/gen |
Synthetic data generation for testing |
platform |
Platform registry (Register, Get, List) |
platform/postgres |
PostgreSQL reader, writer, and filter pushdown |
package main
import (
"context"
"os"
"github.com/rnestertsov/db-toolkit/dio"
"github.com/rnestertsov/db-toolkit/dio/csv"
"github.com/rnestertsov/db-toolkit/platform"
"github.com/rnestertsov/db-toolkit/platform/postgres"
)
func main() {
ctx := context.Background()
pg, err := platform.Get("postgres")
if err != nil {
panic(err)
}
conn, err := pg.OpenConnection(ctx, postgres.ConnectionConfig{
Name: "mydb",
User: "myuser",
Host: "localhost",
Port: "5432",
})
if err != nil {
panic(err)
}
defer conn.Close(ctx)
reader, err := conn.Query(ctx, "SELECT id, name, email FROM users")
if err != nil {
panic(err)
}
writer := csv.NewWriter(os.Stdout)
if err := dio.Copy(ctx, reader, writer); err != nil {
panic(err)
}
}A BatchStream is a pull-based iterator over Arrow record batches. Call Next(ctx) to get the next batch; it returns io.EOF when exhausted. Callers must call Release() on each returned record.
stream, err := reader.Read(ctx)
defer stream.Close()
for {
rec, err := stream.Next(ctx)
if err == io.EOF {
break
}
// process rec
rec.Release()
}Pass filters as ReadOptions to push predicates to the source. SQL databases translate them into WHERE clauses; Parquet readers use them for row-group pruning.
stream, err := reader.Read(ctx,
dio.WithFilter([]dio.Filter{
dio.Eq{Column: 0, Value: 42},
dio.GtEq{Column: 1, Value: "2024-01-01"},
}),
)Readers optionally implement dio.FilterClassifier to report how precisely they handle each filter (FilterExact, FilterInexact, or FilterUnsupported).
Read only the columns you need:
stream, err := reader.Read(ctx,
dio.WithProjection([]int{0, 2, 5}),
)Platforms register themselves at import time. Activate a platform with a blank import, then retrieve it by name:
import _ "github.com/rnestertsov/db-toolkit/platform/postgres"
pg, err := platform.Get("postgres")| Platform | Package | Status |
|---|---|---|
| PostgreSQL | platform/postgres |
Supported |
Adding a new platform? See docs/adding-a-platform.md.
- Go 1.24+
- Docker (for integration tests using testcontainers)
make buildmake testIntegration tests spin up real database instances via testcontainers, so Docker must be running.
This project is licensed under the MIT License — see the LICENSE file for details.