Skip to content

Let's be honest - most data pipeline frameworks treat types as suggestions. Config files are strings. Schemas are "validated" at runtime. Data quality is an afterthought. So, let's do differently

License

Notifications You must be signed in to change notification settings

vim89/flowforge

flowforge - Type‑safe-first Data Engineering

Build Nightly Security Docs Lint

codecov Core Contracts Connectors Infrastructure

Release Maven Central Docker

Scaladoc Changelog Docs

Scala sbt JDK License

Build pipelines that won’t even compile when contracts drift. Keep transformations pure, put effects at the edges, and run on Spark and Flink.

Why (beliefs)

  • Runtime schema drift burns weekends. We believe failures should move left - into the compiler.
  • Side‑effects inside transforms amplify retries/speculation. We believe effects belong at the edges and must be idempotent.
  • Engineers deserve fast, local feedback. We believe pure transformations and compile‑fail tests make data engineering joyful again.

A story:

"A partner team removed a nullable column late Friday. We couldn’t roll back in time; both teams were up all night. If that change had been a compile error, we would have slept."

For Python/ETL folks (dbt/Airflow/Informatica/Talend):

Think "contracts like Pydantic/Avro - but enforced before jobs run," "pure functions you can test without a cluster," and "connectors/engines that make IO explicit and safe."

For EMs / Staff Data Architects:

You get compile‑time guarantees (not CI or runtime heuristics), a small opinionated surface, and batteries‑included defaults with escape hatches.

How (principles)

  • Compile‑time contracts: SchemaConforms[Out, Contract, Policy] proves compatibility; policies include Exact, Backward, Forward (+ Ordered/CI/ByPosition). See docs/how-it-fails.md.
  • Typestate builder: build() exists only when source, transforms, and sink are present. Incomplete pipelines are unbuildable.
  • Pure vs effect boundary: transforms are pure functions; F[_] only at IO edges; engines plug into a single algebra.
  • Pictures over prose: see flowchart.svg and optionality.md.

What (The framework)

  • Core: contracts, builder, EffectSystem, DataAlgebra.
  • Engines: Spark (primary 1.0), Flink (2.12 only).
  • Connectors: filesystem, JDBC, GCS (more coming).
  • Data Quality: native checks by default; optional Deequ when present.
  • Template: flowforge.g8 for new projects.

Diagrams (pictures > words)

Compile‑time contracts flow

Field vs Element Optionality

Scala 2 Magnolia UML

Scala 3 Mirrors UML

Quick links

Module status (coverage)

Nightly runs provide broader integration coverage.

  • Core: Core Coverage
  • Contracts: Contracts Coverage
  • Connectors: Connectors Coverage
  • Infrastructure: Infrastructure Coverage

Guarantees (Non‑negotiables)

  • Compile‑fail contracts for typed endpoints under policy lattice
  • Typestate builder: build() only when complete - incomplete pipelines can’t compile
  • Pure transforms; effectful edges; idempotent side‑effects by design
  • See: docs/design/framework-behaviors.md

10‑Minute quickstart

Prereq: JDK 17+, sbt 1.9+

1) Clone & build

git clone https://github.com/vim89/flowforge.git && cd flowforge
sbt compile

2) See a compile‑time contract failure (red → green)

// Paste in REPL or a scratch test to feel it
import com.flowforge.core.contracts._
final case class Out(id: Long)
final case class Contract(id: Long, email: String)
implicitly[SchemaConforms[Out, Contract, SchemaPolicy.Exact]] // ❌ compile‑time error (missing email)

Relax the policy to Backward (allows extra producer fields and missing optional/defaults):

implicitly[SchemaConforms[Out, Contract, SchemaPolicy.Backward]] //

3) Build a pipeline - typestate forbids incomplete builds

import cats.effect.IO
import com.flowforge.core.PipelineBuilder
import com.flowforge.core.types._
import com.flowforge.core.contracts._

final case class User(id: Long, email: String)
val src  = TypedSource[User](LocalDataSource("/tmp/in", DataFormat.Parquet))
val sink = TypedSink[User](LocalDataSink("/tmp/out", DataFormat.Parquet))

PipelineBuilder[IO]("demo")
  .addTypedSource[User, User, SchemaPolicy.Exact](src, _ => IO.pure(User(1, "a@b")))
  .noTransform
  .addTypedSink[User, SchemaPolicy.Exact](sink, (_, _) => IO.unit)
  .build() // ✅ build is available only now

4) Explore diagrams and failure messages

Quickstart paths

Path Goal Commands
A - Examples Try locally (no cluster) sbt ffDev (compile + focused tests), sbt ffRunSpark (Spark local[*])
B - Red→Green See compile‑time error then fix Use the snippet above; run sbt compile
C - New project Scaffold with g8 sbt new flowforge.g8 --name="ff-demo" --organization="com.acme" then sbt test / sbt run

Compatibility

Component Version Notes
JDK 17+ CI pinned to 17; Spark 3.5.x compatibility
sbt 1.9+
Scala 2.13 (primary) Scala 3 for core only (no Spark deps)
Spark 3.5.x Runs on Java 17
Flink Scala 2.12 only Scala API constraints

Flink (2.12)

Flink’s Scala API is 2.12‑only. The root build excludes Flink from the default aggregate so that +compile, +test:compile, and +test stay green for 2.13 (and Scala 3 where applicable). Build/test Flink explicitly when you need it:

# Compile Flink (Scala 2.12)
sbt "++2.12.* enginesFlink/compile"

# Run Flink tests (Scala 2.12)
sbt "++2.12.* enginesFlink/test"

References: Flink documents binary incompatibility across Scala lines and the need to select the matching _2.12 artifacts for the Scala API. See Flink’s docs on Scala versions and sbt cross‑build guidance.

Architecture (at a glance)

The diagrams above summarize derivation and policy checks; see also docs/diagrams/compile-time-contracts/guide.md for narrative.

Examples & demos

  • Examples module: modules/examples (runnable demos)
  • Optional Deequ mode: add -Dff.quality.mode=deequ (auto‑enables when on classpath)

Documentation map

Release & versioning

FAQ

  • Scala 3?
    • Core compiles on Scala 3; engines depend on Spark/Flink ecosystem (Spark 3.x limits Scala 3 today).
  • Why compile‑time vs tests?
    • Tests are sampled; compile‑time proofs are exhaustive for shapes and policy compatibility.
  • How does this compare to Databricks DLT/Dagster/dbt?
    • They perform runtime/CI checks; FlowForge enforces compile‑time gates and typestate builder. See docs/evidence for deeper comparisons.

Contributing

We welcome folks from Python/ETL backgrounds and JVM veterans alike. Start with docs/contributing/HANDBOOK.md, then pick an issue. Please run sbt scalafmtAll and sbt "scalafixAll" before submitting.

License

Apache 2.0


Flowforge Hybrid Licensing Model

Flowforge adopts a hybrid licensing structure combining open innovation and IP protection.

  • Legacy / historical releases remain under MIT (for transparency and ecosystem continuity).
  • Active and future releases (v1.0 and onward) are licensed under AGPLv3 with additional Flowforge terms (“RESTRICTED COMMERCIAL & DERIVATIVE TERMS FOR FLOWFORGE” in LICENSE).
  • Commercial usage (offering as SaaS, embedding in proprietary systems, or internal closed-source deployments) requires a separate commercial license. See COMMERCIAL_LICENSE.md for template.
  • Contributor License Agreement (CLA) in CLA.md governs contribution terms, ensuring compatibility with the hybrid licensing framework.
  • Commercial exceptions and dual-licensing are handled directly by Vitthal Mirji for partners and enterprise use.

The goal: protect Flowforge’s compile-time innovation while keeping community use free and open.

About

Let's be honest - most data pipeline frameworks treat types as suggestions. Config files are strings. Schemas are "validated" at runtime. Data quality is an afterthought. So, let's do differently

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages