Skip to content

tonivanhala/putki

Repository files navigation

putki

Get out the crane

Construction time again

What is it this time?

We’re laying a pipeline

— Depeche Mode
Pipeline (1983)

Overview

putki allows you to specify and run pipelines that process streaming data.

⚠️
Pre-alpha software

putki is currently experimental software which may change unexpectedly and without notice. Documentation may lag or describe future plans for the library. Use at your own risk!

Main design criteria

  • leverage data-driven approach to pipeline management, so that pipelines can be configured and modified using composition and core data functions.

  • support complex pipelines with branching, joining, and feedback cycles.

  • allow modifying pipelines while they are running.

  • provide tools for debugging pipelines and tracing the data flow.

  • provide a default runtime using a JVM thread pool, while allowing extension to other runtimes.

Future (stretch) goals

  • provide alternative runtimes for JVM, e.g., based on core.async.

  • define and run pipelines on NodeJS using ClojureScript.

  • provide alternative schedulers for orchestrating jobs.

  • performance tests for different configurations.

  • provide implementations for configuring pipelines with automatic retry and crash recovery, e.g., provide pipe implementation that persists intermediate data.

Getting started

Pipelines can be specified in hiccup-like syntax using nested vectors. Call putki.core/start! to start the pipeline, and emit a value with putki.core/emit to feed it through the pipeline.

(def graph [inc
            [#(/ % 3)
             [#(Math/ceil %)
              [println]]]])
(def pipeline (putki.core/start! graph))
(putki.core/emit pipeline 7)
3.0

Internally, putki.core/start! walks through the pipeline graph and transforms it into a workflow map. The map is fed into a workflow runner. The same actions can be split into explicit, separate, steps as below.

(def graph [inc ; (1)
            [#(/ % 3) ; (2)
             [#(Math/ceil %)
              [println]]]])
(def workflow (putki.core/init graph)) ; (3)
(def pipeline (putki.core/run! ; (4)
               (putki.core/local-thread-runner) ; (5)
               workflow))
  1. A pipeline graph can be defined with a nested vector.

  2. Any function taking a single argument can be directly used as a job.

  3. putki.core/init turns the graph into a map specifying jobs and their connections.

  4. putki.core/run! takes a runner and starts the workflow according to the map.

  5. The default workflow runner uses a local thread-pool to run and orchestrate jobs.

⚠️
There is no stopping

Stopping and resetting a running pipeline is currently not possible.

Examples

There shall be examples of how to use putki in common and educational scenarios.

TODO

  • ❏ Streaming JSON document

  • ❏ Implementing an ARMA filter

  • ❏ Processing nested data with a feedback loop

  • ❏ Simple Web server

  • ❏ Managing a pipeline as Integrant component

  • ❏ Using Integrant components as part of a pipeline

  • Streamz — Python data streaming

  • manifold — Clojure lib for asynchronous programming with a stream abstraction

  • Datasplash — Clojure API for Google Cloud Dataflow

  • Apache Spark Streaming — high-throughput distributed stream processing

License

putki is licensed under the Eclipse Public License v2.0.

About

Streaming data processing with pipes as data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages