Skip to content

Concepts

Eugene Lazutkin edited this page Jun 6, 2026 · 6 revisions

This page explains the ideas behind stream-chain — the problem it solves, how it solves it, and the handful of concepts everything else builds on. If you would rather learn by reading code, Intro by examples shows the same ideas as runnable pipelines.

The problem

stream-chain processes streams of objects — records, not raw bytes — and is built for the big cases: multi-gigabyte database dumps, append-only logs, message queues, and other feeds that are generated continuously or are simply too large to hold in memory. You want to process them one record at a time: transform, filter, and enrich as they flow.

That sounds easy — chain a few functions and you are done. Real pipelines are harder, for four reasons:

  • Filtering — a stage may want to drop a record (produce nothing).
  • Fan-out — a stage may want to turn one record into several.
  • Asynchronous work — stages talk to databases, networks, and files, so they have to await.
  • Backpressure — stages run at different speeds. If a fast producer outruns a slow consumer, records pile up in memory until the process dies. Keeping memory flat means making the fast side wait for the slow side — the essence of backpressure.

Node.js streams solve backpressure, but wiring a pipeline out of them by hand is verbose and easy to get wrong.

Why streams?

Given the problem, why build on streams at all — rather than arrays or plain callbacks? Two reasons, both about scale:

  • I/O lives at the edges. Your data sits in files, sockets, and pipes — byte and text streams that are read sequentially, which is exactly what the runtime and the OS are fastest at. Streams meet the data where it already is.
  • Backpressure. When stages run at different speeds, streams let the slow side push back on the fast side, so records never pile up in memory. This is the one feature that makes constant-memory processing of an unbounded feed possible, and it is the main reason stream-chain is built on streams.

Everything else — the functions, the generators, the composition — exists to make streams pleasant to use without giving up either of these.

The idea

Write each stage as an ordinary function (or generator). stream-chain wires your functions — alongside any streams you already have — into a single pipeline that handles the streaming, the backpressure, and the lifecycle for you. The result is a standard Duplex stream (or a Web Streams pair, or a bare async iterable — your choice of substrate), so it drops into pipelines you already have.

You think about what each stage does to one record. The library handles how the records flow.

import chain from 'stream-chain';

const pipeline = chain([
  record => (record.active ? record : chain.none), // filter
  async record => ({...record, rate: await lookup(record.id)}), // async enrich
  record => chain.many([record, summarize(record)]) // fan-out
]);

The model: one input, zero-to-many outputs

The one idea to internalize: a stage maps each input to any number of outputs — zero, one, or many. This is the generator model, and it is what makes pipelines expressive enough for real work.

A generator stage expresses it directly and lazily:

function* (record) {
  if (!record.active) return; // zero outputs: drop it
  yield record; // one output...
  if (record.flagged) yield audit(record); // ...or more
}

A generator yields values one at a time and never accumulates, so memory stays flat no matter how much a stage fans out.

The generator is the pure expression of the model. So why write a plain function at all? Performance. A plain function — even an async one — is faster than an async generator, which carries per-step overhead, so most stages are leaner written plain. stream-chain makes plain functions first-class and gives them a small vocabulary to express the same zero-to-many shape:

  • Return a value — pass it on.
  • Return chain.none — produce nothing (drop the record).
  • Return chain.many([...]) — produce several values at once.

chain.none and chain.many are the transport that lets a plain function say what a generator says natively — so you get the generator model's expressiveness at a plain function's speed. many() collects its values in an array, so it holds them all in memory — fine for small, bounded fan-out, but for large or unbounded fan-out a generator stage is leaner. (many([]) can express "no values", but chain.none is the cheaper, clearer way to say it — which is why it exists.)

Two more, for advanced control (they work only inside gen() / fun() segments, not raw streams):

  • chain.finalValue(x) — emit x and skip the remaining stages of the block: the way to short-circuit a function pipeline when a stage hits an error or a sentinel (it is not passed to the rest of the chain).
  • chain.stop — stop the pipeline early (useful for terminating otherwise-infinite sources).

Any stage may also be async (return a Promise) — the pipeline waits for it, holding backpressure while it does.

A note on null / undefined

At the stream boundary — inside chain() and asStream() — returning null or undefined also drops the record, because Node streams reserve those values for end-of-stream signalling. But the lower-level compositors gen() and fun() pass null/undefined through as ordinary values. So chain.none is the one signal that means "drop" everywhere — prefer it over returning null when you want the same behaviour in every context.

Backpressure, handled

Backpressure is what keeps memory flat on a huge or never-ending stream. When a downstream stage is slow, stream-chain propagates the pressure back up the chain, so upstream stages — including async ones mid-await — pause until there is room. You never buffer the whole stream; only a small, bounded number of records are in flight at once.

This is why the library is built for huge streams specifically: the same pipeline that processes a 10 KB file processes a 10 TB one in constant memory. The native Web Streams adapter (asWebStream()) extends this to per-item backpressure. The mechanics, the knobs, and the highWaterMark setting are covered in highWaterMark.

Building blocks: transducers and adapters

Beneath chain() are two kinds of primitive.

Transducers compose a list of functions into one composed function, with no stream substrate involved:

  • gen() — composes the functions into an async generator: feed it one input, iterate zero-to-many outputs, lazily. It is the memory-flat default, and it is what chain() groups your consecutive functions into for speed.
  • fun() — composes them into a plain function that returns the whole batch of outputs for one input (collected in a many()). It predates async generators and is slightly faster on purely synchronous pipelines, but it holds every output per input in memory — so it is an explicit import, not a default. Reach for it only when per-input output is small and bounded.

Adapters lift a single function onto a stream substrate, when you need a standalone, reusable stream component:

  • asStream() — wraps a function as a Node Duplex.
  • asWebStream() — wraps a function as a Web Streams {readable, writable} pair, with per-item backpressure.

chain() ties it together: it takes an array of functions, arrays, and streams and returns one Duplex you compose with .pipe() like any other stream. Crucially, it groups runs of consecutive functions into a single gen() instead of giving each its own stream: a Duplex per stage would pay Node's stream-machinery cost at every boundary, so merging the function-only runs keeps those boundaries off the hot path. (How performance shaped this collects the reasoning.)

Three substrates: node, web, core

The same composition model runs on three substrates. The code you write is the same; pick by environment:

  • stream-chain / stream-chain/node (default) — Node.js streams. The right choice for server-side work.
  • stream-chain/web — native Web Streams, no Node-stream dependency. Browser-safe. (/node and /web are interchangeable at the chain() level — the same model and the same code inside the chain — but the streams they hand back have different APIs, so switching substrates is a real port of the surrounding code, not just an import change; see Performance.)
  • stream-chain/core — substrate-free async iterables: no node:stream, no Web Streams, just for await...of. The leanest base for embedding.

See the subpath split for the details and migration notes.

Getting objects in and out

stream-chain works in object mode: it moves records, not raw text or bytes. Turning bytes into objects, and objects back into bytes, happens at the edges of the pipeline. A source or sink can be anything that yields or accepts objects — a database cursor, an HTTP response, stream-json assembling tokens into objects, or your own generator.

For the most common on-disk case, JSONL (JSON Lines) ships in the box: a parser and a stringer, plus fused file-edge composites parseFile() / stringerToFile() that read and write JSONL files as a single pipeline stage. JSONL is the batteries-included I/O for object streams — its page covers the full surface.

TypeScript: a first-class binding

stream-chain is implemented in JavaScript, but its TypeScript bindings are not an afterthought — since 3.x they are part of the design, and one of the larger investments in the library. The type machinery is deliberately sophisticated: it deduces the input and output types of a pipeline from the stages you compose — including chains nested inside chains — so the type of chain([...]) reflects what actually flows through it (see typed streams).

This pays off twice. It helps you write correct TypeScript — a stage whose input does not match the previous stage's output is caught at compile time, not at runtime. And it pays off in a modern editor even when you write plain JavaScript: the same types drive autocompletion, signature hints, and inline documentation, so typing and inspecting stream-chain code becomes a source of self-help rather than a trip to the docs.

How performance shaped this

The model is simple; the implementation is shaped almost entirely by one fact — stream-chain is built for huge data, where a tiny per-record cost multiplied by billions of records becomes hours or days. Small, compounding gains are worth real design effort, and three of the choices above come straight from that:

  • Plain functions are first-class, not only generators — because a plain function (even async) outruns an async generator.
  • Consecutive functions are merged into one segment — because a stream per stage would pay Node's stream-machinery cost at every boundary.
  • Two stream substrates, not one — Web Streams are portable and convenient but currently slower than Node streams, so stream-chain keeps both rather than standardising on the slower one.

None of this changes what you write — it stays "a function (or generator) per stage." It only changes what runs underneath. For the hands-on side — when streaming helps, how to keep stages on the fast path, and the numbers — see Performance and Benchmarks.

Where to go next

  • Intro by examples — these concepts as runnable pipelines.
  • chain() — the full factory API and every return-value rule.
  • JSONL — reading and writing object streams from files.
  • highWaterMark — tuning the flow and backpressure.
  • Performance — hints and best practices for going fast.
  • Benchmarks — how performance is measured.

Clone this wiki locally