Releases · thanhtham010891/agora-etl

16 Jun 15:42

v0.4.0

8290969

Release v0.4.0 Latest

Latest

agora-etl-rs becomes Agora's optional
acceleration substrate for measured core hot paths while Python keeps runtime
semantics.

Added

agora.core.acceleration as the single optional acceleration boundary.
DeliveryConfig(acceleration_mode=...) with auto, off, and required.
Declarative [performance] config with acceleration and profile.
Acceleration status in Pipeline.explain(), run summaries, tracing
attributes, agora run --plan, agora doctor, and agora doctor --json.
Performance profiles that resolve to explicit runtime settings while keeping
manual overrides authoritative.
Benchmark matrix runner for 0.3.3 baseline, pure Python 0.4.0, and
Rust-enabled 0.4.0 runs.

Changed

Runtime modules no longer import agora_rs directly.
Existing Rust primitives are selected through the acceleration facade.
Direct flush, Rust prefetch, checkpoint state, metrics accumulator, and linear
batch buffer decisions are visible in runtime metrics.
throughput profile is the opt-in gate for Rust source prefetch outside
buffered lanes until release benchmarks justify broader activation.

Removed

agora.core.runtime.RecordDeliveryCoordinator
agora.core.sink._WRITE_OK
agora.core.context compatibility exports:
_NOOP_SPAN_SCOPE, _BoundLogger, _NoopSpanScope, _PipelineSpanScope,
and _normalize_trace_value
Legacy source/sink data-plane bool flag inference
BaseSource.runtime_counters() compatibility shim

Release Rule

Performance claims must be backed by benchmark JSON. Candidate Rust expansions,
including JSONL batch encoding and Arrow IPC transport, remain benchmark-gated.

Assets 2

12 Jun 15:08

thanhtham010891

v0.3.3

b82bbe2

Release v0.3.3

Focus

0.3.3 is a short additive release that makes the 0.3.x AI and
observability story easier to trust in real worker deployments.

This release is about tightening existing extension paths rather than opening a
new platform phase:

the AI provider contract now matches real provider capabilities
Anthropic is now part of the official plugin bundle
config-driven workers can enable OpenTelemetry through the existing tracing path
short-term release docs and packaging now match the shipped support story

Highlights

AI provider capabilities are now explicit

Agora now treats completion and embeddings as separate capabilities instead of
implying every provider must implement both.

AIProvider remains as the compatibility-facing completion contract for
existing middleware, while the codebase now exposes clearer capability helpers:

CompletionProvider
EmbeddingProvider
require_completion_provider(...)
require_embedding_provider(...)

This makes completion-only providers such as Anthropic first-class without
forcing fake embedding implementations, and it gives embedding-dependent paths a
clear failure mode when the wrong provider is supplied.

Anthropic is now part of the official first-party bundle

Anthropic support has been promoted into agora-etl-plugins through the new
anthropic extra.

The official import and install story is now:

pip install "agora-etl-plugins[anthropic]"
from agora_plugins.anthropic import AnthropicProvider

The supported scope stays intentionally narrow and honest:

completion
structured JSON output
completion-driven AI middlewares

It still does not promise:

embeddings
embedding-based classification
embedding stores

This keeps the public bundle story truthful while still making Anthropic an
official first-party integration.

Config-driven workers can now turn on OTel without custom bootstrap code

Worker startup can now be driven directly from agora/v1 config, including
top-level worker settings and per-pipeline schedules.

Tracing config also gained an auto_configure path for OpenTelemetry. When
enabled, Agora now:

reuses an existing global tracer provider if one is already configured
otherwise auto-configures an OTLP exporter when the optional OTel SDK and
exporter packages are installed

This keeps observability on the existing tracing path instead of introducing a
second tracing system.

Docs and packaging now tell a cleaner release story

The 0.3.3 docs now align better with the real public support boundary:

plugin docs position Anthropic as part of the official bundle
worker and observability docs show the config-only OTel path
changelog and release metadata now match the promoted plugin surface

Upgrade notes

For AI users

completion-only providers are now valid without embedding methods
embedding-based features still require an embedding-capable provider
Anthropic should now be installed from agora-etl-plugins[anthropic]

For worker operators

agora worker --config ... can now build a WorkerPool directly from TOML
each config-driven worker pipeline must define [pipelines.<name>.schedule]
OTel config can now stay declarative when the optional OTel packages are
installed

For plugin users

agora-etl-plugins now targets agora-etl>=0.3.3
Anthropic is now part of the official bundle instead of a separate release
story

Validation

Release validation for 0.3.3 includes:

full package-local coverage for the Anthropic official bundle path
worker/config/tracing regression coverage for OTel auto-wiring
strict docs builds for the main Agora documentation site
full monorepo CI for agora-etl and agora-etl-plugins

Assets 2

10 Jun 10:22

thanhtham010891

v0.3.2

d781bd1

Release v0.3.2

Focus

0.3.2 turns the refactored 0.3.x core into a tighter release contract.

This release is about hardening what already exists:

explicit public boundaries
preservation-backed runtime behavior
clearer plugin contracts
stronger diagnostics, security posture, and performance guardrails

Highlights

Public API boundaries are now frozen more deliberately

Agora now treats the package facades as explicit release contracts:

agora
agora.core
agora.core.<domain>
agora.core.runtime as an advanced public facade

Architecture tests now fail when public surfaces drift, when compatibility
exports are undocumented, or when plugin-facing code leaks into underscore
support modules by accident.

Runtime guarantees are preservation-backed end to end

The 0.3.x runtime guarantee matrix is now fully preservation-backed. This
covers lane selection, ordering, checkpoint cadence, DLQ routing, retry
boundaries, shutdown/cancellation behavior, and process/buffered invariants.

This release is therefore much safer for continued internal refactor: the code
can keep moving without reopening ambiguity about preserved runtime semantics.

Observability and health diagnostics are safer

Built-in health payloads now redact and truncate last_error before exposing it
through metrics/health surfaces.

Plugin diagnostics are also clearer:

agora plugins list now shows broken entry-points explicitly
entry-point key collisions with built-ins/public keys are surfaced as
diagnostics instead of silently winning by import order
agora doctor is more explicit about its trust boundary during config import
and pipeline-build checks

Plugin contract cleanup is now reflected in code, tests, and docs

The official agora-etl-plugins bundle and the incubating plugin packages now
follow the updated contract more consistently:

MANIFEST compatibility binds to AGORA_PLUGIN_MANIFEST_VERSION instead of
hard-coded stale values
package metadata points at the real agora-etl core artifact
plugin contract tests now fail when plugin packages depend on internal Agora
support modules or on legacy module paths that now have public replacements

For 0.3.x, two advanced public plugin modules remain intentionally supported:

agora.core.registry for MANIFEST compatibility binding
agora.core.retry for shared retry helpers

Benchmark coverage now has a concrete maintenance budget

0.3.2 records the first post-refactor benchmark baseline and a simple perf
budget for future maintenance work.

The benchmark split now covers:

file-backed lane throughput
runtime orchestration-heavy shapes such as buffered execution, sink fan-out,
and observability overhead
Arrow process transport microbenchmarks

This makes it possible to catch hot-path regressions intentionally instead of
relying on anecdotal before/after impressions.

Upgrade notes

For builders

Prefer agora and documented agora.core.<domain> facades over file-level
modules.
Avoid underscore-prefixed imports even if a compatibility export still exists
in 0.3.x.

For plugin authors

Bind MANIFEST compatibility from agora.core.registry.
Prefer agora.state, agora.runner, and the documented plugin-contract
modules in docs/plugins/contract.md.
Treat agora.core.registry and agora.core.retry as advanced public plugin
contracts in 0.3.x.

Deprecated aliases retained through `0.3.x`

agora.core.runtime.RecordDeliveryCoordinator
agora.core.sink._WRITE_OK
agora.core.context._NOOP_SPAN_SCOPE
agora.core.context._BoundLogger
agora.core.context._NoopSpanScope
agora.core.context._PipelineSpanScope
agora.core.context._normalize_trace_value

These remain compatibility bridges in 0.3.x and are still targeted for
removal in 0.4.0.

Validation

Release validation for 0.3.2 includes:

public surface / architecture tests
preservation coverage for runtime guarantees and plugin contract behavior
package-local CI for agora-etl-plugins
benchmark baselines for the canonical Phase 5 scenarios

Assets 2

09 Jun 06:41

thanhtham010891

v0.3.1

b133c18

Release v0.3.1

Focus

0.3.1 hardens the public runtime contract shipped in 0.3.0.

This release closes semantic gaps between runtime docs and real execution,
tightens lane/data-plane planning, improves recovery explainability at the CLI,
and adds preservation coverage so these guarantees do not drift silently.

Highlights

Runtime planning and delivery semantics are more stable

Agora now keeps source, planner, writer, and sink decisions aligned more
consistently across the public runtime entrypoints and internal lane
orchestration.

Key hardening areas include:

lane selection and data-plane validation across pipeline.py,
executor.py, source.py, sink.py, and _plan.py
buffered and ordered delivery behavior in _buffered.py,
_delivery.py, and _lanes.py
startup and shutdown rollback so partially opened sink graphs do not leak
state when later components fail to open
checkpoint and source-delivery gating so acknowledgement remains tied to
successful downstream handling

This release is intentionally about making 0.3.x safer to trust under stress,
not about adding a brand-new execution model.

Public data-plane contract is now easier for plugin authors to follow

Agora now exposes and documents the preferred contract helpers:

source_data_plane_spec()
sink_data_plane_spec()
SourceDataPlaneSpec
SinkDataPlaneSpec

The legacy capability booleans still work in 0.3.x, but docs now steer
custom source and sink authors toward explicit data-plane specs instead of
guessing from older flags.

Related docs were tightened in:

docs/source/custom.md
docs/sink/custom.md
docs/plugins/contract.md

Recovery UX is more explainable from the CLI

Two recovery-oriented CLI surfaces are now much more actionable:

agora checkpoint inspect
agora diagnose

They now expose runtime recovery posture more directly, including details such
as recovery support, resume keys, checkpoint granularity, resume behavior, and
warnings when resuming large file-backed sources may be costly.

This makes it easier to answer practical operator questions like:

can this pipeline actually resume?
what unit of progress is persisted?
what does "resume" mean for this source?
is recovery likely to be cheap or expensive?

`agora doctor` is now a stronger preflight tool

agora doctor now goes beyond basic configuration shape checks.

It can validate:

selected pipeline resolution from config
import references used by config-backed components
pipeline buildability before execution
recovery posture
DLQ replay support
required environment variables for the selected pipeline

This helps catch miswired configs earlier, before a real run discovers them the
hard way.

Spawn safety and test isolation are tighter

The CLI path bootstrapping helper now ignores mocked or otherwise non-pathlike
ctx.cwd values when adding project paths to sys.path.

That closes a subtle failure mode where test doubles could pollute process-spawn
state and later break process-batch execution with pickling failures.

Tests

Release validation for 0.3.1 includes:

full package CI from the checked-out source tree
preservation coverage for runtime guarantees, recovery edge cases, plugin
contract, and baseline behaviors
targeted CLI coverage for doctor, checkpoint inspect, and diagnose
public API coverage for the data-plane helper surface

New or expanded test areas include:

tests/core/test_doctor_cli.py
tests/core/test_recovery_cli.py
tests/core/test_data_plane_public_api.py
tests/preservation/test_runtime_guarantees.py
tests/preservation/test_recovery_edge_cases.py

Assets 2

05 Jun 16:02

thanhtham010891

v0.3.0

5e490ff

Release v0.3.0

Focus

Process-isolated batch transforms for CPU-heavy or memory-isolated workloads,
with recovery semantics strong enough for long-running batch pipelines.

Highlights

Source-bound `max_records` limits

run(max_records=N) now applies a source wrapper before execution starts
instead of stopping later inside the runtime lanes.

This makes bounded runs behave the same across:

.build()
.fan_out()
.route()
.run()
.run_sync()

For batch and Arrow sources, the final emitted batch is now trimmed at the
source boundary instead of overshooting and stopping after a full batch has
already been processed.

Middleware data-plane validation is now explicit

Agora now validates source and middleware data planes during runtime planning.

Supported shapes:

Arrow-emitting source plus all-Arrow middleware chain
Arrow-emitting source plus all Python-row middleware chain
Python-row source plus all Python-row middleware chain

Rejected shape:

any chain that mixes Arrow middleware with Python-row/list-batch middleware

This now fails fast with a PipelineError instead of silently switching data
shape mid-chain.

Arrow fan-out now chooses the best path per sink

When the pipeline stays on the Arrow chain and the writer is a fan-out:

sinks with write_arrow_batch() receive the original RecordBatch
sinks without Arrow support receive list[dict] fallback only at the sink boundary

This keeps Arrow-native sinks fast without forcing every sink in the fan-out to
implement the same write contract.

Execution semantics now expose first-class data planes

Agora now resolves one shared vocabulary across source, middleware, writer, and
sink boundaries:

python_rows
python_batches
arrow_batches

The runtime plan now records:

the plane emitted by the source
the plane that actually entered the writer
which sinks had to downgrade at the writer/sink boundary

That contract is also introspectable from user code through
BaseSource.data_plane_spec(), BaseSource.emitted_data_plane, and
BaseSink.data_plane_spec().

The shared vocabulary is now part of the root public API too:

from agora import DataPlane
from agora import SourceDataPlaneSpec, SinkDataPlaneSpec

Legacy batch/Arrow bool flags are now deprecated

The older data-plane booleans still work in 0.3.x, but Agora now treats them
as compatibility shims instead of the primary extension contract.

Deprecated source flags:

supports_batch_emit
emits_arrow_batches

Deprecated sink flags:

batch_writable_native
arrow_passthrough_native

Deprecated capability fields:

SinkCapabilities.batch_writable_native
SinkCapabilities.arrow_passthrough_native

Preferred replacement:

sources override data_plane_spec() and return SourceDataPlaneSpec
sinks advertise accepted_data_planes / native_data_planes, or return
explicit planes from sink_capabilities()

When Agora has to infer a non-row plane from the legacy booleans, it now emits
DeprecationWarning once per source/sink class. The compatibility bridge is
planned for removal in 0.4.0.

Built pipelines can now explain their planned execution shape

BoundPipeline.explain() now returns a PipelineExplain snapshot without
starting the source. This surfaces:

planned lane selection
source and writer data planes
middleware compatibility matrix
per-sink selected plane and downgrade markers

`CsvSink` now reports Arrow-native writes vs downgrade-to-rows

CsvSink.write_arrow_batch() still prefers pyarrow.csv.write_csv(), but it
now downgrades safely to Python-row CSV writing when Arrow CSV rejects the
batch, such as with nested types or invalid UTF-8 payloads.

The run summary and runtime metrics now expose:

csv_arrow_native_batch_count
csv_arrow_native_row_count
csv_arrow_downgrade_batch_count
csv_arrow_downgrade_row_count

This makes it much easier to distinguish:

Arrow lane selected by the planner
Arrow path active in the runtime
CSV sink actually staying native for the full run

`ProcessBatchMiddleware` for process-isolated batch execution

Agora now supports ProcessBatchMiddleware, a batch-lane middleware that runs a
sync transform in a managed ProcessPoolExecutor while keeping checkpointing,
DLQ routing, and sink delivery in the main runtime process.

This gives batch pipelines a clean path for:

CPU-heavy transforms
blocking native-library calls
memory isolation between orchestration and compute

`ArrowProcessBatchMiddleware` for columnar process execution

Agora now also supports ArrowProcessBatchMiddleware, an Arrow-native sibling
that keeps batches columnar across the process boundary.

This path is intended for:

Arrow-native sources and sinks
vectorized transforms using pyarrow.compute
CPU-heavy columnar work where Python-object batches would add too much
serialization overhead

The first cut transports Arrow batches as Arrow IPC bytes and currently requires
the worker transform to preserve input row count.

Timeout recovery now recycles the worker pool

If a process batch exceeds timeout_s, Agora treats that batch as failed and
recycles the worker pool before accepting the next batch.

This avoids a poisoned-pool failure mode where later batches would otherwise
queue behind a stuck worker and fail in a cascade.

In ordered pipelined mode, unresolved sibling batches from that recycled
worker-pool generation are failed too, instead of being committed from stale
worker state.

Ordered pipelined process execution is now active

When max_workers > 1 and max_in_flight_batches > 1, both
ProcessBatchMiddleware and ArrowProcessBatchMiddleware now keep multiple
batches in flight while still committing them in source order.

This turns max_in_flight_batches into a real throughput control for ordered
process-batch pipelines.

Batch-only contract is explicit

ProcessBatchMiddleware now rejects per-record execution directly. It must be
paired with a source that emits batches (supports_batch_emit=True).

This keeps the public contract honest and avoids silent per-record process hops.

Stability Status

ProcessBatchMiddleware is intended as a stable 0.3.0 feature for
Python-object batch pipelines when all of the following are true:

the source is batch-capable
the batch payload is pickleable
the transform function is pickleable
the pipeline keeps source-order commit semantics
if pipelined process execution is enabled, it uses ordered=True

ArrowProcessBatchMiddleware is intended as a stable 0.3.0 feature for
Arrow-native process-batch pipelines when all of the following are true:

the source emits pa.RecordBatch
the worker transform is pickleable
the worker transform returns pa.RecordBatch
the transform preserves row count
the pipeline keeps source-order commit semantics
if pipelined process execution is enabled, it uses ordered=True

The following mode is not available in 0.3.0:

ordered=False for pipelined process execution

Ordered pipelined execution is active when max_workers > 1 and
max_in_flight_batches > 1. Unordered pipelined commit is not available yet
and is rejected explicitly when both ordered=False and
max_in_flight_batches > 1.

Release Validation

Release validation for this feature should include:

targeted process-batch unit coverage
end-to-end batch-lane coverage for success, DLQ, timeout recovery, and
checkpoint gating
ordered pipelined coverage for both Python-object and Arrow-native process
batches
adjacent batch/runtime preservation coverage
package CI from source checkout with PYTHONPATH=src

Recommended commands:

PYTHONPATH=src ./.venv/bin/pytest -q \
  tests/core/test_process_batch_middleware.py \
  tests/integration/test_process_batch_integration.py

PYTHONPATH=src ./.venv/bin/pytest -q \
  tests/core/test_batch_pipeline.py \
  tests/preservation/test_runtime_guarantees.py -k "batch or checkpoint or dlq"

Release Criteria

timed-out process work does not poison later batches in the same run
checkpoint advancement remains gated on downstream handling after process
execution
failed process batches route to the DLQ or stop the run according to the
configured failure policy
package docs clearly state batch-only usage and timeout semantics
the targeted process-batch and adjacent batch/runtime suites pass from the
source tree

Assets 2

03 Jun 10:13

thanhtham010891

v0.2.2

98a0ab2

Release v0.2.2

Focus

Operational visibility for long-lived workers, safer partial-batch delivery,
and a more production-shaped Kafka worker example.

Highlights

Live worker observability for long-lived pipelines

WorkerPool now registers pipelines with the metrics collector before the first
run completes, so /health and /metrics can show long-lived consumers while
they are still running.

Live health payloads now expose:

registered pipeline metadata and schedule
running lifecycle state and active run id
live throughput and elapsed runtime
live records_consumed, records_written, records_dropped, records_errored
records_pending for in-flight records that have not reached a terminal outcome yet

This makes the health surface useful for continuous workers that may not finish
a run for hours or days.

Timed batch flushes for long-lived sources

DeliveryConfig now supports batch_flush_interval_ms for batched delivery.
This allows long-lived sources to flush partial sink batches after a bounded
wait, instead of holding records indefinitely until the batch fills or the run
ends.

The implementation uses a single-owner flush task for pending_writes, which
avoids duplicate delivery when timeout-based flushing and new arrivals happen
close together.

Live metrics no longer under-report buffered source counters

Live /health snapshots now include hot-path source counters that have not yet
been flushed into the main PipelineMetrics object. This fixes confusing
observability states where records_written could temporarily appear higher
than records_consumed during a live run.

Kafka worker example updated for long-lived operation

The etl-order-reliability example now models long-lived Kafka consumers more
closely:

consumer pipelines stay attached to their Kafka group instead of exiting
after short chunked runs
the worker no longer auto-runs the sample producer
Kafka UI is included in the local Docker stack for topic, group, and lag inspection
the example uses timed batch flushing to avoid stuck partial batches at low traffic

Tests

Release validation includes:

worker observability and health response coverage for running pipelines
regression coverage for timed partial-batch flushes
example worker tests for the Kafka reliability example
full package CI from the source tree

Assets 2

01 Jun 10:59

thanhtham010891

v0.2.1

7fa62f5

v0.2.1

Focus

Correctness hardening, performance improvements, and Arrow-native sink expansion.
0.2.1 fixes several bugs introduced in 0.2.0's batch architecture, improves
throughput by ~2x on per-record pipelines via delivery batching, and extends the
Arrow fast path to CsvSink and JsonLinesSink.

Performance

Delivery batching — ~2x throughput on per-record pipelines

DeliveryConfig(batch_size=N) now delivers measurable throughput gains on the
linear lane. Profiling shows delivery overhead (write dispatch, checkpoint
decision, metrics flush) accounts for ~40% of per-record pipeline time. Batching
reduces the number of delivery function calls from O(records) to O(records/N).

Recommended default for per-record pipelines: batch_size=100.

# Before (0.2.0 default — batch_size=1)
Pipeline(source).build(sink).run()                    # ~124K rec/s JSONL

# After (0.2.1 recommended)
Pipeline(source).build(sink, config=DeliveryConfig(batch_size=100)).run()  # ~279K rec/s JSONL

Measured on 100K rows, median of 3 runs, NullSink:

Lane	batch_size=1	batch_size=100	Gain
JSONL direct	124K rec/s	279K rec/s	2.2x
CSV direct	89K rec/s	164K rec/s	1.8x

Arrow-native `CsvSink` and `JsonLinesSink`

Both sinks now implement write_arrow_batch(batch: pa.RecordBatch) and satisfy
the ArrowNativeSink protocol. Arrow pipelines that previously fell back to
to_pylist() at the sink boundary now stay columnar end-to-end.

# Now stays on the Arrow fast path — no to_pylist() at sink
Pipeline(ArrowCsvSource("input.csv")).build(CsvSink("output.csv", row_mapper=lambda r: r)).run()
Pipeline(ArrowJsonLinesSource("input.jsonl")).build(JsonLinesSink("output.jsonl")).run()

CsvSink.write_arrow_batch() uses pyarrow.csv.write_csv() (C extension) —
faster than Python csv.DictWriter for large batches.
JsonLinesSink.write_arrow_batch() uses to_pylist() + batch JSON serialization
in a single write syscall, bypassing the per-record serializer call.

Bug fixes

`BatchMiddleware` missing `process()` and `on_error()` fallbacks

BatchMiddleware subclasses raised AttributeError when used with a non-batch
source (linear lane). The linear lane calls process() per record — but
BatchMiddleware only defined process_batch().

Fixed: BatchMiddleware now provides a process() fallback that wraps the
single record in a list, calls process_batch([record], ctx), and returns the
first result. An on_error() fallback is also added for consistency.

This means BatchMapMiddleware and BatchFilterMiddleware now work correctly
with any source, not just batch-emit sources.

`MiddlewareChain.process_batch()` — OCP violation fixed via double-dispatch

process_batch() previously used a chain of isinstance checks to dispatch
to BatchMiddleware, ArrowBatchMiddleware, MapMiddleware, and
FilterMiddleware. Adding a new middleware type required editing this method.

Fixed: each middleware type now implements apply_in_batch() — a double-dispatch
hook that process_batch() calls directly. The method is now a clean loop with
no isinstance dispatch.

`DeliveryEngine` — duplicate flush logic eliminated

flush_pending_writes() and flush_batch_direct() shared ~200 lines of
duplicated batch-failure handling logic. Both methods now delegate to shared
_flush_batch_outcomes() and _commit_outcomes() helpers.

write_batch_result() (155 lines, 3 code paths) split into three focused
private methods: _write_batch_middleware_failure(), _write_batch_passthrough(),
_write_batch_filtered().

`assert isinstance(outcome, CheckpointedOutcome)` replaced

Production code in _delivery.py used assert for type narrowing — stripped
by python -O. Replaced with raise TypeError guards.

`ScheduledPipelineRunner` — private attribute access removed

ScheduledPipelineRunner accessed self._pipeline._state, self._pipeline._factory,
self._pipeline._max_errors, etc. directly. All private attributes are now
exposed as public properties on ScheduledPipeline:

New properties: state, max_records, max_consecutive_errors, backoff_policy,
pre_run_hook, on_run_complete, observers, and build() method.

`Schema.from_dict` — hash mismatch policy

Schema.from_dict() previously logged a warning on hash mismatch and continued
loading — a tampered or corrupted schema under FREEZE contract went undetected.

Fixed: Schema.from_dict(data, strict_hash=True) now raises ValueError on
hash mismatch. BackendSchemaStore automatically enables strict_hash=True
when constructed with contract=SchemaContract.FREEZE.

`SchemaEvolution.merge()` — hash short-circuit

merge() previously rebuilt the merged columns dict from scratch on every call,
even when the schema had not changed. Fixed: early return when
old.hash == new.hash.

`SchemaProcessor` — `SchemaInferrer` reuse

_infer_record_schema() previously allocated a new SchemaInferrer per record.
Fixed: a single SchemaInferrer is created in __init__ and reused via
reset() before each record.

`EmbeddingStore.mark_if_new()` — atomic check-and-add

The default DedupStore.mark_if_new() is documented as non-atomic. Under
concurrent execution (buffered lane), two coroutines could both see
exists() == False and both call add().

Fixed: EmbeddingStore now overrides mark_if_new() with an asyncio.Lock
that embeds the key once and performs check + append atomically.

Import path validation in declarative configs

ImportRefConfig now validates that {"import": "..."} values in TOML configs
are valid dotted Python identifiers with an optional :attribute suffix.
Malformed paths (absolute paths, shell metacharacters, traversal) are rejected
at config parse time with a clear error message.

New: Rust `CheckpointState` (agora-rs)

CheckpointState is now implemented in Rust (agora-rs) and used automatically
when the extension is installed. The Python implementation remains as a fallback.

The Rust implementation eliminates per-record Python attribute access for
increment(), should_save(), and mark_saved() — the three hot-path
checkpoint operations called on every delivered record.

make_checkpoint_state() is the new factory function — returns the Rust
implementation when available, Python otherwise.

New: Benchmark system (`benchmarks/`)

A new subprocess-isolated benchmark suite replaces the previous ad-hoc scripts.

# Run all lanes (csv, jsonl, parquet) with median of 3 runs
python benchmarks/run.py --rows 100000

# Run specific lanes
python benchmarks/run.py --rows 100000 --only csv jsonl

# Increase statistical confidence
python benchmarks/run.py --rows 100000 --median 5

# Save results to JSON
python benchmarks/run.py --rows 100000 --save

Each case runs in an isolated subprocess — no GC noise or event-loop state
leaks between measurements. Fixture data is regenerated before each run.
Results show median elapsed and throughput across --median runs.

Cases per lane:

Case	CSV	JSONL	Parquet
direct	✓	✓	✓
map	✓	✓	✓
filter	✓	✓	✓
map_filter	✓	✓	✓
batch_map	✓	✓	—
batch_filter	✓	✓	—
arrow	✓	✓	✓
arrow_map	✓	✓	✓
arrow_filter	✓	✓	✓
arrow_to_csv	✓	—	—
arrow_to_jsonl	—	✓	—

New public exports

Added to from agora.core.runtime import ...:

make_checkpoint_state — factory for Rust-backed or Python CheckpointState

Added to from agora.config.pipeline import ...:

validate_import_path — validates declarative import path format

Tests

690 tests pass (up from 649 in 0.2.0).

New test coverage:

tests/core/test_pipeline.py — batch_size delivery batching, direct flush,
delivery success hooks, BatchMiddleware on linear lane
tests/core/test_tracing.py — tracing span lifecycle, InMemoryTracer
tests/core/test_observability_metrics.py — metrics collector, Prometheus exporter
tests/core/test_batch_pipeline.py — extended batch pipeline coverage

Assets 2

30 May 19:02

thanhtham010891

v0.2.0

6fa43cd

Release v0.2.0

Focus

Batch-native architecture. 0.2.0 introduces a new execution lane that allows
sources to emit native batches, middlewares to process batches, and sinks to
consume batches — eliminating the per-record overhead that dominated 0.1.x
throughput.

Performance

The Arrow fast path activates when ParquetSource(use_arrow_batches=True) is
paired with ParquetSink and no per-record middleware transforms the batch.
In this configuration, pa.RecordBatch objects flow from source to sink with
zero Python object allocation per row.

New: Batch-native execution lane

Protocols (`src/agora/core/batch.py`)

Three new opt-in protocols:

BatchableSource[T] — a source that can emit native batches via
stream_batches(). The runtime calls this instead of stream() when
supports_batch_emit = True. Each yielded value is a list[T] or a
pa.RecordBatch (for Arrow-native sources).

BatchMiddleware[T, U] — a middleware that processes a whole batch at
once via process_batch(records, ctx) -> list[U | None]. Return None at
position i to drop records[i]. Raise to fail the entire batch.

ArrowNativeSink — a sink that can consume pa.RecordBatch directly via
write_arrow_batch(batch). When both source and sink are Arrow-native and no
middleware transforms the batch, the runtime uses the shortest path with zero
row materialization.

Failure policy for BatchMiddleware (Option A): if process_batch() raises,
the entire batch is routed to the DLQ (if configured) or the run aborts. There
is no per-record fallback.

Runtime batch lane (`_buffered.py`)

New ExecutionCoordinator.run_batch_pipeline() method. Activated
automatically when is_batch_capable_source(source) returns True. The
existing per-record and buffered paths are unchanged.

Guarantees preserved in the batch lane:

Batches are committed in source order.
Checkpoint advances once per batch, after the batch is durably written.
Sink fail-closed and DLQ policies are honored per batch.
CancelledError aborts without committing the in-flight batch.

`MiddlewareChain` additions

has_batch_stages() — returns True if any middleware is a BatchMiddleware.
process_batch(records, ctx) — runs the batch through the chain. For
BatchMiddleware stages: calls process_batch(). For regular Middleware
stages: applies per-record within the batch loop.

Delivery layer additions

New DeliveryEngine.write_batch_result() — delivers a processed
batch to the sink and advances the checkpoint once. Handles batch failure
(Option A), sink errors, DLQ routing, and LOG_AND_CONTINUE policy.
(RecordDeliveryCoordinator remains as a deprecated alias for DeliveryEngine.)

`ParquetSource` — Arrow batch mode

ParquetSource gains a new use_arrow_batches=False constructor parameter.
When True:

supports_batch_emit is set to True.
stream_batches() yields pa.RecordBatch objects directly — no
to_pylist() call, no Python dict allocation per row.
row_mapper is bypassed in batch mode.
Checkpoint still tracks row_number across batches for resume support.
Resume uses Arrow batch slicing instead of row-by-row skipping.

The default is False — existing pipelines using row_mapper are unaffected.

`ParquetSink` — Arrow batch mode

ParquetSink gains write_arrow_batch(batch: pa.RecordBatch) — writes an
Arrow batch directly via pa.Table.from_batches([batch]) without calling
row_mapper or _rows_to_columns(). The existing write() and
write_batch() paths are unchanged.

Breaking changes

`AGORA_API_VERSION` removed

The AGORA_API_VERSION alias (deprecated in 0.1.9) has been removed from
agora.core.registry. Use AGORA_PLUGIN_MANIFEST_VERSION instead.

# Before (0.1.9 — deprecated)
from agora.core.registry import AGORA_API_VERSION

# After (0.2.0)
from agora.core.registry import AGORA_PLUGIN_MANIFEST_VERSION

`AGORA_PLUGIN_MANIFEST_VERSION` bumped to `"0.4"`

The manifest schema version bumps from "0.3" to "0.4". Plugin packages
that declare MANIFEST.agora_api_version = "0.3" will be excluded from the
active registry and shown as incompatible in agora plugins list.

Update your plugin's MANIFEST:

MANIFEST = _Manifest(
    package="my-plugin",
    version="...",
    agora_api_version="0.4",  # was "0.3"
)

New public exports

Added to from agora import ...:

BatchableSource
BatchMiddleware
BatchProcessResult
BatchFailure
ArrowNativeSink
is_batch_capable_source
is_arrow_native_sink

Tests

New: batch pipeline unit tests

Added tests/core/test_batch_pipeline.py (13 tests):

is_batch_capable_source() detection
is_arrow_native_sink() detection
BatchMiddleware.process_batch() dispatch
Dropped records (None results)
Source order across batches
Checkpoint advances once per batch
Batch failure (Option A) → entire batch to DLQ
Batch failure without DLQ aborts run
No-middleware batch pipeline
max_records respected in batch mode

New: batch-path preservation tests

Added 4 batch-path variants to tests/preservation/test_runtime_guarantees.py:

Source order in batch mode
Checkpoint advances after each batch is durably written
Sink fail-closed in batch mode
DLQ routing on batch failure (Option A)

All 649 tests pass.

Packaging

Bumped agora-etl to 0.2.0.
Removed AGORA_API_VERSION alias.
Bumped AGORA_PLUGIN_MANIFEST_VERSION to "0.4".

Assets 2

28 May 08:37

thanhtham010891

v0.1.9

aed2f3c

Release v0.1.9

Documentation

New: Plugin Contract

Added plugins/contract.md — single source of
truth for plugin authors. Covers:

stability labels (stable / provisional / internal) for every
entry-point group
what each label promises through the 0.1.x → 0.2.x transition
plugin author obligations (base class requirements, internal path
restrictions)
what the runtime does with incompatible plugins (excluded from active
registry, still in diagnostics)
stable base class shapes for sources, sinks, middlewares, dedup stores,
and dedup strategies

New: Manifest Contract

Added plugins/manifest.md — explains the
two-version model that plugin authors frequently confuse:

AGORA_PLUGIN_MANIFEST_VERSION = "0.3" tracks the MANIFEST metadata
schema, not the agora-etl package version
full compatibility matrix (matching / mismatching / no MANIFEST)
how to read the registry_entrypoint_incompatible warning
deprecation notice for AGORA_API_VERSION alias (removed in 0.2.0)
when AGORA_PLUGIN_MANIFEST_VERSION bumps vs when the package bumps

Updated: plugins/index.md and plugins/developing.md

plugins/index.md now links to contract.md and manifest.md from the
"Start here" section.
plugins/developing.md "Manifest compatibility" section now points to
manifest.md instead of carrying its own short paragraph.

MkDocs nav

Added Plugin Contract and Manifest Contract to the Plugins section in
mkdocs.yml.

Code

Deprecation: `AGORA_API_VERSION`

Added a deprecation docstring to AGORA_API_VERSION in
agora/core/registry.py. The alias still works and resolves to the same
value as AGORA_PLUGIN_MANIFEST_VERSION. It will be removed in 0.2.0.

Migration is a one-line change:

# Before
from agora.core.registry import AGORA_API_VERSION

# After
from agora.core.registry import AGORA_PLUGIN_MANIFEST_VERSION

Clearer incompatibility warning

The registry_entrypoint_incompatible log message now includes a hint
field that tells operators exactly what to update and where to find the
compatibility model. Behavior is unchanged — only the message is more
actionable.

Tests

Preservation: plugin contract

Added tests/preservation/test_plugin_contract.py (19 tests):

AGORA_API_VERSION alias equals AGORA_PLUGIN_MANIFEST_VERSION
each stable entry-point group has a Registry at the declared module
path and attribute, with load_entrypoints() callable (6 groups)
each provisional entry-point group has a Registry at the declared
path (4 groups)
manifest matrix: matching version → loaded, compatible=True
manifest matrix: mismatching version → excluded, compatible=False,
still in diagnostics
manifest matrix: no MANIFEST → loaded, compatible=None
incompatible plugin does not abort discovery of other plugins in the
same group
built-in sources registered under their declared names
built-in sinks registered under their declared names
built-in runners registered under their declared names
describe_items() returns RegistryItemInfo objects

All 72 preservation tests pass (53 from 0.1.8 + 19 new).

Assets 2

28 May 08:01

thanhtham010891

v0.1.8

fe93dbc

Release v0.1.8

Documentation

New: Runtime Guarantees

Added guides/runtime-guarantees.md — the
single source of truth for what the runtime promises. Covers:

source-order delivery in linear and buffered modes
sink fail-closed semantics and when each policy applies
the full checkpoint advancement table (success, drop, DLQ, fail-closed,
log-and-continue)
DLQ routing and DLQ failure policy
DLQ replay acknowledgement
cancellation and lifecycle ordering
intentional non-guarantees: at-least-once delivery, no two-phase commit,
untrusted-config caveat, public-edge hardening, cross-source ordering

The page is now the authoritative contract. architecture.md and
failure-handling.md cross-link into it instead of restating guarantees in
prose.

New: Recovery Support Matrix

Added guides/recovery-matrix.md — the
per-source resume contract. Covers:

which built-in sources are checkpointable and what their resume key looks
like
the three conditions a source must satisfy to be treated as checkpointable
per-source notes for JsonLinesSource, CsvSource, ParquetSource,
HTTPSource, and IterableSource
what resume does and does not promise
checkpoint failure handling for both the load and save paths
a minimal recipe for adding resume support to a custom source

sources.md and guides/checkpointing.md now cross-link into the matrix
instead of carrying their own short tables.

Tests

Preservation: runtime guarantees

Added tests/preservation/test_runtime_guarantees.py (12 tests). Each test
maps to one declared guarantee in runtime-guarantees.md:

linear-mode source-order commit
buffered-mode source-order commit when the buffered stage resolves futures
in reverse order
sink fail-closed propagates without DLQ
checkpoint advances on successful write
checkpoint advances when middleware DLQs the record
DLQ routing carries stage="middleware" and middleware name
middleware failure without DLQ continues the pipeline and counts the error
sink failure with DLQ advances the checkpoint
is_checkpoint_capable requires the explicit supports_checkpoint flag
DLQFailurePolicy.RAISE propagates DLQ write errors
DLQFailurePolicy.LOG_ONLY (default) swallows DLQ write errors
SinkFailurePolicy.LOG_AND_CONTINUE advances the checkpoint over a failed
write

Preservation: recovery edge cases

Added tests/preservation/test_recovery_edge_cases.py (10 tests):

corrupted checkpoint payload (missing fields) raises a typed TypeError —
protects the 0.1.7 fix
non-dict checkpoint payload raises a typed TypeError
prepare_resume failure aborts the run under FAIL_CLOSED
prepare_resume failure under LOG_AND_CONTINUE starts from scratch
checkpoint load() failure aborts the run under FAIL_CLOSED
checkpoint load() failure under LOG_AND_CONTINUE starts from scratch
checkpoint save() failure under LOG_AND_CONTINUE does not retry-storm —
protects the 0.1.7 mark_saved fix
non-checkpointable source paired with a checkpoint store runs without
checkpointing instead of raising
two-run resume restores progress from the saved position
SQLiteCheckpointStore survives close/reopen — basis for cross-process
resume

Assets 2

Releases: thanhtham010891/agora-etl

Release v0.4.0

Added

Changed

Removed

Release Rule

Uh oh!

Release v0.3.3

Focus

Highlights

AI provider capabilities are now explicit

Anthropic is now part of the official first-party bundle

Config-driven workers can now turn on OTel without custom bootstrap code

Docs and packaging now tell a cleaner release story

Upgrade notes

For AI users

For worker operators

For plugin users

Validation

Uh oh!

Release v0.3.2

Focus

Highlights

Public API boundaries are now frozen more deliberately

Runtime guarantees are preservation-backed end to end

Observability and health diagnostics are safer

Plugin contract cleanup is now reflected in code, tests, and docs

Benchmark coverage now has a concrete maintenance budget

Upgrade notes

For builders

For plugin authors

Deprecated aliases retained through 0.3.x

Validation

Uh oh!

Release v0.3.1

Focus

Highlights

Runtime planning and delivery semantics are more stable

Public data-plane contract is now easier for plugin authors to follow

Recovery UX is more explainable from the CLI

agora doctor is now a stronger preflight tool

Spawn safety and test isolation are tighter

Tests

Uh oh!

Release v0.3.0

Focus

Highlights

Source-bound max_records limits

Middleware data-plane validation is now explicit

Arrow fan-out now chooses the best path per sink

Execution semantics now expose first-class data planes

Legacy batch/Arrow bool flags are now deprecated

Built pipelines can now explain their planned execution shape

CsvSink now reports Arrow-native writes vs downgrade-to-rows

ProcessBatchMiddleware for process-isolated batch execution

ArrowProcessBatchMiddleware for columnar process execution

Timeout recovery now recycles the worker pool

Ordered pipelined process execution is now active

Batch-only contract is explicit

Stability Status

Release Validation

Release Criteria

Uh oh!

Release v0.2.2

Focus

Highlights

Live worker observability for long-lived pipelines

Timed batch flushes for long-lived sources

Live metrics no longer under-report buffered source counters

Kafka worker example updated for long-lived operation

Tests

Uh oh!

v0.2.1

Focus

Performance

Delivery batching — ~2x throughput on per-record pipelines

Arrow-native CsvSink and JsonLinesSink

Bug fixes

BatchMiddleware missing process() and on_error() fallbacks

MiddlewareChain.process_batch() — OCP violation fixed via double-dispatch

Deprecated aliases retained through `0.3.x`

`agora doctor` is now a stronger preflight tool

Source-bound `max_records` limits

`CsvSink` now reports Arrow-native writes vs downgrade-to-rows

`ProcessBatchMiddleware` for process-isolated batch execution

`ArrowProcessBatchMiddleware` for columnar process execution

Arrow-native `CsvSink` and `JsonLinesSink`

`BatchMiddleware` missing `process()` and `on_error()` fallbacks

`MiddlewareChain.process_batch()` — OCP violation fixed via double-dispatch

`DeliveryEngine` — duplicate flush logic eliminated

`assert isinstance(outcome, CheckpointedOutcome)` replaced

`ScheduledPipelineRunner` — private attribute access removed

`Schema.from_dict` — hash mismatch policy

`SchemaEvolution.merge()` — hash short-circuit

`SchemaProcessor` — `SchemaInferrer` reuse

`EmbeddingStore.mark_if_new()` — atomic check-and-add

New: Rust `CheckpointState` (agora-rs)

New: Benchmark system (`benchmarks/`)

Protocols (`src/agora/core/batch.py`)

Runtime batch lane (`_buffered.py`)

`MiddlewareChain` additions

`ParquetSource` — Arrow batch mode

`ParquetSink` — Arrow batch mode

`AGORA_API_VERSION` removed