Introduce `transform::logging::manager` #16301

oleiman · 2024-01-26T03:37:32Z

This PR introduces transform::logging::manager whose primary responsibility is to buffer transform logs, periodically flushing them to some transform::logging::sink as chunked_fifo of JSON-serialized OpenTelemetry-compatible log events.

Closes https://github.com/redpanda-data/core-internal/issues/1035
Closes https://github.com/redpanda-data/core-internal/issues/998

Backports Required

Release Notes

none

oleiman · 2024-01-26T03:40:35Z

/dt

vbotbuildovich · 2024-01-26T06:07:17Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44335#018d441a-b421-495d-98f2-2d2b5a2c7b80

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44344#018d44e8-9c10-4d81-ad43-d3590cbc0891

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44344#018d44f8-c1ab-4d8d-98ba-2f898ceffb4f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44412#018d52f0-6bc6-4d86-9230-2b37c1bbd33e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44511#018d5ecd-6675-4267-a9e2-b538a9704dc2

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44511#018d5ecd-6679-4873-8e42-09638df305df

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44511#018d60bd-9df6-436c-bc3f-c22ec5b771d1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44561#018d6220-ed1d-4d10-aa48-18b6d009f697

oleiman · 2024-01-26T06:38:28Z

/dt

rockwotj

Nice tests. A good first stab - I think the only substantial feedback is that I think we should do more aggressive batching in the manager.

src/v/config/configuration.cc

rockwotj · 2024-01-26T18:41:43Z

src/v/config/configuration.cc

+      "Flush interval for transform logs. When a timer expires, pending logs "
+      "are collected and published to the transform_logs topic.",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      500ms)


We may need to do some quick math on at one scale this overwhelms a single partition. One of the reasons for the 3s commit interval for data transform progress is so that we don't have to worry too much about addition partitions at that commit interval (we would need a lot of cores for that). I agree with the faster publish/flush rate here, but let's derive some limits of this internally and figure out what we need to have partition counts for our cloud tiers based on this.

ya, strongly agree. this value is totally arbitrary and frankly an artifact of a previous iteration (before I wired in manual_clock).

https://github.com/redpanda-data/core-internal/issues/1040

rockwotj · 2024-01-26T18:42:03Z

src/v/config/configuration.h

@@ -74,6 +74,11 @@ struct configuration final : public config_store {
    bounded_property<size_t> data_transforms_per_function_memory_limit;
    property<std::chrono::milliseconds> data_transforms_runtime_limit_ms;
    bounded_property<size_t> data_transforms_binary_max_size;
+    bounded_property<size_t> data_transforms_logging_buffer_capacity_bytes;
+    // TODO(oren): bounded?


I don't think we need to be that prescriptive personally - it's a tunable after all.

src/v/transform/logging/logger.cc

src/v/transform/logging/log_manager.cc

rockwotj · 2024-01-26T18:56:09Z

src/v/transform/logging/log_manager.h

+        event event;
+        ssx::semaphore_units units;
+    };
+    using queue_t = ss::chunked_fifo<log_event>;


I think chunked_vector will probably be a better structure once #16257 is merged

src/v/transform/logging/tests/log_manager_test.cc

oleiman · 2024-01-29T00:49:06Z

force push contents:

various minor review feedback
large-ish refactor of flushing logic
- jitter
- cond-var based timing and direct wakeup
- group log events by partition ID before forwarding to client
- ^^ separate "flusher" class for most of that
extend client API to look up partition ID for a transform_name

rockwotj

Have not looked super close at the tests yet, but this is looking really good!

src/v/transform/logging/log_manager.h

rockwotj · 2024-01-29T01:37:39Z

src/v/transform/logging/log_manager.cc

+                if constexpr (std::is_same_v<ClockType, ss::manual_clock>) {
+                    co_await _wakeup_signal.wait(
+                      ClockType::now() + _jitter.base_duration());
+                } else {
+                    co_await _wakeup_signal.wait(_jitter());
+                }


Does it simplify things if we always do:

co_await _wakeup_signal.wait<ClockType>(ClockType::now() + _jitter.base_duration());

Because I think the else version you're using here uses timer::clock::now() which defaults to steady_clock, which lowres should be fine here.

oh, yeah you're absolutely right. I got mixed up associating the Clock template param with the class rather than the wait methods.

src/v/transform/logging/log_manager.cc

rockwotj · 2024-01-29T01:43:46Z

src/v/transform/logging/log_manager.cc

+template<typename ClockType>
+ss::future<> manager<ClockType>::stop() {
+    _as.request_abort();
+    co_await _flusher->stop();


Should we be closing the gate first?

I think we need to break the semaphore first because the flush fiber is holding the gate open, right?

incidentally, the fact that this looks strange is probably a bit of an API smell for the flusher thing...

Why does this class even need the gate?

I think the suggestion is "push the gate down into flusher", which sounds right to me, I've done that.

It was a question, but a reasonable answer is that it doesn't and it can be pushed into the flusher :)

src/v/transform/logging/log_manager.cc

oleiman · 2024-01-29T06:23:27Z

force push:

fix concurrency bug (absl map iterator stability)
few minor cleanups

oleiman · 2024-01-29T18:35:51Z

force push contents:

hold onto sem units in client::write (added to io::json_batch)
For primary log buffers: flat_hash_map -> bree_map

rockwotj

🔥

Just some documentation, and some possible code cleanup. Otherwise this LGTM.

rockwotj · 2024-01-30T01:11:08Z

src/v/transform/logging/io.h

+
+} // namespace io
+
+class client {


nit: please add documentation for the semantics of this interface.

rockwotj · 2024-01-30T01:15:40Z

src/v/transform/logging/log_manager.cc

+
+    auto validate_msg =
+      [&msg_len](std::string_view message) -> std::optional<ss::sstring> {
+        auto sub_view = message.substr(0, msg_len(message));


Technically this can leave invalid UTF-8 due to a multiple byte character being cut in the middle, but personally I don't think we need to worry about it.

rockwotj · 2024-01-30T01:16:20Z

src/v/transform/logging/log_manager.cc

+template<typename ClockType>
+ss::future<> manager<ClockType>::stop() {
+    _as.request_abort();
+    co_await _flusher->stop();


Why does this class even need the gate?

rockwotj · 2024-01-30T01:17:07Z

src/v/transform/logging/log_manager.cc

+      event{_self, event::clock_type::now(), level, std::move(*b)},
+      std::move(*units));
+
+    if (check_lwm() && _flusher != nullptr) {


When can _flusher be nullptr?

If enqueue_log were called before manager::start. I think this shouldn't happen, but without all the wiring in place I'm not 100% sure. Maybe an assertion would express this better.

Can't we create the flusher in the constructor and wait to call flusher.start until start?

Probably? I know for things that initialize metrics there's a good reason to construct in start (global side effects?). Not an issue here, but I've seen this pattern on a couple other classes. TBH in this case I'm just cargo-culting a bit.

Oh, yeah transform::service does this because other dependent sharded services are not initialized yet - kind of annoying. I think in this case, we don't have this concern because the memory is owned by the manager, so we can simplify.

rockwotj · 2024-01-30T01:17:37Z

src/v/transform/logging/log_manager.h

+}
+
+template<typename ClockType = ss::lowres_clock>
+class manager {


Can we add some documentation for the purpose and semantics of this class? Also details that it's expected to be an instance per core, etc.

rockwotj · 2024-01-30T01:18:30Z

src/v/transform/logging/log_manager.cc

+                // timepoint overload template has a clocktype parameter, so we
+                // use that one to get the behavior we want for testing
+                co_await _wakeup_signal.wait<ClockType>(
+                  ClockType::now() + _jitter.base_duration());


Why base_duration?

🤦 for the manual clock specialization. got a bit overzealous removing code...we still need that constexpr conditional here.

rockwotj · 2024-01-30T01:19:27Z

src/v/transform/logging/log_manager.cc

+        });
+    }
+
+    template<typename BuffersT>


Why is BuffersT templated?

To limit the scope of buffer_entry, buffer type, map type. I want to keep the details internal to manager and flusher out of the header. Maybe there's an angle I'm not seeing?

+1 to keeping the flusher out of the header. This is fine, just mostly curious.

rockwotj · 2024-01-30T01:20:08Z

src/v/transform/logging/log_manager.cc

+          });
+
+        vlog(tlg_log.trace, "Processed {} log events", n_events);
+        co_return;


nit: co_return isn't needed.

yeah, cruft. i had a status code ~~there~~ in a previous iteration, then removed it, added it back, removed it.

rockwotj · 2024-01-30T01:21:31Z

src/v/transform/logging/log_manager.cc

+        // released (i.e. buffer capacity freed) only once those records have
+        // been produced.
+        co_await _client->write(pid, std::move(events));
+        co_return;


nit: remove this co_return it's not needed.

rockwotj · 2024-01-30T01:23:21Z

src/v/transform/logging/tests/log_manager_test.cc

+    // This will cause a reactor stall in Debug mode but NOT
+    // in Release


Where is this from? Is that because enqueue_log is called multiple times synchronously? There isn't anything inherently in the manager that should cause stalls right? This is just an artifact of the test and Debug mode slowness?

isn't anything inherently in the manager that should cause stalls right?

correct. doesn't look like it to me, anyway.

just an artifact of the test and debug mode slowness

I think so. It definitely stalls in enqueue_log, but it's a sparse backtrace with some asan symbols and some fmtlib (???) symbols in. I was more confident about this yesterday...backtrace decoding seems like it could be wrong.

Eh, Seems like the cause is doing a bunch of slow asan mallocs in a tight loop? I threw an ss::maybe_yield().get() in the test function and the stall went away. Would explain the garbled backtrace I think.

oleiman · 2024-01-31T07:57:19Z

force push minor cleanups and some light documentation on client, manager, manager::enqueue_log

- data_transforms_logging_buffer_capacity_bytes : integer (default 100_KiB) - data_transforms_logging_flush_interval_ms : integer (default 500ms) - data_transforms_logging_line_max_bytes : integer (default 1_KiB) Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

Initial API for determining the partition ID for some transform's logs and writing those logs to the transform_logs topic. Abstract interface for testing purposes. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

`struct sstring_less` Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

For convenience. Call straight into the underlying property. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman · 2024-01-31T23:11:13Z

empty force push for commit signoff

rockwotj

LGTM

vbotbuildovich · 2024-02-01T02:47:45Z

/backport v23.3.x

vbotbuildovich · 2024-02-01T02:48:40Z

Oops! Something went wrong.

Workflow run logs.

gousteris · 2024-02-01T16:58:26Z

/backport v23.3.x

oleiman self-assigned this Jan 26, 2024

github-actions bot added area/redpanda area/wasm WASM Data Transforms labels Jan 26, 2024

oleiman force-pushed the xfm-logging/log-manager2 branch from 7d4d08c to d329120 Compare January 26, 2024 03:40

oleiman force-pushed the xfm-logging/log-manager2 branch from d329120 to 59370b3 Compare January 26, 2024 06:37

oleiman marked this pull request as ready for review January 26, 2024 06:38

oleiman force-pushed the xfm-logging/log-manager2 branch from 59370b3 to 9124aa5 Compare January 26, 2024 07:17

oleiman requested review from rockwotj and dotnwat January 26, 2024 15:14

rockwotj reviewed Jan 26, 2024

View reviewed changes

oleiman force-pushed the xfm-logging/log-manager2 branch from 9124aa5 to 97fffa3 Compare January 29, 2024 00:41

rockwotj reviewed Jan 29, 2024

View reviewed changes

oleiman force-pushed the xfm-logging/log-manager2 branch from 97fffa3 to 2c4ba7e Compare January 29, 2024 06:19

oleiman force-pushed the xfm-logging/log-manager2 branch from 2c4ba7e to 817dfe5 Compare January 29, 2024 18:33

oleiman requested a review from rockwotj January 29, 2024 23:25

rockwotj reviewed Jan 30, 2024

View reviewed changes

oleiman mentioned this pull request Jan 31, 2024

config: Transform logging configs #16155

Closed

7 tasks

oleiman force-pushed the xfm-logging/log-manager2 branch from 817dfe5 to 0df59c4 Compare January 31, 2024 07:56

oleiman requested a review from rockwotj January 31, 2024 23:08

oleiman added 4 commits January 31, 2024 15:10

transform/logging: Introduce transform::logging::client

ffcf549

Initial API for determining the partition ID for some transform's logs and writing those logs to the transform_logs topic. Abstract interface for testing purposes. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

transform/logging: Add ss::logger for transform log subsystem

3cab46d

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

utils: Add sstring comparator to absl_sstring_hash

520c9f4

`struct sstring_less` Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman added 3 commits January 31, 2024 15:10

transform/logging: Introduce transform::logging::manager

678081d

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

mock_property: Add call operators

0019fcc

For convenience. Call straight into the underlying property. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

transform/logging: Tests for log manager

5b8f00a

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>

oleiman force-pushed the xfm-logging/log-manager2 branch from 0df59c4 to 5b8f00a Compare January 31, 2024 23:10

rockwotj approved these changes Feb 1, 2024

View reviewed changes

oleiman merged commit 99e7c2e into redpanda-data:dev Feb 1, 2024
17 checks passed

vbotbuildovich mentioned this pull request Feb 1, 2024

[v23.3.x] Introduce transform::logging::manager #16420

Closed

vbotbuildovich mentioned this pull request Feb 1, 2024

[v23.3.x] Introduce transform::logging::manager #16436

Merged

		// This will cause a reactor stall in Debug mode but NOT
		// in Release

Introduce transform::logging::manager #16301

Introduce transform::logging::manager #16301

Conversation

oleiman commented Jan 26, 2024 • edited

Backports Required

Release Notes

oleiman commented Jan 26, 2024

vbotbuildovich commented Jan 26, 2024 • edited

oleiman commented Jan 26, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jan 29, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jan 29, 2024

oleiman commented Jan 29, 2024

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Jan 31, 2024

oleiman commented Jan 31, 2024

rockwotj left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Feb 1, 2024

vbotbuildovich commented Feb 1, 2024

gousteris commented Feb 1, 2024

Introduce `transform::logging::manager` #16301

Introduce `transform::logging::manager` #16301

oleiman commented Jan 26, 2024 •

edited

vbotbuildovich commented Jan 26, 2024 •

edited

oleiman Jan 30, 2024 •

edited