Skip to content

Improve metrics (and some other things) #3390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jul 24, 2023
Merged

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Jul 23, 2023

This PR is based on @Dakostu's excellent #3376 that laid the groundwork for metrics.

It comes with the following metrics-related changes:

  • We now calculate the approximate number of bytes in execution nodes when they handle events. The calculated number is an underestimate and (1) never includes the byte size of the schema, (2) excludes some array data buffers depending on the Arrow version used.
  • tenzir --dump-metrics '<pipeline>' prints metrics on stderr after the pipeline finished.
  • A set of bug fixes regarding metrics collection that were still in Expose pipeline operator metrics in execution node and pipeline executor #3376.

I also snuck in the following change that is unrelated to metrics, but that I stumbled over while developing this PR:

  • Sorting now works for ip and enum fields.

This is what --dump-metrics looks like:

❯ # Using the M57 trace from the Get Started guide:
❯ cat Zeek/*.log | tenzir --dump-metrics 'read zeek-tsv | to file /dev/null'
operator #1 (load)
  elapsed: 4.02s
  scheduled: 553.13ms (13.75%)
  running: 484.85ms (12.05%)
  outbound:
    bytes: 199,898,907 at a rate of 49677650.77/s
    batches: 12,201 (16383.81 bytes/batch)
operator #2 (read)
  elapsed: 6.06s
  scheduled: 1.84s (30.34%)
  running: 1.78s (29.39%)
  inbound:
    bytes: 199,898,907 at a rate of 32995890.01/s
    batches: 12,201 (16383.81 bytes/batch)
  outbound:
    events: 953,677 at a rate of 157416.68/s
    bytes: 1,997,471 at a rate of 329708.32/s (estimate)
    batches: 54 (17660.69 events/batch)
operator #3 (write)
  elapsed: 7.96s
  scheduled: 7.67s (96.45%)
  running: 7.66s (96.24%)
  inbound:
    events: 953,677 at a rate of 119873.34/s
    bytes: 1,997,471 at a rate of 251074.02/s (estimate)
    batches: 54 (17660.69 events/batch)
  outbound:
    bytes: 541,837,222 at a rate of 68106745.97/s
    batches: 570 (950591.62 bytes/batch)
operator #4 (save)
  elapsed: 7.96s
  scheduled: 6.31ms (0.08%)
  running: 3.94ms (0.05%)
  inbound:
    bytes: 541,837,222 at a rate of 68102061.06/s
    batches: 570 (950591.62 bytes/batch)

@dominiklohmann dominiklohmann added feature New functionality engine Core pipeline and storage engine labels Jul 23, 2023
@dominiklohmann dominiklohmann requested a review from Dakostu July 23, 2023 17:32
This takes the internal rebatching operator of the rebuild command and
makes it available as the `batch <limit>` operator. This is particularly
useful for formats like Feather or Parquet or connectors like Kafka.
This resolves a long-standing todo from the sort operator: We now
rebatch its outputs transparently. In very much non-scientific
measurements this increased the throughput of later operators in the
pipeline significantly.
Copy link

@Dakostu Dakostu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the addition, and the sort fix seems to come in just at the right time 💪

I left some minor comments.

dominiklohmann and others added 7 commits July 24, 2023 14:39
This takes the internal rebatching operator of the rebuild command and
makes it available as the `batch <limit>` operator. This is particularly
useful for formats like Feather or Parquet or connectors like Kafka.

Additionally, this resolves a long-standing todo from the `sort`
operator: We now rebatch its outputs transparently. In very much
non-scientific measurements this increased the throughput of later
operators in the pipeline significantly.
Base automatically changed from topic/expose-pipeline-metrics to main July 24, 2023 16:49
@dominiklohmann dominiklohmann merged commit c0408d6 into main Jul 24, 2023
@dominiklohmann dominiklohmann deleted the topic/estimated-byte-sizes branch July 24, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
engine Core pipeline and storage engine feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants