Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more metrics for processing latency #5974

Merged
merged 3 commits into from Dec 14, 2020
Merged

Add more metrics for processing latency #5974

merged 3 commits into from Dec 14, 2020

Conversation

deepthidevaki
Copy link
Contributor

Description

Since we have been speculating about several root causes/solutions for performance bottlenecks, I thought it would be good to added following metrics:

  • Time taken for processing a record
  • Latency to write a record to the log
  • Latency to commit a record

Related issues

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement

@deepthidevaki
Copy link
Contributor Author

image

@npepinpe
Copy link
Member

npepinpe commented Dec 7, 2020

Writing has a bunch of outliers that are much slower than I expected 😅

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

I had some questions, one's a comment, and the other is: is there a reason we can't filter by pod on the new metrics in Grafana?

But no blockers 🙂

metrics.processingLatency(
metadata.getRecordType(), event.getTimestamp(), ActorClock.currentTimeMillis());
final long processingStartTime = ActorClock.currentTimeMillis();
metrics.processingLatency(metadata.getRecordType(), event.getTimestamp(), processingStartTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the choice to include deserialization as part of the processing time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step I did not want to narrow the scope of a metrics. If processing duration is higher than what we expect, then we can narrow down the metrics to smaller blocks. Would you like to take deserialization out of the processing time?

Copy link
Member

@npepinpe npepinpe Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly wanted to know if it was a conscious choice, and if so, why. Do you think it might be confusing/unexpected that deserializing is part of the processing time? I think it might be a little unexpected, but once you know it doesn't sound out of place imho, you can make a case that it is part of the processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wanted to include the complete processing - from this event is ready to process until the next event is ready. This would give us an idea how much time we spent in StreamProcessor. So that should include the steps updateState and writeEvent, which is not currently included in the processing time. Wdyt? Shall I update it? Then it wouldn't be weird to have deserializing also part of the processing time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense - we can make it more granular by adding a metric per step later on (i.e. updateState, writeFollowUpEvents, etc.). If we add one here, can we also add one in reprocessing? Both will be very useful when refactoring how we do stream processing next quarter 👍

@deepthidevaki
Copy link
Contributor Author

I had some questions, one's a comment, and the other is: is there a reason we can't filter by pod on the new metrics in Grafana?

I just copied existing panel's for processing latency and adjusted the descriptions. I guess they were not filtering by pods. But I can add add it both on the old and new metrics.

@deepthidevaki
Copy link
Contributor Author

@npepinpe I have update processing duration calculation.

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

public void processingDuration(
final RecordType recordType, final long started, final long processed) {
PROCESSING_DURATION
.labels(recordType.name(), partitionIdLabel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to find hotspots by adding value type/intent as labels - however I'm not sure if that creates too much dimensions/data explosion. We can definitely do that as a second iteration though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can improve it when we need it.

@deepthidevaki
Copy link
Contributor Author

bors r+

@zeebe-bors zeebe-bors bot merged commit c4ab987 into develop Dec 14, 2020
@zeebe-bors zeebe-bors bot deleted the dd-metrics branch December 14, 2020 13:50
@zeebe-bors
Copy link
Contributor

zeebe-bors bot commented Dec 14, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants