New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more metrics for processing latency #5974
Conversation
Writing has a bunch of outliers that are much slower than I expected 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I had some questions, one's a comment, and the other is: is there a reason we can't filter by pod on the new metrics in Grafana?
But no blockers 🙂
metrics.processingLatency( | ||
metadata.getRecordType(), event.getTimestamp(), ActorClock.currentTimeMillis()); | ||
final long processingStartTime = ActorClock.currentTimeMillis(); | ||
metrics.processingLatency(metadata.getRecordType(), event.getTimestamp(), processingStartTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the choice to include deserialization as part of the processing time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a first step I did not want to narrow the scope of a metrics. If processing duration is higher than what we expect, then we can narrow down the metrics to smaller blocks. Would you like to take deserialization out of the processing time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly wanted to know if it was a conscious choice, and if so, why. Do you think it might be confusing/unexpected that deserializing is part of the processing time? I think it might be a little unexpected, but once you know it doesn't sound out of place imho, you can make a case that it is part of the processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wanted to include the complete processing - from this event is ready to process until the next event is ready. This would give us an idea how much time we spent in StreamProcessor. So that should include the steps updateState
and writeEvent
, which is not currently included in the processing time. Wdyt? Shall I update it? Then it wouldn't be weird to have deserializing also part of the processing time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes more sense - we can make it more granular by adding a metric per step later on (i.e. updateState, writeFollowUpEvents, etc.). If we add one here, can we also add one in reprocessing? Both will be very useful when refactoring how we do stream processing next quarter 👍
I just copied existing panel's for processing latency and adjusted the descriptions. I guess they were not filtering by pods. But I can add add it both on the old and new metrics. |
@npepinpe I have update processing duration calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
public void processingDuration( | ||
final RecordType recordType, final long started, final long processed) { | ||
PROCESSING_DURATION | ||
.labels(recordType.name(), partitionIdLabel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be interesting to find hotspots by adding value type/intent as labels - however I'm not sure if that creates too much dimensions/data explosion. We can definitely do that as a second iteration though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can improve it when we need it.
81740e0
to
05dc428
Compare
bors r+ |
Build succeeded: |
Description
Since we have been speculating about several root causes/solutions for performance bottlenecks, I thought it would be good to added following metrics:
Related issues
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: