-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(file): optimize metrics in File source #16178
base: master
Are you sure you want to change the base?
feat(file): optimize metrics in File source #16178
Conversation
- optimize file metrics. With this optimization I achieve 2.5x performance boost in the scenarion "File source -> Blackhole sink" Tested: - Local runs
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
✅ Deploy Preview for vrl-playground canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks for this impressive improvement, @zamazan4ik. I do have some requests for improvement below but the basic design is what I would expect.
Just a note about the current implementation. Right now it has some kind of a "memory leak": since we create a mapping "per-file", this mapping is not cleared when we are finished with a file and did not restart the source. Maybe we need something like a GC for these mappings or integrate somehow this flush when we are finished with a file. However, I am certainly sure that it's not critical stuff and it shouldn't block the PR. |
🤔 Yeah, I don't feel great about the leak and we've had a number of "memory leak" style problems around the file server (unlimited cardinality, yay). I don't see a super simple way to hook these maps into when we stop reading from files. @bruceg what do you think here, it's definitely an improvement but I don't like the idea of causing a slow crash as we accumulate more and more files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better. One more required event change and a question. I will comment on garbage collection separately.
src/internal_events/file.rs
Outdated
"protocol" => "file", | ||
"file" => self.file.to_owned() | ||
); | ||
impl RegisterInternalEvent for FileEventsReceived { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still prefer you use registered_event!
for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/sources/file.rs
Outdated
let mut file_id_to_metrics_mapping: HashMap<FileFingerprint, FileBytesReceivedHandle> = | ||
HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes after looking at this again:
- The generic name for the handle is
Registered<FileBytesReceived>
. This is a simple type alias that avoids needing to import the actual name of the handle. - Could this table hold both of the handles? i.e.
HashMap<FileFingerprint, (Registered<FileBytesReceived>, Registered<FileEventsReceived>)>
That way, when we need to garbage collect, there is only one table to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixed.
- I divided them into separate maps due to "use-after-move" issues. As you see, both maps are used in different lambdas with
move
keyword. Of course, we could mitigate it as usual withRc
/Arc
but I didn't want to do so if I could just divide maps and avoid this problem. If there is a better way to merge the maps into one and still have maintainable code and not fight with the borrow checker - please tell me.
What is needed is for the file server to emit an enum of events, instead of just lines, consisting of open file, emit line(s), close file. This will allow the file source to track state with the server. I too am not excited about adding a change like this that is known to have unbounded memory growth, given that we have had problems like that in the past. It is unlikely for this to be a fast memory "leak", but it is still a problem to avoid. |
Do we want to introduce this change in this PR or in another PR, that will fix this "slow" memory leak? I am asking because such modifications to the file server do not seem trivial to me and I need to dig a little into the implementation. |
Regression Test Results
Run ID: b4accdcf-c628-4fbb-adcc-ac8264662684 Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their Changes in
Fine details of change detection per experiment.
|
I'd personally be hesitant to merge a known "bug" into the main branch. At the same time I don't want to ask too much of a community contributor, would you mind evaluating the scope of the changes needed and if you're up to making the required changes? We're happy to take ownership of the PR if you'd prefer to pass it off. |
Agreed we should resolve the memory leak before this is merged. For more context: we like to keep master in a releasable state to avoid releases being delayed to fix issues and to give users running nightlies a better experience. |
Just to note: I don't think that this is a release blocker. Since the memory leak would be too slow. But I agree that it's a problem. |
Related to the discussion #15977
Here I reimplemented a few metrics in File source to the new
register!
pattern. This is done for achieving more performance in the File source scenario. According to my local benchmarks, these changes boost throughput up to 2-2.5x compared to the master branch.The ugliest thing in this PR is a map for mapping files to the corresponding metric handle. If you have a better idea, that is easy to implement - that would be awesome. Probably, somewhere when Vector detects a new file, it could create a corresponding metrics set with the corresponding filename, and pass a handle to the metric with some internal structure (like
Line
) but that is just an idea without an actual deep knowledge about the whole internal File source pipeline.