Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(file): optimize metrics in File source #16178

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zamazan4ik
Copy link
Contributor

@zamazan4ik zamazan4ik commented Jan 29, 2023

Related to the discussion #15977

Here I reimplemented a few metrics in File source to the new register! pattern. This is done for achieving more performance in the File source scenario. According to my local benchmarks, these changes boost throughput up to 2-2.5x compared to the master branch.

The ugliest thing in this PR is a map for mapping files to the corresponding metric handle. If you have a better idea, that is easy to implement - that would be awesome. Probably, somewhere when Vector detects a new file, it could create a corresponding metrics set with the corresponding filename, and pass a handle to the metric with some internal structure (like Line) but that is just an idea without an actual deep knowledge about the whole internal File source pipeline.

- optimize file metrics. With this optimization I achieve 2.5x
  performance boost in the scenarion "File source -> Blackhole sink"

Tested:
- Local runs
@netlify
Copy link

netlify bot commented Jan 29, 2023

Deploy Preview for vector-project ready!

Name Link
🔨 Latest commit 0db3779
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/63d9d2c8e939e10008df32ca
😎 Deploy Preview https://deploy-preview-16178--vector-project.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@netlify
Copy link

netlify bot commented Jan 29, 2023

Deploy Preview for vrl-playground canceled.

Name Link
🔨 Latest commit 0db3779
🔍 Latest deploy log https://app.netlify.com/sites/vrl-playground/deploys/63d9d2c82210870008a9d01a

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Jan 29, 2023
@bruceg bruceg requested review from bruceg and a team January 30, 2023 14:42
@bruceg bruceg added source: file Anything `file` source related type: tech debt A code change that does not add user value. domain: performance Anything related to Vector's performance labels Jan 30, 2023
@spencergilbert spencergilbert requested review from spencergilbert and removed request for a team January 30, 2023 15:32
Copy link
Member

@bruceg bruceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for this impressive improvement, @zamazan4ik. I do have some requests for improvement below but the basic design is what I would expect.

src/internal_events/file.rs Outdated Show resolved Hide resolved
src/internal_events/file.rs Outdated Show resolved Hide resolved
src/internal_events/file.rs Outdated Show resolved Hide resolved
src/sources/file.rs Outdated Show resolved Hide resolved
src/sources/file.rs Outdated Show resolved Hide resolved
@zamazan4ik zamazan4ik requested review from bruceg and removed request for spencergilbert January 30, 2023 21:43
@zamazan4ik
Copy link
Contributor Author

Just a note about the current implementation. Right now it has some kind of a "memory leak": since we create a mapping "per-file", this mapping is not cleared when we are finished with a file and did not restart the source.

Maybe we need something like a GC for these mappings or integrate somehow this flush when we are finished with a file.

However, I am certainly sure that it's not critical stuff and it shouldn't block the PR.

@spencergilbert
Copy link
Contributor

🤔 Yeah, I don't feel great about the leak and we've had a number of "memory leak" style problems around the file server (unlimited cardinality, yay). I don't see a super simple way to hook these maps into when we stop reading from files.

@bruceg what do you think here, it's definitely an improvement but I don't like the idea of causing a slow crash as we accumulate more and more files.

Copy link
Member

@bruceg bruceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better. One more required event change and a question. I will comment on garbage collection separately.

"protocol" => "file",
"file" => self.file.to_owned()
);
impl RegisterInternalEvent for FileEventsReceived {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would still prefer you use registered_event! for this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 585 to 586
let mut file_id_to_metrics_mapping: HashMap<FileFingerprint, FileBytesReceivedHandle> =
HashMap::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes after looking at this again:

  1. The generic name for the handle is Registered<FileBytesReceived>. This is a simple type alias that avoids needing to import the actual name of the handle.
  2. Could this table hold both of the handles? i.e. HashMap<FileFingerprint, (Registered<FileBytesReceived>, Registered<FileEventsReceived>)> That way, when we need to garbage collect, there is only one table to manage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Fixed.
  2. I divided them into separate maps due to "use-after-move" issues. As you see, both maps are used in different lambdas with move keyword. Of course, we could mitigate it as usual with Rc/Arc but I didn't want to do so if I could just divide maps and avoid this problem. If there is a better way to merge the maps into one and still have maintainable code and not fight with the borrow checker - please tell me.

@bruceg
Copy link
Member

bruceg commented Jan 31, 2023

What is needed is for the file server to emit an enum of events, instead of just lines, consisting of open file, emit line(s), close file. This will allow the file source to track state with the server. I too am not excited about adding a change like this that is known to have unbounded memory growth, given that we have had problems like that in the past. It is unlikely for this to be a fast memory "leak", but it is still a problem to avoid.

@zamazan4ik
Copy link
Contributor Author

zamazan4ik commented Feb 1, 2023

What is needed is for the file server to emit an enum of events, instead of just lines, consisting of open file, emit line(s), close file. This will allow the file source to track state with the server. I too am not excited about adding a change like this that is known to have unbounded memory growth, given that we have had problems like that in the past. It is unlikely for this to be a fast memory "leak", but it is still a problem to avoid.

Do we want to introduce this change in this PR or in another PR, that will fix this "slow" memory leak? I am asking because such modifications to the file server do not seem trivial to me and I need to dig a little into the implementation.

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Regression Test Results

Run ID: b4accdcf-c628-4fbb-adcc-ac8264662684
Baseline: bd70509
Comparison: 0db3779
Total vector CPUs: 7

Explanation

A regression test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core.

The table below, if present, lists those experiments that have experienced a statistically significant change in their bytes_written_per_cpu_second performance between baseline and comparison SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5% change in mean bytes_written_per_cpu_second are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting changes are observed.

Changes in bytes_written_per_cpu_second with confidence ≥ 90.00% and absolute Δ mean >= ±5%:

experiment Δ mean Δ mean % confidence
syslog_regex_logs2metric_ddmetrics -248.62KiB/CPU-s -6.71 100.00%
Fine details of change detection per experiment.
experiment Δ mean Δ mean % confidence baseline mean baseline stdev baseline stderr baseline outlier % baseline CoV comparison mean comparison stdev comparison stderr comparison outlier % comparison CoV erratic declared erratic
http_text_to_http_json 799.41KiB/CPU-s 3.23 100.00% 24.15MiB/CPU-s 930.0KiB/CPU-s 12.01KiB/CPU-s 0.0 0.037602 24.93MiB/CPU-s 591.31KiB/CPU-s 7.63KiB/CPU-s 0.0 0.02316 False False
syslog_log2metric_splunk_hec_metrics 239.24KiB/CPU-s 2.57 100.00% 9.1MiB/CPU-s 284.68KiB/CPU-s 3.68KiB/CPU-s 0.0 0.030532 9.34MiB/CPU-s 323.37KiB/CPU-s 4.18KiB/CPU-s 0.0 0.033814 False False
datadog_agent_remap_blackhole_acks 381.78KiB/CPU-s 1.22 100.00% 30.53MiB/CPU-s 1.55MiB/CPU-s 20.45KiB/CPU-s 0.0 0.050653 30.91MiB/CPU-s 1.17MiB/CPU-s 15.41KiB/CPU-s 0.0 0.037719 False False
syslog_humio_logs 94.24KiB/CPU-s 1.04 100.00% 8.89MiB/CPU-s 323.17KiB/CPU-s 4.17KiB/CPU-s 0.0 0.035494 8.98MiB/CPU-s 243.21KiB/CPU-s 3.14KiB/CPU-s 0.0 0.026438 False False
datadog_agent_remap_blackhole 284.61KiB/CPU-s 0.91 100.00% 30.63MiB/CPU-s 1.21MiB/CPU-s 15.96KiB/CPU-s 0.0 0.039393 30.91MiB/CPU-s 423.69KiB/CPU-s 5.47KiB/CPU-s 0.0 0.013385 False False
otlp_http_to_blackhole 13.49KiB/CPU-s 0.87 100.00% 1.52MiB/CPU-s 117.35KiB/CPU-s 1.51KiB/CPU-s 0.0 0.075344 1.53MiB/CPU-s 117.82KiB/CPU-s 1.52KiB/CPU-s 0.0 0.074998 False False
socket_to_socket_blackhole 73.75KiB/CPU-s 0.55 100.00% 13.11MiB/CPU-s 336.32KiB/CPU-s 4.34KiB/CPU-s 0.0 0.025048 13.18MiB/CPU-s 212.74KiB/CPU-s 2.75KiB/CPU-s 0.0 0.015758 False False
splunk_hec_route_s3 23.27KiB/CPU-s 0.2 95.26% 11.49MiB/CPU-s 643.71KiB/CPU-s 8.3KiB/CPU-s 0.0 0.054729 11.51MiB/CPU-s 642.63KiB/CPU-s 8.3KiB/CPU-s 0.0 0.05453 False False
http_to_http_acks 10.0KiB/CPU-s 0.19 15.51% 5.21MiB/CPU-s 2.76MiB/CPU-s 36.42KiB/CPU-s 0.0 0.528649 5.22MiB/CPU-s 2.72MiB/CPU-s 35.93KiB/CPU-s 0.0 0.520458 True False
enterprise_http_to_http 12.25KiB/CPU-s 0.09 99.11% 13.61MiB/CPU-s 329.51KiB/CPU-s 4.25KiB/CPU-s 0.0 0.023638 13.62MiB/CPU-s 151.79KiB/CPU-s 1.96KiB/CPU-s 0.0 0.010879 False False
splunk_hec_to_splunk_hec_logs_noack 5.73KiB/CPU-s 0.04 85.90% 13.62MiB/CPU-s 233.84KiB/CPU-s 3.02KiB/CPU-s 0.0 0.016765 13.63MiB/CPU-s 190.64KiB/CPU-s 2.46KiB/CPU-s 0.0 0.013662 False False
splunk_hec_to_splunk_hec_logs_acks -433.59B/CPU-s -0.0 5.06% 13.62MiB/CPU-s 361.5KiB/CPU-s 4.66KiB/CPU-s 0.0 0.025925 13.62MiB/CPU-s 369.72KiB/CPU-s 4.77KiB/CPU-s 0.0 0.026516 False False
fluent_elasticsearch 423.34B/CPU-s 0.0 55.48% 45.41MiB/CPU-s 30.18KiB/CPU-s 394.32B/CPU-s 0.0 0.000649 45.41MiB/CPU-s 29.83KiB/CPU-s 389.85B/CPU-s 0.0 0.000641 False False
http_to_http_json 338.47B/CPU-s 0.0 6.85% 13.62MiB/CPU-s 212.44KiB/CPU-s 2.74KiB/CPU-s 0.0 0.015228 13.62MiB/CPU-s 209.33KiB/CPU-s 2.7KiB/CPU-s 0.0 0.015005 False False
splunk_hec_indexer_ack_blackhole -1.1KiB/CPU-s -0.01 19.40% 13.62MiB/CPU-s 242.86KiB/CPU-s 3.13KiB/CPU-s 0.0 0.017414 13.62MiB/CPU-s 250.16KiB/CPU-s 3.23KiB/CPU-s 0.0 0.017939 False False
file_to_blackhole -6.53KiB/CPU-s -0.01 23.79% 54.49MiB/CPU-s 1.07MiB/CPU-s 14.18KiB/CPU-s 0.0 0.019695 54.49MiB/CPU-s 1.23MiB/CPU-s 16.24KiB/CPU-s 0.0 0.022575 False False
http_to_http_noack -3.97KiB/CPU-s -0.03 60.64% 13.62MiB/CPU-s 223.25KiB/CPU-s 2.88KiB/CPU-s 0.0 0.016006 13.62MiB/CPU-s 283.34KiB/CPU-s 3.66KiB/CPU-s 0.0 0.02032 False False
otlp_grpc_to_blackhole -1.14KiB/CPU-s -0.11 80.95% 1.04MiB/CPU-s 41.67KiB/CPU-s 550.78B/CPU-s 0.0 0.039309 1.03MiB/CPU-s 53.34KiB/CPU-s 704.64B/CPU-s 0.0 0.050369 False False
syslog_log2metric_humio_metrics -12.51KiB/CPU-s -0.21 99.05% 5.76MiB/CPU-s 216.97KiB/CPU-s 2.8KiB/CPU-s 0.0 0.036782 5.75MiB/CPU-s 303.78KiB/CPU-s 3.92KiB/CPU-s 0.0 0.051609 False False
syslog_splunk_hec_logs -22.46KiB/CPU-s -0.25 100.00% 8.93MiB/CPU-s 166.61KiB/CPU-s 2.15KiB/CPU-s 0.0 0.018225 8.91MiB/CPU-s 245.3KiB/CPU-s 3.17KiB/CPU-s 0.0 0.026898 False False
datadog_agent_remap_datadog_logs_acks -299.02KiB/CPU-s -0.87 100.00% 33.48MiB/CPU-s 1.18MiB/CPU-s 15.56KiB/CPU-s 0.0 0.035169 33.18MiB/CPU-s 1.16MiB/CPU-s 15.36KiB/CPU-s 0.0 0.035025 False False
datadog_agent_remap_datadog_logs -405.77KiB/CPU-s -1.16 100.00% 34.15MiB/CPU-s 1.28MiB/CPU-s 16.9KiB/CPU-s 0.0 0.037466 33.75MiB/CPU-s 1.26MiB/CPU-s 16.7KiB/CPU-s 0.0 0.037462 False False
syslog_loki -143.9KiB/CPU-s -1.61 100.00% 8.73MiB/CPU-s 218.98KiB/CPU-s 2.83KiB/CPU-s 0.0 0.024485 8.59MiB/CPU-s 235.17KiB/CPU-s 3.04KiB/CPU-s 0.0 0.026727 False False
syslog_regex_logs2metric_ddmetrics -248.62KiB/CPU-s -6.71 100.00% 3.62MiB/CPU-s 343.58KiB/CPU-s 4.44KiB/CPU-s 0.0 0.092735 3.38MiB/CPU-s 376.71KiB/CPU-s 4.86KiB/CPU-s 0.0 0.108992 True False

@spencergilbert
Copy link
Contributor

I'd personally be hesitant to merge a known "bug" into the main branch. At the same time I don't want to ask too much of a community contributor, would you mind evaluating the scope of the changes needed and if you're up to making the required changes?

We're happy to take ownership of the PR if you'd prefer to pass it off.

@jszwedko
Copy link
Member

jszwedko commented Feb 2, 2023

Agreed we should resolve the memory leak before this is merged. For more context: we like to keep master in a releasable state to avoid releases being delayed to fix issues and to give users running nightlies a better experience.

@zamazan4ik
Copy link
Contributor Author

For more context: we like to keep master in a releasable state to avoid releases being delayed to fix issues and to give users running nightlies a better experience.

Just to note: I don't think that this is a release blocker. Since the memory leak would be too slow. But I agree that it's a problem.

@jszwedko jszwedko added the meta: awaiting author Pull requests that are awaiting their author. label Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance domain: sources Anything related to the Vector's sources meta: awaiting author Pull requests that are awaiting their author. source: file Anything `file` source related type: tech debt A code change that does not add user value.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants