feat(file): optimize metrics in File source #16178

zamazan4ik · 2023-01-29T17:52:25Z

Related to the discussion #15977

Here I reimplemented a few metrics in File source to the new register! pattern. This is done for achieving more performance in the File source scenario. According to my local benchmarks, these changes boost throughput up to 2-2.5x compared to the master branch.

The ugliest thing in this PR is a map for mapping files to the corresponding metric handle. If you have a better idea, that is easy to implement - that would be awesome. Probably, somewhere when Vector detects a new file, it could create a corresponding metrics set with the corresponding filename, and pass a handle to the metric with some internal structure (like Line) but that is just an idea without an actual deep knowledge about the whole internal File source pipeline.

- optimize file metrics. With this optimization I achieve 2.5x performance boost in the scenarion "File source -> Blackhole sink" Tested: - Local runs

netlify · 2023-01-29T17:52:29Z

✅ Deploy Preview for vector-project ready!

Name	Link
🔨 Latest commit	`0db3779`
🔍 Latest deploy log	https://app.netlify.com/sites/vector-project/deploys/63d9d2c8e939e10008df32ca
😎 Deploy Preview	https://deploy-preview-16178--vector-project.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify · 2023-01-29T17:52:29Z

✅ Deploy Preview for vrl-playground canceled.

Name	Link
🔨 Latest commit	`0db3779`
🔍 Latest deploy log	https://app.netlify.com/sites/vrl-playground/deploys/63d9d2c82210870008a9d01a

bruceg

This looks good, thanks for this impressive improvement, @zamazan4ik. I do have some requests for improvement below but the basic design is what I would expect.

src/internal_events/file.rs

src/sources/file.rs

zamazan4ik · 2023-01-30T23:23:11Z

Just a note about the current implementation. Right now it has some kind of a "memory leak": since we create a mapping "per-file", this mapping is not cleared when we are finished with a file and did not restart the source.

Maybe we need something like a GC for these mappings or integrate somehow this flush when we are finished with a file.

However, I am certainly sure that it's not critical stuff and it shouldn't block the PR.

spencergilbert · 2023-01-31T14:37:59Z

🤔 Yeah, I don't feel great about the leak and we've had a number of "memory leak" style problems around the file server (unlimited cardinality, yay). I don't see a super simple way to hook these maps into when we stop reading from files.

@bruceg what do you think here, it's definitely an improvement but I don't like the idea of causing a slow crash as we accumulate more and more files.

bruceg

This is better. One more required event change and a question. I will comment on garbage collection separately.

bruceg · 2023-01-31T18:19:57Z

src/internal_events/file.rs

-                "protocol" => "file",
-                "file" => self.file.to_owned()
-            );
+    impl RegisterInternalEvent for FileEventsReceived {


Would still prefer you use registered_event! for this too.

bruceg · 2023-01-31T18:25:15Z

src/sources/file.rs

+        let mut file_id_to_metrics_mapping: HashMap<FileFingerprint, FileBytesReceivedHandle> =
+            HashMap::new();


Two notes after looking at this again:

The generic name for the handle is Registered<FileBytesReceived>. This is a simple type alias that avoids needing to import the actual name of the handle.

Could this table hold both of the handles? i.e. HashMap<FileFingerprint, (Registered<FileBytesReceived>, Registered<FileEventsReceived>)> That way, when we need to garbage collect, there is only one table to manage.

Fixed.

I divided them into separate maps due to "use-after-move" issues. As you see, both maps are used in different lambdas with move keyword. Of course, we could mitigate it as usual with Rc/Arc but I didn't want to do so if I could just divide maps and avoid this problem. If there is a better way to merge the maps into one and still have maintainable code and not fight with the borrow checker - please tell me.

bruceg · 2023-01-31T18:30:16Z

What is needed is for the file server to emit an enum of events, instead of just lines, consisting of open file, emit line(s), close file. This will allow the file source to track state with the server. I too am not excited about adding a change like this that is known to have unbounded memory growth, given that we have had problems like that in the past. It is unlikely for this to be a fast memory "leak", but it is still a problem to avoid.

zamazan4ik · 2023-02-01T02:24:43Z

What is needed is for the file server to emit an enum of events, instead of just lines, consisting of open file, emit line(s), close file. This will allow the file source to track state with the server. I too am not excited about adding a change like this that is known to have unbounded memory growth, given that we have had problems like that in the past. It is unlikely for this to be a fast memory "leak", but it is still a problem to avoid.

Do we want to introduce this change in this PR or in another PR, that will fix this "slow" memory leak? I am asking because such modifications to the file server do not seem trivial to me and I need to dig a little into the implementation.

github-actions · 2023-02-01T03:24:47Z

Regression Test Results

Run ID: b4accdcf-c628-4fbb-adcc-ac8264662684
Baseline: bd70509
Comparison: 0db3779
Total vector CPUs: 7

Explanation

A regression test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core.

The table below, if present, lists those experiments that have experienced a statistically significant change in their bytes_written_per_cpu_second performance between baseline and comparison SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5% change in mean bytes_written_per_cpu_second are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting changes are observed.

Changes in bytes_written_per_cpu_second with confidence ≥ 90.00% and absolute Δ mean >= ±5%:

experiment	Δ mean	Δ mean %	confidence
syslog_regex_logs2metric_ddmetrics	-248.62KiB/CPU-s	-6.71	100.00%

Fine details of change detection per experiment.

experiment	Δ mean	Δ mean %	confidence	baseline mean	baseline stdev	baseline stderr	baseline CoV	comparison mean	comparison stdev	comparison stderr	comparison CoV	erratic	declared erratic
http_text_to_http_json	799.41KiB/CPU-s	3.23	100.00%	24.15MiB/CPU-s	930.0KiB/CPU-s	12.01KiB/CPU-s	0.037602	24.93MiB/CPU-s	591.31KiB/CPU-s	7.63KiB/CPU-s	0.02316	False	False
syslog_log2metric_splunk_hec_metrics	239.24KiB/CPU-s	2.57	100.00%	9.1MiB/CPU-s	284.68KiB/CPU-s	3.68KiB/CPU-s	0.030532	9.34MiB/CPU-s	323.37KiB/CPU-s	4.18KiB/CPU-s	0.033814	False	False
datadog_agent_remap_blackhole_acks	381.78KiB/CPU-s	1.22	100.00%	30.53MiB/CPU-s	1.55MiB/CPU-s	20.45KiB/CPU-s	0.050653	30.91MiB/CPU-s	1.17MiB/CPU-s	15.41KiB/CPU-s	0.037719	False	False
syslog_humio_logs	94.24KiB/CPU-s	1.04	100.00%	8.89MiB/CPU-s	323.17KiB/CPU-s	4.17KiB/CPU-s	0.035494	8.98MiB/CPU-s	243.21KiB/CPU-s	3.14KiB/CPU-s	0.026438	False	False
datadog_agent_remap_blackhole	284.61KiB/CPU-s	0.91	100.00%	30.63MiB/CPU-s	1.21MiB/CPU-s	15.96KiB/CPU-s	0.039393	30.91MiB/CPU-s	423.69KiB/CPU-s	5.47KiB/CPU-s	0.013385	False	False
otlp_http_to_blackhole	13.49KiB/CPU-s	0.87	100.00%	1.52MiB/CPU-s	117.35KiB/CPU-s	1.51KiB/CPU-s	0.075344	1.53MiB/CPU-s	117.82KiB/CPU-s	1.52KiB/CPU-s	0.074998	False	False
socket_to_socket_blackhole	73.75KiB/CPU-s	0.55	100.00%	13.11MiB/CPU-s	336.32KiB/CPU-s	4.34KiB/CPU-s	0.025048	13.18MiB/CPU-s	212.74KiB/CPU-s	2.75KiB/CPU-s	0.015758	False	False
splunk_hec_route_s3	23.27KiB/CPU-s	0.2	95.26%	11.49MiB/CPU-s	643.71KiB/CPU-s	8.3KiB/CPU-s	0.054729	11.51MiB/CPU-s	642.63KiB/CPU-s	8.3KiB/CPU-s	0.05453	False	False
http_to_http_acks	10.0KiB/CPU-s	0.19	15.51%	5.21MiB/CPU-s	2.76MiB/CPU-s	36.42KiB/CPU-s	0.528649	5.22MiB/CPU-s	2.72MiB/CPU-s	35.93KiB/CPU-s	0.520458	True	False
enterprise_http_to_http	12.25KiB/CPU-s	0.09	99.11%	13.61MiB/CPU-s	329.51KiB/CPU-s	4.25KiB/CPU-s	0.023638	13.62MiB/CPU-s	151.79KiB/CPU-s	1.96KiB/CPU-s	0.010879	False	False
splunk_hec_to_splunk_hec_logs_noack	5.73KiB/CPU-s	0.04	85.90%	13.62MiB/CPU-s	233.84KiB/CPU-s	3.02KiB/CPU-s	0.016765	13.63MiB/CPU-s	190.64KiB/CPU-s	2.46KiB/CPU-s	0.013662	False	False
splunk_hec_to_splunk_hec_logs_acks	-433.59B/CPU-s	-0.0	5.06%	13.62MiB/CPU-s	361.5KiB/CPU-s	4.66KiB/CPU-s	0.025925	13.62MiB/CPU-s	369.72KiB/CPU-s	4.77KiB/CPU-s	0.026516	False	False
fluent_elasticsearch	423.34B/CPU-s	0.0	55.48%	45.41MiB/CPU-s	30.18KiB/CPU-s	394.32B/CPU-s	0.000649	45.41MiB/CPU-s	29.83KiB/CPU-s	389.85B/CPU-s	0.000641	False	False
http_to_http_json	338.47B/CPU-s	0.0	6.85%	13.62MiB/CPU-s	212.44KiB/CPU-s	2.74KiB/CPU-s	0.015228	13.62MiB/CPU-s	209.33KiB/CPU-s	2.7KiB/CPU-s	0.015005	False	False
splunk_hec_indexer_ack_blackhole	-1.1KiB/CPU-s	-0.01	19.40%	13.62MiB/CPU-s	242.86KiB/CPU-s	3.13KiB/CPU-s	0.017414	13.62MiB/CPU-s	250.16KiB/CPU-s	3.23KiB/CPU-s	0.017939	False	False
file_to_blackhole	-6.53KiB/CPU-s	-0.01	23.79%	54.49MiB/CPU-s	1.07MiB/CPU-s	14.18KiB/CPU-s	0.019695	54.49MiB/CPU-s	1.23MiB/CPU-s	16.24KiB/CPU-s	0.022575	False	False
http_to_http_noack	-3.97KiB/CPU-s	-0.03	60.64%	13.62MiB/CPU-s	223.25KiB/CPU-s	2.88KiB/CPU-s	0.016006	13.62MiB/CPU-s	283.34KiB/CPU-s	3.66KiB/CPU-s	0.02032	False	False
otlp_grpc_to_blackhole	-1.14KiB/CPU-s	-0.11	80.95%	1.04MiB/CPU-s	41.67KiB/CPU-s	550.78B/CPU-s	0.039309	1.03MiB/CPU-s	53.34KiB/CPU-s	704.64B/CPU-s	0.050369	False	False
syslog_log2metric_humio_metrics	-12.51KiB/CPU-s	-0.21	99.05%	5.76MiB/CPU-s	216.97KiB/CPU-s	2.8KiB/CPU-s	0.036782	5.75MiB/CPU-s	303.78KiB/CPU-s	3.92KiB/CPU-s	0.051609	False	False
syslog_splunk_hec_logs	-22.46KiB/CPU-s	-0.25	100.00%	8.93MiB/CPU-s	166.61KiB/CPU-s	2.15KiB/CPU-s	0.018225	8.91MiB/CPU-s	245.3KiB/CPU-s	3.17KiB/CPU-s	0.026898	False	False
datadog_agent_remap_datadog_logs_acks	-299.02KiB/CPU-s	-0.87	100.00%	33.48MiB/CPU-s	1.18MiB/CPU-s	15.56KiB/CPU-s	0.035169	33.18MiB/CPU-s	1.16MiB/CPU-s	15.36KiB/CPU-s	0.035025	False	False
datadog_agent_remap_datadog_logs	-405.77KiB/CPU-s	-1.16	100.00%	34.15MiB/CPU-s	1.28MiB/CPU-s	16.9KiB/CPU-s	0.037466	33.75MiB/CPU-s	1.26MiB/CPU-s	16.7KiB/CPU-s	0.037462	False	False
syslog_loki	-143.9KiB/CPU-s	-1.61	100.00%	8.73MiB/CPU-s	218.98KiB/CPU-s	2.83KiB/CPU-s	0.024485	8.59MiB/CPU-s	235.17KiB/CPU-s	3.04KiB/CPU-s	0.026727	False	False
syslog_regex_logs2metric_ddmetrics	-248.62KiB/CPU-s	-6.71	100.00%	3.62MiB/CPU-s	343.58KiB/CPU-s	4.44KiB/CPU-s	0.092735	3.38MiB/CPU-s	376.71KiB/CPU-s	4.86KiB/CPU-s	0.108992	True	False

spencergilbert · 2023-02-02T03:14:33Z

I'd personally be hesitant to merge a known "bug" into the main branch. At the same time I don't want to ask too much of a community contributor, would you mind evaluating the scope of the changes needed and if you're up to making the required changes?

We're happy to take ownership of the PR if you'd prefer to pass it off.

jszwedko · 2023-02-02T13:27:26Z

Agreed we should resolve the memory leak before this is merged. For more context: we like to keep master in a releasable state to avoid releases being delayed to fix issues and to give users running nightlies a better experience.

zamazan4ik · 2023-02-02T14:25:01Z

For more context: we like to keep master in a releasable state to avoid releases being delayed to fix issues and to give users running nightlies a better experience.

Just to note: I don't think that this is a release blocker. Since the memory leak would be too slow. But I agree that it's a problem.

feat(file): optimize metrics

196e3fb

- optimize file metrics. With this optimization I achieve 2.5x performance boost in the scenarion "File source -> Blackhole sink" Tested: - Local runs

github-actions bot added the domain: sources Anything related to the Vector's sources label Jan 29, 2023

bruceg requested review from bruceg and a team January 30, 2023 14:42

bruceg added source: file Anything `file` source related type: tech debt A code change that does not add user value. domain: performance Anything related to Vector's performance labels Jan 30, 2023

spencergilbert requested review from spencergilbert and removed request for a team January 30, 2023 15:32

bruceg requested changes Jan 30, 2023

View reviewed changes

zamazan4ik added 2 commits January 30, 2023 21:42

Merge branch 'master' into feature/file_optimized_metrics

2fdefe7

fix: code-review

7a4ab57

zamazan4ik requested review from bruceg and removed request for spencergilbert January 30, 2023 21:43

bruceg requested changes Jan 31, 2023

View reviewed changes

davidhuie-dd assigned bruceg Feb 1, 2023

fix: one more code review round

0db3779

zamazan4ik requested a review from bruceg February 1, 2023 03:20

jszwedko added the meta: awaiting author Pull requests that are awaiting their author. label Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(file): optimize metrics in File source #16178

feat(file): optimize metrics in File source #16178

zamazan4ik commented Jan 29, 2023 •

edited

netlify bot commented Jan 29, 2023 •

edited

netlify bot commented Jan 29, 2023 •

edited

bruceg left a comment

zamazan4ik commented Jan 30, 2023

spencergilbert commented Jan 31, 2023

bruceg left a comment

bruceg Jan 31, 2023

zamazan4ik Feb 1, 2023

bruceg Jan 31, 2023

zamazan4ik Feb 1, 2023

bruceg commented Jan 31, 2023

zamazan4ik commented Feb 1, 2023 •

edited

github-actions bot commented Feb 1, 2023

spencergilbert commented Feb 2, 2023

jszwedko commented Feb 2, 2023

zamazan4ik commented Feb 2, 2023

		let mut file_id_to_metrics_mapping: HashMap<FileFingerprint, FileBytesReceivedHandle> =
		HashMap::new();

feat(file): optimize metrics in File source #16178

Are you sure you want to change the base?

feat(file): optimize metrics in File source #16178

Conversation

zamazan4ik commented Jan 29, 2023 • edited

netlify bot commented Jan 29, 2023 • edited

✅ Deploy Preview for vector-project ready!

netlify bot commented Jan 29, 2023 • edited

✅ Deploy Preview for vrl-playground canceled.

bruceg left a comment

Choose a reason for hiding this comment

zamazan4ik commented Jan 30, 2023

spencergilbert commented Jan 31, 2023

bruceg left a comment

Choose a reason for hiding this comment

bruceg Jan 31, 2023

Choose a reason for hiding this comment

zamazan4ik Feb 1, 2023

Choose a reason for hiding this comment

bruceg Jan 31, 2023

Choose a reason for hiding this comment

zamazan4ik Feb 1, 2023

Choose a reason for hiding this comment

bruceg commented Jan 31, 2023

zamazan4ik commented Feb 1, 2023 • edited

github-actions bot commented Feb 1, 2023

Regression Test Results

spencergilbert commented Feb 2, 2023

jszwedko commented Feb 2, 2023

zamazan4ik commented Feb 2, 2023

zamazan4ik commented Jan 29, 2023 •

edited

netlify bot commented Jan 29, 2023 •

edited

netlify bot commented Jan 29, 2023 •

edited

zamazan4ik commented Feb 1, 2023 •

edited