-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Memory leak while using kubernetes_logs source #6673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Meanwhile I updated to vector version 0.12-debian and the memory leak still stayed the same.
The discovery I made is that when I disabled HTTP sink and replaced with Blackhole, the memory leaking stopped. vector-kube-collector configuration file Read Kubernetes logs from files
vector-aggregator configuration file
|
I haven't had a chance to triage but this does strike me as relevant to the conversation in #6781. |
As this memory leak issue is the last blocker for me to deploy vector to my production environment, I continued experimenting with different versions and images. All graphs represent "Average pod memory usage" 0.12.1-distroless-libc + HTTP sink @blt if there is any combination that you would like me to test then let me know. I will continue experimenting with HTTP sink buffer settings and see if they have any effect on the memory leak. |
Some further testing revealed, that enabling prometheus metrics with the following configuration made things alot worse... can be related to #6781
|
+1 with the issue. Is there any expected workaround or anything we could look forward to? |
@KHiis we are actively investigating and will report back as we learn. We're hoping to get to the root cause next week. |
@karlmartink @KHiis apologies for the delay in getting back to you and thank you for the detailed issue. Sinking into the blackhole not demonstrating this issue would suggest to me that vector's buffering as it backs up into the HTTP sink, but the growth when you enable prometheus is surprising to me. @karlmartink out of curiosity, if you enable prometheus and sink into blackhole does the issue still persist? Also, if you have the prometheus sink switched on I would be very curious to know about the following bits of telemetry for each step in your config:
If indeed one of the steps is backing up and causing buffering I'd expect this triplet to fall off steeply for one of them. We are also aware of #6305 where vector will hold file descriptors open if the kubernetes log source is unable to drain quickly enough. Do you have a measure of your open FDs? We have also, in 0.12, introduced debug logging that shows which steps in the config have high load. If you enable debug logging -- |
@blt here is some more telemetry. I grouped the graphs by "component_type" to accurately visualise as I have 3 pods running for vector-aggregator. 0.12.1-distroless-libc + HTTP sink + Prometheus exporter CPU usage graph for the same period Some utilization info from debug output:
I enabled prometheus exporter to blackhole sink. Will post results soon. |
0.12.1-distroless-libc As it can be seen sinking into blackhole again does not demonstrate the issue. Unfortunately I don't have open FD metric to provide. But as can be seen on the graphs, kubernetes log collecting works without issues and has zero problems. |
@karlmartink this is great, thank you. @eeyun it strikes me that this might actually be related to the work you're doing these days. |
I really hope this fixes the issue. |
@blt Any progress on this issue or any other metrics I can provide? |
@karlmartink this has been a confounding one. I don't have any news yet but it's still being worked. |
Alright. I have been able to reproduce the issue in a very minimal setup. Here's my vector config:
Which I sink into atchteeteepea though any web server will do. My run script is:
I can observe that vector does gradually increase its memory by a few bytes here and there which does seem roughly correlated to the number of requests out that vector makes, which is why I have the batch size set so low in my config. Unfortunately there's not a direct correlation and the accumulation process does take a while as your graphs show, so we do apologize for the slow progress here. We've just finished an upgrade to vector's tokio which, as @lukesteensen pointed out to me, should resolve some known sources of fragmentation. I'll be running vector with the above config and setup under valgrind; this profile should be clearer now but it will take some time to get results. |
The upgrade to tokio 1.0 has really paid off. @jszwedko relevant to our recent work, if you pull this massif dump open you'll see its recording vector at 1.5Gb of heap and the majority of that is in the tracing subsystem. Massif out: Screenshot: @karlmartink we have an undocumented environment variable |
@blt I appriciate the feedback. Had some time to do some testing but unfortunately I did not see any improvements on the memory usage side with 0.12.2-distroless-libc |
@karlmartink Alright, good news. If you pop open this massif dump there's a very clear indication that vector's putting a lot of memory into metric-rs histograms. This is almost surely an unfortunate consequence of the bucket sizes chosen by metrics-rs and the granularity of the data we're pushing into them. I hope to have a fix here today or tomorrow. |
Ah, yep, I believe I see the problem. We bridge our internal and metric-rs metrics here: The important bit is that We're calling |
Seems you have made solid progress on the issue which I am really glad about. |
This commit introduces our own metrics-rs Handle implementation, one that takes constant space per metric type. This significantly reduces our happy path memory use under load and should resolve memory leaks noted in the closing issues. The leak culprit is metrics-rs Histogram combined with tracing integration. Their implementation relies on an atomic linked list of fixed-sized blocks to store samples. The storage required by this method grows until a quiescent period is found, a period that happens regularly when metrics-rs is not integrated with tracing and does not happen at present for reasons unknown. However, in our testing we've seen that a single histogram under heavy load can consume approximately 200Mb before being garbage collected, a burden vector doesn't need to bear as our own needs for histogram are served by a simple counting bucket arrangement. Own own histogram uses a fixed, boxed slice of atomic ints as buckets. This is the shape of the data that vector was exposing already and we remove a little internal translation. I believe this PR will significantly reduce the memory use of a loaded vector, leak not withstanding. REF #6673 Closes #6461 #6456 Signed-off-by: Brian L. Troutwine brian@troutwine.us
@karlmartink well, good news. I am reasonably certain that #7014 will address a good portion if not the whole of your problem. Our nightly build job starts at 4 AM UTC so by 6AM UTC there ought to be a build available with #7014 included. When you get a chance can you try out the nightly and let us know how it goes? |
@blt that's awesome! |
It seems that the issue is fixed! @blt thank you for the support. |
Nice! This seems to have fixed a large portion of the unbounded memory growth. However, looking at those new graphs it seems like there is still steady memory growth, just much slower. |
@karlmartink excellent news! If you're satisfied I'll close this ticket out then. @jszwedko yes, agreed. I think some of that is explicable by allocator fragmentation but I'll do a long massif run to get more concrete details, at least for the minimal configs we used to diagnose the more serious leak here. |
This can be closed from my side. |
Vector Version
running on Kubernetes
Vector Configuration File
Debug Output
https://gist.github.com/karlmartink/095979bec3dea7d91430c91a842d3927
Expected Behavior
Stable memory usage or slight increase until some maximum level is increased.
Actual Behavior
Containers memory + cpu usage is slowly growing overtime until maximum limits are reached and service is killed while processed events remain the same.
Example Data
POD CPU and Memory usage

Processed events

Additional Context
I am using vector to mostly collect Kubernetes logs and manipulate them with some transforms. After that they get forwarded to graylog via HTTP to GELF input.
The text was updated successfully, but these errors were encountered: