-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector randomly stops shipping certain k8s logs #12014
Comments
I will add that restart vector seems to solve the issue, but that's not a sustainable solution. |
It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none. |
Hi @danthegoodman1 ! It seems like some of the incoming events are not JSON. I might suggest augmenting your
|
As an aside, you are currently parsing
instead
|
Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through. Will add the suggestions in the meantime |
@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs? |
No, that is what is so strange @jszwedko, makes it real hard to debug 🙃 |
Also getting this, trying to work through why this is
Will update with the solution when I find it |
Updated to
|
Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P
|
This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs. |
@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected) |
One thing to add: We checked the mounted |
@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms? It's |
@wtchangdm will try, but it’s the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s |
@danthegoodman1 BTW, there is an issue I am also tracking: #8616. Not sure if it's related, though. |
Interesting, but we don’t get any warnings or errors like that does. |
@danthegoodman1 could you try with the latest release, 0.21.1? The Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment. |
@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to Will update now and see how it works |
@jszwedko This went very not well, was printing this extremely quickly:
|
|
The |
yep, just made that change :) |
After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs. It is able to ship them for about 30 seconds before it stops sending logs
|
It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods |
Seems related to #10122 (comment) |
Possibly - the other answer would be that |
It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation |
Hi, I faced the same issue on |
Is there anyone who know the answer of this problem? or is there any option in |
Hi, I've encountered similar issue |
Similar sentiment after reading @imcheck and @Genry1 comments - in my case, it happens with a bunch of statefulsets every time there's a rolling upgrade, and of course, the pod name is reused. It might be related to the lock in the file descriptor opened by the vector-agent container at the host level. Cannot think of any WA other than deleting the agent in the node whenever I spot this occurrence? |
We also see this issue if a pod is recreated in the same node with the same name. It's pretty much a show stopper for us. |
Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: #13467 |
@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with
In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with |
@jszwedko is there any updates regarding the issue? Stopping receiving logs from k8s is kinda a critical issue for us. |
@zamazan4ik fluentbit is a good alternative |
I know a lot of log forwarders (fluentbit, rsyslog, whatever else). But I am especially interested in Vector :) |
Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the There seem to be a mix of issues reported in the comments above. I think they can be categorized into:
If you don't think your issue is represented in the above set, please leave an additional comment! |
@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur. I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s. what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t. |
I'm in the same boat as @danthegoodman1. We're currently using Vector in most of our K8s clusters and I love what it brings to the table, but the lack of attention and priority to this specific issue is concerning.
We're likely going to switch over to something else like fluent-bit for log collection until this issue is resolved. |
I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:
The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues. |
Wow, that's good to know. I hope to see comments from Vector dev team about the possible fixes for the issues. |
Great! for us the glob time drop really did it for us |
Just realised that In my case (145 nodes in cluster) where I haven't yet faced described issue (I copied this config snippet from another cluster where such issue existed) dropping I am still checking that dropping this parameter haven't led to missing logs (cause such a big resource consumption change seems wrong to me). |
@dm3ch I noticed this on my setup too, but in my scenario I think the "increased" CPU usage is just a result of the service actually processing logs :) Dropping this setting for me results in very low CPU usage as well, but I was receiving only about 10% of the logs I should've been as well. |
@CharlieC3 Which version of the vector are you using? |
@dm3ch I haven't needed to upgrade since my last comment, so I'm still running Docker image |
Still seeing this errors in with above config changes
Any workaround would really help? .. we are using 0.35.0 |
2 years ago... Doesn't feel remotely appreciated. I'm wondering if the Datadog acq. is to blame. |
@sumeet-zuora seems like you have some other error that's causing logs to get caught up |
This is still an issue |
A note for the community
Problem
After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)
Configuration
Example Data
Additional Context
Only logs from vector:
References
Similar to this I've had in the past #8634
The text was updated successfully, but these errors were encountered: