-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault on all nodes in OpenShift 4.9.33 #838
Comments
Hello @balpert89, thank you for reporting this. I am looking into it. |
Short update, I have also tried with the slim collector, but no luck either:
|
You are probably running in offline mode, and your probes might need an update. Using the slim image, could you try and update your kernel support package to the latest and use Then you will have to restart Sensor to flush the caches. |
The containers are able to reach endpoints on the internet, there is no proxy in front of the environment. I am also uploading the package, however as I understand the documentation, the collector should download the support package from
Is there some debug message in central / sensor / collector indicating the attempt to download the probes? |
I have uploaded the support package
And recreated the collector pod, still experiencing issues:
|
Just to be sure, did you restart the Sensor POD for this cluster ?
|
Yes I have done that, to be sure I have redone the command
|
You are right that the probe is pulled from internet in this case. However, Collector still goes through Sensor to do so; we can only see this request in Collector logs: |
The segfault with this kernel-module probe is being investigated, and I hope that we can fix it soon. In the meantime, I would suggest to switch to the eBPF collector method, which is now the default. |
I have switched it, the segfault occurs nonetheless :-(
This is the current environment variables list of the collector daemonSet, maybe this helps in narrowing down the issue:
|
@balpert89 is there a chance you can grab the core dump for us? I guess you would need to enable it via env variable [1]. |
I have set the environment variable, however how do I get ahold of the dump? And how do I get it out of the container, since there is neither |
It involves some manual steps. The dump should be created at the directory pointed by |
I have mounted the hostPath /proc in the
|
Yes, sorry for not being clear - the content of the file you've posted is going to be the path, where the core dump is stored. In this case it's going to be handled by systemd-coredump tool and should be stored at |
Ok, sorry for that. I have the coredump on my local machine, how can I provide it to you? I feel uncomfortable dropping this onto a public GitHub issue. |
@ovalenti can you please provide me your mail address so I can send you the uncompressed coredump? |
Yes please, here it is: ovalenti at redhat dot com I really appreciate your cooperation on this issue. For now, we suspect that the root cause has something to do with the memory allocation code. And since the crash does not reproduce on our own clusters, this might be triggered by the specifics of your workload. Let me analyse the core dump to see if we can sort this out. |
Thank you for the core-dump, @balpert89. Unfortunately, I could not get to the root cause using it. In order to bisect further the issue, could you please try to temporarily disable the scraper ? This can be achieved by adding the following stanza to the COLLECTOR_CONFIG JSON document : |
After disabling the scraping the collector does not experience issues. |
Excellent, this narrows the research area quite a lot. I am going to bring in some expert of the Scrapper. |
We found a likely cause for this issue ! A fix has been written and is under test. Hopefully we can provide a service release soon. |
Nice, if you have some test image I can provide it on the environment experiencing the issues. |
Thank you @balpert89 for testing latest 3.11.x image. It looks like the fix is working. The patch is on master now, and has been backported to 3.11, (quay.io/stackrox-io/collector:3.11.0-1-g0db9c01c9b-slim) |
Thank you as well :) Only question remains, we currently use the |
We are preparing for it, but I don't have a date yet. |
How can we turn off the scrape, I tried to add turnOffScrape:true into the following field in secure cluster yaml but it is deleted automatically.
|
Hi @bengukaraalioglu, I would suggest to use latest images which include the fix instead. 3.71: quay.io/stackrox-io/collector:3.71.2 The official release of full Stackrox 3.72.1&3.71.2 should happen early next week (@balpert89 ) |
Both releases including the fix are now live ! |
Hi,
Disclaimer: I have opened the same issue at stackrox/stackrox#3195 because I am not sure on which repository this should be tracked as here we have a area/collector label. Please close the one which is at the wrong location.
we are experiencing crashes in collector containers across all nodes in one of our OpenShift clusters.
Debug Log:
I am not sure how to debug this as all daemonSet containers experience this problem.
We are using StackRox 3.71.0. I have tried with collector images
3.9.0
and3.11.0
. Please reach out for any missing information.The text was updated successfully, but these errors were encountered: