Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault on all nodes in OpenShift 4.9.33 #838

Closed
balpert89 opened this issue Sep 26, 2022 · 29 comments
Closed

Segmentation Fault on all nodes in OpenShift 4.9.33 #838

balpert89 opened this issue Sep 26, 2022 · 29 comments
Assignees

Comments

@balpert89
Copy link

balpert89 commented Sep 26, 2022

Hi,

Disclaimer: I have opened the same issue at stackrox/stackrox#3195 because I am not sure on which repository this should be tracked as here we have a area/collector label. Please close the one which is at the wrong location.

we are experiencing crashes in collector containers across all nodes in one of our OpenShift clusters.

Debug Log:

Collector Version: 3.9.0
OS: Red Hat Enterprise Linux CoreOS 49.84.202205050701-0 (Ootpa)
Kernel Version: 4.18.0-305.45.1.el8_4.x86_64
Starting StackRox Collector...
[I 20220926 112218 HostInfo.cpp:126] Hostname: '<redacted>'
[I 20220926 112218 CollectorConfig.cpp:119] User configured logLevel=debug
[I 20220926 112218 CollectorConfig.cpp:149] User configured collection-method=kernel_module
[I 20220926 112218 CollectorConfig.cpp:206] Afterglow is enabled
[D 20220926 112218 HostInfo.cpp:200] EFI directory exist, UEFI boot mode
[D 20220926 112218 HostInfo.h:100] identified kernel release: '4.18.0-305.45.1.el8_4.x86_64'
[D 20220926 112218 HostInfo.h:101] identified kernel version: '#1 SMP Wed Apr 6 13:48:37 EDT 2022'
[D 20220926 112218 HostInfo.cpp:297] SecureBoot status is 2
[D 20220926 112218 collector.cpp:254] Core dump not enabled
[I 20220926 112218 collector.cpp:302] Module version: 2.0.1
[I 20220926 112218 collector.cpp:329] Attempting to download kernel module - Candidate kernel versions:
[I 20220926 112218 collector.cpp:331] 4.18.0-305.45.1.el8_4.x86_64
[D 20220926 112218 GetKernelObject.cpp:148] Checking for existence of /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz and /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko
[D 20220926 112218 GetKernelObject.cpp:151] Found existing compressed kernel object.
[I 20220926 112218 collector.cpp:262]
[I 20220926 112218 collector.cpp:263] This product uses kernel module and ebpf subcomponents licensed under the GNU
[I 20220926 112218 collector.cpp:264] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[I 20220926 112218 collector.cpp:265] Source code for the kernel module and ebpf subcomponents is available upon
[I 20220926 112218 collector.cpp:266] request by contacting support@stackrox.com.
[I 20220926 112218 collector.cpp:267]
[I 20220926 112218 collector.cpp:162] Inserting kernel module /module/collector.ko with indefinite removal and retry if required.
[D 20220926 112218 collector.cpp:109] Kernel module arguments: s_syscallIds=26,27,56,57,246,247,248,249,94,95,14,15,156,157,216,217,222,223,4,5,22,23,12,13,154,155,172,173,214,215,230,231,282,283,288,289,292,293,96,97,182,183,218,219,224,225,16,186,234,194,195,192,193,200,201,198,199,36,37,18,19,184,185,220,221,226,227,-1 verbose=0 exclude_selfns=1 exclude_initns=1
[I 20220926 112218 collector.cpp:183] Done inserting kernel module /module/collector.ko.
[I 20220926 112218 collector.cpp:215] gRPC server=sensor.mcs-security.svc:443
[I 20220926 112218 CollectorService.cpp:50] Config: collection_method:kernel_module, useChiselCache:1, snapLen:0, scrape_interval:30, turn_off_scrape:0, hostname:<redacted>, logLevel:DEBUG
[I 20220926 112218 CollectorService.cpp:79] Network scrape interval set to 30 seconds
[I 20220926 112218 CollectorService.cpp:82] Waiting for GRPC server to become ready ...
[I 20220926 112218 CollectorService.cpp:87] GRPC server connectivity is successful
[D 20220926 112218 ConnTracker.cpp:314] ignored l4 protocol and port pairs
[D 20220926 112218 ConnTracker.cpp:316] udp/9
[I 20220926 112218 NetworkStatusNotifier.cpp:187] Started network status notifier.
[I 20220926 112218 NetworkStatusNotifier.cpp:203] Established network connection info stream.
[D 20220926 112218 SysdigService.cpp:262] Updating chisel and flushing chisel cache
[D 20220926 112218 SysdigService.cpp:263] New chisel:
args = {}
function on_event()
    return true
end
function on_init()
    filter = "not container.id = 'host'\n"
    chisel.set_filter(filter)
    return true
end

[I 20220926 112218 SignalServiceClient.cpp:43] Trying to establish GRPC stream for signals ...
[I 20220926 112218 SignalServiceClient.cpp:61] Successfully established GRPC stream for signals.
[D 20220926 112219 ConnScraper.cpp:406] Could not open process directory 1626873: No such file or directory
[D 20220926 112219 ConnScraper.cpp:406] Could not open process directory 1626877: No such file or directory
[W 20220926 112219 ProtoAllocator.h:41] Allocating a memory block on the heap for the arena, this is inefficient and usually avoidable
collector[0x44746d]
/lib64/libc.so.6(+0x4eb20)[0x7f8425ceeb20]
Caught signal 11 (SIGSEGV): Segmentation fault
/bootstrap.sh: line 94:    11 Segmentation fault      (core dumped) eval exec "$@"
Collector kernel module has already been loaded.
Removing so that collector can insert it at startup.

I am not sure how to debug this as all daemonSet containers experience this problem.

We are using StackRox 3.71.0. I have tried with collector images 3.9.0 and 3.11.0. Please reach out for any missing information.

@ovalenti
Copy link
Contributor

Hello @balpert89, thank you for reporting this. I am looking into it.

@ovalenti ovalenti self-assigned this Sep 26, 2022
@balpert89
Copy link
Author

Short update, I have also tried with the slim collector, but no luck either:

Collector Version: 3.11.0
OS: Red Hat Enterprise Linux CoreOS 49.84.202205050701-0 (Ootpa)
Kernel Version: 4.18.0-305.45.1.el8_4.x86_64
Starting StackRox Collector...
[I 20220926 131746 HostInfo.cpp:126] Hostname: '<redacted>'
[I 20220926 131746 CollectorConfig.cpp:119] User configured logLevel=debug
[I 20220926 131746 CollectorConfig.cpp:149] User configured collection-method=kernel_module
[I 20220926 131746 CollectorConfig.cpp:206] Afterglow is enabled
[D 20220926 131746 HostInfo.cpp:200] EFI directory exist, UEFI boot mode
[D 20220926 131746 HostInfo.h:100] identified kernel release: '4.18.0-305.45.1.el8_4.x86_64'
[D 20220926 131746 HostInfo.h:101] identified kernel version: '#1 SMP Wed Apr 6 13:48:37 EDT 2022'
[D 20220926 131746 HostInfo.cpp:297] SecureBoot status is 2
[D 20220926 131746 collector.cpp:254] Core dump not enabled
[I 20220926 131746 collector.cpp:302] Module version: 2.1.0
[I 20220926 131746 collector.cpp:329] Attempting to download kernel module - Candidate kernel versions: 
[I 20220926 131746 collector.cpp:331] 4.18.0-305.45.1.el8_4.x86_64
[D 20220926 131746 GetKernelObject.cpp:148] Checking for existence of /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz and /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko
[I 20220926 131746 GetKernelObject.cpp:180] Local storage does not contain collector-4.18.0-305.45.1.el8_4.x86_64.ko
[D 20220926 131746 GetKernelObject.cpp:50] Attempting to download kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.1.0/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz?cid=collector
[D 20220926 131746 FileDownloader.cpp:53] Set HTTP status code to '200'
[D 20220926 131746 GetKernelObject.cpp:55] Downloaded kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.1.0/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz?cid=collector
[I 20220926 131746 GetKernelObject.cpp:194] Successfully downloaded and decompressed /module/collector.ko
[I 20220926 131746 collector.cpp:262] 
[I 20220926 131746 collector.cpp:263] This product uses kernel module and ebpf subcomponents licensed under the GNU
[I 20220926 131746 collector.cpp:264] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[I 20220926 131746 collector.cpp:265] Source code for the kernel module and ebpf subcomponents is available upon
[I 20220926 131746 collector.cpp:266] request by contacting support@stackrox.com.
[I 20220926 131746 collector.cpp:267] 
[I 20220926 131746 collector.cpp:162] Inserting kernel module /module/collector.ko with indefinite removal and retry if required.
[D 20220926 131746 collector.cpp:109] Kernel module arguments: s_syscallIds=26,27,56,57,246,247,248,249,94,95,14,15,156,157,216,217,222,223,4,5,22,23,12,13,154,155,172,173,214,215,230,231,282,283,288,289,292,293,96,97,182,183,218,219,224,225,16,186,234,194,195,192,193,200,201,198,199,36,37,18,19,184,185,220,221,226,227,-1 verbose=0 exclude_selfns=1 exclude_initns=1
[I 20220926 131746 collector.cpp:183] Done inserting kernel module /module/collector.ko.
[I 20220926 131746 collector.cpp:215] gRPC server=sensor.mcs-security.svc:443
[I 20220926 131746 CollectorService.cpp:50] Config: collection_method:kernel_module, useChiselCache:1, snapLen:0, scrape_interval:30, turn_off_scrape:0, hostname:<redacted>, logLevel:DEBUG
[I 20220926 131746 CollectorService.cpp:79] Network scrape interval set to 30 seconds
[I 20220926 131746 CollectorService.cpp:82] Waiting for GRPC server to become ready ...
[I 20220926 131746 CollectorService.cpp:87] GRPC server connectivity is successful
[D 20220926 131746 ConnTracker.cpp:314] ignored l4 protocol and port pairs
[D 20220926 131746 ConnTracker.cpp:316] udp/9
[I 20220926 131746 NetworkStatusNotifier.cpp:168] Started network status notifier.
[I 20220926 131746 NetworkStatusNotifier.cpp:182] Established network connection info stream.
[D 20220926 131746 SysdigService.cpp:262] Updating chisel and flushing chisel cache
[D 20220926 131746 SysdigService.cpp:263] New chisel: 
args = {}
function on_event()
    return true
end
function on_init()
    filter = "not container.id = 'host'\n"
    chisel.set_filter(filter)
    return true
end

[I 20220926 131746 SignalServiceClient.cpp:43] Trying to establish GRPC stream for signals ...
[I 20220926 131746 SignalServiceClient.cpp:61] Successfully established GRPC stream for signals.
[W 20220926 131746 ProtoAllocator.h:41] Allocating a memory block on the heap for the arena, this is inefficient and usually avoidable
collector[0x4486bd]
/lib64/libc.so.6(+0x4eb20)[0x7fc0d6a24b20]
Caught signal 11 (SIGSEGV): Segmentation fault
/bootstrap.sh: line 94:    10 Segmentation fault      (core dumped) eval exec "$@"
Collector kernel module has already been loaded.
Removing so that collector can insert it at startup.

@ovalenti
Copy link
Contributor

ovalenti commented Sep 26, 2022

You are probably running in offline mode, and your probes might need an update.

Using the slim image, could you try and update your kernel support package to the latest and use --overwrite option :
https://docs.openshift.com/acs/3.71/configuration/enable-offline-mode.html#update-kernel-support-packages_enable-offline-mode

Then you will have to restart Sensor to flush the caches.

@balpert89
Copy link
Author

The containers are able to reach endpoints on the internet, there is no proxy in front of the environment.

image

I am also uploading the package, however as I understand the documentation, the collector should download the support package from

A Red Hat-operated server available on the internet. Collector uses Central’s network connection to check and download the probes.

Is there some debug message in central / sensor / collector indicating the attempt to download the probes?

@balpert89
Copy link
Author

I have uploaded the support package

[balpert@mgmt ~]$ ./roxctl-3.71 collector support-packages upload support-pkg-2.0.1-20220924015354.zip -e "$ROX_CENTRAL_ADDRESS"
INFO:	Uploading 9178 files from support package ...

1.1 GiB / 1.1 GiB [======================================================================================================================] 54.3 MiB/s Uploading... 00:00
INFO:	Successfully uploaded 9178 files from support package.

And recreated the collector pod, still experiencing issues:

Collector Version: 3.9.0
OS: Red Hat Enterprise Linux CoreOS 49.84.202205050701-0 (Ootpa)
Kernel Version: 4.18.0-305.45.1.el8_4.x86_64
Starting StackRox Collector...
[I 20220927 074703 HostInfo.cpp:126] Hostname: <redacted>'
[I 20220927 074703 CollectorConfig.cpp:119] User configured logLevel=debug
[I 20220927 074703 CollectorConfig.cpp:149] User configured collection-method=kernel_module
[I 20220927 074703 CollectorConfig.cpp:206] Afterglow is enabled
[D 20220927 074703 HostInfo.cpp:200] EFI directory exist, UEFI boot mode
[D 20220927 074703 HostInfo.h:100] identified kernel release: '4.18.0-305.45.1.el8_4.x86_64'
[D 20220927 074703 HostInfo.h:101] identified kernel version: '#1 SMP Wed Apr 6 13:48:37 EDT 2022'
[D 20220927 074703 HostInfo.cpp:297] SecureBoot status is 2
[D 20220927 074703 collector.cpp:254] Core dump not enabled
[I 20220927 074703 collector.cpp:302] Module version: 2.0.1
[I 20220927 074703 collector.cpp:329] Attempting to download kernel module - Candidate kernel versions: 
[I 20220927 074703 collector.cpp:331] 4.18.0-305.45.1.el8_4.x86_64
[D 20220927 074703 GetKernelObject.cpp:148] Checking for existence of /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz and /kernel-modules/collector-4.18.0-305.45.1.el8_4.x86_64.ko
[I 20220927 074703 GetKernelObject.cpp:180] Local storage does not contain collector-4.18.0-305.45.1.el8_4.x86_64.ko
[D 20220927 074703 GetKernelObject.cpp:50] Attempting to download kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.0.1/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz
[D 20220927 074703 FileDownloader.cpp:53] Set HTTP status code to '200'
[D 20220927 074703 GetKernelObject.cpp:55] Downloaded kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.0.1/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz
[I 20220927 074703 GetKernelObject.cpp:194] Successfully downloaded and decompressed /module/collector.ko
[I 20220927 074703 collector.cpp:262] 
[I 20220927 074703 collector.cpp:263] This product uses kernel module and ebpf subcomponents licensed under the GNU
[I 20220927 074703 collector.cpp:264] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[I 20220927 074703 collector.cpp:265] Source code for the kernel module and ebpf subcomponents is available upon
[I 20220927 074703 collector.cpp:266] request by contacting support@stackrox.com.
[I 20220927 074703 collector.cpp:267] 
[I 20220927 074703 collector.cpp:162] Inserting kernel module /module/collector.ko with indefinite removal and retry if required.
[D 20220927 074703 collector.cpp:109] Kernel module arguments: s_syscallIds=26,27,56,57,246,247,248,249,94,95,14,15,156,157,216,217,222,223,4,5,22,23,12,13,154,155,172,173,214,215,230,231,282,283,288,289,292,293,96,97,182,183,218,219,224,225,16,186,234,194,195,192,193,200,201,198,199,36,37,18,19,184,185,220,221,226,227,-1 verbose=0 exclude_selfns=1 exclude_initns=1
[I 20220927 074703 collector.cpp:183] Done inserting kernel module /module/collector.ko.
[I 20220927 074703 collector.cpp:215] gRPC server=sensor.mcs-security.svc:443
[I 20220927 074703 CollectorService.cpp:50] Config: collection_method:kernel_module, useChiselCache:1, snapLen:0, scrape_interval:30, turn_off_scrape:0, hostname:<redacted>, logLevel:DEBUG
[I 20220927 074703 CollectorService.cpp:79] Network scrape interval set to 30 seconds
[I 20220927 074703 CollectorService.cpp:82] Waiting for GRPC server to become ready ...
[I 20220927 074703 CollectorService.cpp:87] GRPC server connectivity is successful
[D 20220927 074703 ConnTracker.cpp:314] ignored l4 protocol and port pairs
[D 20220927 074703 ConnTracker.cpp:316] udp/9
[I 20220927 074703 NetworkStatusNotifier.cpp:187] Started network status notifier.
[I 20220927 074703 NetworkStatusNotifier.cpp:203] Established network connection info stream.
[D 20220927 074703 SysdigService.cpp:262] Updating chisel and flushing chisel cache
[D 20220927 074703 SysdigService.cpp:263] New chisel: 
args = {}
function on_event()
    return true
end
function on_init()
    filter = "not container.id = 'host'\n"
    chisel.set_filter(filter)
    return true
end

[I 20220927 074703 SignalServiceClient.cpp:43] Trying to establish GRPC stream for signals ...
[I 20220927 074703 SignalServiceClient.cpp:61] Successfully established GRPC stream for signals.
[W 20220927 074703 ProtoAllocator.h:41] Allocating a memory block on the heap for the arena, this is inefficient and usually avoidable
collector[0x44746d]
/lib64/libc.so.6(+0x4eb20)[0x7f032f574b20]
Caught signal 11 (SIGSEGV): Segmentation fault
/bootstrap.sh: line 94:    10 Segmentation fault      (core dumped) eval exec "$@"
Collector kernel module has already been loaded.
Removing so that collector can insert it at startup.

@ovalenti
Copy link
Contributor

ovalenti commented Sep 27, 2022

Just to be sure, did you restart the Sensor POD for this cluster ?

oc -n stackrox rollout restart deployment sensor

@balpert89
Copy link
Author

Yes I have done that, to be sure I have redone the command

$ oc rollout restart deployment sensor
deployment.apps/sensor restarted

$ oc delete pod -lapp.kubernetes.io/component=collector
pod "collector-46jlk" deleted
pod "collector-4dtwk" deleted
pod "collector-4ffgp" deleted
pod "collector-57l4b" deleted
pod "collector-5bmwp" deleted
pod "collector-5p9wt" deleted
pod "collector-6c879" deleted
pod "collector-9xnwg" deleted
pod "collector-b6j2g" deleted
pod "collector-c7fqn" deleted
pod "collector-cmrf2" deleted
pod "collector-dddls" deleted
pod "collector-hkflx" deleted
pod "collector-hqzhs" deleted
pod "collector-jcj5s" deleted
pod "collector-jrrdt" deleted
pod "collector-l8rrs" deleted
pod "collector-n46ks" deleted
pod "collector-n9tcl" deleted
pod "collector-rhb9w" deleted
pod "collector-twhmg" deleted

$ oc get pods
NAME                                 READY   STATUS    RESTARTS     AGE
admission-control-646f65f6f7-46gfn   1/1     Running   0            23h
admission-control-646f65f6f7-7zg8s   1/1     Running   0            23h
admission-control-646f65f6f7-gn6g4   1/1     Running   0            23h
collector-2cp4l                      1/2     Error     1 (5s ago)   8s
collector-4nsxj                      1/2     Error     0            9s
collector-5j4kq                      1/2     Error     1 (3s ago)   8s
collector-77jch                      1/2     Error     1 (3s ago)   10s
collector-79cdt                      1/2     Error     1 (4s ago)   8s
collector-8bs8p                      1/2     Error     1 (3s ago)   9s
collector-8qql5                      2/2     Running   1 (2s ago)   9s
collector-9zhf9                      1/2     Error     0            9s
collector-b79k8                      1/2     Error     0            9s
collector-bhxcc                      2/2     Running   1 (2s ago)   9s
collector-cmv7w                      2/2     Running   1 (2s ago)   9s
collector-ctfzl                      2/2     Running   0            8s
collector-fb89g                      1/2     Error     0            8s
collector-fqnz8                      1/2     Error     1 (3s ago)   8s
collector-hdm5k                      2/2     Running   0            8s
collector-jqlj6                      1/2     Error     1 (5s ago)   9s
collector-kft7q                      2/2     Running   0            8s
collector-l9sfm                      1/2     Error     1 (4s ago)   9s
collector-nw69q                      1/2     Error     0            9s
collector-p6bzb                      2/2     Running   1 (2s ago)   9s
collector-sgfxv                      1/2     Error     1 (4s ago)   8s
scanner-9548f9d56-9vd4p              1/1     Running   0            23h
scanner-9548f9d56-tswhz              1/1     Running   0            23h
scanner-db-6dc8b9d494-c2h6r          1/1     Running   0            23h
sensor-6bdf4489d7-bgmr2              1/1     Running   0            2m8s

@ovalenti
Copy link
Contributor

Is there some debug message in central / sensor / collector indicating the attempt to download the probes?

You are right that the probe is pulled from internet in this case. However, Collector still goes through Sensor to do so; we can only see this request in Collector logs:
Downloaded kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.0.1/collector-4.18.0-305.45.1.el8_4.x86_64.ko.gz

@ovalenti
Copy link
Contributor

The segfault with this kernel-module probe is being investigated, and I hope that we can fix it soon.

In the meantime, I would suggest to switch to the eBPF collector method, which is now the default.
From the logs, it looks like the method is currently explicitly set to "kernel_module", which forces usage of the kernel module.
Could you try to set it to "ebpf", please ?

@balpert89
Copy link
Author

I have switched it, the segfault occurs nonetheless :-(

Collector Version: 3.9.0
OS: Red Hat Enterprise Linux CoreOS 49.84.202205050701-0 (Ootpa)
Kernel Version: 4.18.0-305.45.1.el8_4.x86_64
Starting StackRox Collector...
[I 20220927 104054 HostInfo.cpp:126] Hostname: '<redacted>'
[I 20220927 104054 CollectorConfig.cpp:119] User configured logLevel=debug
[I 20220927 104054 CollectorConfig.cpp:149] User configured collection-method=ebpf
[I 20220927 104054 CollectorConfig.cpp:206] Afterglow is enabled
[D 20220927 104054 HostInfo.h:100] identified kernel release: '4.18.0-305.45.1.el8_4.x86_64'
[D 20220927 104054 HostInfo.h:101] identified kernel version: '#1 SMP Wed Apr 6 13:48:37 EDT 2022'
[D 20220927 104054 collector.cpp:254] Core dump not enabled
[I 20220927 104054 collector.cpp:302] Module version: 2.0.1
[I 20220927 104054 collector.cpp:329] Attempting to download eBPF probe - Candidate kernel versions: 
[I 20220927 104054 collector.cpp:331] 4.18.0-305.45.1.el8_4.x86_64
[D 20220927 104054 GetKernelObject.cpp:148] Checking for existence of /kernel-modules/collector-ebpf-4.18.0-305.45.1.el8_4.x86_64.o.gz and /kernel-modules/collector-ebpf-4.18.0-305.45.1.el8_4.x86_64.o
[I 20220927 104054 GetKernelObject.cpp:180] Local storage does not contain collector-ebpf-4.18.0-305.45.1.el8_4.x86_64.o
[D 20220927 104054 GetKernelObject.cpp:50] Attempting to download kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.0.1/collector-ebpf-4.18.0-305.45.1.el8_4.x86_64.o.gz
[D 20220927 104054 FileDownloader.cpp:53] Set HTTP status code to '200'
[D 20220927 104054 GetKernelObject.cpp:55] Downloaded kernel object from https://sensor.mcs-security.svc:443/kernel-objects/2.0.1/collector-ebpf-4.18.0-305.45.1.el8_4.x86_64.o.gz
[I 20220927 104054 GetKernelObject.cpp:194] Successfully downloaded and decompressed /module/collector-ebpf.o
[I 20220927 104054 collector.cpp:262] 
[I 20220927 104054 collector.cpp:263] This product uses kernel module and ebpf subcomponents licensed under the GNU
[I 20220927 104054 collector.cpp:264] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[I 20220927 104054 collector.cpp:265] Source code for the kernel module and ebpf subcomponents is available upon
[I 20220927 104054 collector.cpp:266] request by contacting support@stackrox.com.
[I 20220927 104054 collector.cpp:267] 
[I 20220927 104054 collector.cpp:215] gRPC server=sensor.mcs-security.svc:443
[I 20220927 104054 CollectorService.cpp:50] Config: collection_method:ebpf, useChiselCache:1, snapLen:0, scrape_interval:30, turn_off_scrape:0, hostname:<redacted>, logLevel:DEBUG
[I 20220927 104054 CollectorService.cpp:79] Network scrape interval set to 30 seconds
[I 20220927 104054 CollectorService.cpp:82] Waiting for GRPC server to become ready ...
[I 20220927 104054 CollectorService.cpp:87] GRPC server connectivity is successful
[D 20220927 104054 ConnTracker.cpp:314] ignored l4 protocol and port pairs
[D 20220927 104054 ConnTracker.cpp:316] udp/9
[I 20220927 104054 NetworkStatusNotifier.cpp:187] Started network status notifier.
[I 20220927 104054 NetworkStatusNotifier.cpp:203] Established network connection info stream.
[D 20220927 104054 SysdigService.cpp:262] Updating chisel and flushing chisel cache
[D 20220927 104054 SysdigService.cpp:263] New chisel: 
args = {}
function on_event()
    return true
end
function on_init()
    filter = "not container.id = 'host'\n"
    chisel.set_filter(filter)
    return true
end

[I 20220927 104054 SignalServiceClient.cpp:43] Trying to establish GRPC stream for signals ...
[I 20220927 104054 SignalServiceClient.cpp:61] Successfully established GRPC stream for signals.
[D 20220927 104054 ConnScraper.cpp:406] Could not open process directory 1736785: No such file or directory
[W 20220927 104054 ProtoAllocator.h:41] Allocating a memory block on the heap for the arena, this is inefficient and usually avoidable
collector[0x44746d]
/lib64/libc.so.6(+0x4eb20)[0x7f8ab42bcb20]
Caught signal 11 (SIGSEGV): Segmentation fault
/bootstrap.sh: line 94:    10 Segmentation fault      (core dumped) eval exec "$@"

This is the current environment variables list of the collector daemonSet, maybe this helps in narrowing down the issue:

env:
- name: COLLECTOR_CONFIG
  value: {"logLevel":"debug","tlsConfig": {"caCertPath":"/var/run/secrets/stackrox.io/certs/ca.pem","clientCertPath":"/var/run/secrets/stackrox.io/certs/cert.pem","clientKeyPath":"/var/run/secrets/stackrox.io/certs/key.pem"}}
- name: COLLECTION_METHOD
  value: EBPF
- name: GRPC_SERVER
  value: 'sensor.mcs-security.svc:443'
- name: SNI_HOSTNAME
  value: sensor.mcs-security.svc
- name: ROX_COMPLIANCE_OPERATOR_INTEGRATION
  value: 'true'
- name: ROX_CSV_EXPORT
  value: 'false'
- name: ROX_DECOMMISSIONED_CLUSTER_RETENTION
  value: 'false'
- name: ROX_ECR_AUTO_INTEGRATION
  value: 'true'
- name: ROX_ENABLE_ROLLBACK
  value: 'true'
- name: ROX_FRONTEND_VM_UDPATES
  value: 'false'
- name: ROX_INTEGRATIONS_AS_CONFIG
  value: 'false'
- name: ROX_LOCAL_IMAGE_SCANNING
  value: 'true'
- name: ROX_NETPOL_FIELDS
  value: 'true'
- name: ROX_NETWORK_DETECTION_BASELINE_SIMULATION
  value: 'true'
- name: ROX_NEW_POLICY_CATEGORIES
  value: 'false'
- name: ROX_POLICIES_PATTERNFLY
  value: 'true'
- name: ROX_POSTGRES_DATASTORE
  value: 'false'
- name: ROX_SECURITY_METRICS_PHASE_ONE
  value: 'true'
- name: ROX_SYSTEM_HEALTH_PF
  value: 'false'
- name: ROX_VERIFY_IMAGE_SIGNATURE
  value: 'true'

@erthalion
Copy link
Contributor

erthalion commented Sep 27, 2022

@balpert89 is there a chance you can grab the core dump for us? I guess you would need to enable it via env variable [1].

@balpert89
Copy link
Author

I have set the environment variable, however how do I get ahold of the dump? And how do I get it out of the container, since there is neither tar nor rsync or any other network tool at my disposal in the container?

@erthalion
Copy link
Contributor

I have set the environment variable, however how do I get ahold of the dump? And how do I get it out of the container, since there is neither tar nor rsync or any other network tool at my disposal in the container?

It involves some manual steps. The dump should be created at the directory pointed by /proc/sys/kernel/core_pattern, you could fetch it via kubectl/oc tools and cp command directly from the pod. In case it's being erased on the pod crash, mount a persistent volume by the core_pattern path. Would that work for you?

@balpert89
Copy link
Author

I have mounted the hostPath /proc in the compliance container, but the only content in /host/proc/sys/kernel/core_pattern is:

|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

@erthalion
Copy link
Contributor

erthalion commented Sep 27, 2022

I have mounted the hostPath /proc in the compliance container, but the only content in /host/proc/sys/kernel/core_pattern is:

Yes, sorry for not being clear - the content of the file you've posted is going to be the path, where the core dump is stored. In this case it's going to be handled by systemd-coredump tool and should be stored at /var/lib/systemd/coredump/ (see "Core dumps and systemd" section in the documentation).

@balpert89
Copy link
Author

Ok, sorry for that.

I have the coredump on my local machine, how can I provide it to you? I feel uncomfortable dropping this onto a public GitHub issue.

@balpert89
Copy link
Author

@ovalenti can you please provide me your mail address so I can send you the uncompressed coredump?

@ovalenti
Copy link
Contributor

Yes please, here it is: ovalenti at redhat dot com

I really appreciate your cooperation on this issue. For now, we suspect that the root cause has something to do with the memory allocation code. And since the crash does not reproduce on our own clusters, this might be triggered by the specifics of your workload.

Let me analyse the core dump to see if we can sort this out.

@ovalenti
Copy link
Contributor

Thank you for the core-dump, @balpert89. Unfortunately, I could not get to the root cause using it.

In order to bisect further the issue, could you please try to temporarily disable the scraper ?

This can be achieved by adding the following stanza to the COLLECTOR_CONFIG JSON document :
"turnOffScrape": true

@balpert89
Copy link
Author

After disabling the scraping the collector does not experience issues.

@ovalenti
Copy link
Contributor

Excellent, this narrows the research area quite a lot. I am going to bring in some expert of the Scrapper.

@ovalenti
Copy link
Contributor

We found a likely cause for this issue !

A fix has been written and is under test. Hopefully we can provide a service release soon.

@balpert89
Copy link
Author

Nice, if you have some test image I can provide it on the environment experiencing the issues.

@ovalenti
Copy link
Contributor

Thank you @balpert89 for testing latest 3.11.x image. It looks like the fix is working.

The patch is on master now, and has been backported to 3.11, (quay.io/stackrox-io/collector:3.11.0-1-g0db9c01c9b-slim)

@balpert89
Copy link
Author

Thank you as well :)

Only question remains, we currently use the 3.72.0 tag for the collector, is there an upcoming patch release (3.72.1?) providing the patch?

@ovalenti
Copy link
Contributor

ovalenti commented Oct 5, 2022

We are preparing for it, but I don't have a date yet.

@bengukaraalioglu
Copy link

Thank you for the core-dump, @balpert89. Unfortunately, I could not get to the root cause using it.

In order to bisect further the issue, could you please try to temporarily disable the scraper ?

This can be achieved by adding the following stanza to the COLLECTOR_CONFIG JSON document : "turnOffScrape": true

How can we turn off the scrape, I tried to add turnOffScrape:true into the following field in secure cluster yaml but it is deleted automatically.

collector:
  collection: KernelModule
  turnOffScrape: true

@ovalenti
Copy link
Contributor

Hi @bengukaraalioglu, I would suggest to use latest images which include the fix instead.

3.71: quay.io/stackrox-io/collector:3.71.2
3.72: quay.io/stackrox-io/collector:3.72.1-rc.2 (no collector related change planned until release)

The official release of full Stackrox 3.72.1&3.71.2 should happen early next week (@balpert89 )

@ovalenti
Copy link
Contributor

Both releases including the fix are now live !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants