Camera frontend timeout fires too aggressively (crashing libcamerasrc on RPi5) #139

sgiessl · 2024-06-02T21:03:22Z

libcamerasrc sporadically encounters a timeout scenario. This timeout is NOT caused by any critical hardware condition. As a result, the gstreamer pipeline state becomes unrecoverable (and rpicam-apps might possibly trigger slow camera reset handlers)

(this report is about avoiding premature timeout-triggers, NOT about the error handling of the gstreamer plugin as that's topic of another discussion)

My affected system configuration is Raspberry Pi 5 with Raspberrypi Camera Module 3, under certain conditions as described below.

libcamera version

I have reproduced this in a recent checkout of the libcamera release branch (tag v0.2.0+rpt20240418, reported libcamera version 0.2.0+120-eb00c13d);
The same issue has been observed in older v0.1.0 versions, and using the raspberrypi-official debian bookworm packages.

How to reproduce

I'm running the following pipeline:

LD_LIBRARY_PATH='/usr/local/lib/aarch64-linux-gnu/' \
        LIBCAMERA_LOG_LEVELS="*:0" \
        GST_PLUGIN_PATH=/usr/local/lib/aarch64-linux-gnu/gstreamer-1.0/ \
        GST_DEBUG="libcamera*:6,*:7" \
        gst-launch-1.0 -v libcamerasrc camera-name=/base/axi/pcie@120000/rp1/i2c@${I2C_NODE:-88000}/imx708@1a ! video/x-raw,format=RGBx,width=2304,height=1296,framerate=50/1 ! jpegenc quality=85 ! matroskamux timecodescale=1000 streamable=true ! filesink location=test.${CAM_ID:-0}.mkv > test.${CAM_ID:-0}.mkv.log 2>&1

This will usually start recording the video successfully, but eventually, it will stop generating frames, then log a timeout error message (WARN V4L2 v4l2_videodevice.cpp:2007 /dev/video12[20:cap]: Dequeue timer of 1000000.00us has expired, ERROR RPI pipeline_base.cpp:1374 Camera frontend has timed out!), followed by around 10 seconds of request cancellation errors (DEBUG libcamerasrc gstlibcamerasrc.cpp:231:requestCompleted:<libcamerasrc0> Request was cancelled), and eventually a segmentation fault message and process halt (Caught SIGSEGV).

It can take seconds or hours before the error shows up; the following system stressing conditions seem to increase the error rate:

Running two gst-launch processes, recording separate cameras, in parallel
Using a SSD as the storage device, which is accessed via PCIe (through an inofficial PCIe adapter board)
Particularly, when I'm using it in combination with a SSD that's throwing sporadic dmesg errors (PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID))
Including CPU-intense and potentially blocking gstreamer pipeline elements such as webrtc or file sinks (whether or not using large buffered gstreamer queue elements).

I have the feeling that the issue is less grave when storing to non-blocking output such as ramdisks; however, it cannot be entirely avoided. Note that I do not know whehter the issue can be reproduced with no PCIe SSD adapter connected to the system at all.

How to work around the problem

Activating a custom config file using LIBCAMERA_RPI_CONFIG_FILE=${PWD}/pisp_config.yaml, and with the following config content, the issue does not seem to show up anymore (stress test still running)

{
        "version": 1.0,
        "target": "pisp",

        "pipeline_handler":
        {
                "camera_timeout_value_ms": 100000,
        }
}

(I had to spend many hours tracing logs and reading the code before identifying this straight forward work-around; the logs of libcamerasrc initially hinted me in the direction of framebuffer- or request-pool management issues, because timeouts as the root cause are exposed only by the lower level libcamera debug messages; I hope this report will help others who run into the same issue :) )

Preliminary analysis

By activating the custom configuration value, we override the default CFE timeout value of one second (log Setting CFE timeout to 1000000.00us)

This CFE timeout value is computed dynamically by PiSPCameraData::setCameraTimeout() which is invoked with a frame length argument of 19.9875 ms. This function computes the timeout as std::max<utils::Duration>(1s, 5 * maxFrameLengthMs * 1ms) - it seems to me that the magic number 5 has been chosen arbitrarily; the intermediate value of 100 ms is clamped to 1s which also appears to have been chosen arbitrarily.

Thus, I suspect we could provide an even better experience to new users by tweaking the formula to provide stable operations even under heavy system load?

Open question: Why is this only triggered in gstreamer/libcamerasrc? As far as I can tell from my experiments, rpicam-vid is using identical default CFE timeout parameters, however I have not been able to trigger the frontend timeout there. The most obvious difference is that rpicam-vid requires less CPU because it's encoding the MJPEG stream directly from YUV data (rather than doing a expensive RGBx -> YUV conversion inside the jpegenc element above). However, even with when hacking libcamerasrc to output YUV for lower CPU usage, the gstreamer pipeline is still likely to cause the issue. Thus, there might be some hidden optimization potential in libcamerasrc...? And maybe tweaking the timeout formula might be the wrong way to go...?

Please let me know your thoughts. I'm glad to provide more info (particular excerpts of the logs, process core dump, ...?) and to run further experiments. I'm also offering to propose patches once there is agreement about the solution.

The text was updated successfully, but these errors were encountered:

naushir · 2024-06-05T07:56:25Z

Thank you for the detailed explanation, there are certainly things worth discussing on this topic.

Firstly, depending on the application/framework you are running with, the camera timeout does also include frames/requests recycled back from the application. So if an application holds onto framebuffers for a very long period of time, starving the sensor this timeout will hit. Ideally I would like the timeout only to encompass the sensor hardware, but this is actually very tricky to do.

So assuming the timeout hits because the gstreamer pipeline is not recycling frames, the question is why? A 1 second timeout means that we are dropping > 30 frames from the sensor waiting for something to happen, and this does not feel good. Increasing the timeout to compensate would only hide the underlying problem I feel.

sgiessl · 2024-06-05T08:54:28Z

Thank you very much for your thoughts.

I have not been aware that this flow is impacted by the application, blocked by to be recycled requests. My understanding was that the timeout was primarily related to the process of preparing/passing requests/frames to/from the hardware.

Then it is a good reminder that even the short timeout of 1 second will likely cause (silent) frame drop. Thus, I would share the sentiment that we should better understand the root cause of the timeouts. (...on the other hand, one could argue that for some applications, including mine, occasional silent frame drops may be less of an issue than unpredictable, silent, full stalls or SIGSEGV; but as you said this is likely better to be decided by the application/framework side - let's put this aside and focus on the core issue)

I think the best path forward is still to better understand where the time is actually spent (client side, libcamera core, or PISP) ...?

Are there any ways to extract and distinguish between these timings from logs or other debugging means? This would help in gstreamer framework or -application investigations.
How feasible, in principle, is it to better distinguish between timeouts caused within the sensor hardware vs. starving the sensor of framebuffers? You are saying it's tricky, and I don't even know if it's possible in a API backwards-compatible way, but the benefit might be that app developers could chose their trade-offs in a more fine-grained way. And it would be easier to "defend" libcamera/PISP against claims caused by bad-behaving applications.

Where would you recommend I could best report the gstreamer-related issue (upstream libcamera gstreamer-plugin? or gstreamer developers?)

naushir · 2024-06-05T09:13:54Z

I have not been aware that this flow is impacted by the application, blocked by to be recycled requests. My understanding was that the timeout was primarily related to the process of preparing/passing requests/frames to/from the hardware.

I might have to correct myself here. The timeout actually might already only encompass the hardware framing cycle, and application delays should not cause it to timeout. But I'll have to re-read the code carefully to understand what I did many years ago.

Are there any ways to extract and distinguish between these timings from logs or other debugging means? This would help in gstreamer framework or -application investigations.

Perhaps you can start by turning on verbose libcamera logging and trigger the timeout. To do this, set LIBCAMERA_LOG_LEVELS=*:0 as an environment variable in the terminal window before running your pipeline.

How feasible, in principle, is it to better distinguish between timeouts caused within the sensor hardware vs. starving the sensor of framebuffers? You are saying it's tricky, and I don't even know if it's possible in a API backwards-compatible way, but the benefit might be that app developers could chose their trade-offs in a more fine-grained way. And it would be easier to "defend" libcamera/PISP against claims caused by bad-behaving applications.

See above, this may already be the case, but I need to look at the code in more detail.

Where would you recommend I could best report the gstreamer-related issue (upstream libcamera gstreamer-plugin? or gstreamer developers?)

If it is gstreamer related, the best approach would be to post a message to the libcamera-devel mailing list to start with.

naushir · 2024-06-05T13:20:47Z

I might have to correct myself here. The timeout actually might already only encompass the hardware framing cycle, and application delays should not cause it to timeout. But I'll have to re-read the code carefully to understand what I did many years ago.

So I think I can confirm that this is indeed the case - the timeout only accounts for the sensor frame recycling loop. So then the question comes down if this is an actual problem with the camera hardware and gets provoked more when the system is busy?

ricokritzer · 2024-06-05T20:11:53Z

I run into the same issue every time, using an imx219 sensor. Unfortunately increasing the timeout doesn't solve the problem.

In former versions raspicam worked fine. So this might be a bug.

LIBCAMERA_LOG_LEVELS=*:0

sudo rpicam-hello

[15:20:09.325250564] [4708] INFO Camera camera_manager.cpp:284 libcamera v0.2.0+120-eb00c13d
[15:20:09.369254017] [4711] WARN RPiSdn sdn.cpp:40 Using legacy SDN tuning - please consider moving SDN inside rpi.denoise
[15:20:09.371157376] [4711] WARN RPI vc4.cpp:392 Mismatch between Unicam and CamHelper for embedded data usage!
[15:20:09.371820847] [4711] INFO RPI vc4.cpp:446 Registered camera /base/soc/i2c0mux/i2c@1/imx219@10 to Unicam device /dev/media3 and ISP device /dev/media1
[15:20:09.371863216] [4711] INFO RPI pipeline_base.cpp:1102 Using configuration file '/usr/share/libcamera/pipeline/rpi/vc4/rpi_apps.yaml'
Preview window unavailable
Mode selection for 1640:1232:12:P
SRGGB10_CSI2P,640x480/0 - Score: 4504.81
SRGGB10_CSI2P,1640x1232/0 - Score: 1000
SRGGB10_CSI2P,1920x1080/0 - Score: 1541.48
SRGGB10_CSI2P,3280x2464/0 - Score: 1718
SRGGB8,640x480/0 - Score: 5504.81
SRGGB8,1640x1232/0 - Score: 2000
SRGGB8,1920x1080/0 - Score: 2541.48
SRGGB8,3280x2464/0 - Score: 2718
Stream configuration adjusted
[15:20:09.439330447] [4708] INFO Camera camera.cpp:1183 configuring streams: (0) 1640x1232-YUV420 (1) 1640x1232-SBGGR10_CSI2P
[15:20:09.439737552] [4711] INFO RPI vc4.cpp:621 Sensor: /base/soc/i2c0mux/i2c@1/imx219@10 - Selected sensor format: 1640x1232-SBGGR10_1X10 - Selected unicam format: 1640x1232-pBAA
[15:20:19.502458979] [4711] WARN V4L2 v4l2_videodevice.cpp:2007 /dev/video0[13:cap]: Dequeue timer of 10000000.00us has expired!
[15:20:19.502607699] [4711] ERROR RPI pipeline_base.cpp:1334 Camera frontend has timed out!
[15:20:19.502648717] [4711] ERROR RPI pipeline_base.cpp:1335 Please check that your camera sensor connector is attached securely.
[15:20:19.502689661] [4711] ERROR RPI pipeline_base.cpp:1336 Alternatively, try another cable and/or sensor.
ERROR: Device timeout detected, attempting a restart!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camera frontend timeout fires too aggressively (crashing libcamerasrc on RPi5) #139

Camera frontend timeout fires too aggressively (crashing libcamerasrc on RPi5) #139

sgiessl commented Jun 2, 2024

naushir commented Jun 5, 2024

sgiessl commented Jun 5, 2024

naushir commented Jun 5, 2024

naushir commented Jun 5, 2024

ricokritzer commented Jun 5, 2024 •

edited

Loading

Camera frontend timeout fires too aggressively (crashing libcamerasrc on RPi5) #139

Camera frontend timeout fires too aggressively (crashing libcamerasrc on RPi5) #139

Comments

sgiessl commented Jun 2, 2024

naushir commented Jun 5, 2024

sgiessl commented Jun 5, 2024

naushir commented Jun 5, 2024

naushir commented Jun 5, 2024

ricokritzer commented Jun 5, 2024 • edited Loading

ricokritzer commented Jun 5, 2024 •

edited

Loading