Irregular GPIO latency #497

subspclr4 · 2015-11-18T09:58:04Z

Hello,

I have microsecond-class data acquisition hardware / code that uses the falling edge of a GPIO input to signal an FIQ. The FIQ code preempts a PREEMPT_RT Linux kernel and stores ADC data from other GPIO pins in a buffer. This code works as expected on the Raspberry Pi 1 B using older firmware and slightly modified USB drivers for OSADL’s old 3.12.24-rt38 patch set from last year.

To push higher data rates and modernize the system, I recently redesigned the hardware to take advantage of the new 40 pin header and ported my FIQ code to the RPi 2’s newer kernel. I immediately discovered irregular GPIO latency issues, which I initially attributed multi-core bus contention. I tried a few different troubleshooting “fixes” (busy waiting the other cores during an FIQ, disabling DMA, reviewing the newest USB FIQ code for incompatibilities, etc.), but they were all dead ends.

Running out of ideas with the RPi 2, I purchased an RPi 1 B+ since it was compatible with my working BCM2835 code and had the new 40 pin header. I found that the irregular latency issue was also present in recent firmware (including commit 1f5c0d7), so I then proceeded to work backwards in commits to find when the issue appeared.

After a fair amount of testing, I found that my code runs with zero GPIO latency issues for 9+ hours using the Oct 6, 2014 c786f85 commit but within minutes shows GPIO latency issues on the Oct 12, 2014 e4afeda commit. This suggests to me that the cause is the low-level firmware since I used the same 3.12.24-rt38 kernel to test each commit.

Attached below are screenshots of the observed behavior using the RPi 1 B+ and an oscilloscope with infinite persistence turned on. Infinite persistence is the sum of all signal traces observed during a given time window and is useful for catching infrequent, irregular signal behavior.

The falling edge that the FIQ is responding to is shown on channel 1 (yellow) and the ADC data read strobe driven by FIQ code is shown on channel 3 (purple). Commits 1f5c0d7 and e4afeda show the undesirable, irregular GPIO latency behavior and commit c786f85 does not.

Can someone with access to the firmware source code please look into why this latency issue might be occurring and perhaps offer a fix? I would greatly appreciate it.

I would be happy to help test new firmware fixes. I can also provide the kernel I'm using, but it is of limited utility since reproducing the latency issue is somewhat involved (it requires an oscilloscope / logic analyzer and a method of generating a square wave). I unfortunately do not have a minimal example of my kernel code for others to reproduce this issue, but I can try to provide one if it'd be helpful. My code is tied into a lot of other logic, which makes it difficult to create a minimal example (a high level overview of the code is provided in the diagram below).

My cmdline.txt is:

dwc_otg.speed=1 dwc_otg.fiq_fix_enable=0 dwc_otg.fiq_split_enable=0 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait elevator=noop

My config.txt file is:

gpu_mem=16
disable_l2cache=0
disable_pvt=1
force_turbo=1
arm_freq=700
gpu_freq=250
core_freq=250
sdram_freq=400
kernel=kernel.img

I run my RPi without Xorg using the minimal raspbian-ua-netinst, but I did try increasing gpu_mem to make the config.txt more standard. It did not fix the issue.

Thanks!

-Adam

9 hours with c786f85:

15 minutes with e4afeda:

10 minutes with 1f5c0d7:

Code layout:

The text was updated successfully, but these errors were encountered:

popcornmix · 2015-11-18T18:11:20Z

There are a small number of commits in firmware tree between Oct 6th and Oct 12th 2014. None of them look particularly obvious for causing this.
However I can produce a sequence of firmware builds between these two points which you can hopefully identify the exact commit which causes your issue.

subspclr4 · 2015-11-18T19:27:31Z

Great, thank you for taking the time to look into this. I can test the firmware builds as soon as they become available.

popcornmix · 2015-11-19T00:19:20Z

I would expect the first one will have problems and the last one will be okay. If you can confirm that first, then find the first firmware with the problem and I can look more closely.

P33M · 2015-11-19T10:37:32Z

The infinite persistence trace is interesting - if that's 10 minutes of persistence then your FIQ is only being disrupted in a small percentage of cases.

If you have, say, 1 second of persistence then do any anomalous FIQ latencies appear?

subspclr4 · 2015-11-20T09:15:27Z

@popcornmix Thank you so much for the firmware files! I ran through the list a few times and observed the following behavior:

#	File	Result
1	firmware_1f438843.zip	Latency issue with possible kernel panic
2	firmware_231a26c9.zip	Latency issue with possible kernel panic
3	firmware_433ac48f.zip	Latency issue with possible kernel panic
4	firmware_530617f5.zip	Latency issue with possible kernel panic
5	firmware_1e7dd98d.zip	Latency issue with possible kernel panic
6	firmware_27beaee2.zip	Latency issue with possible kernel panic
7	firmware_5ee9b56e.zip	Latency issue with possible kernel panic
8	firmware_32a704ea.zip	No latency issue while working, but possible kernel panic
9	firmware_192a1fb8.zip	No latency issue while working, but possible kernel panic
10	firmware_4fb0f0a9.zip	No latency issue while working, but possible kernel panic
11	firmware_a363c9d4.zip	No latency issue while working, but possible kernel panic
N/A	commit `c786f85`	No latency issue and no kernel panic in 9+ hours (tested twice)

I need to look more into the nature of the kernel panic. The panic seems to occur within seconds to an hour. Sometimes I’ve just stopped the test to move on to another firmware file if no panic has occurred and the latency has looked normal for a while (>20 minutes).

The above was done on an RPi 1 B+ with bootcode.bin from c786f85. I used dd if=/dev/zero of=/dev/null and my userspace DAQ code to generate a high CPU load of about 2.

Was anything changed with respect to disable_pvt in the above firmware files?

@P33M Yes, the issue is infrequent -- it can take anywhere from a few seconds to a couple minutes before the first latency issue shows up.

popcornmix · 2015-11-20T15:12:15Z

The commit 5ee9b56e is platform: Reorganise so config.txt is parsed before gpioman is set up.

It's a change to order of initialisation rather than any deliberate change in behaviour.
It is possible that disable_pvt is getting set up after it's first been used. I'll see if I can find anything wrong when stepping though the code.

subspclr4 · 2015-11-21T11:39:02Z

It's unfortunate that there's no clear explanation beyond just a change in the initialization order, but I suppose this is still progress because the GPIO issue is now associated with a GPIO related firmware change.

Perhaps I should start working on minimal example code that reproduces the latency issue? It might help in further isolating what's going on. Also, here are the logs from vcdbg for 5ee9b56e and 32a704ea:

firmware_5ee9b56e.txt
firmware_32a704ea.txt

I thought they might be useful.

Finally, I spent some time this evening re-running firmware tests to re-confirm we've isolated the right firmware change. I definitely see the latency problem in firmwares 1f438843 to 5ee9b56e, but not 32a704ea to a363c9d4. I also removed my DAQ circuitry and drove the GPIO pin directly (with source termination) in the off chance there was some issue with my hardware causing false FIQ fires. No luck.

popcornmix · 2015-11-21T14:51:08Z

I've stepped through the code before and after that commit and the problem is that the config.txt parsing code is now disabling pvt before powerman is initialised (which is enabling it again).

I've moved the handling of disable_pvt to later. Here is a top-of-tree firmware with the fix added. Can you test?
https://dl.dropboxusercontent.com/u/3669512/temp/firmware_08c8feab9.zip

subspclr4 · 2015-11-21T22:27:13Z

So far, so good. There hasn't been an FIQ GPIO glitch (or kernel panic) in the past 45 minutes. I'm going to let it run longer on the RPi 1 B+ and then switch over to the RPi 2 code. I'll probably have that done by Monday (I haven't touched my RPi 2 code in a while). Then I guess we can close this issue and have a beer?

Thanks so much for looking into this -- I really appreciate it.

subspclr4 · 2015-11-22T19:37:38Z

My RPi 2 code worked as is, so I was able to run the test overnight. Everything looks good after 11 hours for both the RPi 1 B+ and RPi 2. Should I close this issue?

popcornmix · 2015-11-23T16:58:15Z

I'll close it. The fix is in our firmware source tree so it will be in any future firmware builds.
Thanks for helping narrow down the issue.

See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210 platform: Apply disable_pvt flag later - it was getting overwritten during init See: #497 TC358762: Disable ulps mode which can cause a hang when closing display See: #501

mhaberler · 2015-11-29T06:19:40Z

@subspclr4 - very cool project (and great drill down)!

if I may tack on a question - I am interested in the hardware->usercode notification path using fasync(): what latency and jitter do you observe in the "external event->user task" gets notified path?

(I could make good use of a low-latency notification path for this particular purpose: machinekit/machinekit#687 - any chance that part (kernel module, userland code fragment) being shared?

subspclr4 · 2015-11-29T21:30:53Z

@mhaberler Thanks for the interest. Latency and jitter on fasync notification hasn't been issue for my application, so I haven't measured it in a while. If I remember, the worst case latency was on par with the results I got from cyclictest:

Looking at my old notes, I unfortunately only have end-to-end plots that confound the fasync latency measurement with 300ish microseconds of control algorithm execution time. They show the total chain of signaling: FIQ » high priority IRQ at specific crank angle(s) » fasync » user space control algorithm with floating point math » ioctl back to kernel » my SPI driver » CAN packet to ECU. From those plots the upper bound on fasync is roughly consistent with the cyclictest results above.

I have been rethinking the entire code architecture, and fasync is actually on the chopping block. I want to explore using a plain, threaded blocking read instead of a signal since it will be simpler, will (hopefully) give similar performance, and will be way more standard than relying on signaling.

The results are, however, still TBD. I'll let you know how it goes (and put together some plots / code). As for my current code, it's pretty much exactly what's in LDD. I am not ready to release the entire code base, but it will happen soon. My goal is to rework it into something not tied to a niche automotive application.

mhaberler · 2015-11-29T21:44:06Z

@subspclr4 - interesting; I was wondering about the rationale for the fasync use, which I had not seen before

I did some loosely-related work on a beaglebone - propagating a GPIO flip to a Xenomai thread using an RTDM driver, and I scoped the result - about 5uS until the kernel code starts executing, and about 15uS until userland gets unblocked

While this is Xenomai/RTDM-specific I think a plain Linux driver and userland doing say a read() might not show drastically different results; and since RT-PREEMPT will likely be the goto-kernel for RT support I am really interested in your results!

See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210 platform: Apply disable_pvt flag later - it was getting overwritten during init See: raspberrypi/firmware#497 TC358762: Disable ulps mode which can cause a hang when closing display See: raspberrypi/firmware#501

subspclr4 · 2016-01-28T08:30:36Z

@mhaberler Sorry for the delay writing back. I did run some preliminary tests last month, but I didn't see much of a difference between fasync and a blocking read() on the scope. I was hoping to refine those tests with hard latency numbers by now, but I haven't had a chance yet. Anyways, I figured it'd be good to at least mention that fasync is probably not a mechanism for improved notification performance on machinekit.

See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210 platform: Apply disable_pvt flag later - it was getting overwritten during init See: raspberrypi#497 TC358762: Disable ulps mode which can cause a hang when closing display See: raspberrypi#501

popcornmix closed this as completed Nov 23, 2015

F5OEO mentioned this issue Dec 13, 2015

Irregular GPIO Latency with DMA #516

Closed

subspclr4 mentioned this issue Jun 18, 2016

Enabling the PWM clock at boot with Device Tree raspberrypi/linux#1533

Closed

natevw mentioned this issue Aug 11, 2016

disable_pvt unknown after boot, replacement? #640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Irregular GPIO latency #497

Irregular GPIO latency #497

subspclr4 commented Nov 18, 2015 •

edited

Loading

popcornmix commented Nov 18, 2015

subspclr4 commented Nov 18, 2015

popcornmix commented Nov 19, 2015

P33M commented Nov 19, 2015

subspclr4 commented Nov 20, 2015

popcornmix commented Nov 20, 2015

subspclr4 commented Nov 21, 2015

popcornmix commented Nov 21, 2015

subspclr4 commented Nov 21, 2015

subspclr4 commented Nov 22, 2015

popcornmix commented Nov 23, 2015

mhaberler commented Nov 29, 2015

subspclr4 commented Nov 29, 2015

mhaberler commented Nov 29, 2015

subspclr4 commented Jan 28, 2016

Irregular GPIO latency #497

Irregular GPIO latency #497

Comments

subspclr4 commented Nov 18, 2015 • edited Loading

popcornmix commented Nov 18, 2015

subspclr4 commented Nov 18, 2015

popcornmix commented Nov 19, 2015

P33M commented Nov 19, 2015

subspclr4 commented Nov 20, 2015

popcornmix commented Nov 20, 2015

subspclr4 commented Nov 21, 2015

popcornmix commented Nov 21, 2015

subspclr4 commented Nov 21, 2015

subspclr4 commented Nov 22, 2015

popcornmix commented Nov 23, 2015

mhaberler commented Nov 29, 2015

subspclr4 commented Nov 29, 2015

mhaberler commented Nov 29, 2015

subspclr4 commented Jan 28, 2016

subspclr4 commented Nov 18, 2015 •

edited

Loading