New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular GPIO latency #497

Closed
subspclr4 opened this Issue Nov 18, 2015 · 15 comments

Comments

Projects
None yet
4 participants
@subspclr4

subspclr4 commented Nov 18, 2015

Hello,

I have medium fast data acquisition hardware / code that uses the falling edge of a GPIO input to signal an FIQ. The FIQ code preempts a PREEMPT_RT Linux kernel and stores ADC data from other GPIO pins in a buffer. This code works as expected on the Raspberry Pi 1 B using older firmware and slightly modified USB drivers for OSADL’s old 3.12.24-rt38 patch set from last year.

To push higher data rates and modernize the system, I recently redesigned the hardware to take advantage of the new 40 pin header and ported my FIQ code to the RPi 2’s newer kernel. I immediately discovered irregular GPIO latency issues, which I initially attributed multi-core bus contention. I tried a few different troubleshooting “fixes” (busy waiting the other cores during an FIQ, disabling DMA, reviewing the newest USB FIQ code for incompatibilities, etc.), but they were all dead ends.

Running out of ideas with the RPi 2, I purchased an RPi 1 B+ since it was compatible with my working BCM2835 code and had the new 40 pin header. I found that the irregular latency issue was also present in recent firmware (including commit 1f5c0d7), so I then proceeded to work backwards in commits to find when the issue appeared.

After a fair amount of testing, I found that my code runs with zero GPIO latency issues for 9+ hours using the Oct 6, 2014 c786f85 commit but within minutes shows GPIO latency issues on the Oct 12, 2014 e4afeda commit. This suggests to me that the cause is the low-level firmware since I used the same 3.12.24-rt38 kernel to test each commit.

Attached below are screenshots of the observed behavior using the RPi 1 B+ and an oscilloscope with infinite persistence turned on. Infinite persistence is the sum of all signal traces observed during a given time window and is useful for catching infrequent, irregular signal behavior.

The falling edge that the FIQ is responding to is shown on channel 1 (yellow) and the ADC data read strobe driven by FIQ code is shown on channel 3 (purple). Commits 1f5c0d7 and e4afeda show the undesirable, irregular GPIO latency behavior and commit c786f85 does not.

Can someone with access to the firmware source code please look into why this latency issue might be occurring and perhaps offer a fix? I would greatly appreciate it.

I would be happy to help test new firmware fixes. I can also provide the kernel I'm using, but it is of limited utility since reproducing the latency issue is somewhat involved (it requires an oscilloscope / logic analyzer and a method of generating a square wave). I unfortunately do not have a minimal example of my kernel code for others to reproduce this issue, but I can try to provide one if it'd be helpful. My code is tied into a lot of other logic, which makes it difficult to create a minimal example (a high level overview of the code is provided in the diagram below).

My cmdline.txt is:

dwc_otg.speed=1 dwc_otg.fiq_fix_enable=0 dwc_otg.fiq_split_enable=0 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait elevator=noop

My config.txt file is:

gpu_mem=16
disable_l2cache=0
disable_pvt=1
force_turbo=1
arm_freq=700
gpu_freq=250
core_freq=250
sdram_freq=400
kernel=kernel.img

I run my RPi without Xorg using the minimal raspbian-ua-netinst, but I did try increasing gpu_mem to make the config.txt more standard. It did not fix the issue.

Thanks!

-Adam

9 hours with c786f85:

9ishhours_c786f85ea2495ca8ae239fb9575fbc1d85fb381e

15 minutes with e4afeda:

15minutes_e4afeda91268c412277c388e67811356f2f35931

10 minutes with 1f5c0d7:
10minutes_1f5c0d7c3c86483d7787b1996bc22cb5278e0be8

Code layout:

code

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Nov 18, 2015

Contributor

There are a small number of commits in firmware tree between Oct 6th and Oct 12th 2014. None of them look particularly obvious for causing this.
However I can produce a sequence of firmware builds between these two points which you can hopefully identify the exact commit which causes your issue.

Contributor

popcornmix commented Nov 18, 2015

There are a small number of commits in firmware tree between Oct 6th and Oct 12th 2014. None of them look particularly obvious for causing this.
However I can produce a sequence of firmware builds between these two points which you can hopefully identify the exact commit which causes your issue.

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 18, 2015

Great, thank you for taking the time to look into this. I can test the firmware builds as soon as they become available.

subspclr4 commented Nov 18, 2015

Great, thank you for taking the time to look into this. I can test the firmware builds as soon as they become available.

@popcornmix
@P33M

This comment has been minimized.

Show comment
Hide comment
@P33M

P33M Nov 19, 2015

The infinite persistence trace is interesting - if that's 10 minutes of persistence then your FIQ is only being disrupted in a small percentage of cases.

If you have, say, 1 second of persistence then do any anomalous FIQ latencies appear?

P33M commented Nov 19, 2015

The infinite persistence trace is interesting - if that's 10 minutes of persistence then your FIQ is only being disrupted in a small percentage of cases.

If you have, say, 1 second of persistence then do any anomalous FIQ latencies appear?

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 20, 2015

@popcornmix Thank you so much for the firmware files! I ran through the list a few times and observed the following behavior:

# File Result
1 firmware_1f438843.zip Latency issue with possible kernel panic
2 firmware_231a26c9.zip Latency issue with possible kernel panic
3 firmware_433ac48f.zip Latency issue with possible kernel panic
4 firmware_530617f5.zip Latency issue with possible kernel panic
5 firmware_1e7dd98d.zip Latency issue with possible kernel panic
6 firmware_27beaee2.zip Latency issue with possible kernel panic
7 firmware_5ee9b56e.zip Latency issue with possible kernel panic
8 firmware_32a704ea.zip No latency issue while working, but possible kernel panic
9 firmware_192a1fb8.zip No latency issue while working, but possible kernel panic
10 firmware_4fb0f0a9.zip No latency issue while working, but possible kernel panic
11 firmware_a363c9d4.zip No latency issue while working, but possible kernel panic
N/A commit c786f85 No latency issue and no kernel panic in 9+ hours (tested twice)

I need to look more into the nature of the kernel panic. The panic seems to occur within seconds to an hour. Sometimes I’ve just stopped the test to move on to another firmware file if no panic has occurred and the latency has looked normal for a while (>20 minutes).

The above was done on an RPi 1 B+ with bootcode.bin from c786f85. I used dd if=/dev/zero of=/dev/null and my userspace DAQ code to generate a high CPU load of about 2.

Was anything changed with respect to disable_pvt in the above firmware files?

@P33M Yes, the issue is infrequent -- it can take anywhere from a few seconds to a couple minutes before the first latency issue shows up.

subspclr4 commented Nov 20, 2015

@popcornmix Thank you so much for the firmware files! I ran through the list a few times and observed the following behavior:

# File Result
1 firmware_1f438843.zip Latency issue with possible kernel panic
2 firmware_231a26c9.zip Latency issue with possible kernel panic
3 firmware_433ac48f.zip Latency issue with possible kernel panic
4 firmware_530617f5.zip Latency issue with possible kernel panic
5 firmware_1e7dd98d.zip Latency issue with possible kernel panic
6 firmware_27beaee2.zip Latency issue with possible kernel panic
7 firmware_5ee9b56e.zip Latency issue with possible kernel panic
8 firmware_32a704ea.zip No latency issue while working, but possible kernel panic
9 firmware_192a1fb8.zip No latency issue while working, but possible kernel panic
10 firmware_4fb0f0a9.zip No latency issue while working, but possible kernel panic
11 firmware_a363c9d4.zip No latency issue while working, but possible kernel panic
N/A commit c786f85 No latency issue and no kernel panic in 9+ hours (tested twice)

I need to look more into the nature of the kernel panic. The panic seems to occur within seconds to an hour. Sometimes I’ve just stopped the test to move on to another firmware file if no panic has occurred and the latency has looked normal for a while (>20 minutes).

The above was done on an RPi 1 B+ with bootcode.bin from c786f85. I used dd if=/dev/zero of=/dev/null and my userspace DAQ code to generate a high CPU load of about 2.

Was anything changed with respect to disable_pvt in the above firmware files?

@P33M Yes, the issue is infrequent -- it can take anywhere from a few seconds to a couple minutes before the first latency issue shows up.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Nov 20, 2015

Contributor

The commit 5ee9b56e is platform: Reorganise so config.txt is parsed before gpioman is set up.

It's a change to order of initialisation rather than any deliberate change in behaviour.
It is possible that disable_pvt is getting set up after it's first been used. I'll see if I can find anything wrong when stepping though the code.

Contributor

popcornmix commented Nov 20, 2015

The commit 5ee9b56e is platform: Reorganise so config.txt is parsed before gpioman is set up.

It's a change to order of initialisation rather than any deliberate change in behaviour.
It is possible that disable_pvt is getting set up after it's first been used. I'll see if I can find anything wrong when stepping though the code.

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 21, 2015

It's unfortunate that there's no clear explanation beyond just a change in the initialization order, but I suppose this is still progress because the GPIO issue is now associated with a GPIO related firmware change.

Perhaps I should start working on minimal example code that reproduces the latency issue? It might help in further isolating what's going on. Also, here are the logs from vcdbg for 5ee9b56e and 32a704ea:

firmware_5ee9b56e.txt
firmware_32a704ea.txt

I thought they might be useful.

Finally, I spent some time this evening re-running firmware tests to re-confirm we've isolated the right firmware change. I definitely see the latency problem in firmwares 1f438843 to 5ee9b56e, but not 32a704ea to a363c9d4. I also removed my DAQ circuitry and drove the GPIO pin directly (with source termination) in the off chance there was some issue with my hardware causing false FIQ fires. No luck.

subspclr4 commented Nov 21, 2015

It's unfortunate that there's no clear explanation beyond just a change in the initialization order, but I suppose this is still progress because the GPIO issue is now associated with a GPIO related firmware change.

Perhaps I should start working on minimal example code that reproduces the latency issue? It might help in further isolating what's going on. Also, here are the logs from vcdbg for 5ee9b56e and 32a704ea:

firmware_5ee9b56e.txt
firmware_32a704ea.txt

I thought they might be useful.

Finally, I spent some time this evening re-running firmware tests to re-confirm we've isolated the right firmware change. I definitely see the latency problem in firmwares 1f438843 to 5ee9b56e, but not 32a704ea to a363c9d4. I also removed my DAQ circuitry and drove the GPIO pin directly (with source termination) in the off chance there was some issue with my hardware causing false FIQ fires. No luck.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Nov 21, 2015

Contributor

I've stepped through the code before and after that commit and the problem is that the config.txt parsing code is now disabling pvt before powerman is initialised (which is enabling it again).

I've moved the handling of disable_pvt to later. Here is a top-of-tree firmware with the fix added. Can you test?
https://dl.dropboxusercontent.com/u/3669512/temp/firmware_08c8feab9.zip

Contributor

popcornmix commented Nov 21, 2015

I've stepped through the code before and after that commit and the problem is that the config.txt parsing code is now disabling pvt before powerman is initialised (which is enabling it again).

I've moved the handling of disable_pvt to later. Here is a top-of-tree firmware with the fix added. Can you test?
https://dl.dropboxusercontent.com/u/3669512/temp/firmware_08c8feab9.zip

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 21, 2015

So far, so good. There hasn't been an FIQ GPIO glitch (or kernel panic) in the past 45 minutes. I'm going to let it run longer on the RPi 1 B+ and then switch over to the RPi 2 code. I'll probably have that done by Monday (I haven't touched my RPi 2 code in a while). Then I guess we can close this issue and have a beer?

Thanks so much for looking into this -- I really appreciate it.

subspclr4 commented Nov 21, 2015

So far, so good. There hasn't been an FIQ GPIO glitch (or kernel panic) in the past 45 minutes. I'm going to let it run longer on the RPi 1 B+ and then switch over to the RPi 2 code. I'll probably have that done by Monday (I haven't touched my RPi 2 code in a while). Then I guess we can close this issue and have a beer?

Thanks so much for looking into this -- I really appreciate it.

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 22, 2015

My RPi 2 code worked as is, so I was able to run the test overnight. Everything looks good after 11 hours for both the RPi 1 B+ and RPi 2. Should I close this issue?

subspclr4 commented Nov 22, 2015

My RPi 2 code worked as is, so I was able to run the test overnight. Everything looks good after 11 hours for both the RPi 1 B+ and RPi 2. Should I close this issue?

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Nov 23, 2015

Contributor

I'll close it. The fix is in our firmware source tree so it will be in any future firmware builds.
Thanks for helping narrow down the issue.

Contributor

popcornmix commented Nov 23, 2015

I'll close it. The fix is in our firmware source tree so it will be in any future firmware builds.
Thanks for helping narrow down the issue.

@popcornmix popcornmix closed this Nov 23, 2015

popcornmix added a commit that referenced this issue Nov 25, 2015

khronos: Include ARGB_8888_TF in fast path of mipmap generation
See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210

platform: Apply disable_pvt flag later - it was getting overwritten during init
See: #497

TC358762: Disable ulps mode which can cause a hang when closing display
See: #501
@mhaberler

This comment has been minimized.

Show comment
Hide comment
@mhaberler

mhaberler Nov 29, 2015

@subspclr4 - very cool project (and great drill down)!

if I may tack on a question - I am interested in the hardware->usercode notification path using fasync(): what latency and jitter do you observe in the "external event->user task" gets notified path?

(I could make good use of a low-latency notification path for this particular purpose: machinekit/machinekit#687 - any chance that part (kernel module, userland code fragment) being shared?

mhaberler commented Nov 29, 2015

@subspclr4 - very cool project (and great drill down)!

if I may tack on a question - I am interested in the hardware->usercode notification path using fasync(): what latency and jitter do you observe in the "external event->user task" gets notified path?

(I could make good use of a low-latency notification path for this particular purpose: machinekit/machinekit#687 - any chance that part (kernel module, userland code fragment) being shared?

@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Nov 29, 2015

@mhaberler Thanks for the interest. Latency and jitter on fasync notification hasn't been issue for my application, so I haven't measured it in a while. If I remember, the worst case latency was on par with the results I got from cyclictest:

cyclictest

Looking at my old notes, I unfortunately only have end-to-end plots that confound the fasync latency measurement with 300ish microseconds of control algorithm execution time. They show the total chain of signaling: FIQ » high priority IRQ at specific crank angle(s) » fasync » user space control algorithm with floating point math » ioctl back to kernel » my SPI driver » CAN packet to ECU. From those plots the upper bound on fasync is roughly consistent with the cyclictest results above.

I have been rethinking the entire code architecture, and fasync is actually on the chopping block. I want to explore using a plain, threaded blocking read instead of a signal since it will be simpler, will (hopefully) give similar performance, and will be way more standard than relying on signaling.

The results are, however, still TBD. I'll let you know how it goes (and put together some plots / code). As for my current code, it's pretty much exactly what's in LDD. I am not ready to release the entire code base, but it will happen soon. My goal is to rework it into something not tied to a niche automotive application.

subspclr4 commented Nov 29, 2015

@mhaberler Thanks for the interest. Latency and jitter on fasync notification hasn't been issue for my application, so I haven't measured it in a while. If I remember, the worst case latency was on par with the results I got from cyclictest:

cyclictest

Looking at my old notes, I unfortunately only have end-to-end plots that confound the fasync latency measurement with 300ish microseconds of control algorithm execution time. They show the total chain of signaling: FIQ » high priority IRQ at specific crank angle(s) » fasync » user space control algorithm with floating point math » ioctl back to kernel » my SPI driver » CAN packet to ECU. From those plots the upper bound on fasync is roughly consistent with the cyclictest results above.

I have been rethinking the entire code architecture, and fasync is actually on the chopping block. I want to explore using a plain, threaded blocking read instead of a signal since it will be simpler, will (hopefully) give similar performance, and will be way more standard than relying on signaling.

The results are, however, still TBD. I'll let you know how it goes (and put together some plots / code). As for my current code, it's pretty much exactly what's in LDD. I am not ready to release the entire code base, but it will happen soon. My goal is to rework it into something not tied to a niche automotive application.

@mhaberler

This comment has been minimized.

Show comment
Hide comment
@mhaberler

mhaberler Nov 29, 2015

@subspclr4 - interesting; I was wondering about the rationale for the fasync use, which I had not seen before

I did some loosely-related work on a beaglebone - propagating a GPIO flip to a Xenomai thread using an RTDM driver, and I scoped the result - about 5uS until the kernel code starts executing, and about 15uS until userland gets unblocked

While this is Xenomai/RTDM-specific I think a plain Linux driver and userland doing say a read() might not show drastically different results; and since RT-PREEMPT will likely be the goto-kernel for RT support I am really interested in your results!

mhaberler commented Nov 29, 2015

@subspclr4 - interesting; I was wondering about the rationale for the fasync use, which I had not seen before

I did some loosely-related work on a beaglebone - propagating a GPIO flip to a Xenomai thread using an RTDM driver, and I scoped the result - about 5uS until the kernel code starts executing, and about 15uS until userland gets unblocked

While this is Xenomai/RTDM-specific I think a plain Linux driver and userland doing say a read() might not show drastically different results; and since RT-PREEMPT will likely be the goto-kernel for RT support I am really interested in your results!

popcornmix added a commit to Hexxeh/rpi-firmware that referenced this issue Nov 30, 2015

khronos: Include ARGB_8888_TF in fast path of mipmap generation
See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210

platform: Apply disable_pvt flag later - it was getting overwritten during init
See: raspberrypi/firmware#497

TC358762: Disable ulps mode which can cause a hang when closing display
See: raspberrypi/firmware#501
@subspclr4

This comment has been minimized.

Show comment
Hide comment
@subspclr4

subspclr4 Jan 28, 2016

@mhaberler Sorry for the delay writing back. I did run some preliminary tests last month, but I didn't see much of a difference between fasync and a blocking read() on the scope. I was hoping to refine those tests with hard latency numbers by now, but I haven't had a chance yet. Anyways, I figured it'd be good to at least mention that fasync is probably not a mechanism for improved notification performance on machinekit.

subspclr4 commented Jan 28, 2016

@mhaberler Sorry for the delay writing back. I did run some preliminary tests last month, but I didn't see much of a difference between fasync and a blocking read() on the scope. I was hoping to refine those tests with hard latency numbers by now, but I haven't had a chance yet. Anyways, I figured it'd be good to at least mention that fasync is probably not a mechanism for improved notification performance on machinekit.

neuschaefer pushed a commit to neuschaefer/raspi-binary-firmware that referenced this issue Feb 27, 2017

khronos: Include ARGB_8888_TF in fast path of mipmap generation
See: http://forum.kodi.tv/showthread.php?tid=231092&pid=2168210#pid2168210

platform: Apply disable_pvt flag later - it was getting overwritten during init
See: raspberrypi#497

TC358762: Disable ulps mode which can cause a hang when closing display
See: raspberrypi#501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment