Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do fast and battery efficient DMA-driven SPI transfers? #992

Open
juj opened this issue May 11, 2018 · 12 comments
Open

How to do fast and battery efficient DMA-driven SPI transfers? #992

juj opened this issue May 11, 2018 · 12 comments

Comments

@juj
Copy link

juj commented May 11, 2018

I am trying to do fast continuous SPI transfers out from a Pi (working on a Model 3 B and a Zero W) and while the transfers are running, I'd like to make the main CPU idle until the transfers are finished.

The BCM core has an idle frequency of 250 MHz, and when the CPU is under load, it boosts up to 400 MHz. It looks like the CPU frequency is linked to this same turbo, and it idles at 600 MHz, and turbos up to 1200 MHz on Pi 3, and 1000 MHz on Zero.

Originally I was doing SPI Polled Mode transfers, busy spinning the CPU in a loop pushing bytes out to FIFO, and reading back from it when the read bytes become available. This gave me a nice 400/6=66mbits/sec transfer rate with CDIV=6, but the issue was that this busy spinning kills the battery, and one hardware thread, so this was not feasible on the Zero.

Then after migrating to using DMA instead of Polled Mode, I see I get the same 400/6=66mbits/sec of transfer rate as long as I busy spin the CPU to wait until the DMA transfer is complete. However, after switching from busy spinning to actually sleeping the CPU to wait until the DMA completes, I get a drop of the BCM core frequency down to 250 MHz, and my transfer rates drop to 250/6=41.66mbits/sec, a dramatic -37.5% reduction in SPI throughput.

It seems that heavy SPI activity by itself in the absence of CPU activity does not cause the BCM core to trigger itself to turbo up to increase the SPI transfer speed, but the turbo is controlled only by activity on the main CPU core.

Ideally, what I would like to achieve is to have the BCM core automatically trigger itself to turbo up whenever there exists heavy SPI activity in the FIFO (or perhaps when there are active DMA writes to the SPI TX or RX PER_MAPs ongoing?), ideally keeping the main CPU core frequency at idle, so the system would bump itself up from 600MHz/250MHz to 600MHz/400Mhz when SPI transfers are ongoing.

If such "600MHz/400Mhz" turbo mode is not technically possible and the main CPU and BCM core clocks are fixed to have to turbo at the same time, I'd then like to the system to automatically detect to turbo up to 1200MHz/400Mhz when there is SPI activity going on, while user code could still run an usleep() or a futex/mutex wait for a signal/interrupt to occur.

Are either of the above technically feasible?

As a third fallback option, it would be possible in my application to manually control turbo via some kind of hinting, if such a method might be feasible. My DMA transfers can be anything between a few bytes to up to 480 * 320 * 2 bytes in size at a time, and before I start a DMA transfer, I could add in a hint trigger to tell the system e.g. "please keep BCM core turbo up for the next 0.7/1.3/2.5 msecs". This kind of hinting would allow the BCM core get a breather immediately when the application does not need to do any SPI transfers, dropping back to idle to save power.

My application is about implementing a power and performance efficient display driver for SPI connected displays, you can find the fbcp-ili9341 project here:

30425771_10216084740774348_7762580950430154112_o

A demo video of Quake running at 60fps here.

The transfer footprint of my application ranges between long periods of heavy activity, to short bursts of heavy activity, to long periods of no activity, depending on how much pixel animation there is on screen in particular content. Ideally I'd be able to turbo up the BCM core quickly when SPI transfers are performed, and drop it back to idle when there are none.

As a workaround to not have to busy spin burn cycles on the main CPU to make the SPI transfers keep up, I have added force_turbo=1 in /boot/config.txt, and in that way, I can keep the main CPU asleep but still have the SPI bus running at 400 MHz. This lets the CPU schedule other processes on the Pi Zero W to keep things running smooth. However this is not a feasible solution, as I understand booting with force_turbo=1 irrevocably sets a "warranty void" bit on the device, and it's likely excessive to have the main CPU core run at 1200MHz (1000Mhz on Zero W) even if it is sleeping idle for the most of those cycles.

Any thoughts on what would be the best way to proceed? Thanks in advance!

@pelwell
Copy link
Contributor

pelwell commented May 11, 2018

As you've discovered, The SPI interfaces, like the I2C, SDHOST and a few others, share the main VPU core clock - that's a fundamental hardware restriction - so when the core clock changes so do the slaved clocks. The downstream SDHOST driver gives control of its clock divisor to the VPU, but the drivers for the other interfaces just calculate their divisors based on the turbo frequency, leading to the effect you are seeing.

What is the minimum SPI clock rate you need to hit? It looks like the SPI clock divisor has to be even, but 250MHz/4 = 62.5MHz, which you could achieve by setting "core_freq=250". Setting the core clock maximum to the same value as the minimum has the effect of keeping the core clock constant, decoupling it from the ARM's turbo changes.

With regard to the "warranty void" bit, from https://www.raspberrypi.org/documentation/configuration/config-txt/overclocking.md:

NOTE: Setting any overclocking parameters to values other than those used by raspi-config may set a permanent bit within the SoC, making it possible to detect that your Pi has been overclocked. The specific circumstances where the overclock bit is set are if force_turbo is set to 1 and any of the over_voltage_* options are set to a value > 0. See the blog post on Turbo Mode for more information.

In other words, if the core can run at the chosen turbo frequency without requiring over-voltage then the over-voltage bit won't be set. So if you are prepared to lower your turbo frequency rather than raising it you should be able to use a CDIV of 4 to hit 66.75:

force_turbo=1
core_freq=267
``

@juj
Copy link
Author

juj commented May 11, 2018

Thanks @pelwell for the quick reply, super informative as always!

What is the minimum SPI clock rate you need to hit?

In order to sustain a 60fps update rate on fast moving content, a SPI frequency around 66MHz is needed. A CDIV=4 with 250MHz/4=62.5Mhz is a decent idea, and I think that would be close enough as well, although in that mode, I'd have to prevent the core from turboing to any faster speed, or otherwise it would not be able to keep up with e.g. 400MHz/4=100MHz bus speed. That would then possibly impact overall performance on some rendering heavy applications, since the GPU would be prevented from running at a higher speed.

Setting the core clock maximum to the same value as the minimum has the effect of keeping the core clock constant, decoupling it from the ARM's turbo changes.

Is it possible to force the core frequency to fixed 400 MHz, while letting the ARM cpu turbo up and down between 600Mhz and 1200Mhz by itself? I.e.

core_freq_min=400
core_freq=400
force_turbo=0

and run SPI with CDIV=6. (I'll try this kind of scheme out tonight) If this might let the ARM CPU turbo up and down on its own, this might be a workable middle ground, assuming bulk of the power consumption is due to the main ARM CPU frequency, and not as much due to the core frequency?

@pelwell
Copy link
Contributor

pelwell commented May 11, 2018

Unfortunately the firmware contains a restriction that core_freq_min cannot be greater than the default minimum clock speed (250MHz), just as core_freq cannot be less than the default. There have been several occasions where it would have been useful to bypass those restrictions, so perhaps we could introduce a special case for when core_freq and core_freq_min are the same - what do you think, @popcornmix ?

@juj
Copy link
Author

juj commented May 11, 2018

Unfortunately the firmware contains a restriction that core_freq_min cannot be greater than the default minimum clock speed (250MHz), just as core_freq cannot be less than the default.

Ah, gotcha. Perhaps there might be a way to allow some leeway to these, e.g. core_freq_min <= 400 and core_freq >= core_freq_min.

The downstream SDHOST driver gives control of its clock divisor to the VPU

Btw, how does SDHOST achieve this? Is this a software/firmware/fixed in hardware based thing? If it was possible to specify the SPI0 controller to seamlessly switch to CDIV=6 when BCM core goes to 400MHz turbo state, and CDIV=4 when BCM core goes to 250MHz idle state, that would allow holding a 62.5MHz - 66MHz transfer rate, which would be nice for this use case. Then the system could idle and turbo all it desires, and CDIV would just flip on the fly to target fast transfer frequencies.

Or even something like /boot/config.txt:

core_freq_min=266
core_freq=400
force_turbo=0
spi0_cdiv_idle=4
spi0_cdiv_turbo=6

which would result in 266/4=66.5MHz SPI0 transfer rate in idle mode, and 400/6=66.66MHz SPI0 transfer rate in turbo mode - that would be stellar for retaining flexible power saving, and also allow general flexibility for use cases targeting other SPI0 transfer rates. (and perhaps omitting spi0_cdiv_idle and spi0_cdiv_turbo would make the system operate as traditional without firmware(?) poking on CDIV)

Would it be feasible to release SPI0 CDIV over to the firmware(?) to control this way? Or is the behavior fixed in hardware?

@pelwell
Copy link
Contributor

pelwell commented May 11, 2018

It would be possible to implement something like that - it's just a matter of which processor writes to the register. For the SDHOST interface there is a mailbox message to indicate the preferred SD clock speed, and the firmware calculates the correct divisor for the current core clock, taking care never to exceed the requested value.

Sadly this shared clock control only happens in the downstream SDHOST drive - upstream keeps it simple, with the same result as you've seen with SPI - and to add it to SPI we'd have to patch the upstream SPI driver - something we try to avoid.

@juj
Copy link
Author

juj commented May 13, 2018

Thanks, that makes sense.

Ran some power consumption numbers to estimate how much more power such a core_freq_min=400; core_freq=400 middle ground mode would consume under idle and moderate single core'ish load (running that Quake 1 demo I've been toying with, which seems to go at about 50% of single thread CPU consumption), and got the following on Pi 3B:

with

core_freq=250
arm_freq=600
(SPI0 CDIV at 4 to transfer at 62.5mbps)

I got

idle: 66mA
quake load: 176mA

and then

core_freq=400
arm_freq=600
(SPI0 CDIV at 6 to transfer at 66.66mpbs)

gave

idle: 72mA (+9.09%)
quake load: 212mA (+20.45%)

The numbers may be a bit rough, used a cheap USB power consumption tester and integrated consumed current over a 30 minute operation, and then derived average mA consumption from that.

So if the limitation on core_freq_min and core_freq was lifted, allowing a

core_freq_min=400
core_freq=400
force_turbo=0

setting, it seems that that would consume around +10%-+20% more power compared to a hypothetical

core_freq_min=250
core_freq=400
force_turbo=0
spi0_cdiv_idle=4
spi0_cdiv_turbo=6

option where the firmware was able to control SPI0 CDIV on the fly, rather than locking core_freq to 400MHz even when idle. It would definitely be great to have such a live CDIV scaling mode, but I could also see a locking behavior to be workable, assuming these rough numbers are in the right ballpark and the test was valid.

@pelwell
Copy link
Contributor

pelwell commented May 22, 2018

@popcornmix Any thoughts on allowing a single core frequency, either above or below 250MHz, to be specified, e.g. by detecting that core_freq and core_freq_min are the same, while allowing the ARM frequency to vary with load as normal.

@popcornmix
Copy link
Contributor

The only issue is voltage will presumably be at minimum when arm freq is at 600MHz so core_freq may not be reliable when over 250MHz.

But if we're willing to treat it as a overclock style "it may work for you" then I guess we could allow it.

@pelwell
Copy link
Contributor

pelwell commented May 22, 2018

That was the idea - this will be for specialist applications, and we can document the concerns about guaranteeing the voltage is adequate.

juj added a commit to juj/fbcp-ili9341 that referenced this issue Jun 8, 2018
juj added a commit to juj/fbcp-ili9341 that referenced this issue Jun 8, 2018
@juj
Copy link
Author

juj commented Aug 10, 2019

Any chance Raspberry Pi 4 would support specifying desired SPI CDIV values in /boot/config.txt separately for idle state and turbo state? I am assuming here that Pi 4 would also have two power states like the Pi 3, is that right?

@juj
Copy link
Author

juj commented Aug 25, 2019

Got my hands on a Pi 4B, and observing that it does still have two performance states, with idle clock speed of 600MHz CPU & 250MHz SoC, whereas the turbo speeds are 1500MHz CPU & 500 MHz SoC (vs 250<->400 scaling from Pi3B).

Since the gap is now bigger, to get good SPI bus performance, Pi4B would even more need to be able to set different SPI CDIVs for idle and turbo.

@juj
Copy link
Author

juj commented Jun 14, 2020

Hello - I'd like to revisit this topic if possible? There are a lof of SPI-based peripherals that are used with the Pi, especially Pi-based portable gaming devices are extremely popular. Since its creation, fbcp-ili9341 has become ubiquitous in use for driving SPI-based displays on Pi gaming devices, with thousands of users.

There are a kickstarter projects and commercial products out there that rely on fast SPI bus speeds on the Pi. The related bug item raspberrypi/userland#440 gained 124 thumbs up requests, which is likely the highest by a long margin amount of feedback that any single bug against the Pi has ever received.

In #992 (comment) it was mentioned that the SDHOST clock divisor is already controlled by the code that governs the turbo up/down scaling. In later comment #992 (comment) it was mentioned that it would be possible for the same behavior to be applied to the SPI bus, but that it was just not done due to implementation complexity.

Given the large userbase, I would like to ask to revisit the possibility of tackling this complexity head-on? Ideally, SPI bus speeds would also be controllable in sync by the turbo state switches. Reiterating the proposal from #992 (comment), if one could set fields in /boot/config.txt with e.g.

spi0_cdiv_idle=4
spi0_cdiv_turbo=6

that would tell the firmware which SPI bus speeds it should be applying for both power states. Then for example a default value -1 would disable this feature, for compatibility.

(Alternative specification might be to apply a desired bus speed that should not be exceeded: e.g. via spi0_max_clock=67000000, choosing the smallest divider in each power state that will not exceed the specified bus speed)

That way use cases that dedicate the SPI bus for a single peripheral could specify appropriate targets for the SPI bus speed, instead of having to severely undershoot the bus, which leads to ~ -37.5% available SPI bandwidth on the Pi3B, and -50% available SPI bandwidth on the Pi4B. This would greatly improve the display performance of all these popular Pi projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants