Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Interprocessor (&Inter OS) Communication Latency #92

Open
hermanhermitage opened this Issue · 11 comments

6 participants

@hermanhermitage

This is a placeholder to discuss tackling the interprocessor comms latencies.
(From: #84 (comment))
"The mailbox/property interface takes about 350us to send a set_clock_rate message and get a response back.
I believe the clock changing is a small fraction of that time. I believe the majority of the time is taken by ARM side context switches."

My thoughts are this is a major show stopper on fine grained dispatch and thus problematic for non batch (scene at a time) workloads. Experience with other architectures you can get down to low microsecond uni-directional dispatch of jobs, and <50us when notification/sync is required.

I'm not across linux nor the impact of virtual indexing on context switching on the ARM, nor the VideoCore background workload, nor whether all work must be scheduled via tasks or whether interrupt worker functions can be pumped by interrupts (eg. making one VideoCore scalar processor dedicated to executing work tasks scheduled direct from interrupts or watching a ringbuffer - without the need for context switching).

It feels like software architecture is getting in the way of what the hardware can do. The CPUs are 1 AXI clock away right? There is custom mailbox (fifo) - to avoid paying SDRAM latencies on interprocessor comms?

As a starting point, is there some way I can break down and measure the elements of a typical message/response in terms of where the time is taken up?
(arm user -> context switch -> arm driver -> mailbox -> gpu interrupt -> gpu driver -> context switch -> gpu work -> ... return message...)

Whats the best userland component to look into for understanding the latency?

Thanks.

@popcornmix
Owner

The CPUs are not 1 AXI clock away. Peripheral accesses are typically 20-30 cycles from the GPU. I'd imagine they are longer from ARM. Some arm latency numbers here:
http://www.raspberrypi.org/phpBB3/viewtopic.php?f=2&t=3042#p40366

You can measure the microsecond timer (STC) from userland with gettimeofday (or reading it directly with mmap).
You could add a printk of that from kernel mailbox driver when a message is sent and when the response comes back. I could add a mailbox message that stores the STC when GPU receives the message.

That should provide some idea of where the time goes.

You might also want to look into /opt/vc/bin/vcdbg. There's an option to get a gnuplot graph of GPU interrupts and task switches (again the timing unit will be STC).

@popcornmix
Owner

I've measured this more closely. I've added a mailbox message that returns STC.
I measure STC on arm (in kernel driver) send mailbox, read it back, then look at time from ARM->GPU and time for GPU->ARM. These are times in microseconds I measured in an idle system:

(47,64)
(52,59)
(38,74)
(21,40)
(41,74)
(21,38)
(37,71)
(21,37)
(38,77)
(20,933)
(33,76)
(21,42)
(38,75)
(22,39)
(39,75)
(22,38)
(38,76)
(22,38)
(40,76)
(21,37)
(39,74)
(22,38)
(39,74)
(22,40)
(37,77)
(22,39)

So, about 60us for a message with response in the best case. GPU is rather quicker than ARM to handle interrup/task switch.

@hermanhermitage

Those are pretty positive. I'm still digging into linux & X to get a handle on how it all hangs together - its a complex problem so much of the graphics stack (on linux) seems to have been written in a hardware vacuum.

How deep is the mailbox? Is there any storage other than SDRAM that could be used for a (shared) ring buffer?

@popcornmix
Owner

Mailboxes are 8x32bit words. Don't believe there is any shared storage apart from SDRAM/L2.

@swarren

Presumably there is some embedded ram within the SoC that the boot ROM uses for state before the SDRAM is initialized (and in fact I wonder if SDRAM init doesn't happen inside the firmware loaded from SD card, so quite late). Is that RAM accessible from both CPUs?

@ghollingworth
@Ferroin

Have you tried clocking the GPU such that it is running at an exact divisor/multiple of the CPU clock? I have found personally that the whole system works more efficiently when you do, so that might help some with the communication latencies.

@brunogm0

Anyone care to enable ARM TCM in RPi, some patchs use it on the arm1176 for U300 and s3c64xx ?
IIRC ITCM and DTCM are 16k total each divided in 4k zones? Also anyone can confirm that two DMA channels exist to comunicate from/to TCMs?

@swarren

I believe the TCM is optional, and does not exist in the RPi. At least, when someone asked me about it before, and I checked the ARM registers that represent the TCM presence/size, those registers said there wasn't one.

@popcornmix
Owner

Yes, there is no TCM in RPi (nor any arm side DMA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.