Interprocessor (&Inter OS) Communication Latency #92

Closed
hermanhermitage opened this Issue Sep 12, 2012 · 12 comments

Comments

Projects
None yet
7 participants

This is a placeholder to discuss tackling the interprocessor comms latencies.
(From: #84 (comment))
"The mailbox/property interface takes about 350us to send a set_clock_rate message and get a response back.
I believe the clock changing is a small fraction of that time. I believe the majority of the time is taken by ARM side context switches."

My thoughts are this is a major show stopper on fine grained dispatch and thus problematic for non batch (scene at a time) workloads. Experience with other architectures you can get down to low microsecond uni-directional dispatch of jobs, and <50us when notification/sync is required.

I'm not across linux nor the impact of virtual indexing on context switching on the ARM, nor the VideoCore background workload, nor whether all work must be scheduled via tasks or whether interrupt worker functions can be pumped by interrupts (eg. making one VideoCore scalar processor dedicated to executing work tasks scheduled direct from interrupts or watching a ringbuffer - without the need for context switching).

It feels like software architecture is getting in the way of what the hardware can do. The CPUs are 1 AXI clock away right? There is custom mailbox (fifo) - to avoid paying SDRAM latencies on interprocessor comms?

As a starting point, is there some way I can break down and measure the elements of a typical message/response in terms of where the time is taken up?
(arm user -> context switch -> arm driver -> mailbox -> gpu interrupt -> gpu driver -> context switch -> gpu work -> ... return message...)

Whats the best userland component to look into for understanding the latency?

Thanks.

Contributor

popcornmix commented Sep 13, 2012

The CPUs are not 1 AXI clock away. Peripheral accesses are typically 20-30 cycles from the GPU. I'd imagine they are longer from ARM. Some arm latency numbers here:
http://www.raspberrypi.org/phpBB3/viewtopic.php?f=2&t=3042#p40366

You can measure the microsecond timer (STC) from userland with gettimeofday (or reading it directly with mmap).
You could add a printk of that from kernel mailbox driver when a message is sent and when the response comes back. I could add a mailbox message that stores the STC when GPU receives the message.

That should provide some idea of where the time goes.

You might also want to look into /opt/vc/bin/vcdbg. There's an option to get a gnuplot graph of GPU interrupts and task switches (again the timing unit will be STC).

Contributor

popcornmix commented Sep 16, 2012

I've measured this more closely. I've added a mailbox message that returns STC.
I measure STC on arm (in kernel driver) send mailbox, read it back, then look at time from ARM->GPU and time for GPU->ARM. These are times in microseconds I measured in an idle system:

(47,64)
(52,59)
(38,74)
(21,40)
(41,74)
(21,38)
(37,71)
(21,37)
(38,77)
(20,933)
(33,76)
(21,42)
(38,75)
(22,39)
(39,75)
(22,38)
(38,76)
(22,38)
(40,76)
(21,37)
(39,74)
(22,38)
(39,74)
(22,40)
(37,77)
(22,39)

So, about 60us for a message with response in the best case. GPU is rather quicker than ARM to handle interrup/task switch.

Those are pretty positive. I'm still digging into linux & X to get a handle on how it all hangs together - its a complex problem so much of the graphics stack (on linux) seems to have been written in a hardware vacuum.

How deep is the mailbox? Is there any storage other than SDRAM that could be used for a (shared) ring buffer?

Contributor

popcornmix commented Sep 17, 2012

Mailboxes are 8x32bit words. Don't believe there is any shared storage apart from SDRAM/L2.

swarren commented Oct 21, 2012

Presumably there is some embedded ram within the SoC that the boot ROM uses for state before the SDRAM is initialized (and in fact I wonder if SDRAM init doesn't happen inside the firmware loaded from SD card, so quite late). Is that RAM accessible from both CPUs?

No only from the VPU...

SDRAM initialisation is done in the second stage...

  1. bootcode stored in rom boots the VPU at reset, this loads a small
    bootloader from the SDCard into the L2 cache. (bootcode.bin)
  2. VPU then sets up the SDRAM and loads the main application (i.e.
    start.elf) into SDRAM jumping directly into the SDRAM (after flushing the
    cache of course!)
  3. start.elf then boots the ARM image

Ferroin commented Oct 21, 2012

Have you tried clocking the GPU such that it is running at an exact divisor/multiple of the CPU clock? I have found personally that the whole system works more efficiently when you do, so that might help some with the communication latencies.

brunogm0 commented Mar 8, 2015

Anyone care to enable ARM TCM in RPi, some patchs use it on the arm1176 for U300 and s3c64xx ?
IIRC ITCM and DTCM are 16k total each divided in 4k zones? Also anyone can confirm that two DMA channels exist to comunicate from/to TCMs?

swarren commented Mar 8, 2015

I believe the TCM is optional, and does not exist in the RPi. At least, when someone asked me about it before, and I checked the ARM registers that represent the TCM presence/size, those registers said there wasn't one.

Contributor

popcornmix commented Mar 8, 2015

Yes, there is no TCM in RPi (nor any arm side DMA).

Ruffio commented Jun 24, 2015

@hermanhermitage is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment