Hard cores #47

hartytp · 2016-11-01T21:27:50Z

What is the current thinking on putting hard cores on the Metlino/Sayma/etc.?

jordens · 2016-11-01T21:30:55Z

Do you mean those aluminum core PCBs or do you mean proprietary IP blocks in the gateware/silicon?

hartytp · 2016-11-01T21:33:58Z

The latter -- although, I was thinking more along the lines of an external ARM core connected to the FPGA (IIRC, this was discussed previously)

jordens · 2016-11-01T23:23:17Z

This is m-labs/artiq#535.
There are numerous issues that make "putting hard cores" onto those boards very difficult. An incomplete list:

a) Everybody says "I want an ARM core" but we've always had problems figuring out exactly why or how they want to use it.
b) They have high latencies.
c) Zynq-style silicon also takes away ethernet, RAM etc. from the fabric. That sabotages our design.
d) It's unclear whether our dual CPU architecture translates at all.
e) It's too late in the process to incorporate it.
f) Gateware accelerated solutions for those DSP applications are likely better solutions for the problems at hand.

But yeah. If somebody still wants them, we are open to discuss the details.

sbourdeauducq · 2016-11-02T02:39:18Z

m-labs/artiq#535 is about adding floating point instructions to the gateware mor1kx.

As for putting hard CPU cores on the Sinara hardware, they would go on FMC with something like this: http://www.4dsp.com/FMC645.php
But yes, someone has to figure out how to interact with it and what it is supposed to do.

cjbe · 2016-11-02T11:33:05Z

@jordens to answer (a) here is the problem I am worrying about - if there is a solution that does not involve a hard core I am equally happy with that.

Many of the experiments I am writing at the moment are running into pulse rate limitations.
For release 2.0 the sustained pulse rate for a single TTL (as per the PulseRate test) is ~600ns. This means that by the time one has a little more control flow in the loop, and uses 5-10 TTL, the minimum loop execution time is ~10us.

To give an example of a problem we have in the lab right now:
We want to spin in a tight loop of state-prep, excite, and branch if there was a detection event, otherwise repeat. This involves ~10 TTL signals. Our apparatus allows us to repeat this loop at a rate of ~1us, but we are limited to >~10us by ARTIQ.

Even for experiments that do not involve frequent branching, and hence allow the processor to fill up the FIFOs during the slack times, we often run into underflow errors which require padding delays.

These problems are only going to get worse as experiments get more complicated, and require more branching, with more complicated decisions at the branch.

My understanding (correct me if I am wrong) is that we can push the soft-core from the current 125 MHz to perhaps 200 or 250 MHz, but no further - to go faster than this we would need a hard CPU.

Suggestions?

gkasprow · 2016-11-02T11:34:35Z

This FMC DSP module requires HPC board and has either EMIF or gigabit SRIO interface.

For deterministic access EMIF seems to be better choice but to integrate this module with Sayma we would need detailed datasheet to check which pins are used for EMIF.

On the other side we have experience with FPGA-DSP communication over SRIO but here transceivers are necessary which we don’t have connected to FMC.

sbourdeauducq · 2016-11-02T13:07:02Z

The DSP card is just an example of a CPU mounted on a FMC, I'm not proposing this particular one.

jordens · 2016-11-02T18:34:11Z

@hartytp

You seem to be silently assuming that branching latency is inversely proportional to CPU speed. This is most likely not the case. Branching is impacted by branch prediction and DRAM latency.
Your case seems to be well suited for DMA. Just prepare a DMA segment for those 10 TTL pulses and replay them using < 1 µs CPU time.
It is also unclear whether a hard CPU would help you at all. The timing could be limited by the way we have designed the RTIO register interface. And in the case of a hard CPU, the fabric-to-CPU bus and the clock domain crossings that would be required certainly will not be beneficial to the overall latency.

gkasprow · 2016-11-02T20:01:01Z

One can use ARM CPU with large on chip SRAM which should be more deterministic than external SDRAM. Providing that 10MB of memory is enough.

https://www.renesas.com/en-eu/products/microcontrollers-microprocessors/rz/rza/rza1h.html

dtcallcock · 2016-11-02T22:15:29Z

@jordens

Are there ways ARTIQ performance could be improved (via changes to gateware or additional hardware) other than through DMA or application-specific gateware acceleration? I'm not saying there's a need for it or that it's a sensible thing to do, I just want a sense of what it'd involve.

jordens · 2016-11-03T14:05:52Z

AFAICS there are a bunch of different performance areas with different solutions:

Raw event rate. Best approach would be DMA (orders of magnitude), some other kind of tailored gateware support (order of magnitude), or an improved register layout and tweaked event submission code (factors of a few).
Floating point math. Add an FPU or using/adding a hard CPU. Could give 1-2 orders of magnitude speed up if the interface doesn't eat up the gain.
Integer math. Gateware support or faster/external CPU if the interface doesn't eat up the gain.
Actual event round trip latency (if you want to react to something). Better register layout, maybe faster CPU if latency or bus crossings don't eat up all the advantage. This will get worse with DRTIO by maybe 100-200ns per layer. The only other way around that is local feedback with gateware support.
Experiment/kernel startup. That's the time it takes to spawn a python process or compile the kernel. Improvements could be in the linker or better pooling/reuse of workers and caching of kernels on the host or on the core device.

sbourdeauducq · 2016-11-03T14:22:14Z

Potentially, there is also the option of a tighter coupling of the RTIO core with the CPU, so it doesn't have to program all the RTIO CSRs through external bus transactions at every event.

jordens · 2016-11-03T14:25:49Z

Yes. That's what I meant with "improved register layout and tweaked event submission code".

cjbe · 2016-12-01T00:12:39Z

@jordens The issues I am primarily worried about are:

Raw event rate: I am happy with using DMA to fix the occasional tight part of a sequence. However I am already having to add padding delays all over the place in my current experiments to fix the timing. I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.
Maths - I agree that this could be handled by an external processor / FPU, modulo the complexity to the user to mark up where different bits of code need to run
Reaction latency: A lot of this is in gateware, which can be improved. However, for all but the simplest use cases one needs to do some maths, which hurts currently.

I understand that the difficulty with a hard CPU is getting a low latency / high bandwidth interface, which pushes us to either an external co-processor (with potentially high latency) or a FPGA with a hard core (e.g. Zynq).

It seems like there are firm advantages to using an FPGA with a hard core (low latency, decent maths performance, no need for nasty 'offload CPU' complexity for the user). Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?

This obviously involves significant changes to the gateware, but it feels that this is not grossly different from the effort required to write a soft FPU, or write a mechanism to pass jobs to an offload CPU.

What am I missing?

dtcallcock · 2016-12-01T00:40:02Z

@jordens Would you be willing to write a for-contract issue over on the artiq repository for "improved register layout and tweaked event submission code"? I agree with cjbe that it'd be nice not to have to use DMA everywhere to get a raw event rate out of artiq that's adequate for the bulk of experiments. It's not clear whether 'factors of a few' will be enough but I feel it might be so it seems worth exploring.

sbourdeauducq · 2016-12-01T02:51:47Z

I definitely do not want to end up using DMA for everything - this is conceptually messy, and pushes more work to the user to manage the flow.

The last point can be improved if the compiler extracts and pre-computes DMA sequences, either automatically or with another context manager that you simply wrap around the sequence (no manual/explicit preparation and replaying of DMA segments).

Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq?

As mentioned or hinted above:

Low-latency is unclear (do you have hard numbers on this? try mapping the ARTIQ RTIO core on the AXI bus and execute C routines that represent typical kernels?). The Zynq ARM cores involve clock domain transfers that can be high latency (for example, look at the latency of the GTX transceivers, which move data in a simpler way than CPU buses do)
the current two-CPU system (non-realtime comms and management + real-time kernel) may not be doable
some Zynq ARM cores (e.g. the "realtime" one) only run at 600MHz
Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed 2) the compiler will have to produce ARM instructions 3) I expect the usual "wizard" reverse engineering, bugs, problems and workarounds for braindead designs that come with every hard Xilinx block.

which pushes us to either an external co-processor (with potentially high latency)

If the external processor has a good (synchronous, high bandwidth, low latency) external bus interface, it could potentially run the kernels better than a Zynq core does.

dnadlinger · 2016-12-01T14:55:36Z

Low-latency is unclear

With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010. This was in a different setting without any care taken to optimise the code, but it might be useful as an upper bound.

I believe ETH are getting 70-80 ns overall branching latency on a Zynq 7010 as well (end of TTL input window -> AXI -> CPU -> AXI -> TTL). The same caveat applies, though; I'm pretty sure minimising that has not been a focus there either.

the current two-CPU system (non-realtime comms and management + real-time kernel) may not be doable

Why would it not be?

Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed

The Xilinx drivers already exist (and are solid, if generally meh).

the compiler will have to produce ARM instructions

Trivial. The only interesting part would be the finer details of matching the C ABI chosen.

sbourdeauducq · 2016-12-01T16:42:52Z

With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010.

Ok, for this simple task, the soft CPU may not be that different. How many bus transfers per edge was that? The RTIO core needs quite a bit, plus a read after the event is posted to check the status (underflow, etc.) that incurs a full bus round trip.

Zynq is not straightforward: 1) drivers for Ethernet, UART, etc. would need to be developed
The Xilinx drivers already exist (and are solid, if generally meh).

Meh, yes, and will they integrate well with the rest of the code?

the compiler will have to produce ARM instructions
Trivial. The only interesting part would be the finer details of matching the C ABI chosen.

Yes, that plus the usual collection of bugs and other problems that manifest themselves every single time you use software (the ARTIQ compiler, LLVM, and the unwinder) in a way it has not been used before.

jordens · 2016-12-01T17:24:46Z

@cjbe @dtcallcock I would first like to see a diagnosis and profile what is actually slow and why. This renews our request from a few years ago to see test cases and actual code. This does not mean that the improvements above are not good or unneeded. It's just to ensure (and do so in a CI fashion) that there are no bugs/obvious fixes that would improve things.

With little effort I had gotten around 120 ns (IIRC) of TTL round-trip latency with my old ventilator code which did hard timestamping. I have no idea how much tweaking the ETH guys applied and whether this was actually RTIO-like. They don't seem to publish their code.

gkasprow · 2016-12-03T17:43:09Z

If you want to use ZynQ some time in the future, I prepare HW which is essentially Sayma AMC but with ZynQ US+ chip It will have second FMC instead of SFPs but up to 4 SFPs can be installed on FMC. I will keep RTM compatibility with Sayma AMC. This board will be used for another project related with video processing, but can be used for ARTIQ as well.

hartytp · 2017-02-17T16:50:03Z

Closing this issue as: adding Zync etc to Sayma/Metlino is impractical at this point; and, many of the above concerns should be dealt with by DMA...

jordens added state:for-contract type:discussion labels Nov 4, 2016

jordens mentioned this issue Dec 1, 2016

improved register layout and tweaked event submission code m-labs/artiq#636

Closed

hartytp closed this as completed Feb 17, 2017

jbqubit removed this from planning in dashboard-britton Jun 8, 2017

jbqubit added this to planning in dashboard-britton Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard cores #47

Hard cores #47

hartytp commented Nov 1, 2016

jordens commented Nov 1, 2016

hartytp commented Nov 1, 2016

jordens commented Nov 1, 2016

sbourdeauducq commented Nov 2, 2016

cjbe commented Nov 2, 2016 •

edited

gkasprow commented Nov 2, 2016

sbourdeauducq commented Nov 2, 2016

jordens commented Nov 2, 2016

gkasprow commented Nov 2, 2016

dtcallcock commented Nov 2, 2016

jordens commented Nov 3, 2016 •

edited

sbourdeauducq commented Nov 3, 2016

jordens commented Nov 3, 2016

cjbe commented Dec 1, 2016

dtcallcock commented Dec 1, 2016

sbourdeauducq commented Dec 1, 2016

dnadlinger commented Dec 1, 2016 •

edited

sbourdeauducq commented Dec 1, 2016 •

edited

jordens commented Dec 1, 2016

gkasprow commented Dec 3, 2016 via email

hartytp commented Feb 17, 2017

Hard cores #47

Hard cores #47

Comments

hartytp commented Nov 1, 2016

jordens commented Nov 1, 2016

hartytp commented Nov 1, 2016

jordens commented Nov 1, 2016

sbourdeauducq commented Nov 2, 2016

cjbe commented Nov 2, 2016 • edited

gkasprow commented Nov 2, 2016

sbourdeauducq commented Nov 2, 2016

jordens commented Nov 2, 2016

gkasprow commented Nov 2, 2016

dtcallcock commented Nov 2, 2016

jordens commented Nov 3, 2016 • edited

sbourdeauducq commented Nov 3, 2016

jordens commented Nov 3, 2016

cjbe commented Dec 1, 2016

dtcallcock commented Dec 1, 2016

sbourdeauducq commented Dec 1, 2016

dnadlinger commented Dec 1, 2016 • edited

sbourdeauducq commented Dec 1, 2016 • edited

jordens commented Dec 1, 2016

gkasprow commented Dec 3, 2016 via email

hartytp commented Feb 17, 2017

cjbe commented Nov 2, 2016 •

edited

jordens commented Nov 3, 2016 •

edited

dnadlinger commented Dec 1, 2016 •

edited

sbourdeauducq commented Dec 1, 2016 •

edited