New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard cores #47
Comments
Do you mean those aluminum core PCBs or do you mean proprietary IP blocks in the gateware/silicon? |
The latter -- although, I was thinking more along the lines of an external ARM core connected to the FPGA (IIRC, this was discussed previously) |
This is m-labs/artiq#535. a) Everybody says "I want an ARM core" but we've always had problems figuring out exactly why or how they want to use it. But yeah. If somebody still wants them, we are open to discuss the details. |
m-labs/artiq#535 is about adding floating point instructions to the gateware mor1kx. As for putting hard CPU cores on the Sinara hardware, they would go on FMC with something like this: http://www.4dsp.com/FMC645.php |
@jordens to answer (a) here is the problem I am worrying about - if there is a solution that does not involve a hard core I am equally happy with that. Many of the experiments I am writing at the moment are running into pulse rate limitations. To give an example of a problem we have in the lab right now: Even for experiments that do not involve frequent branching, and hence allow the processor to fill up the FIFOs during the slack times, we often run into underflow errors which require padding delays. These problems are only going to get worse as experiments get more complicated, and require more branching, with more complicated decisions at the branch. My understanding (correct me if I am wrong) is that we can push the soft-core from the current 125 MHz to perhaps 200 or 250 MHz, but no further - to go faster than this we would need a hard CPU. Suggestions? |
This FMC DSP module requires HPC board and has either EMIF or gigabit SRIO interface. For deterministic access EMIF seems to be better choice but to integrate this module with Sayma we would need detailed datasheet to check which pins are used for EMIF. On the other side we have experience with FPGA-DSP communication over SRIO but here transceivers are necessary which we don’t have connected to FMC. |
The DSP card is just an example of a CPU mounted on a FMC, I'm not proposing this particular one. |
|
One can use ARM CPU with large on chip SRAM which should be more deterministic than external SDRAM. Providing that 10MB of memory is enough. https://www.renesas.com/en-eu/products/microcontrollers-microprocessors/rz/rza/rza1h.html |
Are there ways ARTIQ performance could be improved (via changes to gateware or additional hardware) other than through DMA or application-specific gateware acceleration? I'm not saying there's a need for it or that it's a sensible thing to do, I just want a sense of what it'd involve. |
AFAICS there are a bunch of different performance areas with different solutions:
|
Potentially, there is also the option of a tighter coupling of the RTIO core with the CPU, so it doesn't have to program all the RTIO CSRs through external bus transactions at every event. |
Yes. That's what I meant with "improved register layout and tweaked event submission code". |
@jordens The issues I am primarily worried about are:
I understand that the difficulty with a hard CPU is getting a low latency / high bandwidth interface, which pushes us to either an external co-processor (with potentially high latency) or a FPGA with a hard core (e.g. Zynq). It seems like there are firm advantages to using an FPGA with a hard core (low latency, decent maths performance, no need for nasty 'offload CPU' complexity for the user). Are there good technical reasons, apart from a general dislike for closed-source black boxes, to not strongly consider e.g. a Zynq? This obviously involves significant changes to the gateware, but it feels that this is not grossly different from the effort required to write a soft FPU, or write a mechanism to pass jobs to an offload CPU. What am I missing? |
@jordens Would you be willing to write a for-contract issue over on the artiq repository for "improved register layout and tweaked event submission code"? I agree with cjbe that it'd be nice not to have to use DMA everywhere to get a raw event rate out of artiq that's adequate for the bulk of experiments. It's not clear whether 'factors of a few' will be enough but I feel it might be so it seems worth exploring. |
The last point can be improved if the compiler extracts and pre-computes DMA sequences, either automatically or with another context manager that you simply wrap around the sequence (no manual/explicit preparation and replaying of DMA segments).
As mentioned or hinted above:
If the external processor has a good (synchronous, high bandwidth, low latency) external bus interface, it could potentially run the kernels better than a Zynq core does. |
With a simple bare-metal C program that sets up an edge-triggered interrupt on a pin on the hard GPIO controller and mirrors the state to a pin bound to a simple AXI-mapped register, I've previously measured ~80 ns pin-to-pin latency on a Zynq 7010. This was in a different setting without any care taken to optimise the code, but it might be useful as an upper bound. I believe ETH are getting 70-80 ns overall branching latency on a Zynq 7010 as well (end of TTL input window -> AXI -> CPU -> AXI -> TTL). The same caveat applies, though; I'm pretty sure minimising that has not been a focus there either.
Why would it not be?
The Xilinx drivers already exist (and are solid, if generally meh).
Trivial. The only interesting part would be the finer details of matching the C ABI chosen. |
Ok, for this simple task, the soft CPU may not be that different. How many bus transfers per edge was that? The RTIO core needs quite a bit, plus a read after the event is posted to check the status (underflow, etc.) that incurs a full bus round trip.
Meh, yes, and will they integrate well with the rest of the code?
Yes, that plus the usual collection of bugs and other problems that manifest themselves every single time you use software (the ARTIQ compiler, LLVM, and the unwinder) in a way it has not been used before. |
@cjbe @dtcallcock I would first like to see a diagnosis and profile what is actually slow and why. This renews our request from a few years ago to see test cases and actual code. This does not mean that the improvements above are not good or unneeded. It's just to ensure (and do so in a CI fashion) that there are no bugs/obvious fixes that would improve things. With little effort I had gotten around 120 ns (IIRC) of TTL round-trip latency with my old ventilator code which did hard timestamping. I have no idea how much tweaking the ETH guys applied and whether this was actually RTIO-like. They don't seem to publish their code. |
If you want to use ZynQ some time in the future, I prepare HW which is
essentially Sayma AMC but with ZynQ US+ chip
It will have second FMC instead of SFPs but up to 4 SFPs can be installed
on FMC.
I will keep RTM compatibility with Sayma AMC.
This board will be used for another project related with video processing,
but can be used for ARTIQ as well.
|
Closing this issue as: adding Zync etc to Sayma/Metlino is impractical at this point; and, many of the above concerns should be dealt with by DMA... |
What is the current thinking on putting hard cores on the Metlino/Sayma/etc.?
The text was updated successfully, but these errors were encountered: