Timing optimization #3

gabriel-tenma-white · 2019-06-24T17:45:36Z

I've been playing with synthesizing your FFT core for various FPGAs (I'm surveying all FPGA FFT implementations I can find). On Xilinx 7-series FPGAs I'm seeing some timing issues for the following parameters:

Device: XC7K160T
N: 1024, 24 bit input/output, 24 bit twiddle
1 sample per clock
Maximal hardware multiplier usage

The Fmax achieved was ~207MHz, but it seems to be caused by a single timing bottleneck which is insufficient pipelining in the multipliers. Two DSP48E1 are inferred per multiply (each one can only do 18x25 multiply), and the cascading path is failing timing: https://imgur.com/a/GpcUdNE
DRC is showing:

DPOP 1 Warning DSP fft/stage_1024/HWBFLY.bfly/CKPCE_ONE.rp_one0 output fft/stage_1024/HWBFLY.bfly/CKPCE_ONE.rp_one0/P[47:0] is not pipelined (PREG=0). Pipelining the DSP48 output will improve performance and often saves power so it is suggested whenever possible to fully pipeline this function. If this DSP48 function was inferred, it is suggested to describe an additional register stage after this function. If the DSP48 was instantiated in the design, it is suggested to set the PREG attribute to 1.

There just needs to be one more register after the multiply, but I'm really not familiar with Verilog and not sure what else needs to be changed after adding the delay. Do you want to look into fixing this and see what the actual Fmax your design can achieve is?

Also the generated code is broken when output reorderer is disabled, I had to edit the generated main file and connect the last stage to the output.

Related: check out my FPGA FFT implementation (also programmatically generated) and tell me what you think.

ZipCPU · 2019-06-24T23:52:45Z

Ahm, okay, wow, 207MHz?? That's AWESOME! I'm glad you've been able to run the FFT that fast.

There are two "official" answer to the issue of running the FFT faster.

Keep the internal bitwidth from expanding, by setting a maximum bitwidth at or less than 18 bits. (Might need to be 16 or 17-bits) so that only a single DSP is used per multiply.
The two samples per clock option was specifically built for the purpose of being able to handle larger and larger FFTs but at lower clock rates.

Actually adjusting the logic so that the multiplies are pipelined one into the next has a couple problems. I doubt any are unsolvable, but they'd need some work:

Not all solutions would want to split the multiplies into two, so an option would need to be presented to decide when to do this and when not to.
The code itself isn't necessarily that complicated, although my own example breaks a single multiply into four instead of two.
There's also the pipeline hassle that would need to be dealt with. A multiply that takes more clocks needs to be properly scheduled so that the result is able to be matched up with the rest of the butterfly.

So, it's doable, but it would take some work to get it done right.

Now, can you tell me what happened without the output re-order, and why you think it's broken?

Dan

gabriel-tenma-white · 2019-06-25T05:13:51Z

207MHz on a Kintex-7 isn't considered fast ;) I'm pretty sure you can get your design to meet timing at double that if you do a little bit of timing whack-a-mole ;) A 2-sample-per-cycle core would double the throughput but also increase resource usage, it would be much nicer to get that 2x Fmax boost too.

The code currently describes a single multiplier. It is vivado that is splitting that into two multiplies. However vivado is saying that if you simply describe one more register on the output of the RTL multiply, it will insert the right pipeline registers at the right places. I would say maybe add a configurable option to enable an extra register. The way I'm planning to deal with this is to have many multiplier implementations that the user can choose from. Currently I have two implementations, a simple and naive straightforward description (that infers well for multiplies below 25x18), and a hand optimized one designed specifically for 25x35 complex multiplies using 8 DSP48E1s in Xilinx devices with minimal LUT usage.

Yes the rest of the logic needs to be delayed to accommodate the additional multiplier delay, which is the part I'm unsure about. In my implementation I have a "phase" signal that goes through the core and is used to coordinate everything. When there are pipeline delays I would simply subtract the delay from phase and pass on the new phase signal downstream (with registers of course). Since phase is monotonic I can do tricks like phase_out <= phase+1 when rising_edge(clk) which adds a register but keeps the phase value unchanged.

The code just has a compilation error when output reorder is disabled. The br_o_result signal is undeclared and nothing wires br_sample to the output.

ZipCPU · 2019-07-09T00:27:00Z

I fixed the bug when no bit-reversal stage was present. Feel free to inspect the latest code.

ZipCPU closed this as completed Jul 9, 2019

Skriba98 mentioned this issue May 24, 2021

Don't care output #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timing optimization #3

Timing optimization #3

gabriel-tenma-white commented Jun 24, 2019 •

edited

Loading

ZipCPU commented Jun 24, 2019

gabriel-tenma-white commented Jun 25, 2019

ZipCPU commented Jul 9, 2019

Timing optimization #3

Timing optimization #3

Comments

gabriel-tenma-white commented Jun 24, 2019 • edited Loading

ZipCPU commented Jun 24, 2019

gabriel-tenma-white commented Jun 25, 2019

ZipCPU commented Jul 9, 2019

gabriel-tenma-white commented Jun 24, 2019 •

edited

Loading