Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing optimization #3

Closed
gabriel-tenma-white opened this issue Jun 24, 2019 · 3 comments
Closed

Timing optimization #3

gabriel-tenma-white opened this issue Jun 24, 2019 · 3 comments

Comments

@gabriel-tenma-white
Copy link

gabriel-tenma-white commented Jun 24, 2019

I've been playing with synthesizing your FFT core for various FPGAs (I'm surveying all FPGA FFT implementations I can find). On Xilinx 7-series FPGAs I'm seeing some timing issues for the following parameters:

  • Device: XC7K160T
  • N: 1024, 24 bit input/output, 24 bit twiddle
  • 1 sample per clock
  • Maximal hardware multiplier usage

The Fmax achieved was ~207MHz, but it seems to be caused by a single timing bottleneck which is insufficient pipelining in the multipliers. Two DSP48E1 are inferred per multiply (each one can only do 18x25 multiply), and the cascading path is failing timing: https://imgur.com/a/GpcUdNE
DRC is showing:

DPOP 1 Warning DSP fft/stage_1024/HWBFLY.bfly/CKPCE_ONE.rp_one0 output fft/stage_1024/HWBFLY.bfly/CKPCE_ONE.rp_one0/P[47:0] is not pipelined (PREG=0). Pipelining the DSP48 output will improve performance and often saves power so it is suggested whenever possible to fully pipeline this function. If this DSP48 function was inferred, it is suggested to describe an additional register stage after this function. If the DSP48 was instantiated in the design, it is suggested to set the PREG attribute to 1.

There just needs to be one more register after the multiply, but I'm really not familiar with Verilog and not sure what else needs to be changed after adding the delay. Do you want to look into fixing this and see what the actual Fmax your design can achieve is?

Also the generated code is broken when output reorderer is disabled, I had to edit the generated main file and connect the last stage to the output.

Related: check out my FPGA FFT implementation (also programmatically generated) and tell me what you think.

@ZipCPU
Copy link
Owner

ZipCPU commented Jun 24, 2019

Ahm, okay, wow, 207MHz?? That's AWESOME! I'm glad you've been able to run the FFT that fast.

There are two "official" answer to the issue of running the FFT faster.

  1. Keep the internal bitwidth from expanding, by setting a maximum bitwidth at or less than 18 bits. (Might need to be 16 or 17-bits) so that only a single DSP is used per multiply.

  2. The two samples per clock option was specifically built for the purpose of being able to handle larger and larger FFTs but at lower clock rates.

Actually adjusting the logic so that the multiplies are pipelined one into the next has a couple problems. I doubt any are unsolvable, but they'd need some work:

  1. Not all solutions would want to split the multiplies into two, so an option would need to be presented to decide when to do this and when not to.

  2. The code itself isn't necessarily that complicated, although my own example breaks a single multiply into four instead of two.

  3. There's also the pipeline hassle that would need to be dealt with. A multiply that takes more clocks needs to be properly scheduled so that the result is able to be matched up with the rest of the butterfly.

So, it's doable, but it would take some work to get it done right.

Now, can you tell me what happened without the output re-order, and why you think it's broken?

Dan

@gabriel-tenma-white
Copy link
Author

207MHz on a Kintex-7 isn't considered fast ;) I'm pretty sure you can get your design to meet timing at double that if you do a little bit of timing whack-a-mole ;) A 2-sample-per-cycle core would double the throughput but also increase resource usage, it would be much nicer to get that 2x Fmax boost too.

The code currently describes a single multiplier. It is vivado that is splitting that into two multiplies. However vivado is saying that if you simply describe one more register on the output of the RTL multiply, it will insert the right pipeline registers at the right places. I would say maybe add a configurable option to enable an extra register. The way I'm planning to deal with this is to have many multiplier implementations that the user can choose from. Currently I have two implementations, a simple and naive straightforward description (that infers well for multiplies below 25x18), and a hand optimized one designed specifically for 25x35 complex multiplies using 8 DSP48E1s in Xilinx devices with minimal LUT usage.

Yes the rest of the logic needs to be delayed to accommodate the additional multiplier delay, which is the part I'm unsure about. In my implementation I have a "phase" signal that goes through the core and is used to coordinate everything. When there are pipeline delays I would simply subtract the delay from phase and pass on the new phase signal downstream (with registers of course). Since phase is monotonic I can do tricks like phase_out <= phase+1 when rising_edge(clk) which adds a register but keeps the phase value unchanged.

The code just has a compilation error when output reorder is disabled. The br_o_result signal is undeclared and nothing wires br_sample to the output.

@ZipCPU
Copy link
Owner

ZipCPU commented Jul 9, 2019

I fixed the bug when no bit-reversal stage was present. Feel free to inspect the latest code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants