-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timing optimization #3
Comments
Ahm, okay, wow, 207MHz?? That's AWESOME! I'm glad you've been able to run the FFT that fast. There are two "official" answer to the issue of running the FFT faster.
Actually adjusting the logic so that the multiplies are pipelined one into the next has a couple problems. I doubt any are unsolvable, but they'd need some work:
So, it's doable, but it would take some work to get it done right. Now, can you tell me what happened without the output re-order, and why you think it's broken? Dan |
207MHz on a Kintex-7 isn't considered fast ;) I'm pretty sure you can get your design to meet timing at double that if you do a little bit of timing whack-a-mole ;) A 2-sample-per-cycle core would double the throughput but also increase resource usage, it would be much nicer to get that 2x Fmax boost too. The code currently describes a single multiplier. It is vivado that is splitting that into two multiplies. However vivado is saying that if you simply describe one more register on the output of the RTL multiply, it will insert the right pipeline registers at the right places. I would say maybe add a configurable option to enable an extra register. The way I'm planning to deal with this is to have many multiplier implementations that the user can choose from. Currently I have two implementations, a simple and naive straightforward description (that infers well for multiplies below 25x18), and a hand optimized one designed specifically for 25x35 complex multiplies using 8 DSP48E1s in Xilinx devices with minimal LUT usage. Yes the rest of the logic needs to be delayed to accommodate the additional multiplier delay, which is the part I'm unsure about. In my implementation I have a "phase" signal that goes through the core and is used to coordinate everything. When there are pipeline delays I would simply subtract the delay from phase and pass on the new phase signal downstream (with registers of course). Since phase is monotonic I can do tricks like phase_out <= phase+1 when rising_edge(clk) which adds a register but keeps the phase value unchanged. The code just has a compilation error when output reorder is disabled. The br_o_result signal is undeclared and nothing wires br_sample to the output. |
I fixed the bug when no bit-reversal stage was present. Feel free to inspect the latest code. |
I've been playing with synthesizing your FFT core for various FPGAs (I'm surveying all FPGA FFT implementations I can find). On Xilinx 7-series FPGAs I'm seeing some timing issues for the following parameters:
The Fmax achieved was ~207MHz, but it seems to be caused by a single timing bottleneck which is insufficient pipelining in the multipliers. Two DSP48E1 are inferred per multiply (each one can only do 18x25 multiply), and the cascading path is failing timing: https://imgur.com/a/GpcUdNE
DRC is showing:
There just needs to be one more register after the multiply, but I'm really not familiar with Verilog and not sure what else needs to be changed after adding the delay. Do you want to look into fixing this and see what the actual Fmax your design can achieve is?
Also the generated code is broken when output reorderer is disabled, I had to edit the generated main file and connect the last stage to the output.
Related: check out my FPGA FFT implementation (also programmatically generated) and tell me what you think.
The text was updated successfully, but these errors were encountered: