Skip to content

dual port ram changes migration#1

Open
mjao1 wants to merge 1 commit into
sifferman:mainfrom
mjao1:dual-port-vreg
Open

dual port ram changes migration#1
mjao1 wants to merge 1 commit into
sifferman:mainfrom
mjao1:dual-port-vreg

Conversation

@mjao1
Copy link
Copy Markdown

@mjao1 mjao1 commented May 15, 2026

Dual Port RAM

Using dual port RAM for certain FUs is beneficial since many operations in ternip_rms.sv and ternip_rowwise_operations.sv have back-to-back reads and write.

The single memory request path was replaced with a dual port memory backend, then ternip_core + certain FUs were wired so reads and writes can happen in parallel:

New memory

  • rtl/common/ternip_dual_port_mem.sv was added
  • It is a shared MEM array with two independent request/read pipelines (A and B), each with similar handshake style to ternip_pipelined_mem

Vector register file uses dual port backend

  • ternip_vector_registers.sv now instantiates ternip_dual_port_mem instead of ternip_pipelined_mem and exposes:
    - port A: request_* + read_*
    - port B: request2_* + read2_*

Core wiring/arbitration

  • ternip_core.sv:
    - existing FU request network still drives port A (vector_request_*)
    - new port B network vector_request2_* is arbitrated between RMS and rowwise_operation
    - read2_* is currently tied off (port B is used for extra request bandwidth, primarily writes)

FU logic updated to use port B where it matters most

  • ternip_rms.sv: in NORM, reads stay on port A, writes use vector_request2_*
  • ternip_rowwise_operation.sv: in ADD/SUB/MUL/DIV/SIG/CSIG/SILU, reads stay on port A, writes use vector_request2_*

Cycle count

Measured speedups (generic, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 3118122 3118122
rowwise_operation_tb 391050 365550 1.07×

Measured speedups (xc7a200t_D=1024_OneCore, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 464386 412186 1.13×
rowwise_operation_tb 287250 237450 1.21×

Measured speedups (xc7a200t_D=1024_MaxCores, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 5392277 4950317 1.09×
rowwise_operation_tb 3641850 3233550 1.13×

Measured speedups (xcu250_D=1024_OneCore, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 381301 329101 1.16×
rowwise_operation_tb 229950 180150 1.28×

Measured speedups (xcu250_D=1024_MaxCores, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 714834 606954 1.18×
rowwise_operation_tb 454050 352950 1.29×

Measured speedups (xcu250_D=2048_OneCore, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 1327676 1117508 1.19×
rowwise_operation_tb 901700 698550 1.29×

Measured speedups (xcu250_D=2048_MaxCores, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 1327676 1117508 1.19×
rowwise_operation_tb 901700 698550 1.29×

Measured speedups (xcu250_D=2560_OneCore, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 1628574 1368190 1.19×
rowwise_operation_tb 1125800 871350 1.29×

Measured speedups (xcu250_D=2560_MaxCores, Verilator)

workload baseline cycles dual port cycles end-to-end speedup
rms_tb 1628574 1368190 1.19×
rowwise_operation_tb 1125800 871350 1.29×

(Excluding generic)
Average rms speedup: 1.165x
Average rowwise speedup: 1.259x

Per-FU phase speedups (xcu250_D=1024_OneCore, Verilator)

phase baseline cycles dual port cycles speedup
RMS NORM 111,360 59,160 1.88x
Rowwise ADD 25,800 13,050 1.98x
Rowwise SUB 25,800 13,050 1.98x
Rowwise MUL 19,450 13,200 1.47x
Rowwise SIG 12,800 6,800 1.88x
Rowwise CSIG 12,800 6,800 1.88x
Rowwise SILU 13,000 6,950 1.87x

Timing

xcu250_D=1024_OneCore

metric single port dual port diff
Clock period constraint 3.333 ns 3.333 ns  
Target frequency 300.03 MHz 300.03 MHz  
WNS 0.038 ns 0.014 ns -0.024 ns
TNS 0.000 ns 0.000 ns  
Setup failing endpoints 0 0  
WHS 0.013 ns 0.010 ns -0.003 ns
THS 0.000 ns 0.000 ns  
Hold failing endpoints 0 0  
WPWS 1.124 ns 1.124 ns  
TPWS 0.000 ns 0.000 ns  
Timing status Met Met  

xc7a200t_D=1024_OneCore

metric single port dual port diff
Clock period constraint 10.000 ns 10.000 ns  
Target frequency 100.000 MHz 100.000 MHz  
WNS 0.715 ns 0.848 ns +0.133 ns
TNS 0.000 ns 0.000 ns  
Setup failing endpoints 0 0  
WHS 0.065 ns 0.065 ns  
THS 0.000 ns 0.000 ns  
Hold failing endpoints 0 0  
WPWS 3.950 ns 3.950 ns  
TPWS 0.000 ns 0.000 ns  
Timing status Met Met  

Utilization

xcu250_D=1024_OneCore

metric single port dual port diff
CLB LUTs (used) 33 208 33 508 +300
LUT util % 1.92% 1.94% +0.02 pp
LUT as Logic 28 046 28 346 +300
LUT as Memory (total) 5 162 5 162  
CLB Registers (FF) 31 683 31 835 +152
FF util % 0.92% 0.92%  
CARRY8 1 765 1 765  
RAMB36 tiles 4 6 +2
BRAM tile util % 0.15% 0.22% +0.07 pp
DSP48E2 120 120  
DSP util % 0.98% 0.98%  

xc7a200t_D=1024_OneCore

metric single port dual port diff
Slice LUTs (used) 19 260 19 337 +77
LUT util % 14.31% 14.37% +0.06 pp
LUT as Logic 17 212 17 289 +77
LUT as Memory (total) 2 048 2 048  
Slice Registers (FF) 9 057 9 328 +271
FF util % 3.36% 3.47% +0.11 pp
Slices (used) 5 903 5 822 −81
Slice util % 17.54% 17.30% −0.24 pp
RAMB36 18 20 +2
RAMB18 1 1  
BRAM tile util % 5.07% 5.62% +0.55 pp
DSP48E1 120 120  
DSP util % 16.22% 16.22%  

Power

xcu250_D=1024_OneCore

metric single port dual port diff
Total on-chip (W) 3.584 3.593 +0.009
Dynamic (W) 0.628 0.637 +0.009
Device static (W) 2.956 2.956  
Clocks (W) 0.487 0.489 +0.002
CLB logic (W) 0.063 0.066 +0.003
Signals (W) 0.038 0.040 +0.002
Block RAM (W) 0.004 0.005 +0.001
DSPs (W) 0.036 0.036  

xc7a200t_D=1024_OneCore

metric single port dual port diff
Total on-chip (W) 0.244 0.232 −0.012
Dynamic (W) 0.121 0.109 −0.012
Device static (W) 0.123 0.123  
Clocks (W) 0.050 0.052 +0.002
Slice logic (W) 0.026 0.023 −0.003
Signals (W) 0.025 0.022 −0.003
Block RAM (W) 0.001 <0.001  
DSPs (W) 0.019 0.012 −0.007

Routing

xcu250_D=1024_OneCore

metric single port dual port diff
Logical nets 185232 185540 +308
Nets not needing routing 117989 118271 +282
Internally routed nets 100252 100534 +282
Nets with no loads 15824 15824  
Implicitly routed ports 1913 1913  
Routable nets 67243 67269 +26
Fully routed nets 67243 67269 +26
Nets with routing errors 0 0  

xc7a200t_D=1024_OneCore

metric single port dual port diff
Logical nets 54081 54626 +545
Nets not needing routing 18972 19235 +263
Internally routed nets 17312 17575 +263
Nets with no loads 892 892  
Implicitly routed ports 768 768  
Routable nets 35109 35391 +282
Fully routed nets 35109 35391 +282
Nets with routing errors 0 0  

Note: Single port baselines were measured from ternary_matmul commit 8ddfcc6378f4aeb1daa2b684609b62b91bf13c8d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant