dual port ram changes migration by mjao1 · Pull Request #1 · sifferman/ternip

mjao1 · 2026-05-15T21:07:17Z

Dual Port RAM

Using dual port RAM for certain FUs is beneficial since many operations in ternip_rms.sv and ternip_rowwise_operations.sv have back-to-back reads and write.

The single memory request path was replaced with a dual port memory backend, then ternip_core + certain FUs were wired so reads and writes can happen in parallel:

New memory

rtl/common/ternip_dual_port_mem.sv was added
It is a shared MEM array with two independent request/read pipelines (A and B), each with similar handshake style to ternip_pipelined_mem

Vector register file uses dual port backend

ternip_vector_registers.sv now instantiates ternip_dual_port_mem instead of ternip_pipelined_mem and exposes:
- port A: request_* + read_*
- port B: request2_* + read2_*

Core wiring/arbitration

ternip_core.sv:
- existing FU request network still drives port A (vector_request_*)
- new port B network vector_request2_* is arbitrated between RMS and rowwise_operation
- read2_* is currently tied off (port B is used for extra request bandwidth, primarily writes)

FU logic updated to use port B where it matters most

ternip_rms.sv: in NORM, reads stay on port A, writes use vector_request2_*
ternip_rowwise_operation.sv: in ADD/SUB/MUL/DIV/SIG/CSIG/SILU, reads stay on port A, writes use vector_request2_*

Cycle count

Measured speedups (generic, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	3118122	3118122	1×
rowwise_operation_tb	391050	365550	1.07×

Measured speedups (xc7a200t_D=1024_OneCore, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	464386	412186	1.13×
rowwise_operation_tb	287250	237450	1.21×

Measured speedups (xc7a200t_D=1024_MaxCores, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	5392277	4950317	1.09×
rowwise_operation_tb	3641850	3233550	1.13×

Measured speedups (xcu250_D=1024_OneCore, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	381301	329101	1.16×
rowwise_operation_tb	229950	180150	1.28×

Measured speedups (xcu250_D=1024_MaxCores, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	714834	606954	1.18×
rowwise_operation_tb	454050	352950	1.29×

Measured speedups (xcu250_D=2048_OneCore, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	1327676	1117508	1.19×
rowwise_operation_tb	901700	698550	1.29×

Measured speedups (xcu250_D=2048_MaxCores, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	1327676	1117508	1.19×
rowwise_operation_tb	901700	698550	1.29×

Measured speedups (xcu250_D=2560_OneCore, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	1628574	1368190	1.19×
rowwise_operation_tb	1125800	871350	1.29×

Measured speedups (xcu250_D=2560_MaxCores, Verilator)

workload	baseline cycles	dual port cycles	end-to-end speedup
rms_tb	1628574	1368190	1.19×
rowwise_operation_tb	1125800	871350	1.29×

(Excluding generic)
Average rms speedup: 1.165x
Average rowwise speedup: 1.259x

Per-FU phase speedups (xcu250_D=1024_OneCore, Verilator)

phase	baseline cycles	dual port cycles	speedup
RMS NORM	111,360	59,160	1.88x
Rowwise ADD	25,800	13,050	1.98x
Rowwise SUB	25,800	13,050	1.98x
Rowwise MUL	19,450	13,200	1.47x
Rowwise SIG	12,800	6,800	1.88x
Rowwise CSIG	12,800	6,800	1.88x
Rowwise SILU	13,000	6,950	1.87x

Timing

xcu250_D=1024_OneCore

metric	single port	dual port	diff
Clock period constraint	3.333 ns	3.333 ns
Target frequency	300.03 MHz	300.03 MHz
WNS	0.038 ns	0.014 ns	-0.024 ns
TNS	0.000 ns	0.000 ns
Setup failing endpoints	0	0
WHS	0.013 ns	0.010 ns	-0.003 ns
THS	0.000 ns	0.000 ns
Hold failing endpoints	0	0
WPWS	1.124 ns	1.124 ns
TPWS	0.000 ns	0.000 ns
Timing status	Met	Met

xc7a200t_D=1024_OneCore

metric	single port	dual port	diff
Clock period constraint	10.000 ns	10.000 ns
Target frequency	100.000 MHz	100.000 MHz
WNS	0.715 ns	0.848 ns	+0.133 ns
TNS	0.000 ns	0.000 ns
Setup failing endpoints	0	0
WHS	0.065 ns	0.065 ns
THS	0.000 ns	0.000 ns
Hold failing endpoints	0	0
WPWS	3.950 ns	3.950 ns
TPWS	0.000 ns	0.000 ns
Timing status	Met	Met

Utilization

xcu250_D=1024_OneCore

metric	single port	dual port	diff
CLB LUTs (used)	33 208	33 508	+300
LUT util %	1.92%	1.94%	+0.02 pp
LUT as Logic	28 046	28 346	+300
LUT as Memory (total)	5 162	5 162
CLB Registers (FF)	31 683	31 835	+152
FF util %	0.92%	0.92%
CARRY8	1 765	1 765
RAMB36 tiles	4	6	+2
BRAM tile util %	0.15%	0.22%	+0.07 pp
DSP48E2	120	120
DSP util %	0.98%	0.98%

xc7a200t_D=1024_OneCore

metric	single port	dual port	diff
Slice LUTs (used)	19 260	19 337	+77
LUT util %	14.31%	14.37%	+0.06 pp
LUT as Logic	17 212	17 289	+77
LUT as Memory (total)	2 048	2 048
Slice Registers (FF)	9 057	9 328	+271
FF util %	3.36%	3.47%	+0.11 pp
Slices (used)	5 903	5 822	−81
Slice util %	17.54%	17.30%	−0.24 pp
RAMB36	18	20	+2
RAMB18	1	1
BRAM tile util %	5.07%	5.62%	+0.55 pp
DSP48E1	120	120
DSP util %	16.22%	16.22%

Power

xcu250_D=1024_OneCore

metric	single port	dual port	diff
Total on-chip (W)	3.584	3.593	+0.009
Dynamic (W)	0.628	0.637	+0.009
Device static (W)	2.956	2.956
Clocks (W)	0.487	0.489	+0.002
CLB logic (W)	0.063	0.066	+0.003
Signals (W)	0.038	0.040	+0.002
Block RAM (W)	0.004	0.005	+0.001
DSPs (W)	0.036	0.036

xc7a200t_D=1024_OneCore

metric	single port	dual port	diff
Total on-chip (W)	0.244	0.232	−0.012
Dynamic (W)	0.121	0.109	−0.012
Device static (W)	0.123	0.123
Clocks (W)	0.050	0.052	+0.002
Slice logic (W)	0.026	0.023	−0.003
Signals (W)	0.025	0.022	−0.003
Block RAM (W)	0.001	<0.001
DSPs (W)	0.019	0.012	−0.007

Routing

xcu250_D=1024_OneCore

metric	single port	dual port	diff
Logical nets	185232	185540	+308
Nets not needing routing	117989	118271	+282
Internally routed nets	100252	100534	+282
Nets with no loads	15824	15824
Implicitly routed ports	1913	1913
Routable nets	67243	67269	+26
Fully routed nets	67243	67269	+26
Nets with routing errors	0	0

xc7a200t_D=1024_OneCore

metric	single port	dual port	diff
Logical nets	54081	54626	+545
Nets not needing routing	18972	19235	+263
Internally routed nets	17312	17575	+263
Nets with no loads	892	892
Implicitly routed ports	768	768
Routable nets	35109	35391	+282
Fully routed nets	35109	35391	+282
Nets with routing errors	0	0

Note: Single port baselines were measured from ternary_matmul commit 8ddfcc6378f4aeb1daa2b684609b62b91bf13c8d

dual port changes migration

9838c96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dual port ram changes migration#1

dual port ram changes migration#1
mjao1 wants to merge 1 commit into
sifferman:mainfrom
mjao1:dual-port-vreg

mjao1 commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mjao1 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dual Port RAM

New memory

Vector register file uses dual port backend

Core wiring/arbitration

FU logic updated to use port B where it matters most

Cycle count

Timing

Utilization

Power

Routing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mjao1 commented May 15, 2026 •

edited

Loading