rr_graph: implement more robust delay normalization calculation #1576

acomodi · 2020-10-23T12:05:38Z

Signed-off-by: Alessandro Comodi acomodi@antmicro.com

Description

This PR aims at making the delay normalization factor used in the based cost calculation to be more representative of the real underneath device fabric.

Previous to this PR, the calculation of the const_indices' delays/resistance/capacitance was based on heuristics, which proved to be insufficient for complex RR Graphs such as the Series7 one.

All the nodes in the RR graph are now checked and their switches delays averaged, therefore we check now for every node instead of picking them as samples.

The main differences between the previous indexed data calculation are:

short switches are no longer increasing the total delay values.
If a node does not have any switch (due to the previous point), it's delay contribution won't be added to the whole sum.
The delay normalization calculation is solely based on the previous rr_graph_indexed data instead of being based to some fixed values (such as clb_dist).
- If nodes were not added and a cost index has null values, it won't contribute to the final delay normalization value.

Motivation and Context

Base costs for Series7 device had a 15x reduction making the delay and base costs not comparable anymore.
The 15x reduction was due to an uncaught bug in the VtR/SymbiFlow fork, which get caught as soon as I had tested the upstream master against SymbiFlow tests.

How Has This Been Tested?

Currently, all strong and basic test run through completion, but some of the strong tests do report a change in the observed metrics, which in all cases but one seem to be a positive change (less CPD, wire length, etc).

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

acomodi · 2020-10-23T12:15:10Z

@litghost @vaughnbetz @HackerFoo FYI, I have tested this against SymbiFlow tests and it seems to be solving the base cost value reduction and the consequent runtime degradation.

With this PR, the base cost reported for the A35T device is now only ~1.06x lower than the previous values, and I have seen a restoration of runtime results, as well as a general improvement of the CPD.
Another thing I have observed is that the base cost seems to be consistent across different RR graphs (for instance the ROI-based one and the 50T).
I have observed that previously, the various RR graphs reported a 5x difference in the base costs between these two RR graphs.

acomodi · 2020-10-23T13:12:53Z

Failing tests include:

regression_tests/vtr_reg_strong/strong_custom_grid...[Fail]
[Fail]
non_column.xml/raygentop.v/common min_chan_width_routing_area_total relative value 0.7979175341181404 outside of range [0.8,1.3] and not equal to golden value: 3434390.0
[Fail]
non_column.xml/raygentop.v/common min_chan_width_routing_area_per_tile relative value 0.7979173735061246 outside of range [0.8,1.3] and not equal to golden value: 3153.71

regression_tests/vtr_reg_strong/strong_power...[Fail]
[Fail]
k6_frac_N10_mem32K_40nm.xml/ch_intrinsics.v/common total_power relative value 0.8824649298597196 outside of range [0.9,1.1] and not equal to golden value: 0.00998

regression_tests/vtr_reg_strong/strong_bidir...[Fail]
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common critical_path_delay relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: 8.75717
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common geomean_nonvirtual_intradomain_critical_path_delay relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: 8.75717
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common setup_WNS relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: -8.75717

regression_tests/vtr_reg_strong/strong_clock_modeling...[Fail]
[Fail]
timing/k6_N10_40nm.xml/verilog/mkPktMerge.v/common_--clock_modeling_ideal_--route_chan_width_60 routed_wirelength relative value 0.47368421052631576 outside of range [0.6,1.5], above absolute threshold 5.0 and not equal to golden value: 38.0

I am also running the quick_titan QoR benchmarks to have more data.

litghost · 2020-10-23T16:30:46Z

The QoR changes look like a mixed bag. I think a titan QoR comparison will be required to evaluate the overall impact of the these changes.

litghost · 2020-10-23T16:34:32Z

vpr/src/route/rr_graph_indexed_data.cpp

-        Tdel_sum += Tdel / (float)clb_dist;
+        if (rr_indexed_data[cost_index].number_of_nodes == 0 || T_value == 0.0) continue;
+
+        Tdel_vector.push_back(T_value * 1e10);


Where did this magic number of come from? It should be constexpr double kNameOfConstant = 1e10, along with a comment about the value should be.

Right, I forgot to fix that. Anyway, this was needed to temporary shift delay values, as during the geo mean computation, the subsequent multiplication of the delays would eventually bring the mult value to zero.
Unsure whether this is the best way to achieve this though.

Floating point doesn't need shifts like this. Rather than using 1e10, just a double during the intermediate calcuation.

Actually I tried with a double for the intermediate calculation, but it went actually beyond the limit which is around 1e-300

So the issue is that maybe product(Tdel_vector) is the wrong math to use here? I think you may have a numerical convergence error here.

The fact is that I wanted to use the geometric mean to take into account possible delay values outliers of some segment types which, I have seen, may differ by two orders of magnitude.

Given that delay values are usually in the order of 1e-10, I applied the 1e10 correction for the intermediate step and than restored it at the end of the geo mean calculation.

The solution here is to change from (product(T_del))^(1/len(T_del)) to https://en.wikipedia.org/wiki/Geometric_mean#Relationship_with_logarithms e^(1/len(T_del)*sum(map(logn(T_del)))). logn(0) is a violation, but those need to be skipped in either case. Again, the problem here is the naive geomean algorithm converges always to zero (or infinity) as N increases. This is numerically denegenerate, and you can always increase N (e.g. len(T_del)) to drive the product to zero.

I don't think geometric mean is the right approach here. It does fairly average numers with widely varying magnitude ranges (treats them all as equally significant) but it is very dangerous if numbers can go to or approach 0. We could get some small/weird wires or fragments with delay near 0, and we really do not want those dropping the average way down. I think that's why you're special casing 0 values -- otherwise the geomean blows up. But a value of epsilon is almost as destructive. I'd change to the arithmetic mean and things should be a lot more stable. It is much less prone to going to tiny averages due to some tiny values, and as overly small base costs are the problem I think we want to stop using an average that weights small values very heavily (infinitely for 0).

Ok, I have changed the code to use the arithmetic mean instead

litghost · 2020-10-23T16:37:32Z

vpr/src/route/rr_graph_indexed_data.cpp

-        if (inode == -1)
-            continue;
-        cost_index = device_ctx.rr_nodes[inode].cost_index();
-        frac_num_seg = clb_dist * device_ctx.rr_indexed_data[cost_index].inv_length;


The inv_length doesn't seem to be accounted for anymore?

I was unsure about this. Basically, the inv_length is taken into account in a later step of the base_cost calculation and IMO it should be taken into account based on the base_cost_type parameter value.

vpr/src/route/rr_graph_indexed_data.cpp

vpr/src/route/rr_node.h

litghost

Comments in. @vaughnbetz needs to do a detailed review as well.

vaughnbetz · 2020-10-24T04:02:39Z

QoR regtest failures above look OK. 3 of 4 are improvements, and the last one is a 44% critical path delay degradation on a small, bidirectional architecture circuit, and that's known to be a noisy test, and on a less important architecture.

acomodi · 2020-10-26T20:56:32Z

@vaughnbetz @litghost Titan quick QoR have finished and has been run with the current PR's status against master:

Changes (4c55d01)
Baseline (ab5f508).

Results summary are presented in this colab page.

For what I see there seems to be a general degradation of CPD and Placement runtime, in favor of a reduced routing run-time. I'll investigate why this is so.

In general, would it be acceptable to add yet another parameter to switch between different rr_indexed_data (and consequently also the delay normalization) computation methods?

litghost · 2020-10-26T21:10:18Z

The geomean table should probably be transposed, as it current hard to read in it's current form.

litghost · 2020-10-26T21:11:05Z

While the routing time is down ~20%, the overall flow time is up ~6%. Where did the time go?

litghost · 2020-10-26T21:11:46Z

While the routing time is down ~20%, the overall flow time is up ~6%. Where did the time go?

Place time is up ~16% :(

litghost · 2020-10-26T21:15:01Z

So overall:

CPD is up 8% (bad)
Route wire length about the same
Flow time is up 6% (bad)
Place time is up 16% (bad)
Route time is down 20% (good)

I'd say this change seems pretty bad on the Titan benchmarks. What is the latest symbiflow results?

acomodi · 2020-10-27T13:12:10Z

@litghost SymbiFlow tests results seemed to be promising, achieving better run-time as well as CPD w.r.t. baseline. I am currently gathering more consistent data and add that to the colab page, so to display also the SymbiFlow tests results.

I think that the issue might be related to the missing inv_length during the delay_normalization, as you have noticed in one of the review comments. Investigating whether this is the case.

acomodi · 2020-10-27T22:56:13Z

@litghost @vaughnbetz the new results from the latest hydra run with the inv_length addition to the delay normalization factor have finished (colab page with updated results).

Summary:

CPD: improved by ~1.2%
Route Time: improved by ~15%
Place Time: worsened by ~15%
Total VTR time: worsened by ~5.8%
Wirelength: equal

Worth noticing that most of the biggest designs got fairly good improvements:

e.g.

bitcoin miner:

CPD:
- baseline: 13.35
- changes: 8.74
Route Time:
- baseline: 2837.24
- changes: 1454.27
Total VTR time:
- baseline: 24612 s
- changes: 24132 s

directrf:

CPD:
- baseline: 11.3487
- changes: 11.0106
Route Time:
- baseline: 2182.41
- changes: 846.19
Total VTR time:
- baseline: 25547.40
- changes: 19810.19

litghost · 2020-10-27T22:58:41Z

@vaughnbetz A ~6% runtime jump but better QoR using a more robust and simpler algorithm seems like a reasonable trade-off to me. Thoughts?

vaughnbetz · 2020-10-28T00:03:19Z

Yes, this does seem good to me too. I'm a bit baffled as to why placement time would go up though -- did the placement delay matrix computation get slower? Otherwise I wouldn't expect placement to be affected. If the placement delay matrix computation slowed down we may be able to claw it back as it seems like it wouldn't be fundamental. In any case placement delay matrix can be cached, so it could be saved for some flows. Can you parse out the place time components to see where the slowdown comes from.

acomodi · 2020-10-28T11:28:42Z

@vaughnbetz I think I cannot access the log files from the hydra runs, but I have run locally some of the smaller designs and indeed, the place_delay matrix computation has increased:

cholesky_mc benchmark:

Baseline:

Lookahead time: 78.59
Place delay matrix: 42.18
Place time: 169.10

Changes:

Lookahead time: 78.04
Place delay matrix: 117.00
Place time: 242.40

vaughnbetz · 2020-10-29T04:55:22Z

Thanks. That's really strange. I can't figure out why a different base cost normalization would slow down the place delay matrix calculation. Presumably the routing paths being explored are different, but still not sure how that leads to this behaviour. Probably worth double-checking that any data (rr_indexed_data?) needed by the place_delay_matrix is properly loaded, and printing out the base costs of all resoures to make sure none have funny data (should just all be scaled up or down some, uniformly) at the point where the place delay matrix is calculated.

acomodi · 2020-10-29T14:08:44Z

@vaughnbetz The default for the place_delay_matrix calculation is set to astar by default. This means that the delay calculation method is based on the router, which uses, by default, the lookahead computed at the previous step.
I think the delay matrix runtime is affected as the lookahead itself got computed with the new base_cost values.

I have double checked that the base costs in the indexed data do have no funny values prior to calculating the delay matrix

vaughnbetz · 2020-10-29T14:28:31Z

Thanks. The astar search using a high criticality so it is looking for low delay paths. So the base cost shouldn't really matter a lot to it. So something odd is happening to slow it down so much.

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi · 2020-11-16T17:40:06Z

@litghost @vaughnbetz The latest runs completed. I have separated them in the following colab pages:

The best results are obtained with the following settings:

Higher a_star during the place_delay_matrix calculation: this parameter is adjustable, and I think that a lower value can yield good enough results as well, probably achieving better CPD estimates and final CPDs.
Doubling of initial_pres_fac.
New indexed data calculation method.

vaughnbetz · 2020-11-16T18:26:49Z

Thanks @acomodi .
Just to be check: the circuit name in the tables is not meaningful (all values are averages over multiple circuits), correct?

Titan results show all 3 changes are beneficial.
VTR results show the new normalization is beneficial (slightly lower Wmin, faster runtime). Slightly worse critical path delay, but that could be due to the smaller Wmin (126 vs. 130). Flow runtime is also down.
VTR results show the higher astar_fac and double pres_fac are not beneficial; higher Wmin, no improvement in runtime or cpd vs. the new normalization. They aren't too far off though (Wmin = 136 instead of the default 130 with all 3 changes, and cpd is up ~2%).

So the new normalization should go in. Higher astar_fac and doubled pres_fac is beneficial for Titan, may be mildly negative for VTR. Could set different defaults for those?

acomodi · 2020-11-16T18:38:18Z

So the new normalization should go in. Higher astar_fac and doubled pres_fac is beneficial for Titan, may be mildly negative for VTR. Could set different defaults for those?

@vaughnbetz Ok. Just to be sure, you mean changing the default router_delay_profiler astar_fac and the pres_fac to get VTR tests in line?

Just to be check: the circuit name in the tables is not meaningful (all values are averages over multiple circuits), correct?

So, there are two different tables:

the first one lets you select the various circuits' results, to have some more detailed information.
the last one is a collection of normalized data of all the tests to see the overall trends.

I have also double-checked that, with the current implementation, some of strong tests do not meet QoR, but, by lowering the initial_pres_fac (where the initial_pres_fac default is 1.0). Would it be better to update the golden results, or change the initial_pres_fac for the failing strong tests?

vaughnbetz · 2020-11-16T18:55:23Z

Ah, got it, thanks.

Results are good for all 3 changes for Titan (every change improves things).
New normalization scheme is a slight improvement for VTR; higher astar_fac during place_delay_matrix and higher pres_fac are slight degradations. So I'd recommend having Titan run with the higher astar_fac and pres_fac (set at command line), but leaving the defaults for both alone so VTR uses the old values. Hopefully that resolves the QoR failures too (although there might still be some random ones).

acomodi · 2020-11-16T19:25:22Z

Ok, I will run another set of tests on Hydra, to get the final results.

In the meantime I collected some SymbiFlow tests with and without the changes in this PR. The results are collected here

NOTE:

the changes tests seem to have way too high run-times, probably due to the fact that there might have been some additional loads on my local machine which slowed everything down. I will re-run everything tomorrow and update the results table.
tests have been run on the 50T and 100T devices

@litghost FYI

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi · 2020-11-17T19:29:51Z

@vaughnbetz Final runs on Hydra have completed, more details in the colab pages linked below:

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

vaughnbetz · 2020-11-19T17:43:35Z

VTR and Titan results look good. Symbiflow results look OK to me too, but I haven't analyzed them as much / am not as familar with them.

vaughnbetz

Suggest some additional comments; otherwise looks good.

vaughnbetz · 2020-11-19T17:48:09Z

vpr/src/route/rr_graph_indexed_data.cpp

-                                          t_rr_type rr_type,
-                                          int nodes_per_chan,
-                                          const t_rr_node_indices& L_rr_node_indices) {
+static void load_rr_indexed_data_T_values() {
    /* Loads the average propagation times through segments of each index type   *


This comment looks out of date. I think now you're going through all rr_nodes and getting medians for their delay. Good to update this comment and check for other out-of-date comments. Should also describe what we're trying to do here (get typical values of T_linear etc. for different wire types, which are used for things like constructing lookaheads and place delay matrices).

vaughnbetz · 2020-11-19T17:49:20Z

vpr/src/route/rr_graph_indexed_data.cpp

-    R_total = (float*)vtr::calloc(device_ctx.rr_indexed_data.size(), sizeof(float));
+    std::vector<int> num_nodes_of_index(rr_indexed_data.size(), 0);
+    std::vector<std::vector<float>> C_total(rr_indexed_data.size());
+    std::vector<std::vector<float>> R_total(rr_indexed_data.size());

    /* August 2014: Not all wire-to-wire switches connecting from some wire segment will


Again, I think we're averaging / doing a median over all rr_nodes, not just some in one chosen channel. Should update the comment.

vaughnbetz · 2020-11-19T17:50:28Z

vpr/src/route/rr_graph_indexed_data.cpp

@@ -453,44 +409,41 @@ static void load_rr_indexed_data_T_values(int index_start,
                // the combined capacitance of the node and internal capacitance of the switch. The
                // second transient response is the result of the Rnode being distributed halfway along a


Change "second transient response" to "multiplication by the second term by 0.5"

vaughnbetz · 2020-11-19T17:50:50Z

vpr/src/route/rr_graph_indexed_data.cpp

 }

-static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered) {
+static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered, vtr::vector<RRNodeId, std::vector<RREdgeId>>& fan_in_list) {


Add comment saying what this routine does, and why it's useful.

litghost

LGTM. @vaughnbetz, we good to merge?

acomodi · 2020-11-19T17:58:43Z

We can potentially merge this and then I will open a follow-up PR with the additional detailed comments.

vaughnbetz · 2020-11-19T19:25:56Z

Sounds good, merging.

acomodi marked this pull request as draft October 23, 2020 12:05

acomodi changed the title ~~rr_graph: implement more robust delay normalization calculation~~ WIP: rr_graph: implement more robust delay normalization calculation Oct 23, 2020

acomodi changed the title ~~WIP: rr_graph: implement more robust delay normalization calculation~~ rr_graph: implement more robust delay normalization calculation Oct 23, 2020

acomodi requested review from litghost and vaughnbetz October 23, 2020 16:00

litghost reviewed Oct 23, 2020

View reviewed changes

vpr/src/route/rr_graph_indexed_data.cpp Outdated Show resolved Hide resolved

litghost reviewed Oct 23, 2020

View reviewed changes

vpr/src/route/rr_node.h Outdated Show resolved Hide resolved

litghost reviewed Oct 23, 2020

View reviewed changes

acomodi force-pushed the robust-delay-norm-factor branch from 9d04721 to 94605cc Compare October 27, 2020 13:31

acomodi force-pushed the robust-delay-norm-factor branch from 94605cc to 66d9b87 Compare October 28, 2020 12:01

acomodi marked this pull request as ready for review October 28, 2020 12:01

acomodi added 6 commits November 16, 2020 13:27

rr_graph: do not alter inv_length

2475e85

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

rr_graph: corrected condition to fixup indexed data

9b2d8e7

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

rr_graph_indexed_data: add histogram and select its mode

0449dc1

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

rr_graph: use fan-in instead of fan-out when getting avg switches data

d338cb6

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

rr_graph: use median instead of mode for C/R Node values

6b0fd77

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

vpr: add router profiler a_star factor flag

98b9370

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi force-pushed the robust-delay-norm-factor branch from 64da26f to cfc4e6d Compare November 16, 2020 12:31

acomodi mentioned this pull request Nov 16, 2020

Instability due to small changes to memory initialization f4pga/f4pga-arch-defs#1776

Open

flags: restore astar_fac and initial_pres_fac defaults

40f8a4e

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi force-pushed the robust-delay-norm-factor branch from cfc4e6d to 40f8a4e Compare November 17, 2020 10:15

rr_graph: add comment on indexed data generation

4debcf6

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi requested a review from vaughnbetz November 17, 2020 14:43

vtr_flow: update golden results

ed6c456

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>

acomodi force-pushed the robust-delay-norm-factor branch from d168bdd to ed6c456 Compare November 18, 2020 09:46

acomodi changed the title ~~rr_graph: implement more robust delay normalization calculation~~ WIP: rr_graph: implement more robust delay normalization calculation Nov 18, 2020

acomodi changed the title ~~WIP: rr_graph: implement more robust delay normalization calculation~~ rr_graph: implement more robust delay normalization calculation Nov 18, 2020

litghost mentioned this pull request Nov 18, 2020

Master+wip next SymbiFlow/vtr-verilog-to-routing#560

Merged

vaughnbetz reviewed Nov 19, 2020

View reviewed changes

litghost approved these changes Nov 19, 2020

View reviewed changes

vaughnbetz merged commit 18b7ca6 into verilog-to-routing:master Nov 19, 2020

acomodi mentioned this pull request Nov 19, 2020

rr_graph: added comments on the new rr_graph_indexed data calculation #1598

Merged

7 tasks

		@@ -453,44 +409,41 @@ static void load_rr_indexed_data_T_values(int index_start,
		// the combined capacitance of the node and internal capacitance of the switch. The
		// second transient response is the result of the Rnode being distributed halfway along a

rr_graph: implement more robust delay normalization calculation #1576

rr_graph: implement more robust delay normalization calculation #1576

Conversation

acomodi commented Oct 23, 2020

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

acomodi commented Oct 23, 2020 • edited

acomodi commented Oct 23, 2020

litghost commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

litghost Oct 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

litghost left a comment

Choose a reason for hiding this comment

vaughnbetz commented Oct 24, 2020

acomodi commented Oct 26, 2020

litghost commented Oct 26, 2020

litghost commented Oct 26, 2020 • edited

litghost commented Oct 26, 2020

litghost commented Oct 26, 2020

acomodi commented Oct 27, 2020

acomodi commented Oct 27, 2020

litghost commented Oct 27, 2020

vaughnbetz commented Oct 28, 2020

acomodi commented Oct 28, 2020

vaughnbetz commented Oct 29, 2020

acomodi commented Oct 29, 2020

vaughnbetz commented Oct 29, 2020

acomodi commented Nov 16, 2020

vaughnbetz commented Nov 16, 2020

acomodi commented Nov 16, 2020

vaughnbetz commented Nov 16, 2020

acomodi commented Nov 16, 2020

acomodi commented Nov 17, 2020

vaughnbetz commented Nov 19, 2020

vaughnbetz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

litghost left a comment

Choose a reason for hiding this comment

acomodi commented Nov 19, 2020

vaughnbetz commented Nov 19, 2020

acomodi commented Oct 23, 2020 •

edited

litghost Oct 23, 2020 •

edited

litghost commented Oct 26, 2020 •

edited