Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rr_graph: implement more robust delay normalization calculation #1576

Merged
merged 15 commits into from
Nov 19, 2020

Conversation

acomodi
Copy link
Collaborator

@acomodi acomodi commented Oct 23, 2020

Signed-off-by: Alessandro Comodi acomodi@antmicro.com

Description

This PR aims at making the delay normalization factor used in the based cost calculation to be more representative of the real underneath device fabric.

Previous to this PR, the calculation of the const_indices' delays/resistance/capacitance was based on heuristics, which proved to be insufficient for complex RR Graphs such as the Series7 one.

All the nodes in the RR graph are now checked and their switches delays averaged, therefore we check now for every node instead of picking them as samples.

The main differences between the previous indexed data calculation are:

  • short switches are no longer increasing the total delay values.
  • If a node does not have any switch (due to the previous point), it's delay contribution won't be added to the whole sum.
  • The delay normalization calculation is solely based on the previous rr_graph_indexed data instead of being based to some fixed values (such as clb_dist).
    • If nodes were not added and a cost index has null values, it won't contribute to the final delay normalization value.

Motivation and Context

Base costs for Series7 device had a 15x reduction making the delay and base costs not comparable anymore.
The 15x reduction was due to an uncaught bug in the VtR/SymbiFlow fork, which get caught as soon as I had tested the upstream master against SymbiFlow tests.

How Has This Been Tested?

Currently, all strong and basic test run through completion, but some of the strong tests do report a change in the observed metrics, which in all cases but one seem to be a positive change (less CPD, wire length, etc).

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@acomodi acomodi marked this pull request as draft October 23, 2020 12:05
@acomodi acomodi changed the title rr_graph: implement more robust delay normalization calculation WIP: rr_graph: implement more robust delay normalization calculation Oct 23, 2020
@acomodi
Copy link
Collaborator Author

acomodi commented Oct 23, 2020

@litghost @vaughnbetz @HackerFoo FYI, I have tested this against SymbiFlow tests and it seems to be solving the base cost value reduction and the consequent runtime degradation.

With this PR, the base cost reported for the A35T device is now only ~1.06x lower than the previous values, and I have seen a restoration of runtime results, as well as a general improvement of the CPD.
Another thing I have observed is that the base cost seems to be consistent across different RR graphs (for instance the ROI-based one and the 50T).
I have observed that previously, the various RR graphs reported a 5x difference in the base costs between these two RR graphs.

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 23, 2020

Failing tests include:

regression_tests/vtr_reg_strong/strong_custom_grid...[Fail]
[Fail]
non_column.xml/raygentop.v/common min_chan_width_routing_area_total relative value 0.7979175341181404 outside of range [0.8,1.3] and not equal to golden value: 3434390.0
[Fail]
non_column.xml/raygentop.v/common min_chan_width_routing_area_per_tile relative value 0.7979173735061246 outside of range [0.8,1.3] and not equal to golden value: 3153.71

regression_tests/vtr_reg_strong/strong_power...[Fail]
[Fail]
k6_frac_N10_mem32K_40nm.xml/ch_intrinsics.v/common total_power relative value 0.8824649298597196 outside of range [0.9,1.1] and not equal to golden value: 0.00998

regression_tests/vtr_reg_strong/strong_bidir...[Fail]
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common critical_path_delay relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: 8.75717
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common geomean_nonvirtual_intradomain_critical_path_delay relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: 8.75717
[Fail]
k4_n4_v7_longline_bidir.xml/styr.blif/common setup_WNS relative value 1.4492353123212178 outside of range [0.5,1.4] and not equal to golden value: -8.75717

regression_tests/vtr_reg_strong/strong_clock_modeling...[Fail]
[Fail]
timing/k6_N10_40nm.xml/verilog/mkPktMerge.v/common_--clock_modeling_ideal_--route_chan_width_60 routed_wirelength relative value 0.47368421052631576 outside of range [0.6,1.5], above absolute threshold 5.0 and not equal to golden value: 38.0

I am also running the quick_titan QoR benchmarks to have more data.

@acomodi acomodi changed the title WIP: rr_graph: implement more robust delay normalization calculation rr_graph: implement more robust delay normalization calculation Oct 23, 2020
@litghost
Copy link
Collaborator

The QoR changes look like a mixed bag. I think a titan QoR comparison will be required to evaluate the overall impact of the these changes.

Tdel_sum += Tdel / (float)clb_dist;
if (rr_indexed_data[cost_index].number_of_nodes == 0 || T_value == 0.0) continue;

Tdel_vector.push_back(T_value * 1e10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this magic number of come from? It should be constexpr double kNameOfConstant = 1e10, along with a comment about the value should be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot to fix that. Anyway, this was needed to temporary shift delay values, as during the geo mean computation, the subsequent multiplication of the delays would eventually bring the mult value to zero.
Unsure whether this is the best way to achieve this though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point doesn't need shifts like this. Rather than using 1e10, just a double during the intermediate calcuation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried with a double for the intermediate calculation, but it went actually beyond the limit which is around 1e-300

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is that maybe product(Tdel_vector) is the wrong math to use here? I think you may have a numerical convergence error here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact is that I wanted to use the geometric mean to take into account possible delay values outliers of some segment types which, I have seen, may differ by two orders of magnitude.

Given that delay values are usually in the order of 1e-10, I applied the 1e10 correction for the intermediate step and than restored it at the end of the geo mean calculation.

Copy link
Collaborator

@litghost litghost Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution here is to change from (product(T_del))^(1/len(T_del)) to https://en.wikipedia.org/wiki/Geometric_mean#Relationship_with_logarithms e^(1/len(T_del)*sum(map(logn(T_del)))). logn(0) is a violation, but those need to be skipped in either case. Again, the problem here is the naive geomean algorithm converges always to zero (or infinity) as N increases. This is numerically denegenerate, and you can always increase N (e.g. len(T_del)) to drive the product to zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think geometric mean is the right approach here. It does fairly average numers with widely varying magnitude ranges (treats them all as equally significant) but it is very dangerous if numbers can go to or approach 0. We could get some small/weird wires or fragments with delay near 0, and we really do not want those dropping the average way down. I think that's why you're special casing 0 values -- otherwise the geomean blows up. But a value of epsilon is almost as destructive. I'd change to the arithmetic mean and things should be a lot more stable. It is much less prone to going to tiny averages due to some tiny values, and as overly small base costs are the problem I think we want to stop using an average that weights small values very heavily (infinitely for 0).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have changed the code to use the arithmetic mean instead

if (inode == -1)
continue;
cost_index = device_ctx.rr_nodes[inode].cost_index();
frac_num_seg = clb_dist * device_ctx.rr_indexed_data[cost_index].inv_length;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inv_length doesn't seem to be accounted for anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about this. Basically, the inv_length is taken into account in a later step of the base_cost calculation and IMO it should be taken into account based on the base_cost_type parameter value.

vpr/src/route/rr_node.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@litghost litghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in. @vaughnbetz needs to do a detailed review as well.

@vaughnbetz
Copy link
Contributor

QoR regtest failures above look OK. 3 of 4 are improvements, and the last one is a 44% critical path delay degradation on a small, bidirectional architecture circuit, and that's known to be a noisy test, and on a less important architecture.

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 26, 2020

@vaughnbetz @litghost Titan quick QoR have finished and has been run with the current PR's status against master:

Results summary are presented in this colab page.

For what I see there seems to be a general degradation of CPD and Placement runtime, in favor of a reduced routing run-time. I'll investigate why this is so.

In general, would it be acceptable to add yet another parameter to switch between different rr_indexed_data (and consequently also the delay normalization) computation methods?

@litghost
Copy link
Collaborator

The geomean table should probably be transposed, as it current hard to read in it's current form.

@litghost
Copy link
Collaborator

litghost commented Oct 26, 2020

While the routing time is down ~20%, the overall flow time is up ~6%. Where did the time go?

@litghost
Copy link
Collaborator

While the routing time is down ~20%, the overall flow time is up ~6%. Where did the time go?

Place time is up ~16% :(

@litghost
Copy link
Collaborator

So overall:

  • CPD is up 8% (bad)
  • Route wire length about the same
  • Flow time is up 6% (bad)
  • Place time is up 16% (bad)
  • Route time is down 20% (good)

I'd say this change seems pretty bad on the Titan benchmarks. What is the latest symbiflow results?

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 27, 2020

@litghost SymbiFlow tests results seemed to be promising, achieving better run-time as well as CPD w.r.t. baseline. I am currently gathering more consistent data and add that to the colab page, so to display also the SymbiFlow tests results.

I think that the issue might be related to the missing inv_length during the delay_normalization, as you have noticed in one of the review comments. Investigating whether this is the case.

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 27, 2020

@litghost @vaughnbetz the new results from the latest hydra run with the inv_length addition to the delay normalization factor have finished (colab page with updated results).

Summary:

CPD: improved by ~1.2%
Route Time: improved by ~15%
Place Time: worsened by ~15%
Total VTR time: worsened by ~5.8%
Wirelength: equal

Worth noticing that most of the biggest designs got fairly good improvements:

e.g.

bitcoin miner:

  • CPD:

    • baseline: 13.35
    • changes: 8.74
  • Route Time:

    • baseline: 2837.24
    • changes: 1454.27
  • Total VTR time:

    • baseline: 24612 s
    • changes: 24132 s

directrf:

  • CPD:

    • baseline: 11.3487
    • changes: 11.0106
  • Route Time:

    • baseline: 2182.41
    • changes: 846.19
  • Total VTR time:

    • baseline: 25547.40
    • changes: 19810.19

@litghost
Copy link
Collaborator

@vaughnbetz A ~6% runtime jump but better QoR using a more robust and simpler algorithm seems like a reasonable trade-off to me. Thoughts?

@vaughnbetz
Copy link
Contributor

Yes, this does seem good to me too. I'm a bit baffled as to why placement time would go up though -- did the placement delay matrix computation get slower? Otherwise I wouldn't expect placement to be affected. If the placement delay matrix computation slowed down we may be able to claw it back as it seems like it wouldn't be fundamental. In any case placement delay matrix can be cached, so it could be saved for some flows. Can you parse out the place time components to see where the slowdown comes from.

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 28, 2020

@vaughnbetz I think I cannot access the log files from the hydra runs, but I have run locally some of the smaller designs and indeed, the place_delay matrix computation has increased:

cholesky_mc benchmark:

Baseline:

  • Lookahead time: 78.59
  • Place delay matrix: 42.18
  • Place time: 169.10

Changes:

  • Lookahead time: 78.04
  • Place delay matrix: 117.00
  • Place time: 242.40

@acomodi acomodi marked this pull request as ready for review October 28, 2020 12:01
@vaughnbetz
Copy link
Contributor

Thanks. That's really strange. I can't figure out why a different base cost normalization would slow down the place delay matrix calculation. Presumably the routing paths being explored are different, but still not sure how that leads to this behaviour. Probably worth double-checking that any data (rr_indexed_data?) needed by the place_delay_matrix is properly loaded, and printing out the base costs of all resoures to make sure none have funny data (should just all be scaled up or down some, uniformly) at the point where the place delay matrix is calculated.

@acomodi
Copy link
Collaborator Author

acomodi commented Oct 29, 2020

@vaughnbetz The default for the place_delay_matrix calculation is set to astar by default. This means that the delay calculation method is based on the router, which uses, by default, the lookahead computed at the previous step.
I think the delay matrix runtime is affected as the lookahead itself got computed with the new base_cost values.

I have double checked that the base costs in the indexed data do have no funny values prior to calculating the delay matrix

@vaughnbetz
Copy link
Contributor

Thanks. The astar search using a high criticality so it is looking for low delay paths. So the base cost shouldn't really matter a lot to it. So something odd is happening to slow it down so much.

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
@acomodi
Copy link
Collaborator Author

acomodi commented Nov 16, 2020

@litghost @vaughnbetz The latest runs completed. I have separated them in the following colab pages:

The best results are obtained with the following settings:

  • Higher a_star during the place_delay_matrix calculation: this parameter is adjustable, and I think that a lower value can yield good enough results as well, probably achieving better CPD estimates and final CPDs.
  • Doubling of initial_pres_fac.
  • New indexed data calculation method.

@vaughnbetz
Copy link
Contributor

Thanks @acomodi .
Just to be check: the circuit name in the tables is not meaningful (all values are averages over multiple circuits), correct?

Titan results show all 3 changes are beneficial.
VTR results show the new normalization is beneficial (slightly lower Wmin, faster runtime). Slightly worse critical path delay, but that could be due to the smaller Wmin (126 vs. 130). Flow runtime is also down.
VTR results show the higher astar_fac and double pres_fac are not beneficial; higher Wmin, no improvement in runtime or cpd vs. the new normalization. They aren't too far off though (Wmin = 136 instead of the default 130 with all 3 changes, and cpd is up ~2%).

So the new normalization should go in. Higher astar_fac and doubled pres_fac is beneficial for Titan, may be mildly negative for VTR. Could set different defaults for those?

@acomodi
Copy link
Collaborator Author

acomodi commented Nov 16, 2020

So the new normalization should go in. Higher astar_fac and doubled pres_fac is beneficial for Titan, may be mildly negative for VTR. Could set different defaults for those?

@vaughnbetz Ok. Just to be sure, you mean changing the default router_delay_profiler astar_fac and the pres_fac to get VTR tests in line?

Just to be check: the circuit name in the tables is not meaningful (all values are averages over multiple circuits), correct?

So, there are two different tables:

  • the first one lets you select the various circuits' results, to have some more detailed information.
  • the last one is a collection of normalized data of all the tests to see the overall trends.

I have also double-checked that, with the current implementation, some of strong tests do not meet QoR, but, by lowering the initial_pres_fac (where the initial_pres_fac default is 1.0). Would it be better to update the golden results, or change the initial_pres_fac for the failing strong tests?

@vaughnbetz
Copy link
Contributor

Ah, got it, thanks.

Results are good for all 3 changes for Titan (every change improves things).
New normalization scheme is a slight improvement for VTR; higher astar_fac during place_delay_matrix and higher pres_fac are slight degradations. So I'd recommend having Titan run with the higher astar_fac and pres_fac (set at command line), but leaving the defaults for both alone so VTR uses the old values. Hopefully that resolves the QoR failures too (although there might still be some random ones).

@acomodi
Copy link
Collaborator Author

acomodi commented Nov 16, 2020

Ok, I will run another set of tests on Hydra, to get the final results.

In the meantime I collected some SymbiFlow tests with and without the changes in this PR. The results are collected here

NOTE:

  • the changes tests seem to have way too high run-times, probably due to the fact that there might have been some additional loads on my local machine which slowed everything down. I will re-run everything tomorrow and update the results table.
  • tests have been run on the 50T and 100T devices

@litghost FYI

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
@acomodi
Copy link
Collaborator Author

acomodi commented Nov 17, 2020

@vaughnbetz Final runs on Hydra have completed, more details in the colab pages linked below:

Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
@acomodi acomodi changed the title rr_graph: implement more robust delay normalization calculation WIP: rr_graph: implement more robust delay normalization calculation Nov 18, 2020
@acomodi acomodi changed the title WIP: rr_graph: implement more robust delay normalization calculation rr_graph: implement more robust delay normalization calculation Nov 18, 2020
@vaughnbetz
Copy link
Contributor

VTR and Titan results look good. Symbiflow results look OK to me too, but I haven't analyzed them as much / am not as familar with them.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest some additional comments; otherwise looks good.

t_rr_type rr_type,
int nodes_per_chan,
const t_rr_node_indices& L_rr_node_indices) {
static void load_rr_indexed_data_T_values() {
/* Loads the average propagation times through segments of each index type *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks out of date. I think now you're going through all rr_nodes and getting medians for their delay. Good to update this comment and check for other out-of-date comments. Should also describe what we're trying to do here (get typical values of T_linear etc. for different wire types, which are used for things like constructing lookaheads and place delay matrices).

R_total = (float*)vtr::calloc(device_ctx.rr_indexed_data.size(), sizeof(float));
std::vector<int> num_nodes_of_index(rr_indexed_data.size(), 0);
std::vector<std::vector<float>> C_total(rr_indexed_data.size());
std::vector<std::vector<float>> R_total(rr_indexed_data.size());

/* August 2014: Not all wire-to-wire switches connecting from some wire segment will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think we're averaging / doing a median over all rr_nodes, not just some in one chosen channel. Should update the comment.

@@ -453,44 +409,41 @@ static void load_rr_indexed_data_T_values(int index_start,
// the combined capacitance of the node and internal capacitance of the switch. The
// second transient response is the result of the Rnode being distributed halfway along a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "second transient response" to "multiplication by the second term by 0.5"

}

static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered) {
static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered, vtr::vector<RRNodeId, std::vector<RREdgeId>>& fan_in_list) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment saying what this routine does, and why it's useful.

Copy link
Collaborator

@litghost litghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @vaughnbetz, we good to merge?

@acomodi
Copy link
Collaborator Author

acomodi commented Nov 19, 2020

We can potentially merge this and then I will open a follow-up PR with the additional detailed comments.

@vaughnbetz
Copy link
Contributor

Sounds good, merging.

@vaughnbetz vaughnbetz merged commit 18b7ca6 into verilog-to-routing:master Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libvtrutil VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants