-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rr_graph: implement more robust delay normalization calculation #1576
rr_graph: implement more robust delay normalization calculation #1576
Conversation
@litghost @vaughnbetz @HackerFoo FYI, I have tested this against SymbiFlow tests and it seems to be solving the base cost value reduction and the consequent runtime degradation. With this PR, the base cost reported for the A35T device is now only ~1.06x lower than the previous values, and I have seen a restoration of runtime results, as well as a general improvement of the CPD. |
Failing tests include:
I am also running the |
The QoR changes look like a mixed bag. I think a titan QoR comparison will be required to evaluate the overall impact of the these changes. |
Tdel_sum += Tdel / (float)clb_dist; | ||
if (rr_indexed_data[cost_index].number_of_nodes == 0 || T_value == 0.0) continue; | ||
|
||
Tdel_vector.push_back(T_value * 1e10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this magic number of come from? It should be constexpr double kNameOfConstant = 1e10
, along with a comment about the value should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot to fix that. Anyway, this was needed to temporary shift delay values, as during the geo mean computation, the subsequent multiplication of the delays would eventually bring the mult
value to zero.
Unsure whether this is the best way to achieve this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floating point doesn't need shifts like this. Rather than using 1e10, just a double
during the intermediate calcuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tried with a double for the intermediate calculation, but it went actually beyond the limit which is around 1e-300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue is that maybe product(Tdel_vector)
is the wrong math to use here? I think you may have a numerical convergence error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact is that I wanted to use the geometric mean to take into account possible delay values outliers of some segment types which, I have seen, may differ by two orders of magnitude.
Given that delay values are usually in the order of 1e-10, I applied the 1e10 correction for the intermediate step and than restored it at the end of the geo mean calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution here is to change from (product(T_del))^(1/len(T_del))
to https://en.wikipedia.org/wiki/Geometric_mean#Relationship_with_logarithms e^(1/len(T_del)*sum(map(logn(T_del))))
. logn(0)
is a violation, but those need to be skipped in either case. Again, the problem here is the naive geomean algorithm converges always to zero (or infinity) as N increases. This is numerically denegenerate, and you can always increase N (e.g. len(T_del)
) to drive the product to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think geometric mean is the right approach here. It does fairly average numers with widely varying magnitude ranges (treats them all as equally significant) but it is very dangerous if numbers can go to or approach 0. We could get some small/weird wires or fragments with delay near 0, and we really do not want those dropping the average way down. I think that's why you're special casing 0 values -- otherwise the geomean blows up. But a value of epsilon is almost as destructive. I'd change to the arithmetic mean and things should be a lot more stable. It is much less prone to going to tiny averages due to some tiny values, and as overly small base costs are the problem I think we want to stop using an average that weights small values very heavily (infinitely for 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have changed the code to use the arithmetic mean instead
if (inode == -1) | ||
continue; | ||
cost_index = device_ctx.rr_nodes[inode].cost_index(); | ||
frac_num_seg = clb_dist * device_ctx.rr_indexed_data[cost_index].inv_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inv_length
doesn't seem to be accounted for anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unsure about this. Basically, the inv_length is taken into account in a later step of the base_cost calculation and IMO it should be taken into account based on the base_cost_type parameter value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments in. @vaughnbetz needs to do a detailed review as well.
QoR regtest failures above look OK. 3 of 4 are improvements, and the last one is a 44% critical path delay degradation on a small, bidirectional architecture circuit, and that's known to be a noisy test, and on a less important architecture. |
@vaughnbetz @litghost Titan quick QoR have finished and has been run with the current PR's status against master: Results summary are presented in this colab page. For what I see there seems to be a general degradation of CPD and Placement runtime, in favor of a reduced routing run-time. I'll investigate why this is so. In general, would it be acceptable to add yet another parameter to switch between different rr_indexed_data (and consequently also the delay normalization) computation methods? |
The geomean table should probably be transposed, as it current hard to read in it's current form. |
While the routing time is down ~20%, the overall flow time is up ~6%. Where did the time go? |
Place time is up ~16% :( |
So overall:
I'd say this change seems pretty bad on the Titan benchmarks. What is the latest symbiflow results? |
@litghost SymbiFlow tests results seemed to be promising, achieving better run-time as well as CPD w.r.t. baseline. I am currently gathering more consistent data and add that to the colab page, so to display also the SymbiFlow tests results. I think that the issue might be related to the missing inv_length during the delay_normalization, as you have noticed in one of the review comments. Investigating whether this is the case. |
9d04721
to
94605cc
Compare
@litghost @vaughnbetz the new results from the latest hydra run with the inv_length addition to the delay normalization factor have finished (colab page with updated results). Summary: CPD: improved by ~1.2% Worth noticing that most of the biggest designs got fairly good improvements: e.g. bitcoin miner:
directrf:
|
@vaughnbetz A ~6% runtime jump but better QoR using a more robust and simpler algorithm seems like a reasonable trade-off to me. Thoughts? |
Yes, this does seem good to me too. I'm a bit baffled as to why placement time would go up though -- did the placement delay matrix computation get slower? Otherwise I wouldn't expect placement to be affected. If the placement delay matrix computation slowed down we may be able to claw it back as it seems like it wouldn't be fundamental. In any case placement delay matrix can be cached, so it could be saved for some flows. Can you parse out the place time components to see where the slowdown comes from. |
@vaughnbetz I think I cannot access the log files from the hydra runs, but I have run locally some of the smaller designs and indeed, the place_delay matrix computation has increased: cholesky_mc benchmark: Baseline:
Changes:
|
94605cc
to
66d9b87
Compare
Thanks. That's really strange. I can't figure out why a different base cost normalization would slow down the place delay matrix calculation. Presumably the routing paths being explored are different, but still not sure how that leads to this behaviour. Probably worth double-checking that any data (rr_indexed_data?) needed by the place_delay_matrix is properly loaded, and printing out the base costs of all resoures to make sure none have funny data (should just all be scaled up or down some, uniformly) at the point where the place delay matrix is calculated. |
@vaughnbetz The default for the I have double checked that the base costs in the indexed data do have no funny values prior to calculating the delay matrix |
Thanks. The astar search using a high criticality so it is looking for low delay paths. So the base cost shouldn't really matter a lot to it. So something odd is happening to slow it down so much. |
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
64da26f
to
cfc4e6d
Compare
@litghost @vaughnbetz The latest runs completed. I have separated them in the following colab pages: The best results are obtained with the following settings:
|
Thanks @acomodi . Titan results show all 3 changes are beneficial. So the new normalization should go in. Higher astar_fac and doubled pres_fac is beneficial for Titan, may be mildly negative for VTR. Could set different defaults for those? |
@vaughnbetz Ok. Just to be sure, you mean changing the default router_delay_profiler astar_fac and the pres_fac to get VTR tests in line?
So, there are two different tables:
I have also double-checked that, with the current implementation, some of strong tests do not meet QoR, but, by lowering the initial_pres_fac (where the initial_pres_fac default is 1.0). Would it be better to update the golden results, or change the initial_pres_fac for the failing strong tests? |
Ah, got it, thanks. Results are good for all 3 changes for Titan (every change improves things). |
Ok, I will run another set of tests on Hydra, to get the final results. In the meantime I collected some SymbiFlow tests with and without the changes in this PR. The results are collected here NOTE:
@litghost FYI |
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
cfc4e6d
to
40f8a4e
Compare
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
@vaughnbetz Final runs on Hydra have completed, more details in the colab pages linked below: |
Signed-off-by: Alessandro Comodi <acomodi@antmicro.com>
d168bdd
to
ed6c456
Compare
VTR and Titan results look good. Symbiflow results look OK to me too, but I haven't analyzed them as much / am not as familar with them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest some additional comments; otherwise looks good.
t_rr_type rr_type, | ||
int nodes_per_chan, | ||
const t_rr_node_indices& L_rr_node_indices) { | ||
static void load_rr_indexed_data_T_values() { | ||
/* Loads the average propagation times through segments of each index type * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment looks out of date. I think now you're going through all rr_nodes and getting medians for their delay. Good to update this comment and check for other out-of-date comments. Should also describe what we're trying to do here (get typical values of T_linear etc. for different wire types, which are used for things like constructing lookaheads and place delay matrices).
R_total = (float*)vtr::calloc(device_ctx.rr_indexed_data.size(), sizeof(float)); | ||
std::vector<int> num_nodes_of_index(rr_indexed_data.size(), 0); | ||
std::vector<std::vector<float>> C_total(rr_indexed_data.size()); | ||
std::vector<std::vector<float>> R_total(rr_indexed_data.size()); | ||
|
||
/* August 2014: Not all wire-to-wire switches connecting from some wire segment will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think we're averaging / doing a median over all rr_nodes, not just some in one chosen channel. Should update the comment.
@@ -453,44 +409,41 @@ static void load_rr_indexed_data_T_values(int index_start, | |||
// the combined capacitance of the node and internal capacitance of the switch. The | |||
// second transient response is the result of the Rnode being distributed halfway along a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "second transient response" to "multiplication by the second term by 0.5"
} | ||
|
||
static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered) { | ||
static void calculate_average_switch(int inode, double& avg_switch_R, double& avg_switch_T, double& avg_switch_Cinternal, int& num_switches, short& buffered, vtr::vector<RRNodeId, std::vector<RREdgeId>>& fan_in_list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment saying what this routine does, and why it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @vaughnbetz, we good to merge?
We can potentially merge this and then I will open a follow-up PR with the additional detailed comments. |
Sounds good, merging. |
Signed-off-by: Alessandro Comodi acomodi@antmicro.com
Description
This PR aims at making the delay normalization factor used in the based cost calculation to be more representative of the real underneath device fabric.
Previous to this PR, the calculation of the const_indices' delays/resistance/capacitance was based on heuristics, which proved to be insufficient for complex RR Graphs such as the Series7 one.
All the nodes in the RR graph are now checked and their switches delays averaged, therefore we check now for every node instead of picking them as samples.
The main differences between the previous indexed data calculation are:
short
switches are no longer increasing the total delay values.clb_dist
).Motivation and Context
Base costs for Series7 device had a 15x reduction making the delay and base costs not comparable anymore.
The 15x reduction was due to an uncaught bug in the VtR/SymbiFlow fork, which get caught as soon as I had tested the upstream master against SymbiFlow tests.
How Has This Been Tested?
Currently, all strong and basic test run through completion, but some of the strong tests do report a change in the observed metrics, which in all cases but one seem to be a positive change (less CPD, wire length, etc).
Types of changes
Checklist: