Troubleshooting RR graphs with poor router performance

In a revision of an existing architecture to clean up RR graph bugs, the router wirelength and runtime have both blown up for test circuits with size on the order of thousands of LUTs.  The particulars are posted to current behavior:

Both RR graphs are generated from scratch (not derived/revised from VPR's RR graph generator).  The QOR of the original graph is benchmarked against the RR graph VPR generates from the arch XML and found to be roughly equal (VPR's graph wirelength: 71017, runtime also ~15 seconds).  The VPR arch XML generates a relatively different graph vs. the custom graph, discussed in current behavior.

The salient differences between custom graphs are:

Original graph:
1.  one of two types of IOBs contains an extra IPIN node at ptc=0, side=TOP that does not correspond to hardware.  There are 8 instances of this IOB type and 72 IOBs of another type.
2. The routing channel permutation algorithm is implemented with a bug such that the ptc_num value of CHANX/CHANY nodes with edges to each IPIN do not necessarily appear in pairs of consecutive ptc_num values.  For example, the lowest ptc values assigned for CLB IPIN ptc=0 are CHANX ptc=1 and CHANX ptc=8.

Revised graph:
1.  The IOB IPINs and with side=TOP, and the edges from CHANX/CHANY leading to them,  are removed.
2. The routing channel permutation bug is fixed, and so pairs of consecutive ptc_nums (INC_DIR and DEC_DIR) are assigned, i.e. CHANX ptc=0 and ptc=1 have edges to the same IPIN.

#### Expected Behaviour
Because the graphs are so similar, the expected behavior is to see similar wirelength and router results across both RR graph XML versions.

#### Current Behaviour

The architecture is a customized eFPGA with some unusual grid properties:
*The perimeter X/Y locations (x=0, y=0, y=30, x=36) are all EMPTY
*There are two IO block types that differ only in the number of I/Os located in them:
    a.  two copies of an IOB with 8 I/Os per IOB are located on each of the four sides of the array.  These are the IOBs with the spurious  TOP side IPIN in the original graph.
    b.  24 copies of an IOB with 50 IOs per IOB are located on three of the four sides of the array.
    c.  leftover boundary tile locations (x=1, y=1, y=29, x=35) are populated with CLBs.
    d.  fc_in=0.15 and fc_out=0.15 for all pb_types.

example test circuit:  synthetic finite state machine, packs into 631 CLBs with 8 4-LUTs each.
original graph wirelength:  70173
original graph runtime:  ~15 seconds
current graph wirelength:  routing fails.
current graph runtime:  > 30 minutes

I have also shown that if I modify the original RR graph by manually deleting the edges from the CHANX/CHANY nodes to the IOB IPINs with side=TOP (but not deleting the nodes), the wirelength and runtime are effectively the same as the original RR graph.

Regarding a comparison to a VPR-generated RR graph derived from the original architecture XML, the following observations can be made:

*VPR inserts an additional row and column of routing channels at X=0, Y=0 that are not included in the custom RR graph
*The custom RR graph attempts to compensate for the missing row/column by looping channels that would go to X=0, Y=0 back to X=1, Y=1, respectively.
*VPR's routing channel index permutation algorithm for creating edges from CHANX/CHANY to IPINs differs from the custom graph generator's algorithm (original and bug fixed). 

#### Possible Solution
If the mere existence of a second side for the IPIN node helps VPR in some way (for example, by fooling the router lookahead into making better choices), it would be nice to understand why this is the case, to help motivate modifying my graph generator to add these nodes systematically instead of as part of a bug.

Additionally, if there is risk in allowing the architecture XML to specify the architecture in a way that implies a different RR graph than what is loaded with `--read_vpr_graph`, it would be good to know what problems this kind of deviation can cause.

Finally, some observations from other experiments I have run attempting to improve the health of the architecture post-bug fix:

* Increasing the flexibility of the connection boxes partially mitigates but does not fully resolve the issue.
* Trials on smaller versions of the architecture with smaller test circuits produce similar (scaled) results.

#### Steps to Reproduce



Reproduction of the problem is possible on request.  The original graph is available in Zero ASIC's [logik](https://github.com/zeroasiccorp/logik) release, and current graphs for the same architecture can be furnished for collaboration purposes if deemed appropriate.


#### Context


The goal of this work is to generate an eFPGA core to insert into an SoC.  A custom RR graph generator is needed to properly support genfasm and to ensure CAD correspondence to the eFPGA hardware.  For these reasons, we rely on generating a custom RR graph rather than reverse-engineering VPR's auto-generated RR graph.  Given that, whatever can be done to keep the router stable vs. tiny perturbations of the routing graph is beneficial.

#### Your Environment

* VTR revision used:  commit hash de31f094aa4f894a5e6e0dc32c66365f4b341190
* Operating System and version:  Ubuntu 20.04
* Compiler version:  GCC 10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Troubleshooting RR graphs with poor router performance #2803

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce

Context

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Troubleshooting RR graphs with poor router performance #2803

Description

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce

Context

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions