Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize dominators for constructing extended basic blocks #142

Closed
jserv opened this issue Jun 5, 2023 · 2 comments
Closed

Utilize dominators for constructing extended basic blocks #142

jserv opened this issue Jun 5, 2023 · 2 comments
Assignees

Comments

@jserv
Copy link
Contributor

jserv commented Jun 5, 2023

Quoted from Basic Blocks and CFG, the definition of extended basic block (EBB):

  • Extended basic block a maximal sequence of instructions beginning with a leader that contains no join nodes other than its first node.
  • Has a single entry, but possible multiple exit points.
  • Some optimizations are more effective on extended basic blocks.

We can identify loops by using dominators:

  • a node A in the flowgraph dominates a node B if every path from entry node to B includes A.
  • This relations is antisymmetric, reflexive, and transitive.

back edge: An edge in the flow graph, whose head dominates its tail (example - edge from B6 to B4).

A loop consists of all nodes dominated by its entry node (head of the back edge) and having exactly one back edge in it.

Intercept contains an effective dominator implementation. See

Usage:

void codegen_optimise(CodegenContext *ctx) {
  opt_inline_global_vars(ctx);
  opt_analyse_functions(ctx);

  /// Optimise each function individually.
  do {
    foreach_ptr (IRFunction*, f, ctx->functions) {
      if (f->is_extern) continue;

      DominatorInfo dom = {0};
      do {
        build_dominator_tree(f, &dom, true);
        opt_reorder_blocks(f, &dom);
      } while (
          opt_const_folding_and_strengh_reduction(f) ||
          opt_dce(f) ||
          opt_mem2reg(f) ||
          opt_jump_threading(f, &dom) ||
          opt_tail_call_elim(f)
      );
      free_dominator_info(&dom);
    }
  }

  /// Cross-function optimisations.
  while (opt_inline_global_vars(ctx) || opt_analyse_functions(ctx));
}

Similarly, blink comes with an approach to detect loops during code generation.

qwe661234 pushed a commit to qwe661234/rv32emu that referenced this issue Aug 30, 2023
Previously, we employed recursive jump translation to implement an extended
basic block. Nevertheless, this approach makes it challenging to detect
loop paths since we position the loop's entry block inside the block, and
its using frequency would not be updated.

Based on this observation, it becomes necessary to eliminate the recursive
jump translation. By doing so, we can accurately update the using frequency
of the loop's entry block.

See: sysprog21#142
qwe661234 pushed a commit to qwe661234/rv32emu that referenced this issue Aug 30, 2023
Previously, we employed recursive jump translation to implement an extended
basic block. Nevertheless, this approach makes it challenging to detect
loop paths since we position the loop's entry block inside the block, and
its using frequency would not be updated.

Based on this observation, it becomes necessary to eliminate the recursive
jump translation. By doing so, we can accurately update the using frequency
of the loop's entry block.

As shown in the performance results below, we gain 4% performance improvement
when running coreMark and SciMark2 and lost 1% performance when running
dhrysone.

* Intel Core i7-11700

|   Metric   | origin  | proposed |Speedup|
|------------+---------+----------+-------|
| CoreMark   | 2193.28 | 2289.26  |  +4%  |
| SciMark2   |   13.45 |   18.48  |  +4%  |
| Dhrystone  | 1413.11 | 1447.11  |  -1%  |

See: sysprog21#142
@qwe661234
Copy link
Collaborator

Previously, we employed recursive jump translation to implement an extended basic block. Nevertheless, this approach makes it challenging to detect loop paths since we position the loop's entry block inside the block, and its using frequency would not be updated.

Based on this observation, it becomes necessary to eliminate the recursive jump translation. By doing so, we can accurately update the using frequency of the loop's entry block.

image

As shown in the performance results below, we gain 4% performance improvement when running coreMark and SciMark2 and lost 1% performance when running dhrysone.

  • Intel Core i7-11700
Metric origin proposed Speedup
CoreMark 2193.28 2289.26 +4%
SciMark2 13.45 18.48 +4%
Dhrystone 1413.11 1447.11 -1%

qwe661234 pushed a commit to qwe661234/rv32emu that referenced this issue Aug 30, 2023
Previously, we employed recursive jump translation to implement an extended
basic block. Nevertheless, this approach makes it challenging to detect
loop paths since we position the loop's entry block inside the block, and
its using frequency would not be updated.

Based on this observation, it becomes necessary to eliminate the recursive
jump translation. By doing so, we can accurately update the using frequency
of the loop's entry block.

As shown in the performance results below, we gain 4% performance improvement
when running coreMark and 37% when running SciMark2, but we lost 1% performance
when running dhrysone.

* Intel Core i7-11700

| Metric     | origin  | proposed |Speedup|
|------------+---------+----------+-------|
| CoreMark   | 2193.28 |  2289.26 |   +4% |
| SciMark2   |   13.45 |    18.48 |  +37% |
| Dhrystone  | 1413.11 |  1447.11 |   -1% |

See: sysprog21#142
@jserv
Copy link
Contributor Author

jserv commented Sep 17, 2023

The effectiveness has been confirmed. We shall look for further faster approaches for loop detection.

@jserv jserv closed this as completed Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants